Show the code
x <- c(97.88, 107.91, 88.26, 115.21, 87.38)\[ s^2 = \frac{\Sigma(x_i - \bar{x})^2}{n - 1} \tag{3.1}\]
\[ s = \sqrt{s^2} = \sqrt{\frac{\sigma(x_i - \bar{x})^2}{n - 1}} \tag{3.2}\]
We used the population variance above because the denominator, N, is easier to demonstrate since the variance is simply the average of the area of all the squares. However, we will almost always want to estimate the sample variance which changes the denominator to n - 1 (technically called the degrees of freedom which will be described in more detail in later chapters). In fact, the var function in R to calculate variance will always return the sample variance (and sd function for standard deviation by extension). In practice reporting the sample variance when working with a population can be considered a slightly conservative estimate. However, as n increases, the difference between the sample and population variances are going to converge (see Appendix A for more details about convergence, limits, and core calculus concepts helpful for learning statistics).
Let’s first define two function to caclulate the sample and population variances.
We will now calculate the sample and population variance from a vector randomly generated from the normal distribution with mean of 0 and standard deviation of 1.
max_n <- 500
x <- rnorm(max_n)
variances <- data.frame(n = seq(2, max_n, 1),
sample_var = numeric(max_n - 1),
population_var = numeric(max_n - 1))
for(i in seq_len(nrow(variances))) {
n <- variances[i,]$n
variances[i,]$sample_var <- sample_var(x[1:n])
variances[i,]$population_var <- population_var(x[1:n])
}
variances <- variances |>
dplyr::mutate(Difference = abs(population_var - sample_var))Figure @ref(fig:fig-variance-convergence) depicts the two variances along with the difference for sample sizes rangine from 2 to 500. The blue line representing the difference is very close to zero when n = 200.
Examining the last five rows (i.e. n = 496 to 500) show that the difference in the two variances is less than 0.01.
n sample_var population_var Difference
495 496 1.013745 1.011701 0.002043841
496 497 1.012765 1.010727 0.002037757
497 498 1.015033 1.012995 0.002038220
498 499 1.014507 1.012474 0.002033080
499 500 1.012474 1.010449 0.002024949
This Shiny application can be run locally using the VisualStats::variance_shiny() function.