Show the code
<- c(97.88, 107.91, 88.26, 115.21, 87.38) x
\[ s^2 = \frac{\Sigma(x_i - \bar{x})^2}{n - 1} \tag{3.1}\]
\[ s = \sqrt{s^2} = \sqrt{\frac{\sigma(x_i - \bar{x})^2}{n - 1}} \tag{3.2}\]
We used the population variance above because the denominator, N, is easier to demonstrate since the variance is simply the average of the area of all the squares. However, we will almost always want to estimate the sample variance which changes the denominator to n - 1 (technically called the degrees of freedom which will be described in more detail in later chapters). In fact, the var
function in R to calculate variance will always return the sample variance (and sd
function for standard deviation by extension). In practice reporting the sample variance when working with a population can be considered a slightly conservative estimate. However, as n increases, the difference between the sample and population variances are going to converge (see Appendix A for more details about convergence, limits, and core calculus concepts helpful for learning statistics).
Let’s first define two function to caclulate the sample and population variances.
We will now calculate the sample and population variance from a vector randomly generated from the normal distribution with mean of 0 and standard deviation of 1.
max_n <- 500
x <- rnorm(max_n)
variances <- data.frame(n = seq(2, max_n, 1),
sample_var = numeric(max_n - 1),
population_var = numeric(max_n - 1))
for(i in seq_len(nrow(variances))) {
n <- variances[i,]$n
variances[i,]$sample_var <- sample_var(x[1:n])
variances[i,]$population_var <- population_var(x[1:n])
}
variances <- variances |>
dplyr::mutate(Difference = abs(population_var - sample_var))
Figure @ref(fig:fig-variance-convergence) depicts the two variances along with the difference for sample sizes rangine from 2 to 500. The blue line representing the difference is very close to zero when n = 200.
Examining the last five rows (i.e. n = 496 to 500) show that the difference in the two variances is less than 0.01.
n sample_var population_var Difference
495 496 0.9842646 0.9822802 0.001984404
496 497 0.9836034 0.9816243 0.001979081
497 498 0.9827055 0.9807322 0.001973304
498 499 0.9811142 0.9791481 0.001966161
499 500 0.9792802 0.9773217 0.001958560
This Shiny application can be run locally using the VisualStats::variance_shiny()
function.