4  Correlation

Correlation is a measure of the relationship between two variables. The correlation can range from -1 indicating a “perfect” negative relationship to 1 indicating a “perfect” positive relationship. A correlation of 0 indicates no relationship. Figure 4.1 depicts scatter plots with various correlations.

Show the code
plots <- list()
for(rho in c(-1, -0.8, -0.4, 0, 0.4, 0.8, 1)) {
    df <- mvtnorm::rmvnorm(n = 30,
                           mean = c(0, 0),
                           sigma = matrix(c(1^2, rho * (1 * 1),
                                         rho * (1 * 1), 1^2), 2, 2)) |>
        as.data.frame()
    
    plots[[length(plots) + 1]] <- ggplot(df, aes(x = V1, y = V2)) + 
        geom_point(size = 0.5) +
        theme_minimal() +
        theme(axis.text = element_blank(), panel.grid = element_blank()) +
        xlab(rho) + ylab('') +
        coord_equal() +
        xlim(c(-3, 3)) + ylim(c(-3, 3))
}
plots$nrow <- 1
do.call(plot_grid, plots)
Figure 4.1: Scatterplots representing correlations from -1 to 1.

For a population, the correlation is defined as the ratio of the covariance to the product of the standard deviations, and is typically denoted using the Greek letter rho (\(\rho\)), is defined as:

\[ \rho = \frac{cov(X,Y)}{\sigma_{X} \sigma_{Y}} \tag{4.1}\]

The standard deviation (\(\sigma\)) is equal to the square root of the variance (\(\sigma = \sqrt{\frac{\Sigma(x_i - \bar{x})^2}{n - 1}}\)) which is covered in detail in Section 3.1. What is new here is the covariance. Like variance, we are interested in deviations from the mean except now in two dimensions. The formula for the covariance is:

\[ cov_{xy} = \frac{\Sigma(x_i - \bar{x})(y_i - \bar{y})}{n - 1} \tag{4.2}\]

Let’s break down Equation 4.2 visually. The following R code generates a data frame with two variables, x and y, with means of 20 and 40 and standard deviatiosn of 2 and 3, respectively. The population correlation is 0.8 (note that the sample will likely have a slightly different correlation).

mean_x <- 20
mean_y <- 40
sd_x <- 2
sd_y <- 3
n <- 30
rho <- 0.8
set.seed(2112)
df <- mvtnorm::rmvnorm(
    n = n,
    mean = c(mean_x, mean_y),
    sigma = matrix(c(sd_x^2, rho * (sd_x * sd_y),
                     rho * (sd_x * sd_y), sd_y^2), 2, 2)) |>
    as.data.frame() |>
    dplyr::rename(x = V1, y = V2) |>
    dplyr::mutate(x_deviation = x - mean(x),
                  y_deviation = y - mean(y),
                  cross_product = x_deviation * y_deviation)

The numerator of the covariance equation included the product of the deviation from the mean in the x direction (i.e. \(x_i - \bar{x}\)) and the y direction (i.e. \(y_i - \bar{y}\)). Figure 4.2 depicts the scatter plot with all 30 observations, the mean of x and y as dashed lines, and the deviations for one observation represented by blue arrows.

Show the code
deviation_to_plot <- 23
regression_vis(df,
                plot_x_mean = TRUE,
                plot_y_mean = TRUE,
                plot_positive_cross_products = FALSE,
                plot_negative_cross_products = FALSE,
                plot_x_deviations = deviation_to_plot,
                plot_y_deviations = deviation_to_plot) +
    annotate('text', x = 22.5, y = 46.2,
             label = bquote("x - bar(x)"), parse = TRUE, vjust = 1.5, size = 8) +
    annotate('text', x = 25.0, y = 44,
             label = bquote("y - bar(y)"), parse = TRUE, vjust = 2, size = 8, angle = -90)
Figure 4.2: Scatter plot showing the deviations for x and y.

Hence, for each observation (i) in our data frame the numerator is summing the area of a rectangle (see Figure 4.3 for the one observation in yellow). When we calculate variances we square the deviation which as a result ensures that all the values being summed are positive. However, for covariance this is not the case. The numbers in Figure 4.3 correspond to the four quadrants of the plot. For points that fall in quadrants 1 and 3, the cross products (i.e. \((x_i - \bar{x})(y_i - \bar{y}\)) will be positive. For points that fall in quadrants 2 and 4 however, the cross products will be negative.

Show the code
regression_vis(df,
                plot_x_mean = TRUE,
                plot_y_mean = TRUE,
                plot_positive_cross_products = FALSE,
                plot_negative_cross_products = FALSE,
                plot_cross_product = deviation_to_plot,
                plot_x_deviations = deviation_to_plot,
                plot_y_deviations = deviation_to_plot,
                cross_product_alpha = 0.5) +
    geom_text(label = '1', x = 22.5, y = 44, color = 'blue', size = 16) +
    geom_text(label = '2', x = 17.5, y = 44, color = 'blue', size = 16) +
    geom_text(label = '3', x = 17.5, y = 36, color = 'blue', size = 16) +
    geom_text(label = '4', x = 22.5, y = 36, color = 'blue', size = 16)
Figure 4.3: Scatter plot showing the deviations for x and y with the cross product represented as the area of the yellow rectangle.

Figure 4.4 depicts all the cross products but shades the cross products that are positive in blue and cross products that are negative in red. As a result we can interpret the covariance as the ratio of the area of cross products for points in quadrants 1 and 3 to the area of the cross products for points in quadrants 2 and 4.

Show the code
regression_vis(df,
                plot_x_mean = TRUE,
                plot_y_mean = TRUE,
                plot_positive_cross_products = TRUE,
                plot_negative_cross_products = TRUE)
Figure 4.4: Scatter plot with all cross products. Positive and negative cross products are represented by blue and red rectangles, respectively.

We can see this more clearly if we plot a histogram of cross products (see Figure 4.5).

Show the code
ggplot(df, aes(x = cross_product, fill = cross_product > 0)) +
    geom_histogram(bins = 15, alpha = 0.75) +
    scale_fill_manual('Cross product > 0', values = c('lightblue', 'darkred')) +
    xlab('Cross Product') + 
    theme_vs()
Figure 4.5: Histogram of cross products.

The last part of Equation 4.2 is the denominator where we average across all of the observations. If we wish to calculate the covariance for a population we divide my n, but for sample covariance we divide by n - 1. Substuting Equation 4.2 into Equation 4.1 we get the following:

\[{ r }_{ xy }=\frac { \frac { \sum _{ i=1 }^{ n }{ \left( { X }_{ i }-\overline { X } \right) \left( { Y }_{ i }-\overline { Y } \right) } }{ n-1 } }{ { s }_{ x }{ s }_{ y } } \tag{4.3}\]

The Shiny application allows you to play with all of the features that go into calculating correlation. Clicking on individual points will display the cross product.

This Shiny application can be run locally using the VisualStats::correlation_shiny() function.