Digital Garden
Machine Learning
Correlation

Correlation

Correlation is a statistical measure that describes how related two random variables are, for computer scientists, this is usually how related two vectors or time series are.

Correlation Coefficients

Most often the correlation coefficient is defined as ρ\rho (rho) or rr and is a value between -1 and 1, i.e. 1r1-1 \leq r \leq 1. The simplest form of correlation is a linear correlation, which is a linear relationship between the two variables. The correlation coefficient can then be interpreted in the following way:

  • r=1r = 1 means that the two variables are perfectly correlated, i.e. if one variable increases, the other variable increases as well.
  • r=1r = -1 means that the two variables are perfectly negatively correlated, i.e. if one variable increases, the other variable decreases.
  • r=0r = 0 means that the two variables are not correlated at all, i.e. if one variable increases or decreases, the other variable does not change.

Therefore, the absolute value of the correlation coefficient can be interpreted as the strength of the correlation.

The standard interpretation of the correlation strength is:

  • 0.8r10.8 \leq |r| \leq 1 is a strong correlation.
  • 0.5r0.80.5 \leq |r| \leq 0.8 is a moderate correlation.
  • 0.1r0.50.1 \leq |r| \leq 0.5 is a weak correlation.
  • 0r0.10 \leq |r| \leq 0.1 is no correlation.
Visualizations of linear correlations, positive and negative.
Example

A simple example of a linear correlation is the air temperature and the number of ice creams sold. If it is hot then more ice creams are sold, if it is cold then less ice creams are sold. This is a linear correlation, and the correlation coefficient is close to 1 as the two variables increase and decrease together.

Non-linear correlations are also possible, but they are more difficult to interpret.

Visualizations of linear and non-linear correlations. The top left is the linear correlation coeff., the top right is the non-linear correlation coeff.
Warning

The correlation coefficient only measures if there is a correlation between the two variables, it does not say anything about cause and effect. For example if we have a positive correlation between the number of ice creams sold and the number of people who drown, it does not mean that eating ice cream causes people to drown. It is simply that both variables increase during the summer months.

Outliers

The correlation coefficient is just like regression sensitive to outliers. If there are outliers in the data, then the correlation coefficient will be affected by them. So it is important to check for outliers before calculating the correlation coefficient.

An example of an outlier affecting the correlation coefficient.

Pearson's Correlation Coefficient

Pearson's correlation coefficient is the most common correlation coefficient, and is a measure of the linear correlation between two variables. If we have the two variables XX and YY, which have nn values, then the Pearson's correlation coefficient is defined as:

rX,Y=ni=1n(xiyi)i=1nxii=1nyini=1nxi2(i=1nxi)2ni=1nyi2(i=1nyi)2=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2i=1n(yiyˉ)2\begin{align*} r_{X,Y} &= \frac{n \cdot \sum_{i=1}^{n}{(x_i - y_i)} - \sum_{i=1}^{n}{x_i} \cdot \sum_{i=1}^{n}{y_i}} {\sqrt{n \cdot \sum_{i=1}^{n}{x_i^2} - (\sum_{i=1}^{n}{x_i})^2} \cdot \sqrt{n \cdot \sum_{i=1}^{n}{y_i^2} - (\sum_{i=1}^{n}{y_i})^2}} \\ &= \frac{\sum_{i=1}^{n}{(x_i - \bar{x}) \cdot (y_i - \bar{y})}} {\sqrt{\sum_{i=1}^{n}{(x_i - \bar{x})^2}} \cdot \sqrt{\sum_{i=1}^{n}{(y_i - \bar{y})^2}}} \end{align*}

This formula can look a bit scary, but it is actually quite simple. Let us start with a few quick reminders. The mean value of a variable XX is defined as:

E(X)=μX=xˉ=1ni=1nxiE(X) = \mu_X = \bar{x} = \frac{1}{n} \cdot \sum_{i=1}^{n}{x_i}

The variance of a variable XX is defined as:

Var(X)=E(X2)E(X)2=1ni=1nxi2(1ni=1nxi)2\begin{align*} Var(X) &= E(X^2) - E(X)^2 \\ &= \frac{1}{n} \cdot \sum_{i=1}^{n}{x_i^2} - (\frac{1}{n} \cdot \sum_{i=1}^{n}{x_i})^2 \end{align*}

The standard deviation of a variable XX is defined as:

σ(X)=Var(X)\sigma(X) = \sqrt{Var(X)}

And lastly the covariance between two variables XX and YY is defined as:

cov(X,Y)=E((XμX)(YμY))=1ni=1n(xixˉ)(yiyˉ)\begin{align*} \text{cov}(X,Y) &= E((X - \mu_X) \cdot (Y - \mu_Y)) \\ &= \frac{1}{n} \cdot \sum_{i=1}^{n}{(x_i - \bar{x}) \cdot (y_i - \bar{y})} \end{align*}
Todo

This can be linked to all the other articles about statistics.

Now we can see that the nominator of the Pearson's correlation coefficient is the covariance between XX and YY times nn, and the denominator is the product of the standard deviation of XX and YY times nn. So the Pearson's correlation coefficient is the covariance between XX and YY divided by the product of the standard deviation of XX and YY. So we can rewrite the formula as:

rX,Y=cov(X,Y)σXσYr_{X,Y} = \frac{\text{cov}(X,Y)}{\sigma_X \cdot \sigma_Y}
Example

Let's say we have the hypothesis that students with a higher GPA also have a higher SAT score. We can then check if there is a correlation between the two variables. We have the following data:

StudentGPA (XX)SAT (YY)
13.4595
23.2520
33.9715
42.3405
53.9680
62.5490
73.5565

If we plot the data we get the following plot, where we can see a clear correlation:

By extending the table slightly we can calculate the Pearson's correlation coefficient pretty quickly.

StudentGPA (XX)SAT (YY)xiyix_i y_ixi2x_i^2yi2y_i^2
13.4595202311.56354025
23.2520166410.24270400
33.9715278915.21511225
42.34059325.29164025
53.9680265215.21462400
62.549012256.25240100
73.5565197812.25319225
Sum22.739701326276.012322400

Now we can calculate the Pearson's correlation coefficient:

rX,Y=71326222.73970776.0122.727232240039702=27152864.220.95\begin{align*} r_{X,Y} &= \frac{7 \cdot 13262 - 22.7 \cdot 3970}{\sqrt{7 \cdot 76.01 - 22.7^2} \cdot \sqrt{7 \cdot 2322400 - 3970^2}} \\ &= \frac{2715}{2864.22} \approx 0.95 \end{align*}

Spearman's Rank Correlation Coefficient

Spearman's rank correlation coefficient measures the monotonic relationship between two variables. As a reminder, a monotonic function is a function that is either strictly increasing or strictly decreasing. The word rank comes into play because we first rank the values of the variables with the lowest value of the variable getting the rank 1 and the highest value of the variable getting the rank nn, we then have R(X)R(X) and R(Y)R(Y) which are the ranked variables.

Then the Spearman's rank correlation coefficient is actually calculated just as the Pearson's correlation coefficient between the two ranked variables. Often the Spearman's rank correlation coefficient is denoted as rsr_s.

rs=cov(R(X),R(Y))σR(X)σR(Y)r_s = \frac{\text{cov}(R(X),R(Y))}{\sigma_{R(X)} \cdot \sigma_{R(Y)}}

If all the ranks are unique (which is often the case if there aren't any duplicates in the data), then the formula can be simplified to:

rs=16i=1n(R(Xi)R(Yi))2n(n21)=16i=1ndi2n(n21)r_s = 1 - \frac{6 \cdot \sum_{i=1}^{n}{(R(X_i) - R(Y_i))^2}}{n \cdot (n^2 - 1)} = 1 - \frac{6 \cdot \sum_{i=1}^{n}{d_i^2}}{n \cdot (n^2 - 1)}
Visualizations of Spearman's rank correlation coefficient compared to Pearson's correlation coefficient.

Correlation Matrix

Isn't this also used in image classification to see how close classes are?

Coefficient of determination - r2r^2

Cross Correlation