Correlation
Correlation is a statistical measure that describes how related two random variables are, for computer scientists, this is usually how related two vectors or time series are.
Correlation Coefficients
Most often the correlation coefficient is defined as (rho) or and is a value between -1 and 1, i.e. . The simplest form of correlation is a linear correlation, which is a linear relationship between the two variables. The correlation coefficient can then be interpreted in the following way:
- means that the two variables are perfectly correlated, i.e. if one variable increases, the other variable increases as well.
- means that the two variables are perfectly negatively correlated, i.e. if one variable increases, the other variable decreases.
- means that the two variables are not correlated at all, i.e. if one variable increases or decreases, the other variable does not change.
Therefore, the absolute value of the correlation coefficient can be interpreted as the strength of the correlation.
The standard interpretation of the correlation strength is:
- is a strong correlation.
- is a moderate correlation.
- is a weak correlation.
- is no correlation.
A simple example of a linear correlation is the air temperature and the number of ice creams sold. If it is hot then more ice creams are sold, if it is cold then less ice creams are sold. This is a linear correlation, and the correlation coefficient is close to 1 as the two variables increase and decrease together.
Non-linear correlations are also possible, but they are more difficult to interpret.
The correlation coefficient only measures if there is a correlation between the two variables, it does not say anything about cause and effect. For example if we have a positive correlation between the number of ice creams sold and the number of people who drown, it does not mean that eating ice cream causes people to drown. It is simply that both variables increase during the summer months.
Outliers
The correlation coefficient is just like regression sensitive to outliers. If there are outliers in the data, then the correlation coefficient will be affected by them. So it is important to check for outliers before calculating the correlation coefficient.
Pearson's Correlation Coefficient
Pearson's correlation coefficient is the most common correlation coefficient, and is a measure of the linear correlation between two variables. If we have the two variables and , which have values, then the Pearson's correlation coefficient is defined as:
This formula can look a bit scary, but it is actually quite simple. Let us start with a few quick reminders. The mean value of a variable is defined as:
The variance of a variable is defined as:
The standard deviation of a variable is defined as:
And lastly the covariance between two variables and is defined as:
This can be linked to all the other articles about statistics.
Now we can see that the nominator of the Pearson's correlation coefficient is the covariance between and times , and the denominator is the product of the standard deviation of and times . So the Pearson's correlation coefficient is the covariance between and divided by the product of the standard deviation of and . So we can rewrite the formula as:
Let's say we have the hypothesis that students with a higher GPA also have a higher SAT score. We can then check if there is a correlation between the two variables. We have the following data:
Student | GPA () | SAT () |
---|---|---|
1 | 3.4 | 595 |
2 | 3.2 | 520 |
3 | 3.9 | 715 |
4 | 2.3 | 405 |
5 | 3.9 | 680 |
6 | 2.5 | 490 |
7 | 3.5 | 565 |
If we plot the data we get the following plot, where we can see a clear correlation:
By extending the table slightly we can calculate the Pearson's correlation coefficient pretty quickly.
Student | GPA () | SAT () | |||
---|---|---|---|---|---|
1 | 3.4 | 595 | 2023 | 11.56 | 354025 |
2 | 3.2 | 520 | 1664 | 10.24 | 270400 |
3 | 3.9 | 715 | 2789 | 15.21 | 511225 |
4 | 2.3 | 405 | 932 | 5.29 | 164025 |
5 | 3.9 | 680 | 2652 | 15.21 | 462400 |
6 | 2.5 | 490 | 1225 | 6.25 | 240100 |
7 | 3.5 | 565 | 1978 | 12.25 | 319225 |
Sum | 22.7 | 3970 | 13262 | 76.01 | 2322400 |
Now we can calculate the Pearson's correlation coefficient:
Spearman's Rank Correlation Coefficient
Spearman's rank correlation coefficient measures the monotonic relationship between two variables. As a reminder, a monotonic function is a function that is either strictly increasing or strictly decreasing. The word rank comes into play because we first rank the values of the variables with the lowest value of the variable getting the rank 1 and the highest value of the variable getting the rank , we then have and which are the ranked variables.
Then the Spearman's rank correlation coefficient is actually calculated just as the Pearson's correlation coefficient between the two ranked variables. Often the Spearman's rank correlation coefficient is denoted as .
If all the ranks are unique (which is often the case if there aren't any duplicates in the data), then the formula can be simplified to:
Correlation Matrix
Isn't this also used in image classification to see how close classes are?