The point of this tutorial is to give you some pause and cautions about the idea of correlation. Specifically you will focus on:
Correlation is a statistical measure like mean or standard deviation. But it doesn't tell the entire story. You have to actually look at the data. You actually have to graph the data-- plot it-- in order to really fully understand the relationship.
Here are four data sets.
All of these four data sets have a standard deviation in the x of 3.32. They all have a mean of x of 9. Their means of y's are 750 and their standard deviations of y's are 203, every single one of them. Their correlations are all 0.816, meaning they're all, allegedly, equally linearly strong.
But if we look at the four graphs, only one of them is linear in the way that the data suggests that it is.
One of the big ideas about correlation is that it can be affected strongly by non-linearity or influential points.
You need to not just trust that the correlation gives you a strong number and believe then the x and the y are very strongly linearly related. You have to actually look at the data points on the scatterplot to see if they are forming a line like that first one was, or forming a curve, or if they have influential points.
Another thing about correlation that can be misleading is iit can also be affected by what we call inappropriate grouping.
Inappropriate Grouping
Combining together subgroups that should not be combined, resulting in a weakened, or even reversed, association.
Look at these workers at a particular factory.
You have the age of the worker on the x-axis, and salary in thousands of dollars on the y-axis. You would assume that the younger folks would make less than the older folks. Apparently on this scatter plot, that's not really the case.
It appears there's a weak negative association. It appears that the longer you work there, the less you make. Which doesn't really make a whole lot of sense. Typically, longevity is rewarded with higher salaries.
There might be a lurking variable behind this, where, if you look at it closely, you can see that there are two groups.
One of the groups has a college degree. These might all be the younger people in the factory. They might have ascended to higher positions, maybe a foreman rather than someone working on the assembly line. Maybe these older folks don't have a college degree, and maybe have the lower paying jobs than the younger folks.
So, you might have something like this.
Both of these two groups have strong positive association. The longer you work there or the older you are, at any rate, your salary will go up. However, when viewed as a whole, like you were before, it appeared that the association was negative.
Correlation is a useful measure. However, like any statistical measurement, it doesn't tell the entire story. You have to graph your data, because correlation can be affected by influential points, non-linearity, and inappropriate grouping.
Inappropriate grouping is when you have a weakened, or even a reverse association when you group, versus if you didn't group. On the previous example, when you didn't group the data, it appeared that there was a negative association. Whereas when you did group the data, you found that there was a positive association. That was an example of inappropriately combining the two data sets of degrees and non-college degrees.
Good luck.
Source: This work adapted from Sophia Author Jonathan Osters.
Combining together subgroups that should not be combined, resulting in a weakened, or even reversed, association.