Source: Image of table and graph created by Jonathan Osters
The point of this tutorial is to give you some pause and cautions about the idea of correlation. Now, correlation is a statistical measure like mean or standard deviation. But it doesn't tell the entire story. You have to actually look at the data. You actually have to graph the data-- plot it-- in order to really fully understand the relationship.
Let me give you an example of these four data sets. All of these four data sets have a standard deviation in the x of 3.32. They all have a mean of x of 9. Their means of y's are 750 and their standard deviations of y's are 203, every single one of them. And their correlations are all 0.816, meaning they're all, allegedly, equally linearly strong.
But if we look at the four graphs, only one of them is linear in the way that the data suggests that it is. This one's clearly not linear. It's sort of more of a parabola. These two have influential points. This one's very linear. But it's not exactly linear in the way that the data suggests that it would be.
One of the big ideas about correlation is that it can be affected strongly by non-linearity or influential points. So what we need to do is we need to not just trust that the correlation gives us a strong number, and say, oh well, then the x and the y are very strongly linearly related. We have to actually look at the data points on the scatter plot to see if they are, in fact, forming a line like that first one was, or forming a curve, or if they have influential points.
Another thing about correlation that can be misleading, it can also be affected by what we call inappropriate grouping. So suppose that we have these workers at a particular factory. Now, you have the age of the worker on the x-axis, and salary in thousands of dollars on the y-axis. So, now you would assume that the younger folks would make less than the older folks. But apparently on this scatter plot, that's not really the case. If you look at it, it appears there's a weak negative association. It appears that the longer you work there, the less you make. Which, in any job, doesn't really make a whole lot of sense. Typically, longevity is rewarded with higher salaries.
What we can see is there might be a lurking variable behind this, where if we look at it closely, we can see that there, in fact, are two groups. There might be two groups, one group having a college degree. And these might all be the younger people in the factory. They might have ascended to higher positions, maybe a foreman rather than someone working on the assembly line. Whereas maybe these older folks don't have a college degree, and maybe have the lower paying jobs than the younger folks.
So, you might have something like this. What we can see is that both of these two groups have strong positive association. The longer you work there or the older you are, at any rate, your salary will go up. However, when viewed as a whole, like we were before, it appeared that the association was negative.
And so to recap, correlation is a useful measure. I'm not trying to discount its effectiveness. However, like any statistical measurement, it doesn't tell the entire story. You have to graph your data, because correlation can be affected by influential points, non-linearity, and inappropriate grouping.
Inappropriate grouping is when you have a weakened, or even a reverse association when you group, versus if you didn't group. So on the previous example, when we didn't group the data, it appeared that there was a negative association. Whereas when we did group the data, we found that there was a positive association. And that was an example of inappropriately combining the two data sets of degrees and non-college degrees.
So we talked about inappropriate grouping. We also talked about the things that can affect correlation. So always, always graph your data. Good luck, and we'll see you next time.
Combining together subgroups that should not be combined, resulting in a weakened, or even reversed, association.