Source: Images of graphs created by Ryan Backman
Hi. This tutorial covers cautions about correlation. So let's take a look at a situation here and see if we can figure out where we might need to be a little cautious. All right, so 30 people of age 18 were selected for a study. Their heights were measured.
On a closed driving course, they were asked to drive a car as fast as they wanted on an open stretch of a road. Their top speed was recorded. The following scatter plot shows the data. So if we take a look at the data that was collected here, notice, on the x-axis, we have height. On the y-axis, we have fastest speed driven. So height is measured in inches. Fastest speed driven is measured in miles per hour.
So what we can see here is that, first of all, we have an r value-- a correlation coefficient-- of about 0.72. So if we interpret that value, that would say that there's a strong correlation between height and fastest speed driven. And we can see that it does have this positive association, and we can see that it does appear to be moderate.
We might be tempted to conclude that the taller the person is, the faster they're willing to drive. And like the name of this tutorial, we want to make sure that we're a little bit cautious when we're interpreting this correlation. And here's why. What this data really consisted of was really two clusters. So we have a cluster of points here and we have a cluster of points here.
And the way that those points are clustered is basically, these are the males, these are the females. So now, what I've done is reproduced the individual scatter plots of males and females down here on their own graphs. I've also calculated r values for each of them. So if we look at just the females, the females had an r value of 0.075, which is a very weak positive association.
So we really have very, very little association between height and fastest speed driven here. Now, if we look at the males, the males also had very little association between height and fastest speed driven. So in this case, r was 0.066. So basically, what we're looking at here is that, because males are generally a lot taller than females at that specific age, and males at age 18 are more likely to drive fast-- an 18-year-old male generally is more of a risk taker than a female at that age-- so we can see that males were willing to drive a lot faster.
So when we grouped these two populations together, it appeared that there was a positive association because of similarities within the two groups. But if we look at them separately, we can see that there's really no association between height and fastest speed driven within each population. So in that situation, we had some inappropriate grouping.
And inappropriate grouping is the act of combining two subgroups of bivariate data so that the combined data has a much different correlation than the separate data sets. And that's what we saw. We had no correlation-- very little correlation between the two data sets when they are separate, but when they were together, there was a pretty high correlation.
So correlation should be used when-- or excuse me-- caution should be used when interpreting correlation. Sometimes r does not provide a complete summary. It's important to look at the scatter plot of the data in addition to calculating r. So instead of just looking at that r valued and saying that there's a strong positive association, maybe, from what you know about the situation and maybe seeing that there were those two distinct clusters, avoiding making that strong correlation-- maybe that could have been avoided if you were able to do a little research ahead of time.
So that is one way that that inappropriate grouping can be used to maybe get some results that really aren't really significant. So that has been the tutorial on cautions using-- cautions about correlation. Thanks for watching.