Online College Courses for Credit

4 Tutorials that teach Cautions about Correlation
Take your pick:
Cautions about Correlation
Common Core: S.ID.9

Cautions about Correlation

Author: Ryan Backman

Identify non-linearity, influential point, or inappropriate grouping in a data set.

See More

Try Our College Algebra Course. For FREE.

Sophia’s self-paced online courses are a great way to save time and money as you earn credits eligible for transfer to many different colleges and universities.*

Begin Free Trial
No credit card required

37 Sophia partners guarantee credit transfer.

299 Institutions have accepted or given pre-approval for credit transfer.

* The American Council on Education's College Credit Recommendation Service (ACE Credit®) has evaluated and recommended college credit for 32 of Sophia’s online courses. Many different colleges and universities consider ACE CREDIT recommendations in determining the applicability to their course and degree programs.


Source: Images of graphs created by Ryan Backman

Video Transcription

Download PDF

Hi. This tutorial covers cautions about correlation. So let's take a look at a situation here and see if we can figure out where we might need to be a little cautious. All right, so 30 people of age 18 were selected for a study. Their heights were measured.

On a closed driving course, they were asked to drive a car as fast as they wanted on an open stretch of a road. Their top speed was recorded. The following scatter plot shows the data. So if we take a look at the data that was collected here, notice, on the x-axis, we have height. On the y-axis, we have fastest speed driven. So height is measured in inches. Fastest speed driven is measured in miles per hour.

So what we can see here is that, first of all, we have an r value-- a correlation coefficient-- of about 0.72. So if we interpret that value, that would say that there's a strong correlation between height and fastest speed driven. And we can see that it does have this positive association, and we can see that it does appear to be moderate.

We might be tempted to conclude that the taller the person is, the faster they're willing to drive. And like the name of this tutorial, we want to make sure that we're a little bit cautious when we're interpreting this correlation. And here's why. What this data really consisted of was really two clusters. So we have a cluster of points here and we have a cluster of points here.

And the way that those points are clustered is basically, these are the males, these are the females. So now, what I've done is reproduced the individual scatter plots of males and females down here on their own graphs. I've also calculated r values for each of them. So if we look at just the females, the females had an r value of 0.075, which is a very weak positive association.

So we really have very, very little association between height and fastest speed driven here. Now, if we look at the males, the males also had very little association between height and fastest speed driven. So in this case, r was 0.066. So basically, what we're looking at here is that, because males are generally a lot taller than females at that specific age, and males at age 18 are more likely to drive fast-- an 18-year-old male generally is more of a risk taker than a female at that age-- so we can see that males were willing to drive a lot faster.

So when we grouped these two populations together, it appeared that there was a positive association because of similarities within the two groups. But if we look at them separately, we can see that there's really no association between height and fastest speed driven within each population. So in that situation, we had some inappropriate grouping.

And inappropriate grouping is the act of combining two subgroups of bivariate data so that the combined data has a much different correlation than the separate data sets. And that's what we saw. We had no correlation-- very little correlation between the two data sets when they are separate, but when they were together, there was a pretty high correlation.

So correlation should be used when-- or excuse me-- caution should be used when interpreting correlation. Sometimes r does not provide a complete summary. It's important to look at the scatter plot of the data in addition to calculating r. So instead of just looking at that r valued and saying that there's a strong positive association, maybe, from what you know about the situation and maybe seeing that there were those two distinct clusters, avoiding making that strong correlation-- maybe that could have been avoided if you were able to do a little research ahead of time.

So that is one way that that inappropriate grouping can be used to maybe get some results that really aren't really significant. So that has been the tutorial on cautions using-- cautions about correlation. Thanks for watching.

Terms to Know
Inappropriate Grouping

Combining together subgroups that should not be combined, resulting in a weakened, or even reversed, association.