Source: All graphs created by Dan Laub; Image of car crash, PD, http://bit.ly/1Ph4jQx; Image of storm, PD, http://bit.ly/1kdt5Wo; Image of plant, PD, http://bit.ly/1OHffEb; Image of weightlifter, PD, http://bit.ly/1OHfkb2; Image of bucket, PD, http://bit.ly/1NLLqlm; Image of burger, PD, http://bit.ly/1Yxbfvk; Image of woman, PD, http://bit.ly/1I9lLH4; Image of money bag, PD, http://bit.ly/1RSaTQ3
Hi. Dan Laub here, and in this lesson we want to discuss the cautions that exist between correlations and causation. And before we do so, let's talk about the objectives for this lesson. The first objective is to know the limitations of the correlation coefficient. The second would be to understand the difference between variables that are correlated, and those that are causally related. And so let's get started.
Recall from previous lessons that the correlation coefficient is a measurement that explains how well two variables are related in terms of how changes to one variable reflect changes in another. However, the correlation between two variables does not mean that a change in one variable causes changes in the other. Two variables are causally related if the change in one of them is responsible for the change in the other.
But correlation does not imply causation, and the correlation coefficient indicates whether two quantities are associated, but not necessarily that they are causally linked. So let's look at a brief example here. The scatter plot you see in front of you illustrates the relationship between how much rainfall a particular location receives on one day, and the number of car accidents that occur on the same day.
Now, can we tell if there's a correlation that exists here? Well, there is. The way we can tell that is by looking at the scatter plot here, and you can notice that, while the data points are scattered quite a bit, there is a positive association here, meaning the more rains, the higher number of car accidents there actually are. In a case like this, the correlation coefficient is equal to 0.684, which tells us that there's a positive association between the two.
However, can we necessarily say that rain causes car accidents? Well, not necessarily. Even though it may contribute to them, we don't necessarily know that it's going to be a causal relationship that exists here. It is important to recognize whether or not a relationship between variables is correlated or causal. By being able to know the difference, it helps us establish accurate conclusions about a data set.
And so another example we could look at would be the case of using fertilizer, and how quickly plants grow. So in the case of fertilizing plants, we would expect the use of fertilizer would influence the growth rate. So the variables are causally dependent. If the data were shown in a scatter plot, there would be a clear trend illustrating a positive association between the use of fertilizer, and how fast the plants grow.
Another example might be the frequency of weight lifting, and the amount of muscle mass that a particular person adds. Once again, we would expect there to be some causality that would take place there, meaning, well, how do you build muscle mass? Lifting weights is a great example in terms of how you would do that. And so we would expect a positive correlation, and if we were to look at a scatter plot, we would pretty much see a clear trend illustrating the positive association between the two.
Another example is say we are in a, hospital and we are interested in tracking the presence of germs on a particular surface, say an operating room table. Well, obviously the more often it was cleaned, we're likely going to see a smaller presence of germs, and we would expect, once again, there'd be a strong correlation between how often it was cleaned, and how many germs were identified on that particular surface.
And we'd probably expect there to be a causal relationship there as well. One example that might not necessarily imply causality would be the frequency of how often somebody exercises, or how often someone exercises, and the amount of fast food that they consume. The results of the analysis of how often a person exercises, and how much money they spend on fast food would find a clear connection between the quantities of both.
However, the quantities are not causally related, meaning that a change in one cannot be directly linked to the other one. If the data were shown in a scatter plot, there would be a clear trend, which would indicate that the variables are correlated. But in a case like this, it is likely that a third variable is responsible for the observed correlation.
This third variable, or in this case, in all likelihood, the level of one's concern for their health, would be considered an extraneous variable, because it was not accounted for in the initial analysis. Let's take a look at an example between two variables that are probably not going to be causally related. And what I'm looking at here is one's shoe size and their IQ score.
So one would think that those with larger feet probably aren't going to have higher IQs, nor would those with smaller feet have higher IQs. However, in this particular instance of 40 random observations, hypothetically we show here a very strong negative correlation.
And as you can see in the scatterplot, there's a downward sloping line, and the points are relatively clustered close together, which tells us, as you see with the correlation coefficient here, that it's nearly close to negative 1, which tells us there's a strong negative correlation. Now, correlation is one thing, causality is something else. And in a case like this we would hardly expect there to be a strong correlation between one's shoe size and their IQ score.
And what we might have here would be an extraneous variable, where there would be a third variable that's going to be responsible for such a correlation. And in a case like this, maybe it's simply just somebody's innate intelligence, the intelligence they're born with. Or it might be the testing conditions under which they took the IQ test. And that would be a good illustration of an extraneous variable that would cause one's IQ score to be affected, not necessarily due to their shoe size.
As another example, let's look at something in which we figured there would be some possible causation going on. In a situation like this, we are looking at an individual's age, and their income. Now, we would expect that the older somebody is, we would assume that they're likely going to be earning more money. Is it necessarily their age that allows that to happen, or are there other factors at play here?
And so it's very possible that there are some other extraneous variables involved in this correlation, and we might see something like education, work experience, or the amount of effort one puts into their job determines how much they're going to earn in terms of income. And so just because one is older doesn't necessarily mean they have additional education.
Now, they are, in all likelihood, going to have a higher level of work experience, which might be the extraneous variable in a case like this. And so you see the scatterplot here in front of you, and if we pick some hypothetical people that we observe, and we look at their incomes, and we look at how old they are in the number of years, and we'll see here a very, very straight looking line that moves upward.
And what that tells us is that we have a positive correlation that exists between somebody's age and the income they earn in terms of thousands of dollars per year. And in this case, notice how strong that correlation coefficient actually is. 0.958. Now do remember, the closer we get to the value of 1, the higher the correlation is. And as we would expect, these two values, age and income, are likely going to be highly correlated.
And in a case like this, it's very possible that the causation is not due to one's age. It might be simply due to the fact that they have more work experience than somebody else does. It's important to note here that we don't want to confuse causality and correlation. While variables may be correlated with one another, that does not necessarily indicate that one variable is causing the other one to change.
In doing this, confusing the two could lead to wrong conclusions or poor predictions about the variables that we're actually looking at. So let's go back to the objectives in this lesson, just to make sure we covered them. The first was to know the limitations of the correlation coefficient, which we did cover. There's only so much the correlation coefficient can tell us.
It can tell us that there's a positive or negative association between variables, and it can tell us how strong it is, relatively speaking. But we don't necessarily know from looking at the correlation coefficient if there's any causal relationship between variables.
And we understand the difference between variables that are correlated and those that are causally related. If they're causally related, we would expect there to be a relatively strong association, but we can't necessarily tell that by looking at the correlation coefficient alone. So again, my name is Dan Laub, and hopefully you got value from this lesson.