Source: Headache, public domain http://www.publicdomainpictures.net/view-image.php?image=15614 Calculator, public domain http://commons.wikimedia.org/wiki/File:Graphing_calculator.JPG All other Images and Graphs created by Author
This tutorial is going to teach you about correlation.
When we first described scatter plots, we would talk about their form, their direction, and their strength. Form assessed the linearity. Direction said whether it was positive or negative. Do the data points tend to move up as you move to the right or do they tend to move down? And then strength--how well do they follow that form?
When the form is linear, we can use a number called correlation. It measures the strength and the direction of a linear relationship. The direction will be easy to spot. It will be a positive number if there's a positive association, and a negative number if there's a negative association. The numerical quantity will measure the strength.
So in terms of direction, what correlation would do is if the explanatory and response variables rise together, that's going to be called the positive direction. And if one falls as the other increases, that's going to be called a negative correlation.
So some important facts about what we call the correlation coefficient. Its variable is called r. r Is unitless. And it's a number that's going to be between negative 1 and positive 1. Its numerical size, relative to 1 or negative 1, indicates the strength of the linear association. Numbers that are close to negative 1 or positive 1 are associated with a strong association between the two variables, a 1 indicating a strong positive association, and a negative 1 indicating a strong negative association. Whereas, numbers near zero represent almost no linear relationship.
You can use this chart to help you to understand. Numbers between 0.8 and 1 are considered to have a strong correlation; between 0.5 and 0.8 a moderate correlation; and between 0 and 0.5 a very weak correlation. And the same exists between negative 1 and 0.
Correlation is essentially the average of the products of the z-scores for the x's and the y's. Now the z-scores are the values of x minus the means of x divided by the standard deviation of x. And the same thing for y. So we will go through an example--a fairly small example--of doing this by hand. When this is done we're dividing not by n but by n minus 1. So it's not exactly an average of these products, but it's almost the same.
So, let's take a look. These are destinations that we could go to from the city of Minneapolis-Saint Paul. And these are the distances away from Minneapolis, and the airfare to fly to any of these different places. So, the first thing we're going to do is we're going to convert the x's to z-scores. We're going to take the miles, which we believe to be the explanatory variable-- the ones that we think cause airfare to rise, you know, the miles away being bigger, you'd think the airfare would be bigger.
So, we'll take these miles and convert them into z-scores. We need the mean and the standard deviation to do that. So we did 460 minus the mean divided by 619, and we got negative 0.864. We did the same thing for Los Angeles at 1870, and all of the other ones. We'll do the same thing with the y values, the airfare values. So 379 minus 304 divided by 91 gives us 0.825. And we did the same thing with all the rest of them.
The next thing we're going to do is we're going to multiply the corresponding z-scores. Multiplying the 0.864 times the 0.825 gives us this number, and et cetera. Next we're going to add them up. The sum here ends up being positive 2.11. And finally we're going to divide by the number of observations minus 1. In this case, four. So dividing by four yields an r, a correlation of 0.527. What we realized from before is it's a positive but fairly weak association. And we can see that from the scatter plot.
Now if you think that this is a ridiculously cumbersome process to go through, it can be, especially if the data set is larger than the five items that we just had in the previous data set. But take heart. It's almost always found using technology. So if it frustrates you, reach for your calculator and there's definitely going to be a way to figure out the correlation from there.
We can also use technology on Excel. If we have this information listed, all we have to do is type in the command Equals Correl, short for correlation. And we're going to select all the things we believe to be the x's, comma all of the things we believe to be the y's. Close the parentheses and hit Enter. And sure enough it gives us the 0.527 that we got before. And this is a faster way to do it, especially with larger data sets.
Taking a look, here's some examples of scatter plots with different correlation coefficients. This is a very nearly linear data set. And so its correlation is negative 0.99. It's a negative association, very strong.
This you can see is a negative association, but it's not terribly strong.
This is a correlation of zero. It has a very cloudy association. There's no linear association between the x's and the y's.
Here's a positive 0.7, which means it's a fairly moderate to strong association. We can see the upward association. If we put an oval around it, it would be longer than it is wide.
Here's a positive 0.9. Now, you can see there's a huge difference between 0.99 versus regular 0.9, but this is still a very clear, linear association.
And then this is a very weak, positive association.
Now just a caution. You're going to hear the word correlation thrown around a lot in everyday speech. And often there's very common errors made, something like there's a strong correlation between Type 2 diabetes, physical inactivity, and obesity. Well, it's possible that yes, that they're related, but you can't use the word correlation. Correlation only relates the linear relationship between two quantitative variables. Type 2 diabetes is categorical, either you have it, or you don't. Physical inactivity could be quantitative, but it's not obviously quantitative. And obesity, that's certainly categorical.
And then here. IQ is quantitative but religious affiliation is categorical. You can't calculate correlation between the two.
And so to recap, correlation measures the strength and direction of a linear relationship between two variables on a scatter plot. Strong associations have correlation coefficients near positive 1 or negative 1. Scatterplots with weak correlation coefficients are values near zero. And almost always, you can find it using technology, either your calculator or some kind of Internet Applet, or on a spreadsheet, or there's a lot of different ways. And because of the way that r is calculated, where you're multiplying z-scores, it doesn't matter which we call the explanatory and which we call the response.
Good luck and we'll see you next time.
The strength and direction of a linear association between two quantitative variables.
The numerical value between -1 and +1 that measures the correlation between two quantitative variables.