This tutorial is going to teach you about correlation, specifically focusing on:
When first describing scatter plots, you learned about their form, their direction, and their strength.
Form assessed the linearity. Direction says whether it is positive or negative. Do the data points tend to move up as you move to the right or do they tend to move down? Strength shows how well they follow that form?
When the form is linear, we can use a number called correlation. It measures the strength and the direction of a linear relationship. The direction will be easy to spot. It will be a positive number if there's a positive association, and a negative number if there's a negative association. The numerical quantity will measure the strength.
In terms of direction, what correlation would do is if the explanatory and response variables rise together, that's going to be called the positive direction. And if one falls as the other increases, that's going to be called a negative correlation.
Some important facts about what we call the correlation coefficient: Its variable is called r and is unit-less. It's a number between negative 1 and positive 1. Its numerical size, relative to 1 or negative 1, indicates the strength of the linear association. Numbers that are close to negative 1 or positive 1 are associated with a strong association between the two variables, a 1 indicating a strong positive association, and a negative 1 indicating a strong negative association. Numbers near zero represent almost no linear relationship.
You can use the chart below to help you to understand. Numbers between 0.8 and 1 are considered to have a strong correlation; between 0.5 and 0.8 a moderate correlation; and between 0 and 0.5 a very weak correlation. And the same exists between negative 1 and 0.
Correlation is essentially the average of the products of the z-scores for the x's and the y's. Now the z- scores are the values of x minus the means of x divided by the standard deviation of x. And the same thing for y.
These are destinations that we could go to from the city of Minneapolis-Saint Paul with the distances away from Minneapolis and the airfare to fly to any of these places.
The first thing we're going to do is convert the x's to z-scores. It is believed that miles is the explanatory variable-- the ones that believed to cause airfare to rise. Take these miles and convert them into z-scores.
You need the mean and the standard deviation to do that. Take 460 minus the mean divided by 619, and which gives you negative 0.864. Do the same thing for Los Angeles at 1870, and all of the other ones.
Next, do the same thing with the y values, the airfare values. 379 minus 304 divided by 91 gives us 0.825. Then do the same thing with all the rest of them.
Next, multiply the corresponding z-scores, 0.864 times 0.825, all down the rows, then add them up.
The sum here ends up being positive 2.11.
Finally, you're going to divide by the number of observations minus 1, in this case, four. So 4-1.
Dividing by four yields an r, a correlation of 0.527. What we realized from before is it's a positive but fairly weak association. And we can see that from the scatter plot:
This is a cumbersome process to go through, and the correlation coefficient is almost always found using technology.
In Excel, if we have this information listed, all you have to do is type in the command =Correl, short for correlation, select all the things believed to be the x's, and all of the things we believe to be the y's. Close the parentheses and hit Enter.
Sure enough it gives you the 0.527 that you got before.
Scatter plots with different correlation coefficients.
This is a very nearly linear data set. And so its correlation is negative 0.99. It's a negative association, very strong.
This you can see is a negative association, but it's not terribly strong.
This is a correlation of zero. It has a very cloudy association. There's no linear association between the x's and the y's.
Here's a positive 0.7, which means it's a fairly moderate to strong association. We can see the upward association. If we put an oval around it, it would be longer than it is wide.
Here's a positive 0.9. Now, you can see there's a huge difference between 0.99 versus regular 0.9, but this is still a very clear, linear association.
And then this is a very weak, positive association.
Just a caution, you're going to hear the word correlation thrown around a lot in everyday speech. There are often very common errors made, such as there's a strong correlation between Type 2 diabetes, physical inactivity, and obesity. Though it's possible that they are related, you can't use the word correlation. Correlation only relates the linear relationship between two quantitative variables. Type 2 diabetes is categorical, either you have it, or you don't. Physical inactivity could be quantitative, but it's not obviously quantitative. Obesity, that's certainly categorical. IQ is quantitative but religious affiliation is categorical. You can't calculate correlation between the two.
Correlation measures the strength and direction of a linear relationship between two variables on a scatter plot. Strong associations have correlation coefficients near positive 1 or negative 1. Scatterplots with weak correlation coefficients are values near zero. And almost always, you can find it using technology, either your calculator, an Internet Applet, a spreadsheet, or another way. Because of the way that r is calculated, where you're multiplying z-scores, it doesn't matter which we call the explanatory and which we call the response.
Good luck!
Source: This work adapted from Sophia Author Jonathan Osters.
The strength and direction of a linear association between two quantitative variables.
The numerical value between -1 and +1 that measures the correlation between two quantitative variables.