Or

4
Tutorials that teach
Correlation

Take your pick:

Tutorial

Source: Correlation Examples; Public Domain: http://en.wikipedia.org/w/index.php?title=File:Correlation_examples2.svg&page=1

This is your tutorial on correlation. Correlation is a way to describe the tendency of the data-- so how closely it's associated. And there's three major types of correlation. You can have positive correlation, when both the variables are moving in the same direction. They're both increasing or both decreasing.

You can have negative correlation, and that's when one variable is increasing, the other variable is decreasing-- or the reverse-- when one variable's decreasing, the other variable's increasing. Or you could have zero correlation. The variables are randomly or non-linearly associated.

So with correlation, we have this thing called the correlation coefficient. It's represented with a lowercase r. The correlation coefficient is a quantitative way to describe the association of the two variables. So it puts a number on how closely the variables are associated. And it's a measure of the strength of only the linear association, and that's something that's important to keep in mind.

Even if your r value tells you that the two variables are not associated, it just means they're not linearly associated. Now, I might not always remember to say linearly, and some people don't always remember it, but it's a good thing to keep in mind because they could have a curved association or they could be making some other sort of shape.

We'll see some examples in a minute. But the important thing to remember about r is it ranges from negative 1 to 1. So if it has a strong association, it's going to be close to negative 1-- it's going to be close to negative 1 or 1. Something that has no association-- no linear association-- is going to have an r of 0. If it's weak, it's going to be pretty close to 0.

So that could be like negative 0.2 or positive 0.1. And then moderate is kind of in-between-- so a value like 0.6 or negative 0.65. Those r's are going to be pretty moderate. It's not quite strong, but it's not weak either. This shows several scatter plots for different values of correlation.

One important thing to remember is that r is unitless, so these numbers that we see-- 1, 0.4, negative 0.4-- there's no unit for any of those. In this bottom row here, all of these scatter plots have a correlation of 0. Now, we're only talking about linear correlation here, so here we can see that the scatter plot does have some sort of relationship, but it's not making a line. It's making more of a curve.

Similarly here, you can see the circle. And again, it doesn't have a linear relationship, so it does not have any correlation, but there's still some sort of relationship going on with the data. Now, another scatter plot that has a correlation of 0 is this horizontal line right here. And then similarly, this cluster of dots also has a correlation of 0.

Now, when we look at this middle row, we see a lot of straight lines. And now, with these very tight straight lines, they either have a correlation of 1-- they have a strong positive correlation-- or a correlation of negative 1-- the strong negative correlation. Now, here, this one has 0 because it's horizontal. It doesn't have a slope.

But with a correlation of 1, it's either positive or negative, but in either case, you can see that it makes a fairly straight line. Now, this top row shows some of the variation in between. So we, again, see the positive 1-- the straight line with a positive slope-- and the negative line-- the straight line with a negative slope with the 0-- no linear relationship smattering of points here. Now, this positive 0.4 and negative 0.4 are both showing a fairly weak correlation.

We can kind of start to see a trend, but again, it's pretty weak. There's a lot of distortions and non-clarity too. Same here with moderate-- for the negative 0.8 and the positive 0.8, the lines and the linear relationship is becoming a bit more defined, but there are still some points that are a little further away from the trend line than others.

And then the 1 and the negative 1 we would classify as strong. And then the 0 on the middle is [INAUDIBLE] same with these 0's down here. So this has introduced you to the variety of different kinds of correlations that we can see in scatter plots.

Now, r is typically calculated with statistical program like Excel or a calculator. However, you can do it by hand. There are a bunch of formulas for r that are represented slightly differently, but they all mean the same thing. So it could look like this or it could look like this. I'm going to use this formula when I do my calculations, but other tutorials on other websites might walk you through another way.

So first we have our problem, or our data set. We have a set of subjects, and their age-- which is the x-- and the glucose level-- which is the y. In order to do our correlation calculations, we need a couple of pieces. This sigma here means sum, so we need the sum of x times y. We need the sum of x, the sum of y, the sum of x squared, and the sum of y squared.

So we're just adding those pieces together. In order to keep my work straight, I like to make a chart. So I like to have x times y as a column, and x squared as a column, and y squared is a column. This is going to help me to keep my information together. Now, I'm going to fill those in because I've already calculated them out.

So I'm going to pause the video while I do that, so you should be calculating those on your own. So for here, you're doing 43 times 99-- goes in there. x squared is 43 times 43. y squared is 99 times 99. So here I've completed my chart with the x times y, x squared, and the y squared.

So right here, you can double check all of your work to make sure you got the same answers, but for this bottom row, I did 57 times 87 to get 4,959, I did x squared 57 times 57 to get 3,249, and then 87 times 87 to get 7,569. Now, I also added a column-- or a row down the bottom for sigma, the sums, because that's what I actually need. That's the most important part.

So in order to get this 188, I added up all of the x values. 405 is adding up all of the y's, 15,706 is adding up all of the x times y's, 7,928 is adding up all the x squareds, and 33,461 is adding up all the y squareds. So this row down the bottom here is the most important piece. It's the one that we're going to inputting into our formula.

So now, first, what I did is I just matched each of the pieces of the formula and replaced with the numbers we found. The only other thing is this n-- that's the number-- so how many pieces of data were in our set. So we had five. Then I took all of this and entered it into my calculator.

I made sure to go carefully because it's really easy to make mistake on these problems, because there's so many different pieces, and parentheses and exponent. So it's easy to make a mistake. That's why we typically use computers. This simplified down to 2,390/3,758.8.

When I divided those, I got 0.6367. So our r value, our correlation coefficient, is this positive 06367-- sorry-- 0.6367. It could be negative, so if you get a negative answer, that's OK. In this case, it's not.

And this is telling us that our correlation is moderate. It's not super strong. It's not perfectly 1. It's not no correlation. It's not 0. But that data set for the subject's age and glucose level has a moderate association. This has been your tutorial on correlation.