This tutorial is going to teach you about the coefficient of determination. This is also called R squared, the square of the correlation coefficient. Specifically you will focus on:
The correlation coefficient gives a general measure of strength and direction of a linear relationship. That's a general measurement, whereas R squared gives you a very specific measurement. It tells you the percent of the variation in the Y direction that can be explained by the linear relationship with the X variable. That can be a little confusing to understand.
Coefficient of Determination (r2)
A value that explains the percent of variation in the response variable that can be explained by a linear association with the explanatory variable. It is the square of the correlation coefficient.
Look at this example. The graph here is a dot plot.
It's on the y-axis, but it's a dot plot of the seafood prices in 1980. This is going to be your Y variable, but it's not very well contextualized. You would still wonder why is sea scallop are so expensive? What would cause that to be so high? Versus why are some prices so low?
What you can do is add a variable to understand why the 1980 price of sea scallops were so high, versus some of the other prices were so low. Look at it with the 1970 prices to explain why some of these are high or low or in the middle.
The low prices were low in 1970, and the high prices were high in 1970. Looking at this divorced of its previous context doesn't really help to explain why certain prices are high or low. Looking at it with the full context of previous knowledge and its associations helps to explain why this point is high up, and why these points are low. It's high up because it's strongly linearly associated.
The value of R squared in this particular example is 0.935.
Because it's a 0.935, that means 93.5% of the variation in 1980 prices can be explained by a linear association-- or a linear relationship with 1970 prices.
You might be wondering what happened to the other 6.5% of the variation. How is that explained? The reason that it's not 100% of the variation is because these points don't all lie perfectly on a line. If they did, we would be able to explain-- all the reasoning behind the 1980 price would be explained by the 1970 price. But they don't lie exactly on a line.
There are some points that fall conspicuously a little bit below what you would imagine the line to look like. The remaining 6.5% of variation has to be explained by something else. Maybe some species of fish were overfished, and that raised prices. Or people's tastes changed and the demand for a particular fish fell, and that lowered the price.
Ultimately, R squared is always a positive number, and it does help to measure the strength of the linear association. But it does measure something very, very specific and it doesn't indicate the direction. It only can indicate the strength.
For instance, both of these two scatterplots have the same R squared of 0.81, although clearly the one on the left has a positive association and the other one has a negative association.
If only R squared is given, what you have to do is you have to take the square root in order to obtain the correlation. But you also have to look at the association, positive or negative, to determine the sign. So this has a correlation coefficient of 0.9. This has a correlation coefficient of negative 0.9.
The coefficient of determination allows us to understand the percent of variation in the vertical direction that can be explained by the linear association that the two variables have. If you solve for R, from R squared, you need to not only take the square root but also look at the scatterplot to determine the sign. Because R squared can't be negative, but R, in fact, can.
Source: This work adapted from Sophia Author Jonathan Osters.
A value that explains the percent of variation in the response variable that can be explained by a linear association with the explanatory variable. It is the square of the correlation coefficient.