This tutorial is going to teach you about residuals, which occur when you fit a line to data points. Specifically you will focus on:
When you create a best fit line, typically it doesn't pass through all the points.
The only way it would pass through all the points is if the correlation was exactly 1, which means that all the points lie exactly on a line. Most of the time, they don't lie exactly on a line. In that case, most of the points are going to have some difference between what the line predicts, which is in brown, and the value that they actually are, which are the grey dots.
The line gives us predictions for what the payroll will be.
This is the payroll in millions for a football team against the number of thousands of dollars that the quarterback of that team makes. So 1,000 $1,000 is a $1 million.
These are predicted payrolls for if the quarterback makes a certain amount of money.
The predicted payroll, payroll hat, is equal to $18.8 million plus 3 times quarterback.
Because these are predictions, they'll be off a little bit from the actual values, even if only by a little.
Even this point above does or doesn't lie exactly on the line, it will have is a residual value of zero. A residual is the amount by which the predictions are off from the actual amount.
Residual
The difference between the actual value of the response variable for a particular data point and its predicted value from the regression line.
Consider the Dallas Cowboys, circled in the scatterplot.
They pay their quarterback $1.75 million, and they pay the overall team well above what the line would predict for a team that pays their quarterback that amount of money. We can look at this visually by figuring out what the predictive value should be from $1.75 million.
Payroll hat, predicted payroll should be $24.05 million for the payroll for the team. However, when you look at the Dallas Cowboys' actual payroll, the actual payroll is about $28 million.
That's over $4 million more than the line would have predicted their payroll to be. This vertical distance between the $28.394 million that's actually being paid versus the $24.05 million that's being predicted is called the residual between those two values.
The difference is called the residua and is calculated by taking the actual response value, in this case, the actual team payroll, minus the predicted team payroll, the predicted response value. It's y actual minus y predicted-- y hat.
In this particular problem, the residual for the Dallas Cowboys ends up being $4.344 million dollars. This is a positive number.
Every point has a residual value. If the actual response is above the predicted response, visually that means the point lies above the line, then the residual is positive. Conversely, if the point falls below the line on the scatter plot, that means the actual response is lower than the predicted response, and the residual will be negative.
If by some freak chance the point falls on the line, the residual value is zero. Since every point has a residual value, you can actually plot instead of QB payroll and team payroll, you can graph quarterback payroll and residual amount.
This will be in millions of dollars. This graph here to the right, where you see how far off the predictions are, is called a residual plot. A residual plot is really useful because it can help you evaluate whether or not a line is actually a useful predictor for our data. A good linear model will have points above and below the line in random scatters, not a curved pattern in the residuals, and equal variability throughout the entire residual plot.
Residual Plot
A scatter plot that plots Residuals vs. explanatory variable, as opposed to response variable vs. explanatory variable. It can be used to assess the fit of a line.
This is a bad choice for a best fit line
It does have points above and below as residuals but it's not randomly scattered like the original one was. The original one had points all over the place, above and below, all throughout the entire pattern.
This one has points that are below only on the left, and points that are above only on the right.
That's what makes this line a poor choice for a line of best fit.
Look at a different data set where this is a curved data set.
A line doesn't make sense to predict this. You can verify that from the residual plot. What you see is a curved pattern in the residual plot. Also, it means that the scatter is not very random. What a curved pattern in the residual plot implies is that there is a better fit than a line for your data.
Finally, equal variability throughout the pattern-- this residual plot shows sort of a trumpet pattern where the variability gets wider. The line is a good fit at the beginning because the residuals are small, but it's a poor fit at the end, and the residuals are getting larger.
You can see that also in this scatter plot.
They're close to the line, some are fitting the line well and others are not fitting the line.
Residuals are how much the data points are different than the line of best fit. They're positive if a point lies above the line, negative if it falls below the line, and zero if it falls on the line.
You can use the resulting residual plot to determine if a line is actually an effective model for predicting our data.
Good luck.
Source: This work adapted from Sophia Author Jonathan Osters.
A scatter plot that plots Residuals vs. explanatory variable, as opposed to response variable vs. explanatory variable. It can be used to assess the fit of a line.
The difference between the actual value of the response variable for a particular data point and its predicted value from the regression line.