In this tutorial, you're going to learn about how to find a line of best fit using the method of least squares. This tutorial will focus specifically on:
When you look at data on a scatterplot, there are lots of lines that provide good fits for the data. You can usually eyeball them. In fact, there are many criteria for which you can create what's called a best- fit line. You can use many different criteria to determine what's considered a best fit. The one that we're going to use is called the method of least squares.
This is the price of seafood, for different types of seafood.
Unsurprisingly, the most expensive ones in 1970 were still the most expensive in 1980, so not too shocking. And the trend is linear. Choose the regression line to be 1980 price predicted-- that's what the hat is for-- is equal to three times the 1970s price.
If you take a look at that line, it seems fair.
There are a couple points that are noticeably lower than the line, so what makes it a good fit?
What you notice is that with every line that made, regardless if it's a good fit or a bad fit, we end up with residuals.
Every point has a residual. If it lies exactly on the line, its residual is zero. The three points above have fairly large negative residuals. So if you look at all the residuals put together, the sum of the residuals is negative 189.
You want the sum of the residuals to be low because that would mean that the points are close to the line. You can compare this to see if negative 189 is a good fit and compare it to a line that you know for sure is a worse fit.
Here is a poor-fit line:
The 1980 price is going to be predicted to be 109.8. That means 109.8 cents per pound for everything. Regardless of it was really cheap in 1970 or really expensive in 1970, just say that it's 109.8 cents per pound for everything.
Now, that's a bad idea. This is a poor fit for a line. You can see a lot of points are above and below the line. It's not a good fit. It doesn't go through the pack. You can see visually there are some very large residuals here. But the problem is when you add up all the residuals, it's zero. What happened here?
Well, there are some large positive residuals that are canceled out by adding together several of these fairly large negative residuals. They end up canceling each other out so that even know the blue line is a poor fit for the data, the sum of the residuals is equal to zero. But it’s agreed that this first model was, in fact, a better fit than the second model. So how can that be reconciled?
What you do is not use the minimized sum of the residuals. You use the method of least squares, and this involves minimizing the sum of the squares of the residuals. What that means is the negative residuals, when you square them, become positive. The positive ones, when you square them, also become positive so that this negates the effect of having positive and negative residuals that might cancel each other out. Now check using the method of least squares which line is a better fit.
Look at this one, the sum of the squares of the residuals. Imagine taking this distance here, squaring it, and then doing that for every residual and adding them up, you get 13,519 as the sum of squares.
Compare it to the poor-fit line. Sure enough, 13,519 is a lot smaller than 143,838, indicating that the red line is a better fit for the data than the pink line below.
The blue line not even the best fit for the line, but it's better than the pink one. The best-fit line is, in fact, 1980 price predicted is equal to 2.7 1970 price minus 1.2 cents per pound.
In this case, with this line being the model, the sum of the squares of residuals is 9,326, which is even better than the 13,519. 9,326 is the smallest that the sum of squares can be, which makes this line the best-fit line.
The method of least squares requires that you minimize the sum of the squares of the residuals. The line that does this is the best-fit line. It is called the least-squares line or the least-squares regression line.
Source: This work adapted from Sophia Author Jonathan Osters.
The regression line where the sum of the squares of the residuals are the smallest.