Source: Tables and graphs created by Jonathan Osters
In this tutorial, you're going to learn about how to find a line of best fit using the method of least squares. We'll talk about exactly what that means as we go through.
So when you look at data on a scatterplot, there are lots of lines the provide good fits for the data. You can usually eyeball them. In fact, there are many criteria for which you can create what's called a best-fit line. And you can use many different criteria to determine what's considered a best fit. The one that we're going to use is called the method of least squares. So let's look at an example here.
This is the price of seafood, for different types of seafood, like sea scallops is up here, and trout is down here in 1980 versus in 1970. So unsurprisingly, the most expensive ones in 1970 were still the most expensive in 1980, so not too shocking. And the trend is linear. So suppose we choose the regression line to be 1980 price predicted-- that's what the hat is for-- is equal to three times the 1970s price.
OK, so let's take a look. Let's see what that looks like. It looks like that line. It seems fair. I mean, maybe it's not the best-fit line, but we'll go with it.
There are a couple points that are noticeably lower-- this one, and this one, and this one-- and there really aren't any that are noticeably much higher than that, than the line, so maybe this isn't the greatest line. But when we see this, we see that it's a pretty good fit. So what makes it a good fit?
Well, what we notice is that with every line that we make, regardless if it's a good fit or a bad fit, we end up with residuals. Every point has a residual. If it lies exactly on the line, its residual is zero. These three points noticeably have fairly large negative residuals. And so if you look at all the residuals put together, the sum of the residuals is negative 189.
Now, I don't know if that's a good number, if that's like a low number. I guess obviously we'd like the sum of the residuals to be low because that would mean that the points are close to the line, you'd think. So what we can do is we can compare this to see if negative 189 is a good fit and compare it to a line that we know for sure is a worse fit. So let's go with a poor-fit line like this one.
The 1980 price is going to be predicted to be 109.8. That means 109.8 cents per pound for everything. Regardless of it was really cheap in 1970 or really expensive in 1970, we're just going to say that it's 109.8 cents per pound for everything.
Now, that's a bad idea. This is a poor fit for a line. We can see that all these points are below the line, and all these points are above the line. It's not a good fit. It doesn't go through the pack.
And you can see visually there are some very large residuals here. But the problem is when you add up all the residuals, it's zero. What happened here?
Well, we have some large positive residuals that are canceled out by adding together several of these fairly large negative residuals. And so they end up canceling each other out so that even know this line, this blue line, is a poor fit for the data, the sum of the residuals is equal to zero. But I think we can all agree that this first model was, in fact, a better fit than the second model. So how can we reconcile that?
What we're going to do is we're not actually going to use the minimized sum of the residuals. What we're going to do is we're going to use the method of least squares, and this involves minimizing the sum of the squares of the residuals. And so what that means is the negative residuals, when you square them, become positive. The positive ones, when you square them, also become positive so that this negates the effect of having positive and negative residuals that might cancel each other out. So now let's check using the method of least squares which line is a better fit.
So let's look at this one. The sum of the squares of the residuals. So imagine taking this distance here, squaring it, and then doing that for every residual and adding them up , you get 13,519 as the sum of squares. Now, I don't know if that's a high number or a low number, but let's compare it to the poor-fit line. And sure enough, this is a lot smaller than 143,838, indicating that this red line is a better fit for the data than the blue line.
Now, this is not even the best fit for the line, but it's better than the blue one. The best-fit line is, in fact, 1980 price predicted is equal to 2.7 1970 price minus 1.2 cents per pound. In this case, with this line being the model, the sum of the squares of residuals is 9,326, which is even better than the 13,000 and some change that we had before. This is the smallest that the sum of squares can be, which makes this line the best-fit line.
And so to recap, the method of least squares requires that we minimize the sum of the squares of the residuals. The line that does this is the best-fit line. We call it the least-squares line or the least-squares regression line.
Good luck, and we'll see you next time.
Terms to Know
The regression line where the sum of the squares of the residuals are the smallest.