Or

4
Tutorials that teach
Least-Squares Line

Take your pick:

Tutorial

Hi. This tutorial covers the least-squares line. So let's start with some data.

All right. So this is a pretty simple data set, only 5 values. Here's the raw data here. And then here's the data as a scatter plot. OK. We'll come back to that in a sec.

So now, we want to consider two possible best-fit lines. I have equations for both of them here. So y hat equals 0.5 plus 0.5x. And y hat equals negative 1 plus 1x. So let's graph each equation on the previous scatter plot.

So I'll do the first one here in green. So it's 0.5 plus 0.5x. So that means it has a y-intercept of 0.5 and a slope of 0.5. So the y-intercept of 0.5 is going to be here. And then if it has a slope of 0.5, that means for every one unit we go over, we go a half a unit up.

So there'd be a point here. And then if we went over one, up one, there'd be another point here. I only need two points to make a line, but I can use all three here. So this would be the first possible best-fit line.

And the second one was y hat equals negative 1 plus 1x. So the y-intercept is negative. The slope is 1. So the y-intercept of negative 1 would be here.

I won't plot a point there, but we could imagine, then, going-- a slope of 1 would be over one and up one. So the change in x is 1. Change in y is 1. So we'd know we'd have a point here.

Another point-- you would go over one, up one, over one, up one. So we're going to get a line that looks something like that. And I will connect the points with the ruler again.

OK. So to kind of evaluate these two lines, I would say the green one seems to do a pretty good job. We have two points below it, three points above it. The three points above it are a little closer to the line than the two points below it.

A good thing about the red line-- for one, it goes through two points exactly, so that's good. It has two points below, one point above. I would say one downfall here is that this point here is pretty far from the line.

So they're both pretty good. But we want to kind of come up with some criteria for determining which of these two is the better of the two lines. All right.

So a good best-fit line should minimize residuals, so let's find the residuals for each line. So what I basically have here is a table with all of the points here. I have a column for the y hat values, a column for the y minus y hat values, so the residuals. And then we'll use that column in a sec.

So the first thing I want to do is determine my y hat values. So what I'm going to do is take my x's, plug them into this y hat equation. So if I take 2, substitute it in here for x, 0.5 times 2 is 1 plus 0.5 is 1.5. For 3, 0.5 times 3 is 1.5 plus 0.5 is 2.

5 times 0.5 is 2.5 plus 0.5 is 3. 4-- so that would be 2 plus 0.5 is 2.5. And 5 we already calculated. so that's also 3. OK.

So now my y minus y hat values-- I just need to subtract my y's from my-- subtract y minus y hat. So 2 minus 1.5 is 0.5. 1 minus 2 is negative 1. 2 minus 3 is negative 1. 3 minus 2.5 is 0.5. 4 minus 3 is 1.

And let's do the same thing for my second equation here. So plug 2 in for x. So 1 times x-- or 1 times 2 is 2 plus negative 1 is 1. 3 plus negative 1 is 2. 5 plus negative 1 is 4. 4 times 1 is 4 plus negative 1 is 3. And 5 is also 4 there.

Now y minus y hat. This is positive 1, negative 1, negative 2, 0, and 0. OK.

So what we do now-- so we want to minimize the residuals. But a lot of times when you're measuring residuals, you're going to end up with residuals generally equal to 0. So we can see that in this case, if we add these up, 0.5 plus negative 1 plus negative 1 plus 0.5 plus 1, we will get this equal to zero.

These values here, these will actually add up to negative 3. So we can see that these do have residuals.

A lot of times-- there's a special characteristic of best-fit lines. If they go through a certain point, these residuals will always equal zero. The sum of the residuals will always equal zero.

So a lot of times what we'll do instead is calculate squared residuals. So if we square residuals, they're always going to give us positive numbers, so that when we add them, they won't ever equal zero. So really, we're just going to square each of these numbers.

So if we square 0.5, we get 0.25. If we square root of negative 1, we get 1, 1, 0.25, and 1. OK. And now if I square the residuals here, this is 1, 1, 4, 0, and 0. All right.

So what we can do now is take the sum of the squared residuals. So 0.25, 1, 1, 0.25, 1-- that's 3.5 for the first line. And for the second line, y minus y hat squared, that's going to be 1, 1, 4, 0, 0. So that's going to be 6.

So since this line has a smaller sum of the squared residuals, this line will fit the data better than this line. So this, I would say, is a better fit. OK.

So the least-squares line is a best-fit line that is found through the process of minimizing the sum of the squared residuals. So in the last two examples, we just found which one has the smaller sum of the squared residuals. When you're calculating what's called the least-squares line, there's actually a process that will allow you to find the line that has the least squares. We're not going to go through that process. But I just want to make sure that you understand that what we're doing when we're finding a least-squares line is finding the minimum value of the sum of the squared residuals. All right.

And a least-squares line is the most common type of best-fit line. So when you usually hear that term best-fit line or regression line, generally they're talking about the least-squares line.

So that has been the tutorial on the least-squares line. Thanks for watching.