Source: Graphs created by Jonathan Osters
This tutorial is going to teach you about linear equations. Now, this might be something that you may have learned in an algebra class, but it's always a good idea to review it anyway. As we fit lines to data, we're going to try and figure out the equations of those lines. So you may recall the following equation from your algebra courses, y equals mx plus b. And y and x are variables. We recognize x as our explanatory variable and y as our response variable. But the other two values are numbers, and they represent something.
The value of m is called the slope. And the slope is a rate of change. So you may have heard several terms of rates of change, like miles per hour or meters per second or miles per gallon in a car. But what it is, it's an increase of 1 in x corresponds to an increase of m in y. So if something-- if this was like 30 miles per gallon that means that an increase of one gallon would correspond to an increase of 30 miles that you could travel. On the other hand, this b here is called the y-intercept. It's the value when x is 0. So the line will pass through (0, b)-- whatever this number is. And again, the x is the explanatory variable and the y is our response variable.
So let's try and take a look at what these numbers actually look like on a graph. So what do these numbers look like? Remember, slope is the increase in y that corresponds to an increase of 1 in x. So here's an increase of 1 and x, going from 1 to 2. What we have to do is find out the points that are on the line, and figure out by how much vertically this went up between 1 and 2. So this was an increase of 1 and x, which means this is an increase of m in the y direction, an increase of that value of the slope-- so maybe it's 5 or 20 or one-fifth. And then the line also here, as it passes through the y-axis, passes through 0-- because we're not going to the right or left, 0, adds up b.
In statistics, we do change it up a little bit. Instead of y, we call it y hat. Y hat is the notation for our prediction. We have values of y that are not predictions. They're actual data points. But because we're doing a best fit, this is our best guess as the value of y-- is the prediction. So anything with a hat is called a prediction. And then for some reason, we switch up the values here. The slope becomes called b1 and the intercept becomes called b0. You might also see a plus bx or ax plus b as well. If you're using a calculator you might even see both of them. They're not any different from each other.
Again, the slope is b1. The y-intercept is b0. So suppose that you have a trend line and the equation is y hat equals 12 plus 0.5x. So what's the predicted y-coordinate? What is the y hat when x equals 20? You could draw this onto a graph and figure it out. You can also do this algebraically. We know that the predicted y is 12 more than half of the x. And we know what the x is. So we can just figure it out directly from there-- 12 more than half of 20 is 12 more than 10 which is 22.
How about this one? This trend line, I don't know its equation, but I know that it passes through (4, 500) and (12, 900). So what's the equation of that line? Well, we need the two pieces of information. We need the slope and we need the y-intercept. First, we're going to find the slope. We can see visually that from (4, 500) to (12, 900), it went up 8 in the x direction. But it also-- based on the scaling it doesn't look like very much-- but it actually went up 400 in the y direction. So a change of 400 in the y direction divided by a change of 8 in the x direction means that for every 1 increase in the x direction, it actually went up 50 in the y. And so the slope is 50.
To figure out the y-intercept, we take one of our x comma y pairs. It doesn't matter if we choose (12, 900) comma or (4, 500), they'll give us the same answer. So I chose (12, 900). And we're going to put 12 temporarily in for x and 900 temporarily in for y hat, and solve it algebraically for b0. 50 times 12 is 600. Subtract it over, you get 300. Put it all together-- b1 is 50, b0 is 300. And so the equation is y hat equals 300 plus 50x.
Now one thing that's important to note is that the best fit line actually will change if you switch the explanatory and response variables. That's why it's actually important to choose at the beginning which one is the explanatory verses which one is the response variable. Because if you take a look down here, slope is a rate of change. So miles per gallon, for instance, would be the rate of change in this particular example verses in this example over here, if you switched to put gallons on the y-axis and miles on the x-axis, the rate of change here would actually be measuring gallons per mile, which is a different number.
If a car is getting an average of say 20 miles per gallon that will actually only be one-twentieth gallons per mile. So it's a different line. One thing that's important to note though is the value of the correlation coefficient is going to be the same for each of these two graphs, but the line itself is different, and that's why we need to end up choosing which one is the explanatory versus which one is the response.
And so to recap fitting a line to data points on the scatter plot requires a little bit of algebraic savvy. There are two parts to the equation of a line, the slope and the y-intercept. The slope is a rate of change. It's how quickly the response variable y changes when the explanatory variable x increases by 1. And the y-intercept is the value of y when x equals 0.
In actual practice, we're going to want to put variable names and units attached to these, not just X's and Y's. And we'll do that more as the tutorials progress. But for now, this algebraic review should help you understand more about trend lines, slope, and the y-intercept of a line. Good luck. And we'll see you next time.