Source: Tables created by Jonathan Osters
In this tutorial, you're going to learn about multiple regression. Multiple regression is going to allow us to predict a response based on more than one variable, more than one explanatory variable, although they have to be independent. So in many school districts, teacher salaries are dependent on two variables-- years of experience and number of postgraduate hours accumulated.
And it's possible that a teacher with a lot of years of experience might not have a whole lot of postgrad hours. And it's possible that someone with a lot of postgrad hours doesn't have a whole lot of experience. So suppose the table with those three variables-- salary, years of experience, and postgrad hours-- they're listed here in the table below for Mr. Backman, Mr. Jones, Ms. Nordstrom, Mr. Osters, and Ms. Williams.
So we could do a linear regression for salary versus years, which would predict a starting salary for someone with no years of experience of $31,164. And for every additional year that a person works, they are predicted to make an additional $2,685 on average. And if you look at the r-squared for this, it's fairly high at 0.83.
So it's clear there's something of an association here between salary and years. We could also do a linear regression for salary and hours. The r-squared here isn't as high, so there's a little bit less of an association between postgraduate hours and salary. But it appears that the equation is a little bit different where the predicted starting salary for someone with no postgrad hours would be $31,384. And for each additional postgrad hour, the salary is predicted to increase by about $409.
And this is a little bit weaker of an association than the one with years. But there still is an association there. What we can do with multiple regression is, if those variables are independent, then we can do a regression on both variables. And so the predicted salary is going to have some part that has a constant, some coefficient for the number of years that the teacher has, and some coefficient for the number of postgrad hours that that teacher has accumulated.
And look at the r-squared value. It's higher than either of the two individual linear regressions. Every time you add a variable, an independent variable here, if we had added a third one, this r-squared would continue to increase. It always will go up when you add another variable because more of the variability in salary is going to be explained by an additional variable.
So now let's take a look at how well these models did. These lists here indicate the residuals for each model. That's how far off each model was in predicting the teacher's salary. It looked like model A underpredicted Mr. Backman's salary by nearly $4,000 and over predicted Mr. Jones' salary by about $2,800.
If you look at model B, these residuals are pretty big. Mr. Jones' salary was underpredicted by nearly $8,000 in model B. And Ms. Nordstrom's salary was overpredicted by over about $5,000.
If you look at model C, on average, these residuals are much smaller than those of model A or model B. There was only one teacher that had a better prediction from either model A or model B than from model C. Overall, model C is the most accurate model that we have, the one from multiple regression.
And so to recap. Multiple regression is going to allow us to use more than one explanatory variable to predict the response. Those explanatory variables are going to have to be independent, of course. And this allows for certain variables to have a larger effect on the response than others, but still shows what those effects are and allows us to explain more of the variation in the response, increasing the r-squared value.
So by adding a second explanatory variable independent of the first, or a third independent of the first two, et cetera, et cetera, the value of r-squared will increase. And so we talked about multiple regression. Good luck. And we'll see you next time.