In this tutorial, you're going to learn about multiple regression. Specifically you will focus on:
Multiple regression is going to allow you to predict a response based on more than one variable, more than one explanatory variable, although they have to be independent.
Multiple Regression
Using more than one explanatory variable to predict the value of the response variable.
In many school districts, teacher salaries are dependent on two variables-- years of experience and number of postgraduate hours accumulated.
It's possible that a teacher with a lot of years of experience might not have a whole lot of postgrad hours. It’s also possible that someone with a lot of postgrad hours doesn't have a whole lot of experience. Suppose the table with those three variables-- salary, years of experience, and postgrad hours-- they're listed here in the table below for Mr. Backman, Mr. Jones, Ms. Nordstrom, Mr. Osters, and Ms. Williams.
We could do a linear regression for salary versus years, which would predict a starting salary for someone with no years of experience of $31,164. And for every additional year that a person works, they are predicted to make an additional $2,685 on average.
If you look at the r-squared for this, it's fairly high at 0.83. It's clear there's something of an association here between salary and years. You could also do a linear regression for salary and hours. The r-squared here isn't as high, so there's a little bit less of an association between postgraduate hours and salary. But it appears that the equation is a little bit different where the predicted starting salary for someone with no postgrad hours would be $31,384. For each additional postgrad hour, the salary is predicted to increase by about $409.
This is a little bit weaker of an association than the one with years, but there still is an association there.
For multiple regression, if those variables are independent, then you can do a regression on both variables.
The predicted salary is going to have some part that has a constant, some coefficient for the number of years that the teacher has, and some coefficient for the number of postgrad hours that that teacher has accumulated. Look at the r-squared value. It's higher than either of the two individual linear regressions. Every time you add an independent variable, the r-squared would continue to increase. It always will go up when you add another variable because more of the variability in salary is going to be explained by an additional variable.
Look how well these models did.
These lists here indicate the residuals for each model. That's how far off each model was in predicting the teacher's salary. It looked like model A underpredicted Mr. Backman's salary by nearly $4,000 and over predicted Mr. Jones' salary by about $2,800. If you look at model B, these residuals are pretty big. Mr. Jones' salary was underpredicted by nearly $8,000 in model B and Ms. Nordstrom's salary was overpredicted by over about $5,000. If you look at model C, on average, these residuals are much smaller than those of model A or model B. There was only one teacher that had a better prediction from either model A or model B than from model C. Overall, model C is the most accurate model that we have, the one from multiple regression.
Multiple regression is going to allow us to use more than one explanatory variable to predict the response. Those explanatory variables are going to have to be independent. This allows for certain variables to have a larger effect on the response than others, but still shows what those effects are and allows us to explain more of the variation in the response, increasing the r-squared value. By adding a second explanatory variable independent of the first, or a third independent of the first two, et cetera, the value of r-squared will increase.
Good luck.
Source: This work adapted from Sophia Author Jonathan Osters.
Using more than one explanatory variable to predict the value of the response variable.