Source: Graphs created by Jonathan Osters
This tutorial is going to teach you about outliers and influential points. Now, you may have understood the term "outliers" before when we were talking about univariate data. But in bivariate data, outliers are something a little bit different. So let's take a look at these two data sets.
By this point, you should automatically think to graph your data, graph your data, graph your data. Now, one thing that you might realize is that these seem kind all over the place, whereas these, all the x's except one are 8, which might be a dead giveaway to something. But take a look.
They have the same mean for the x's-- 9. They have the same mean for the y's-- around 750. Their standard deviation for the x's is the same. Their standard deviation for the y's is the same at 203. And their correlations are the same at 0.816 in the positive direction.
So based on all that information, you would think that these two graphs look pretty similar. The problem is, they don't. They both have what are called influential points that are changing a lot of the values. So an influential point is a point whose presence changes substantially a statistical measure. Usually the measure that we're talking about changing is correlation, but it also could affect any of those other measurements that we put on the previous page as well-- the mean of x or y and the standard deviation of x or y.
So, for instance, this scatterplot with this point has a correlation coefficient of 0.816, versus without it, the correlation coefficient is 1. The points line up exactly on the line. Conversely, if we look over here, this point is influential, and it changes all of these values substantially.
With it, the mean of x is 9, the standard deviation of x is 3.3, and the correlation coefficient is 0.816. Without it, the mean of x becomes 8 because all the x-values are 8. The standard deviation is 9 because they never deviate from 8, and the correlation is 0. So it changes all of these measures very substantially by being there. That point is certainly influential.
Other types of points that we're going to talk about are called outliers. Outliers deal with the form of the data set. Now, there's no hard-and-fast established definition to outliers. The one that I like to use deals with the form of the data set. So these are ones that substantially deviate from the pattern, from the form established by the remainder of the points.
Typically they are also influential points, but they don't have to be influential points by this definition. So these two influential points are, in fact, outliers. The form of the remainder of the points is a strong linear form this way. And with it, it diminishes that. The form of these points is a vertical line essentially, and with this point, it very much diminishes that.
So using that definition, we can talk about other examples of outliers. The thing about outliers is that they might be outliers in one direction but not the other. Both of these circled points are, in fact, outliers on the scatterplot because they don't fit the overall trend.
Now, if you look at this one, it's an outlier in the y-direction because it's so much higher than the other points. But it's sort of in the middle of the pack regarding the x's. So it's an outlier in the y-direction but not the x-direction.
This one is an outlier in the x-direction because it's so much further to the right of the other pack of points but not in the y-direction. If you look horizontally, it's sort of in the middle lower part of the y-direction. So it's an outlier in one direction but not the other. And in this case, because neither of them fit the overall trend provided by the other points, both of them are outliers
Now, an outlier might not be an outlier in either the x- or the y-direction so long as it doesn't fit the overall trend established by the rest of the data. Here it's a curve. And so this point here in the middle that doesn't fit that curve will be an outlier, as will that point way out there. There's no real relationship to the other points, and so it's just an outlier in both the x- and the y-direction. And then here, it's again an outlier in neither the x- nor the y-direction. But because the rest of the points follow this curved pattern and it's sort of out here in the middle of nowhere, that's going to be an outlier.
Some of those are influential, and some of those are not. Neither of these are going to have a great effect on the correlation or the least squares regression line that these data sets create. In these two cases, a line is an inappropriate model. But if you did make a line, having this point versus removing this point wouldn't affect that line or the correlation very much.
The same thing with this is. The correlation wouldn't change very much because the correlation already is very near to zero. This point, on the other hand, is influential. The correlation will increase from nearly zero without it to a positive correlation coefficient with it.
And so to recap, important points on a scatterplot are influential points and also outliers. Influential points substantially change at least one statistical measure. Outliers simply are points that deviate from the overall form of the rest of the points. They may be outliers in the x- or y-direction, but don't have to be, according to this definition. Be aware that different people use different definitions of outliers for scatterplots so there's not one hard-and-fast definition. So we talked about influential points and outliers.
Good luck, and we'll see you next time.
An observation that, if removed, significantly changes a statistical measure.
In a scatter plot, an outlier is an observation that has an extreme x value, an extreme y value, both an extreme x and y, or is well away from the main trend of points.