This tutorial is going to teach you about outliers and influential points by discussing:
You may have understood the term "outliers" when talking about univariate data. But in bivariate data, outliers are something a little bit different.
Points that deviate substantially from the overall form of the remainder of the data points.
Let's take a look at these two data sets. One thing that you might realize is that the ones on the left seem all over the place, whereas the ones on the right, all the x's except one are 8, which might be a dead giveaway to somethin
But, they have the same mean for the x's-- 9. They have the same mean for the y's-- around 750. Their standard deviation for the x's is the same. Their standard deviation for the y's is the same at 203. And their correlations are the same at 0.816 in the positive direction.
Based on that information, one would think that these two graphs look pretty similar.
Both of the graphs above have what are called influential points that are changing a lot of the values.
An observation that, if removed, significantly changes a statistical measure
Usually the measure that we're talking about changing is correlation, but it also could affect other measurements as well-- the mean of x or y and the standard deviation of x or y.
For instance, the scatterplot above on the left with this point has a correlation coefficient of 0.816, versus without it, the correlation coefficient is 1. The points line up exactly on the line. Conversely, if we look over the one above on the right, this point is influential, and it changes all of these values substantially. With it, the mean of x is 9, the standard deviation of x is 3.3, and the correlation coefficient is 0.816. Without it, the mean of x becomes 8 because all the x-values are 8. The standard deviation is 9 because they never deviate from 8, and the correlation is 0. So it changes all of these measures very substantially by being there. That point is certainly influential.
The form of these points is a vertical line essentially, and with this point, it very much diminishes that.
They might be an outlier in one direction but not the other. Both of these circled points are, in fact, outliers on the scatterplot because they don't fit the overall trend. If you look at this one, it's an outlier in the y-direction because it's so much higher than the other y-direction but not the x-direction.
This one is an outlier in the x-direction because it's so much further to the right of the other pack of points but not in the y-direction. If you look horizontally, it's sort of in the middle lower part of the y-direction. It's an outlier in one direction but not the other.
In this case, because neither of them fit the overall trend provided by the other points, both of them are outliers.
Now, an outlier might not be an outlier in either the x- or the y-direction so long as it doesn't fit the overall trend established by the rest of the data. Here it's a curve.
The point in the middle that doesn't fit that curve will be an outlier.
Some of those are influential, and some of those are not. Two of these are not going to have a great effect on the correlation or the least squares regression line that these data sets create. In these two cases, a line is an inappropriate model. But if you did make a line, having this point versus removing this point wouldn't affect that line or the correlation very much. The correlation wouldn't change very much because the correlation already is very near to zero. The point on the right, on the other hand, is influential. The correlation will increase from nearly zero without it to a positive correlation coefficient with it.
Important points on a scatterplot are influential points and also outliers. Influential points substantially change at least one statistical measure. Outliers simply are points that deviate from the overall form of the rest of the points. They may be outliers in the x- or y-direction, but don't have to be, according to this definition. Be aware that different people use different definitions of outliers for scatterplots so there's not one hard-and-fast definition.
Source: This work adapted from Sophia Author Jonathan Osters.
An observation that, if removed, significantly changes a statistical measure.
In a scatter plot, an outlier is an observation that has an extreme x value, an extreme y value, both an extreme x and y, or is well away from the main trend of points.