Or

4
Tutorials that teach
Outliers and Influential Points

Take your pick:

Tutorial

This tutorial covers outliers and influential points. First, outliers. It's a term that we've heard in other tutorials, particularly when you're talking about the 1.5 IQR rule and modified box plots. In this case, we're talking about outliers in terms of a scatter plot. So what we consider an outlier is not as definitive here. Now an outlier could be an extreme x-value, an extreme y-value, an extreme x and y-value, or it could even be neither extreme on an x or y, but it's away from the main trend.

So that's what I mean by less definitive. There's no specific rule that says, yes, this is an outlier. No, this one is not. There's a little bit of wiggle room and almost personal opinion on whether or not something is extreme enough to be an outlier or outside the trend enough. So here, it's what I believe to be true, but some people might classify it in a slightly different way. I'll try to point it out when that happens.

Now in our first example, over here, we have four different scatter plots. And we're going to look at each of them to determine whether or not there is an outlier. So in this one here, we have a set of data points that fall almost a straight line, and then there's this one point up the top here. This particular value would be an outlier.

It is an outlier because it has an extreme y-value. So on the x-axis, this data point is within the range of the other data points. It's not a problem. But if you look on the y, the next highest value is only around 8, whereas this one is above the 12. So that's much higher than the other ones, so I would consider that to be an outlier.

Now here, something similar is happening. There's a point way over here. But most of the rest of the data points are contained in this section right here. So this point way off to the side, that's an outlier as well. It is extreme on the x and on the y because it's way over in the x-- it's probably around 20 when everything else is around 8-- and same on the y-axis. The next highest value is 9, and this one's above 12. So that one's an outlier because it's an extreme x and an extreme y.

Now right above this is a curve set of data points. And here, I would probably say that there are no outliers. All the points are within the same kind of range for x and y. There's nothing that's extreme, and there's nothing that's outside of the trend.

If, let's say, we had a point here underneath the curve, some people would consider this to be an outlier. So even though it's not extreme on the x or the y, however, it's outside of the trend of the other points, so it doesn't really follow the trends. So that could be considered an outlier there.

Now this last one up the top here, I would say that there probably are no outliers for this scatter plot. Again, we're not seeing any x or extreme y-values. They all seem to follow the same generic trend. So that one, I would say, no outliers.

Now we'll talk about influential points. Influential points are points that when removed significantly change a statistical measure. So it could change the mean. It could change the slope of the regression line, which we'll learn about a little bit later. But it's something that's very strongly changing the data set.

Now here, an influential point, I would consider this an influential point. When you add it in, it's very much so changing the median and the mean-- sorry-- the mean value of the x because it's up above it and away from the rest of the x's and all the other x's here. If I took this point away, that mean for the x's would be 8. But with this point in, the mean for the x's is getting pulled up. So I don't exactly know where it would fall, but somewhere above 8. So that's an influential point.

Another thing that's good to know about outliers and influential points is with an influential point or an outlier, like the one we have up here, because the mean is affected, then r ends up being affected. r is based on the mean of x and y. So the existence of outliers like that one are going to affect the mean and, in turn, greatly affect the r. It's also influential on the y-axis because, again, it's pulling up the means of the y's.

Here, you could consider this as an influential point as well because, again, it's going to significantly increase the mean of the y's. And it's also kind of increasing the slope of the regression line. Here, I don't know that I would necessarily classify any of these points as influential points. And similarly, here, I don't know that I would necessarily classify any of those as influential points.

But again, like with outliers, there's a little bit of wiggle room. It's not something that's hard and fast. Someone might say that, yes, they do see an influential point up here, but I personally don't. So this has been your tutorial on outliers and influential points.