Or

4
Tutorials that teach
Outliers and Influential Points

Take your pick:

Tutorial

Source: Image of graphs created by Ryan Backman

Hi. This tutorial covers Outliers and Influential Points. OK? So let's start with the definition of an outlier, and this is an outlier for bivariate data. So it's a little different than the outliers that we've talked about for univariate data, pretty similar though still.

Outliers are important to consider, because the correlation coefficient r is based on the mean of both the x and the y variables. So r can be greatly influenced by outliers. So again, it's important to consider those outliers, when you're dealing with bivariate data.

So an outlier is a point in a scatter plot that has an extreme x value, an extreme y value, or both, and then a point can also be an outlier if it's well away from the main trend of the points. OK? So actually, there's four different ways. You can have an extreme x, an extreme y, or both, and then also it's possible that it's not an extreme value. But if it's so far away from the rest of the points, or the main trend of the points, we can still consider it an outlier.

OK. So what we're going to look at is I basically have the same graph here reproduced three times. So what this is measuring is the number of boat registrations. This is measured in ten thousands, and then the number of manatee fatalities, and each of these points represents a different year. OK?

So in Florida, they used to have a lot of issues with manatee deaths from power boats. So we could see that there seems to be a positive association between how many boat registrations there are, so probably how many boats are in the water and then how many manatee fatalities they recorded that year. But let's look at some different possible outliers.

OK. So we said that an outlier can have an extreme x value. OK? So an extreme x value would mean that it's far away. So let's make one with an extreme x value but not an extreme y value. OK?

So this outlier would have an extreme x value but not an extreme y value. OK? So you'd be in way more boat registrations than normal, but the y value, right around 30, would be a pretty typical number of fatalities in a year. OK? So what we'd say for this one is this is an outlier, because it has an extreme x, extreme x value. OK?

Now, if we look for something that has an extreme y value but not an extreme x value, then you're going to maybe look at a point up here. OK? So if we have an extreme number of manatee fatalities but not an extreme number of registrations, this also is an outlier, but it's for a different reason. So this is outlier extreme y. OK?

Now, if you're looking for one that has both an extreme x and an extreme y, that's going to be a point maybe up here. OK? So again, this will also be an outlier. So now, if there are a high number of manatee fatalities and a high number of boat registrations, this is an outlier because of extreme x and y. OK?

Now, the third type of outlier that you'll see is if it's not an extreme value, but it's just way outside of the trend. So we'll just take a look at it another example here. So let's say that your cluster of points, you're scatter plot, looks something like this. OK? High number of points, kind of in that parabolic shape, and you have a point right there, well outside that trend of data.

So even though this is an extreme value, the x value's pretty typical. The y value's pretty typical. This will still be an outlier, and it's an outlier because it's outside of the trend of the data. OK? So just four different ways you'll see outliers when looking at bivariate data. OK?

Now, let's also take a look at an influential point. An influential point is an observation that, if removed, significantly changes a statistical measure like r. OK. So if we think about this point here, that point will be an influential point, because if that point were removed from the entire data set, that's going to reduce, or weaken, the value of r, your correlation coefficient. OK?

So this point is influential, because if removed, it will weaken that r value. So this has been your tutorial on Outliers and Influential Points. Thanks for watching.