Or

4
Tutorials that teach
Outliers and Modified Boxplots

Take your pick:

Tutorial

Source: Image of graph and table created by Ryan Backman

Hi. This tutorial covers outliers and modified boxplots. Let's start with the data set. Pretty large data set and is equal to 90. So there 90 data values here. Couple notable things about the data set. Minimum is 30, maximum is 1441. The next largest value is 1068.

It seems like, in terms of hundreds, like 100, 200, 300, 400, most of those are pretty well represented. So let's think about some ways of summarizing the data.

So a good way to summarize it is using the five number summary. I calculated that ahead of time, so the minimum, like we said, was 30, max was 1441. The median was 559.5. So there is an even number of data points. So we had to take the average of two values. Q1 was 436, Q3 was 739.

OK. Now a lot of times when you have the five number summary, it's really easy to calculate the IQR. So let's go ahead and do that. Recall, the IQR is just simply Q3 minus Q1.

So Q3 is 739 minus Q1, 436. If we subtract those two, we end up with 303. So that means the middle 50% of the values in that data set, span about 303 units.

OK. Let's also take a look now at the boxplot. So I've constructed a boxplot here. Again, min, Q1, median, Q3, maximum. This distance here, the length of that box, is our IQR, 303.

And now, the one thing we will want to look at, is this maximum, 1441. The maximum of 1441 is far away from the rest of the data, and seems to give the distribution a positive skew. Remember, the next highest value is 1068. This is all the way up at 1441. So that's going to give this upper whisker, it's going to make this upper whisker, a lot longer. Possibly giving it a positive skew.

OK. So 1441 looks like an outlier, but, is it? All right. So we call that a data value is a significantly higher or lower, or is significantly higher or lower than the majority of the other data values. There must be a gap between the outlier and the bulk of the other values.

OK. So this just kind of defines, kind of in general just what an outlier is. But we actually do have a more specific definition of an outlier, and that's using what's called the 1.5 times the IQR rule.

So any data value is an outlier, if it is further than 1.5 times the IQR units, away from the nearest quartile.

So if it's 1.5 IQR units above Q3, we consider it a high outlier. And if it's 1.5 IQR units below, the first quartile we consider it a low outlier.

So kind of putting those into formulas. A low outlier will be anything less than Q1 minus 1.5 times the IQR. OK. So this kind of gives you the fence post that separates a low outlier from just a regular data value.

Now, high outlier is anything greater than Q3 plus 1.5 times the IQR. So again, that gives you a fence post that kind of separates a regular data value from a high outlier.

All right. So let's actually now determine the cutoff points for a low and high outlier for the data set. OK. So let's start with the low outlier. We'll do that one first. So low outlier has to be Q1 minus 1.5 times the IQR. A low outlier has to be smaller than this value.

OK. So if we get those values off of the five number summary that we had before, Q1 recall, was 436, and then minus 1.5, we already calculated the IQR also, and that was 303.

OK. So if we calculate that, 436 minus 1.5 times 303. And what this value is, is negative 18.5.

So for a number to be a low outlier, it would need to be less than negative 1.85. And in our data set, we didn't have any negative values. So we're not going to have any lower outliers.

Now, let's do the same thing for a high outlier. So now high outlier needs to be Q3 plus 1.5 times the IQR. So Q3 recall, was 739 plus 1.5 times the IQR of 303. And again, we'll type that into the calculator. 439 plus 1.5 times 303. So 739 plus 1.5 times 303. And that gives me 1,193.5. So for something to be a high outlier it needs to be larger than this value.

So the question is, are there any outliers in the data set? And to answer that, yes there is. That maximum of 1441 is a high outlier, because 1441 is greater than that boundary point that we found of 1,193.5.

And of course, it is possible to have multiple outliers. So if there were two values bigger than this, then we might have two high outliers. It's also possible to have both a low outlier, and a high outlier in the same distribution. So these give us the boundaries, and then you just need to consult the data then to determine if there are any there.

Now, those outliers can help us modify an existing boxplot. So a type of box-and-whisker plot, where outliers are identified with dots, and the whiskers only extend to the lowest and highest values that are not outliers.

So basically a modified boxplot is like a regular boxplot, but we use dots for our outliers.

So I'm going to go ahead and construct a modified boxplot for this data. Now, the first thing I'm going to do, is I'm going to identify my outlier with a dot. So my outlier is going to be here, at 1441.

So now I'm going to make my boxplot, pretty much just like normal. So I'm going to start with the minimum, here. Q1 is where my box is going to start. So I'm going to draw a longer line, here. My box is going to end at 739, so that's going to be about here. And I'm connect these to make my box.

So this rectangular region is still just your box. The median of 559.5, that's going to go inside of the box, here. So that represents my median. And I'm gonna connect this whisker.

Now, my upper whisker needs to stop at the highest value that's not the outlier. So that value is 1068, so now my whisker is going to only go out until 1068. I'll make a little hash mark, here. So a little bit past the 1,000. And then I'm going to draw this whisker out here.

So this now becomes my modified boxplot. Looks very similar to a regular boxplot, but again, we do have that outlier out here marked with a dot.

And then just to compare this boxplot to our regular box plots, they're going to look very, very similar, except for, obviously, this whisker stops, where there isn't an outlier. So you're going to see that big gap displayed on your modified boxplot.

So this is a good way to get a little more information out of your boxplot, by identifying those outliers. We can also see that maybe there's not as much of a positive skew as we thought, by looking at the original graph.

All right. Well, that has been your tutorial on outliers and modified boxplots. Thanks for watching.