This tutorial is going to explain you the idea of outliers and modified boxplots. These two are related ideas. Boxplots were first introduced in another tutorial. This time we're going to modify them so that we can observe outliers on them.
Outliers were previously explained to be values that were far outside the pattern established by the rest of the data. They're really, really high or really, really low in comparison to the rest of the data set. So if we look at a set of test scores, these values, the 90, 98, 89, almost everyone scored in the 80's or 90's, except for this poor soul, who scored a 46. They're going to be an outlier.
Now, it would be nice if there was a mathematical definition of how we could determine an outlier. Well, we can. We can determine whether a point really is an outlier or not using what's called the 1.5 times IQR method. IQR stands for Interquartile Range.
First we find the quartiles of the data set. What we have to do is order the data from least to greatest, find the median, and then within the low and high data sets, find the medians of those. So the median of the small data set, which becomes the first quartile, is between 84 and 88, which is 86.
Up here it's between 91 and 94, which is 92 and 1/2. And the interquartile range is the distance between those two. The difference between 92 and 1/2 and 86 is 6.5. This hasn't determined an outlier for us yet, but we'll keep going, and it'll do that.
The method says that if we have a point that is 1.5 IQRs below the first quartile or 1.5 IQRs or more above the third quartile, then it becomes an outlier. so let's look at this example. The interquartile range here is 6.5. That means that we can be 1 and a 1/2 interquartile ranges below the first quartile and still not be an outlier.
If we take a look, 86, which is Q1, minus 1.5 IQR, gets you to 76.25. On the high side, the third quartile plus 1.5 IQR is equal to 102.25. Assuming that this was 100-point test, what this indicates is that we can't have any outliers on the high side. But anything below 76.25 will be considered an outlier on the low side. Only 46 falls outside this range, and so it is an outlier.
Let's take another look. Suppose that we have all of the data set of all house prices that were purchase in Albuquerque, New Mexico, from February to April in 1993. These are in thousands of dollars. So this was a $54,000 house.
I have calculated the first and third quartiles for you. If you would, pause the video and determine what the range is for outliers. What is the lower fence for outliers and the upper fence for outliers. Pause the video, scribble it out next to you.
What you should have come up with is this. The interquartile range is 42, so any point below 78 minus 1.5 IQR, which is 15, or above 120 plus 1.5 IQR is 183 will be an outlier, anything below 15 or above 183.
Now, what you might notice-- there's nothing in the list below 15. But in fact, there are several above 183. In fact, there are seven. There are seven outliers in this data set. That is completely legitimate and legal to have happen.
We can use this new information to create a new version of an already existing plot that we have. We've made boxplots in another tutorial. We can modify them to show outliers.
Instead of making the whiskers on the box-and-whisker plot, extend all the way out to the maximum and minimum. That will make the whiskers really long if the maximum or minimum or even both are outliers. Instead of going all the way out to those outliers, we'll just extend them to the highest and lowest values that aren't outliers and notate the outliers separately. So take a look.
Suppose we go back to this student data set. We're going to mark the same values that we would have had we been making a regular box-and-whisker plot. The only thing is we're not going to go all the way down to 46 for our minimum even though 46 is our minimum. 46 is an outlier.
We'll go to the next lowest number that isn't the minimum-- 84-- and make our line there. Then we'll make our box and whiskers, but we still have to show the 46 as part of this data set somehow. We'll mark it with a dot. This is a modified boxplot.
In a regular boxplot, we would've extended this whisker all the way down to 46. We chose not to do it this time. We went to the next highest number that wasn't an outlier. Going back to the home value data, there were the seven high outliers. This is a modified plot for that data set.
And so to recap, we can determine in some measurable way if a point is an outlier. We don't have to qualify it by saying, well, I think it's really high. We can create gates. Within those gates, a point will not be an outlier. Outside the gates, it will be an outlier. And it's the 1.5 IQR method.
Data sets might have no outliers, or they might have one or more outliers on the low side, or one or more outliers on the high side, or both. There's no rule for how many outliers are allowed in a data set. Whatever outliers exist, a modified boxplot will show them.
It's a boxplot, a regular boxplot, which shows outliers. And you extend the whiskers only as far as they stay in the gates. So the terms that we used were outliers, the 1.5 IQR method, and the 1.5 IQR method found as outliers, which allowed us to create modified boxplots.
Good luck, and we'll see you next time.
If a point is larger than Q3 + 1.5xIQR, or smaller than Q1 - 1.5xIQR, then it is an outlier.
A graphical display showing a modified version of the five number summary. If a distribution has outliers, then the "whiskers" only extend to the highest and lowest points that are not outliers.
A point that is so large or small as to be unusual, given the rest of the data points.