This tutorial is going to explain you the idea of outliers and modified boxplots. You will learn:
1. Outliers
Outliers and modified boxplots are related.
Outlier
A point that is so large or small as to be unusual, given the rest of the data points.
Modified Boxplot
A graphical display showing a modified version of the five number summary. If a distribution has outliers, then the "whiskers" only extend to the highest and lowest points that are not outliers.
Boxplots were first introduced in another tutorial. This tutorial will present a modified version of boxplots so that it is easier to observe outliers in them.
Outliers were previously explained to be values that were far outside the pattern established by the rest of the data. They're either very high or very low in comparison to the rest of the data set.
Here is a set of test scores:
Almost everyone scored in the 80's or 90's, except for one student, who scored a 46. That student is an outlier.
In order to make it easier to find outliers, there is a mathematical rule for determining whether a point is an outlier or not. This is called the “1.5 x IQR rule.” IQR stands for Interquartile Range.
1.5xIQR Rule
If a point is larger than Q3 + 1.5xIQR, or smaller than Q1 - 1.5xIQR, then it is an outlier.
So how do you use the 1.5 x IQR method?
1. First, find the quartiles of the data set:
The median of this data set is 90. The median of the first quartile is between 84 and 88, (which is 86), and the median of the third quartile is between 91 and 94 (which is 92½).
The interquartile range is the distance between those two. The difference between 92½ and 86 is 6.5.
2. If we have a point that is 1.5 IQRs below the first quartile or 1.5 IQRs or more above the third quartile, then it is an outlier.
Assuming that this was 100-point test, what this indicates is that the data set can't have any outliers on the high side. But anything below 76.25 will be considered an outlier on the low side. Only 46 falls outside this range, and so it is an outlier.
Suppose that we have all of the data set of all house prices that were purchased in Albuquerque, New Mexico, from February to April in 1993. These are in thousands of dollars. Here are the first and third quartiles.
What is the range for outliers? What is the lower fence for outliers and the upper fence for outliers?
What you should have come up with is this.
The interquartile range is 42, so any point below 78 minus 1.5 IQR, which is 15, or above 120 plus 1.5 IQR is 183 will be an outlier, anything below 15 or above 183.
Notice that there's nothing in the list below 15, but there are 7 above 183. This means that there are seven outliers in this data set. That is completely legitimate and legal to have happen.
You can use this new information to create a new version of an already existing plot that you have. You’ve made boxplots in another tutorial and can modify them to show outliers.
Generally, you would make the whiskers on the box-and-whisker plot extend all the way out to the maximum and minimum. That will make the whiskers really long if the maximum or minimum or even both are outliers.
So for this modified boxplot, instead of going all the way out to those outliers, extend them only to the highest and lowest values that aren't outliers and notate the outliers separately.
Refer back to the student data set. Mark the same values that you would have if you were making a regular box-and-whisker plot. However, don’t go all the way down to 46 for your minimum even though 46 is your minimum. 46 is an outlier, so instead go to the next lowest number that isn't the minimum -- 84 -- and make your line there.
Then you can make your box and whiskers.
You still have to show the 46 as part of this data set somehow, sol mark it with a dot:
This is a modified boxplot.
In the home value data, there were the seven high outliers. This is a modified plot for that data set:
You can determine in some measurable way if a point is an outlier using the 1.5 IQR rule. Data sets might have no outliers, or they might have one or more outliers on the low side, or one or more outliers on the high side, or both. There's no rule for how many outliers are allowed in a data set. Whatever outliers exist, a modified boxplot will show them.
Thank you and good luck!
Source: THIS WORK IS ADAPTED FROM SOPHIA AUTHOR JONATHAN OSTERS
A point that is so large or small as to be unusual, given the rest of the data points.
If a point is larger than Q3 + 1.5xIQR, or smaller than Q1 - 1.5xIQR, then it is an outlier.
A graphical display showing a modified version of the five number summary. If a distribution has outliers, then the "whiskers" only extend to the highest and lowest points that are not outliers.