[MUSIC PLAYING] Hi, Dan Laub, here. And in this lesson, we're going to discuss calculating the range of data. And before we get started, let's discuss the objectives for this lesson.
The first objective is to be able to determine the range of a data set. And the second is to be able to determine the interquartile range of a data set. And so we're going to cover both of those. So let's get started.
In addition to finding the center of a data set, such as the mode or median, we are also interested in finding a number that tells us how far the data set is spread out from the center. By determining how spread out the data set is, we are establishing a measure of what is known as variability. One way to measure variability is determine what is known as the range of a data set. Another way to measure variability is to determine what is known as the interquartile range of a data set.
With either of these measures, understanding the variability in data can help compare the findings of two tests to determine if they're comparable, whether those tests may come from within the same study or maybe from studies conducted at different places, at different times. By comparing variability of different tests or studies, we can add reliability to those tests or studies. As when there is less variability between data across the test or studies, these tests may be considered more valid.
If you recall from the experimental method, among the eight different steps, step seven asks to revise the guess if the prediction is wrong, and to start from step two if the prediction looks plausible, or to start testing again from step four to verify the results. Why would we want to verify the results? Well, as we continue testing, this helps assure a minimal degree of variability in the test. And additional tests will show less variability. And as we continue to do so, it helps validate the results of the test that we're conducting.
Finding the range of a data set means finding the smallest value, known as the minimum, and the largest value, which is known as the maximum. The range is simply the difference between the maximum and the minimum values. So the range is a measure of variability or spread, meaning that the larger the range, the more spread out the values in the data set are. However, keep in mind that if the data set has extreme values, the range may not accurately represent the spread of the data.
So let's give you the example, here, to work with. Let's say, for instance, we're dealing with incomes of people. And we're dealing with a situation where there's going to be a very large range. So we could have very low income, maybe a few thousand dollars a year, and we could have a very high income, could potentially be tens of millions of dollars per year. Obviously, this is an extreme range. It's not really going to give us a really good sense of how well those incomes are going to be distributed
Let's talk about determining the range of a data set. And so for a simple example, I'm going to pick the monthly rainfall of a particular city in the United States. And let's go with San Francisco, in this case. So if you see the graph in front of you, it depicts what the rainfall looks like in San Francisco, based upon the average amount they receive per month. And you clearly see here the graph's shape, where there's a lot in the early part of the year, very little in the summer months, and then, again, a lot during the end of the year.
And so how will we figure out the range of this particular data set? Well, we would simply arrange the numbers from smallest to largest. And so when we do so, we look at July having the least amount of average rainfall-- zero. It doesn't rain at all, on average, in San Francisco, during July.
And then, we work all way up to December, in which they receive an average amount of rainfall for the month of December of 4.57 inches. And we've determined the range simply by looking at the difference between 0 and 4.57, which is going to be, obviously, 4.57. And as you can see from the graph here, you can clearly see how one end is the minimum, and the other end is the maximum, and the difference would be the range.
Sometimes, however, we might be interested in determining how an observation falls into a specific range in order to make comparisons to other observations. In some cases, splitting the data into what's called quarters can be useful. And when the data is broken down into groups that represent 25% of the data, they are called quartiles.
For example, let's look at the height of men. And in the case of this, it's going to be a normal distribution-- as you see here with the bell-shaped curve. And the median height for men is going to be roughly 70 inches, or 5 foot 10 inches tall. And the quartiles, we've broken down according to this-- anything less than 68 inches will be in the first quartile, anything between 68 and 70 inches will be in the second quartile, between 70 and 72 in the third, and anything above 72 inches will be in the fourth quartile.
When considering just the range of a data set, sometimes the range can be thrown off by an occasional random value that may cause the range to not represent the true variability of a data set. In a case like this, it is useful to have a measure of variability that is more accurate. Such a measure of variability is known as the interquartile range.
The interquartile range represents the middle 50% of the data set, while the range represents the entire extent of the data set. Like the range, the interquartile range is a measure of variability, or spread. The larger the interquartile range, the more spread out the values in the middle 50% of the data set are. The interquartile range is a more reliable measure of the spread of data than the range is, because it does not only take into account the maximum and minimum values of a data set.
In order to find the interquartile range of a data set, one must find the first quartile-- denoted by Q1-- and the third quartile-- denoted by Q3. The interquartile range is then the difference between the third quartile and the first quartile, represented by the equation you see here. In order to consider the middle portion of the data, we are only concerned about the first and the third quartile.
So how will we go about figuring this out? Well, let's look at a hypothetical data set that would involve the income of 25 to 34-year-olds. And so the first and third quartile can be determined as follows.
First, we'd sort the data from smallest to largest value, and then, divide it into two halves. The middle value would be the median, which in this case is the midpoint between $57,000 per year and $50,000 per year, or $53,500 per year. Next, find the middle value of the first half. This is the first quartile, Q1, which in this case is $36,000. Third, determine the middle value of the second half. This would be the third quartile, or Q3, and this value is $77,000.
The median of the entire data set is called the second quartile, or Q2, which in this case would be $53,500. So the interquartile range is determined by using the following formula-- Q3 minus Q1. So in this specific instance, it would be $77,000 minus $36,000, or $41,000. And that would give us the value of the interquartile range for this group of income.
Now, notice how that would compare to the range. We started with a minimum of $20,000, and the maximum was $107,000. So the range, in this instance, is $87,000, which is more than twice as much as the interquartile range. So you can clearly see, with an example like this, how the interquartile range gives us a much better measure of variability.
So let's talk about our objectives real quick, just to make sure we covered what we said we were going to. And the first objective was to be able to determine the range of a data set, which we did. And the second was to be able to determine the interquartile range of a data set, which we also did. And we covered the difference between the two and gave an example that illustrated how the interquartile range is simply a better measure of variability.
So again, my name is Dan Laub. And hopefully, you got some value from this lesson.
Represents the middle 50% of the data set
The difference between the maximum and minimum values
Q3 - Q1
Maximum Value - Minimum Value