Source: Tables and graphs created by the author
This tutorial is going to teach you about histograms and binning.
Histograms are a type of distribution for quantitative data. So when you have a quantitative data set oftentimes the values are spread out over a large range of values.
So let's take an example. Suppose there's an elementary school class in Indiana, let's say Muncie, Indiana, chooses to keep track of the high temperature for each of the 180 school days. Now in Indiana it can get low, down to zero degrees Fahrenheit, in the winter and maybe near 90 at the beginning or at the end of the school year. So to get an overall trend for the data we might not be interested in every single individual temperature. But maybe we want to care about how many days were in the 20s, that is the temperature was 20 up to 29.999, or in the 30s, as long as it starts with a 3, 30 up to 39.99, 40s, et cetera.
The idea that we can break those temperatures that occur over a wide range into more manageable intervals and categorize them that way is called binning. Binning allows us to make a frequency table out of those categories.
So suppose they recorded every temperature. But then they categorized them by whether they were in 0s, 10s, 20s, 30s, 40s, 50s, 60s, 70s, 80s, or 90s. This means there was one day out of the year that it hit the 90s, seven days were in the '80s. And once we decide on bin widths we can create a frequency table. And then we can create a histogram.
A histogram is somewhat similar to a bar graph in that on the horizontal axis, we're going to take the temperatures, which are our categories now. The only difference is these categories are numbers. And it makes sense that we would put 0 as being first, and 10 as being second, and 20 as being third. Our bins will go from 0 to 10, 10 to 20, 20 to 30, 30 to 40, et cetera. Our frequencies, just like a bar graph, will go up the vertical axis.
And so this information in a histogram looks like this. Our first bin goes from 0 degrees to 10 degrees. And there were 10 days that did that. Our second bin goes from 10 degrees to 20 degrees. And there were 16 days that did that. And so this bar goes all the way up to 16. And every bar follows from the rule from the table.
Now the difference between this and some other distribution is well maybe we chose to classify it by 10s here. But what if we chose to classify it by 5 degree intervals instead? Instead of going 0 to 10 what if those 10 days were split up between 0 to 4 and 5 to 9, 10 to 14, and 15 to 19? What if we split them that way? Well then the bins might look different. Suppose there were four days between 0 and 4 degrees. And six of those 0s days were between 5 and 9 degrees. What we have here is we took one bin and split it into two bins. And if we do that with every one of our bins we end up with twice as many bins and twice as many bars on our histogram.
Let me go back. We go from this set of bins and this histogram to this set of bins and this histogram. There are twice as many bars. There's not as tall of bars as they were before. But they do still give the same overall shape as they did before. Still there's not very many here. There's a lot in here. And not a lot over there.
You'll note that in the 90 to 95 bin there's no bar there. The reason is when we broke up that bar the one data value that was in the 90s was actually in the 95 to 99. When there's no data in a particular bin there's not going to be any bar that extends up from the x-axis.
So you might confuse a bar graph with a histogram from time to time. And there's a couple of differences. There's mainly two key differences. In a histogram the boxes touch. And it makes sense because the intervals, the bins, run one into the other, like with our temperature example. We go right from the 0s into the 10s. It makes sense to have the boxes right next to each other. Whereas in a bar graph they don't have to do that.
Secondly, the order of the boxes in a histogram matters. In a bar graph, typically, there's no reason to believe that one category has a higher value than the other. Suppose that we were doing college majors. There's no reason that I need to put economics further to the right than chemistry. It's not numerically greater. However, in a histogram the values to the further to the right are, in fact, numerically greater than the values to the left. And so because we're dealing with higher numbers and lower numbers we're going to have the order of the boxes matter.
Note also that binning is important. We can have problems if we make the bins too narrow. On the example that we did before we had bins of width 10 degrees and bins of width 5 degrees. We could have made them bins of width 1 or 2 degrees. And we would still have gotten a legitimate histogram. But maybe we wouldn't have gotten the overall shape of the distribution like we did on the histograms that we made.
Bins that are too narrow can create, what I call, the pancake effect , too many bins with almost nothing in them. You don't really get to see the overall shape.
Conversely, we can make the opposite mistake. Suppose that I went from 0 degrees to 50 degrees and then 50 degrees to 100 degrees. This is that data shown in two bins instead of the many bins that I had on the previous example. If I have too few bins and lots of data in them you still don't get where the shape of the distribution looks like. You know that most of the data is in this bin and not that one. But you don't know where in this bin in is. The classes were too wide. The bins were too wide. And you don't get the overall understanding.
So to recap histograms are distributions for quantitative data. Typically they're more spread out data. And so we bin the spread out data and create bars using the frequencies in those bins. And it's important to appropriately bin them so that you don't get the pancake effect and you don't get the opposite problem the skyscraper effect. In this tutorial we talked about binning and histograms. Good luck. And we'll see you next time.
The method of deciding what widths of categories should be used on a histogram
A distribution of data that shows the frequency of different ranges of values. Each frequency is the height of a bar.