This tutorial is going to teach you about histograms and binning. You will learn about:
Histograms are a type of distribution for quantitative data.
Histogram
A distribution of data that shows the frequency of different ranges of values. Each frequency is the height of a bar.
When you have a quantitative data set oftentimes the values are spread out over a large range of values.
Suppose there's an elementary school class in Muncie, Indiana that chooses to keep track of the high temperature on each of the 180 school days. In Indiana, the temperature can get low in the winter, down to zero degrees Fahrenheit, and maybe near 90℉ at the beginning or at the end of the school year. In order to understand the overall trend of the data, you might not be interested in every single individual temperature. Maybe instead, you care about how many days were in the 20℉s, that is, days that the temperature was 20℉ up to 29.999℉, or in the 30s℉, 30℉ up to 39.99℉, or in the 40s℉, et cetera.
The idea that we can break those temperatures that occur over a wide range into more manageable intervals and categorize them that way is called binning.
Binning
The method of deciding what widths of categories should be used on a histogram
Binning allows us to make a frequency table out of those categories.
So suppose the Muncie School District recorded the temperature on every day. But then they categorized them by whether they were in 0℉s, 10℉s, 20℉s, 30℉s, 40℉s, 50℉s, 60℉s, 70℉s, 80℉s, or 90℉s.
This means there was one day out of the year that it hit the 90℉s, seven days were in the 80℉s. Once you decide on bin widths, you can create a frequency table and then a histogram.
A histogram is somewhat similar to a bar graph in that, on the horizontal axis, you’re going to take the temperatures, which are our categories now. The only difference is these categories are numbers. And it makes sense that we would put 0 as being first, and 10 as being second, and 20 as being third. Our bins will go from 0 to 10, 10 to 20, 20 to 30, 30 to 40, et cetera. Our frequencies, just like a bar graph, will go up the vertical axis.
As you can see from this histogram, the first bin goes from 0 degrees to 10 degrees, and there are 10 days that fall into that category. The second bin goes from 10 degrees to 20 degrees, and because there are 16 days there, that bar goes all the way up to 16. Every bar follows from the rule from the table.
How does the way we bin data change the histogram?
In the original histogram, data was classified by 10s. But what if you chose to classify it by 5 degree intervals instead? Instead of going 0 to 10, what if those 10 days were split up between 0 to 4 and 5 to 9, 10 to 14, and 15 to 19? Well then the bins might look different.
Suppose there were four days between 0 and 4 degrees. And six of those 0s days were between 5 and 9 degrees. What we have here is we took one bin and split it into two bins. And if we do that with every one of our bins we end up with twice as many bins and twice as many bars on our histogram.
In this new histogram, the bars are not as tall as they were before but they do still give the same overall shape as they did before. However, there's not very many bars overall. There's a lot of data in one part of the graph not a lot in the other parts. You'll note that in the 90 to 95 bin, there's no bar. The reason is when we broke up that bar the one data value that was in the 90s was actually in the 95 to 99. When there's no data in a particular bin there's not going to be any bar that extends up from the x-axis.
So binning is important. You can have problems if you make the bins too narrow. In the previous examples, we had bins of width 10 degrees and bins of width 5 degrees. You could have made them bins of width 1 or 2 degrees and still have a legitimate histogram. But maybe you wouldn't have gotten the same overall shape of the distribution.
There are two main problems you may have with binning: the pancake effect and the skyscraper effect:
You might confuse a bar graph with a histogram from time to time. But there are a couple of differences between the two kinds of graphs.
There's mainly two key differences.
Histograms are distributions for quantitative data. Typically, they're more spread out data. You use binning to the spread out data and create bars using the frequencies in those bins. Histograms can look like bar graphs, but are different. It's important to appropriately bin them so that you don't get the pancake effect and you don't get the opposite problem the skyscraper effect.
Thank you and good luck!
Source: THIS WORK IS ADAPTED FROM SOPHIA AUTHOR JONATHAN OSTERS
A distribution of data that shows the frequency of different ranges of values. Each frequency is the height of a bar.
The method of deciding what widths of categories should be used on a histogram