Source: Image of Benfords Law table, Creative Commons: http://en.wikipedia.org/wiki/File:Benfords_law_illustrated_by_world%27s_countries_population.png
Hi. This tutorial covers paradoxes. So let's start with a definition of what a paradox is. And a paradox is a situation where phenomenon can be looked at in two seemingly contradictory ways. Paradoxes and statistics are common. Data interpretation often utilizes probability. And often, our intuition about probability is often misleading.
Probability gets pretty tricky. And a lot of times, what we think is true isn't actually true. So there are several paradoxes that come up in stats, and we're going to look at one in particular called Benford's Law.
So what Benford's Law is is a property that states that in many real world data sets, the first digit in whatever value you're measuring is not uniformly distributed. So if we're thinking about a data set that involves the values of houses, so if we have a big list of data, chances are that leading digit is not going to be uniformly distributed. Or if we think of the number of pages in a book, the first digit and all of those number of pages might not be uniformly distributed.
Our intuition might say, well, yeah, all of that stuff is fairly random. It would seem to be uniformly distributed. But Benford's Law says otherwise. So this is basically what he came up with. So the percent chances of each digit as a leading digit of a real world data value is predicted by Benford's Law.
So what that's saying is that the digit 1 has about a 30% of being the leading digit, OK? And if we think about that, if we go back to that housing value example, there are going to be a lot of houses that will be about $100,000. So because a lot of those values have a leading digit of 1, that probability becomes so much higher.
And actually, it ends up becoming a positively skewed distribution. So as you move to the right, the probabilities get lower and lower. So the chance of a 9 being a leading digit in a real world data set is pretty low, less than 5%. And I'll give you an example of this.
So this data from 2010. So there were 237 countries that were part of this survey. That was all the countries in the world. And we looked at the first digit of their population. So basically, what this graph does is you look at the leading digit, and the value on the left here is the percent of countries with that number as the leading digit. So 67 out of 237 had a leading digit of 1 in their population. So it might have been 17 million something. That country would be in this bar. So that bar goes up to about 27. Leading digit of 2, OK, that looks to be about 18%.
Now, what these black dots are is these black dots show what Benford's law predicted. So Benford's Law predicted this would actually be up a little bit more. But notice as you get lower and lower, those predictions become more accurate.
All right, so although our intuition might tell us that the leading digit of real world data would have a uniform distribution, Benford's Law tells us otherwise. Since our intuition in this case is wrong, Benford's Law is a type of statistical paradox. And there are others, but I just want to make sure that you had a good example of one specific statistical paradox. So that has been your tutorial on paradoxes. Thanks for watching.
Hi. This tutorial covers a type of paradox called Simpson's Paradox. Let's just start with a definition of what Simpson's Paradox is first. Simpson's Paradox is, when two data sets are subdivided, a numerical measure for the first data set can be consistently higher than for the second, but when whole, the numerical measure of the second data set is higher than the numerical measure of the first.
Let's take a look at an example now. It involves baseball. Consider two Boston Red Sox baseball players, Mike Lowell and Jacoby Ellsbury. In 2007, Ellsbury hit with a 0.353 batting average. If you're not familiar with baseball, what batting average means is that, if we change this to a percent, that would end up being 35.3%. 35.3%.
And what that means is that, in 35.3% of Ellsbury's at-bats, he ended up getting a hit. So he got on base. Then, in 2008, he hit 0.280. That means 28% of his at-bats, he got on base. In the same years, Lowell, in 2007, hit with a 0.324 batting average, and in 2008, he hit 0.274.
If we compare the two, Ellsbury had a higher average in 2007. He also had a higher average between the two players in 2008. Ellsbury would have a better two-season batting average, right? Wrong. Let's see where Simpson's Paradox comes into play here.
If we take a look, now, at just the raw numbers-- so instead of just giving the percentages, we're going to give both the number of hits and the number of at-bats. In 2007, Ellsbury had 41 hits and 116 at-bats. In 2008, he had 155 hits and 554 at-bats. And then, in the 2007-2008 category here, simply just adding up the hit total to get 196 and the at-bats to get 670.
Now, if you look at Lowell, we have the same breakdowns here. But you can see that his two-year batting average-- Lowell's two-year batting average-- was 0.304 compared to Ellsbury's 0.290. So, although Ellsbury hit for a higher average in 2007-- we can see that that average is a lot higher-- his sample size, which in this case was his number of at-bats, was much lower than Lowell's.
Even though he had a much higher average, it was based on much fewer at-bats. So really, what happens is that this batting average, which is pretty high, was weighted- so Lowell's 2007 year was waited a lot more than Ellsbury's 2007 at-bat.
This is an example of Simpson's paradox because, although it looks like Ellsbury hit better in both years, his combined average was not better because of this small sample size here.
Sometimes, data collected from different-sized samples can be compared. If the difference in sample size is great, Simpson's Paradox might present itself. So be a little wary if they're combining-- or if they're comparing data sets with much different-sized samples. So, beware. Simpson's Paradox can be used to intentionally distort and misrepresent data.
In the baseball example, probably not the end of the world if you did make that conclusion. But a lot of times-- or sometimes, data can be used-- it can be kind of aggregated like that to show what people want to show instead of looking at the big picture of things. That has been the tutorial on Simpson's Paradox. Thanks for watching.