Source: Tables and Graphs created by Author, Benford's Law, Public Domain: http://commons.wikimedia.org/w/index.php?title=File:Rozklad_benforda.svg&page=1 @2:26
This tutorial is going to speak to you about statistical paradoxes. Now, paradoxes are apparent contradictions in what you see versus what you expected to see. Specifically, the one that we're going to talk about in this tutorial is called Benford's Law.
So statistics allows us to draw conclusions about things that we see. But sometimes, the phenomena that we see are counter to what we thought would happen. So these seeming contradictions are called paradoxes. And we can understand them better. And when we understand them better, we can improve as statistical thinkers.
So suppose that you were going to create a phony checking account. And you wanted to set it up so that you could steal some money from people. I don't know. Would it matter what number you started with? Think about that for a second.
Probably what you thought was no. It really wouldn't matter. All the numbers one through nine are equally likely to be selected for the first number. So if the account number is going to be randomly selected, it doesn't really matter. That's your intuition.
What's really the case, though, is that our intuition that all the numbers one through nine are equally likely to be selected for the first number is actually wrong. What's really the case is they're most likely to start with one. Checking account numbers, they're most likely to start-- pretty much any number you see-- is most likely to start with a one in published data.
And this is the idea of Benford's Law. Benford's Law says that the first number of-- what he said was financial accounts-- but most any real life data follow a pattern with the number one being the most likely to being the next most likely, specifically in this order. So only about 10% of account numbers will start with a four, whereas about 30% will start with a one.
And apparently, what people do who try and steal identities is they'll actually load it up on fours, fives, and sixes, because they think those are the middle. But really, it's the number one that's the most likely.
And Benford took a look at lots of these. The area of rivers, populations of different countries. Just regular old constants that are used in physics and things like that. Numbers that happen to appear in the newspaper. The specific heat values of different things.
And just took a look at all sorts of these different values and saw that almost across the board, one is the most likely, two is the next most likely lead number. And nine is the least likely lead number. It wasn't here, but in the vast majority of cases it is.
So once you wrap your mind around that, you might think to yourself, oh, all right, well if the lead number follows that law, then the second digit must also follow that law. But again, your intuition leads you astray. The first digit follows Benford's Law.
The second digit is, if you take a look, approximately equally likely to be any of the numbers zero through nine. Apparently, a little bit favoring the lower numbers, but all of these are about 10%. So the second digit, it's about equal frequency. It's approximately a uniform distribution here, whereas this Benford's Law distribution was very, very heavily skewed if you remember that histogram.
Now, the reason that you see something like this has to do with exponential growth, where if you take a look, these are the number 100, 200 to 300, 300 to 400. These are, you can see, equally spaced.
But if you take a look at the numbers that create them on the curve, there's a lot more numbers here on the x-axis that create a value between 100 and 200 versus create a number between 200 and 300. And you can see that that amount diminishes as you move along to the right.
So these are the numbers that create a one. That's about 30% of these numbers that are highlighted. About less than that create a two. Less even still create a three. And less create a four, all the way down to nine. These are the ones that create numbers between 900 and 1,000.
And so to recap, a paradox is a seeming contradiction between what you think should happen verses what's actually happening. The First Digit Law, which is Benford's Law, is one of these paradoxes. We thought that we would find a uniform distribution among first digits of certain numbers that we see. But apparently, one is a lot more common to lead than anything else.
Not all numbers occur with equal frequency as the lead digit. And once we understand paradoxes more, as we'll see in other tutorials that will go into more paradoxes in greater detail, we'll hone our statistical thinking and get more and more precise. So the terms we used were paradox and Benford's Law in specific. Good luck, and we'll see you next time.
A law that shows that most of the numbers that are published, regardless of topic, begin with smaller numbers, and very few of them lead with larger numbers. The most common first digit is 1, the least common is 9.
An apparent contradiction between what our intuition tells us, and what is true in reality.