Paradoxes are apparent contradictions in what you see versus what you expected to see. Specifically, the one that we're going to talk about in this tutorial is called Benford's Law, covered in the next section.
Statistics allows us to draw conclusions about things that we see. Sometimes, though, the phenomena that we see are counter to what we thought would happen. These seeming contradictions are called paradoxes. If we understand them better, we can improve as statistical thinkers.
EXAMPLESuppose that you were going to create a phony checking account and you wanted to set it up so that you could steal some money from people. To do so, you would need to create a checking account number for this phony account. For your fake account number, would it matter what number you selected as the first number?
Benford's Law, also called the First Digit Law, says that the first number of most any real-life data, including financial reports, follow a pattern with the number 1 being the most likely, 2 being the next most likely, and so on in a specific order.
This law shows that only about 10% of account numbers will start with a four, whereas about 30% will start with a one.
People who try to steal identities are likely to use more 4's, 5's, and 6's, because they think those are the middle. In reality, it's the number 1 that's the most likely as a lead number.
Benford examined many different sets of data, including:
Benford looked at these different values and saw that almost across the board, 1 is the most likely lead number, 2 is the next most likely lead number, and 9 is the least likely lead number.
|Digit||First Digit||Second Digit|
As you can see in the image above, the second digit is approximately equally likely to be any of the numbers 0-9. There is a slight favoring of the lower numbers, but all are about 10%. The second digit has an equal frequency.
The reason for a phenomenon like the one you saw in the above examples has to do with exponential growth, which looks like this:
As you can see, the 100-200, 200-300, and 300-400 ranges are equally spaced. However, there are more numbers on the x-axis that create a value 100-200 versus ones that create a number 200-300. That amount diminishes as you move along to the right of the x-axis.
There are many kinds of paradoxes, and Simpson's Paradox is just one of them. Simpson’s Paradox is a relationship that's present in groups but reversed when the groups are combined.
A very famous example of Simpson’s Paradox took place in 1973. That year, UC Berkeley had a sex discrimination lawsuit filed against them that asserted that UCB was favoring men over women substantially in the admissions process for their grad schools. Here is the data:
As you can see, it looks like 977 men applied and 492 were accepted, which is a little over half. In contrast, of the 400 women who applied, well under half, only 148, were accepted. In fact, the proportions are 50.3% versus 37%.
The difference between 37% and 50.3% is huge, which is why the lawsuit was filed. To see exactly where the women were being discriminated against, the lawyers looked into the admissions by department. You would expect that there would be a large discrepancy in certain departments. For this tutorial, we will look at the data for two departments, which we are calling the Engineering and English (though the true numbers within these departments may have been different in the real case).
For the Engineering department, you can see that about 63% percent of men were accepted to the Engineering department versus 68% for women. Women were accepted at higher rates to the Engineering department. Therefore, the discrepancy was not present in the Engineering department. You might then assume that the discrepancy occurs in the English department. However, women were accepted at higher rates to the English department as well--34.9% versus 33.3%.
Women were accepted at higher rates to the Engineering and English departments, but much lower overall. Examining how the men's application rates were distributed, their 63% was weighted for a lot more into the weighted average of admissions rates versus the 68% for the women.
Only 25 of the 400 applicants to the Engineering department were women. That's not very many. Therefore, that 68%, even though it's a high percentage, doesn't count nearly as much in the weighted average as the 34.9% does. So, the 63% is weighted heavily for the men versus the 68%, which is weighted hardly at all for the women. This is why you see that reversal of association.
Source: Adapted from Sophia tutorial by Jonathan Osters.