Source: Images created by Author
In this tutorial, you're going to learn some cautions about hypothesis tests. These are things you're going to want to pay attention to because they might jeopardize the conclusions you've worked so hard to find. The first is the problem of multiple comparisons. We'll define it in a second, but first we're going to illustrate it.
So, the first person thinks that jelly beans cause acne. They tell the scientists to investigate, and they found at the 5% significance level no link between those. They say, well, OK. Well, that settles that. She says maybe all it's only in a certain color. So, they get back to work. And they didn't find any significant link between purple jelly beans, or brown jelly beans, or pink jelly beans, or blue jelly beans. Or teal, or salmon, or red, or turquoise, or magenta, or yellow. Or grey, or tan, or cyan. They found a link with green. They found that green jelly beans had a link with acne at the 5% significance level, but then they went on and they found no association between mauve jelly beans, beige, lilac, black, peach, and orange.
So, in all, they ran 20 significance tests testing to see whether these different colored jelly beans were associated with acne. They found one that had a significant link. And, of course, they rushed to the publication. And they say that it's with 90% confidence only 5% chance of coincidence.
Now, this is a comic that illustrates the problem of multiple comparisons. Which is, because the significance level is the probability of a Type I error, that means that at the 5% significance level about 5% of sample means or proportions will be rejected if the null hypothesis is true. That means that about one in 20 comparisons will register as significant even when they're not. And so, the scenario in the comic was the scientists ran 20 tests on the jelly beans. The likeliest outcome here is that none of the jelly beans have any effect whatsoever on acne.
But, just by chance, one in 20 of the comparisons came up statistically significant. Which is exactly what we would have expected if the null hypothesis was true, and running a test at the 5% significance level. The likeliest scenario is that supposed association at the 5% significance level was a Type I error. So, when you run multiple tests for the same variable, you can find some conclusions that, if you only did one test, it wouldn't have shown that.
Another problem that you can make is picking your alpha level after having done the test. So, these two scientists are having this conversation. "Hey, what was our alpha level?" And this guy says, "Depends. What was our p-value?" And he says, "Um, 0.07?" He says, "Let's go with 0.10 then." you can't pick your significance level afterwards and pretend like that's what you would have chosen independently. What they're doing here is they're choosing a significance level that makes their p-value significant. So, they're choosing that after the fact and this is a clear case of publication bias.
And lastly, the conditions of a test matter. They're not formalities. You always have to check them and make sure that they're met. You should always have a clear idea of what the model is that you're going to use, like the normal model or the t distribution, and what the conditions are and make sure that you've verified them.
So, suppose a sample was taken with bias. If it's not a simple random sample or wasn't selected in some random way, there's no rescuing it. There's no fancy formula that we can use to rescue the fact that the statistics are systematically off from the right answer. We can't rescue it because it wasn't collected in the right way to begin with. So, for instance, when the randomness condition is not met, then you really can't do anything with your data to try and make it, try and say that there's an association there because the conditions for running a test aren't even met.
And so, to recap, there are certain cautions that need to be taken when doing statistical inference. Some of the concerns-- there are lots of concerns, but major cautions are selecting a significance level independently of the results of the study. You want to do it beforehand. Checking carefully to make sure the conditions for inference are met. They're not just formalities, you have to check them. And being cautious about multiple comparisons to the same variable. Depending on the significance level, some of those variables will have a statistically significant p-value, simply by chance. Which is what we knew would happen when we picked an alpha level.
And so, the biggest issue that we ran up against was the problem of multiple comparisons. And so, you can think about the jelly bean problem as you think about the problem of multiple comparisons. Good luck and we'll see you next time.
A problem that occurs when multiple hypothesis tests are performed on a population leading to an increase in false positive tests.