In this tutorial, you're going to learn some cautions about hypothesis tests. Specifically you will focus on:
These are things you're going to want to pay attention to because they might jeopardize the conclusions you've worked so hard to find.
The first is the problem of multiple comparisons.
IN CONTEXT
Someone thinks that jelly beans cause acne. They tell the scientists to investigate, and they found at the 5% significance level no link between those. They say, well, OK. Well, that settles that. She says maybe all it's only in a certain color. So, they get back to work. And they didn't find any significant link between purple jelly beans, or brown jelly beans, or pink jelly beans, or blue jelly beans. Or teal, or salmon, or red, or turquoise, or magenta, or yellow. They found a link with green. They found that green jelly beans had a link with acne at the 5% significance level, but then they went on and they found no association between mauve jelly beans, beige, lilac, black, peach, and orange.
In all, they ran 20 significance tests testing to see whether these different colored jelly beans were associated with acne. They found one that had a significant link. And, of course, they rushed to the publication. And they say that it's with 95% confidence only 5% chance of coincidence.
Because the significance level is the probability of a Type I error, that means that at the 5% significance level about 5% of sample means or proportions will be rejected if the null hypothesis is true. That means that about one in 20 comparisons will register as significant even when they're not. And so, the scenario the scientists ran 20 tests on the jelly beans. The likeliest outcome here is that none of the jelly beans have any effect whatsoever on acne.
But, just by chance, one in 20 of the comparisons came up statistically significant. Which is exactly what you would have expected if the null hypothesis was true, and running a test at the 5% significance level.
The likeliest scenario is that supposed association at the 5% significance level was a Type I error. So, when you run multiple tests for the same variable, you can find some conclusions that, if you only did one test, it wouldn't have shown that.
Another problem that you can make is picking your alpha level after having done the test. So, these two scientists are having this conversation. "Hey, what was our alpha level?" And this guy says, "Depends. What was our p-value?" And he says, "Um, 0.07?" He says, "Let's go with 0.10 then." you can't pick your significance level afterwards and pretend like that's what you would have chosen independently.
What they're doing here is they're choosing a significance level that makes their p-value significant. So, they're choosing that after the fact and this is a clear case of publication bias.
Lastly, the conditions of a test matter. They're not formalities. You always have to check them and make sure that they're met. You should always have a clear idea of what the model is that you're going to use, like the normal model or the t distribution, and what the conditions are and make sure that you've verified them.
Suppose a sample was taken with bias. If it's not a simple random sample or wasn't selected in some random way, there's no rescuing it. There's no fancy formula that you can use to rescue the fact that the statistics are systematically off from the right answer. You can't rescue it because it wasn't collected in the right way to begin with.
When the randomness condition is not met, then you really can't do anything with your data to try and make it, try and say that there's an association there because the conditions for running a test aren't even met.
What about multiple testing on a population, running multiple hypothesis tests on one sample? This leads to possibly finding significance when there really is any.
There are certain cautions about hypothesis tests that need to be taken when doing statistical inference. Some of the concerns-- there are lots of concerns, but major cautions are selecting a significance level independently of the results of the study. You want to do it beforehand. Checking carefully to make sure the conditions for inference are met. They're not just formalities, you have to check them.
And being cautious about multiple comparisons to the same variable. Depending on the significance level, some of those variables will have a statistically significant p-value, simply by chance. Which is what we knew would happen when we picked an alpha level.
Good luck.
Source: This work adapted from Sophia Author Jonathan Osters.
A problem that occurs when multiple hypothesis tests are performed on a population leading to an increase in false positive tests.