Or

4
Tutorials that teach
Significance Level and Power of a Hypothesis Test

Take your pick:

Tutorial

Source: Graphs created by Jonathan Osters

In this tutorial, you're going to learn about the significance level of a hypothesis test. Now, we should first understand what we mean by statistical significance. When we calculate a test statistic in a hypothesis test, we can calculate the p-value. The p-value is the probability that we would have obtained a statistic as large as the one we got or as small as the one we got or as extreme, far away from our hypothesized value, as the one we got, given that the null hypothesis is true. So it's a conditional probability.

Now, sometimes we're willing to attribute whatever difference we found between our statistic and our parameter to chance. If this is the case, we fail to reject the null hypothesis, if we're willing to write off of the differences between our statistic and our hypothesized parameter. If we're not, if it's just too far away from the mean to attribute to chance, we're going to reject the null hypothesis in favor of the alternative.

So this is what it might look like for a two-tailed test. The hypothesized mean is right here in the center of this normal distribution. And anything that we consider to be too far away, something like two standard deviations or more away, we might consider that to be too far away. And in that case, anything in here we might attribute to chance and fail to reject the null hypothesis, whereas anything that we consider just too far away will reject the null hypothesis, if it's just too high or too low for us to continue believing that this is the real mean. And again, this is assuming that the null hypothesis is true.

But now think about this. All of this curve assumes that the null hypothesis is true. But we make a decision to reject the null hypothesis anyway if the statistic we got is far away. It means that this would rarely happen by chance. But it's still the wrong thing to do technically, if the null hypothesis is true, which says that the mean is right here.

And so this idea that we're comfortable making some error sometimes is called a significance level. So the probability of rejecting the null hypothesis in error, so rejecting when the null hypothesis is in fact true, is called a Type I Error. But the nice thing is we get to choose how big we want this error to be.

If you remember on the blue sections of that normal curve, we could have said, all right. Well, we deem too far away to be three standard deviations from the mean on either side. We get to choose how large we want this error to be. That value that we say we only want to be wrong 1% of the time, or we're OK being wrong 5% of the time rejecting the null hypothesis in error that often, is called a significance level. And we denote it with the Greek letter alpha.

When we choose how big we want alpha to be, we do it before we start the tests. And we do it this way to reduce bias, because if we already ran the tests, we could choose an alpha level that would automatically make our result seem more significant than it is. And we don't want to bias our results that way.

So let's take a look back at this visual here. The alpha in this case is 0.05. If you recall, the 68-95-99.7 rule says that 95% of the values will fall within two standard deviations of the mean, meaning that 5% of the values will fall outside of those two standard deviations. And so our decision to reject the null hypothesis will be 5% of the time, the most extreme 5% of cases, we will not be willing to attribute to chance variation from the hypothesized mean.

Now, the level of significance will depend on the type of experiment that we're doing. So suppose that we are trying to bring a drug to market. And we want to be really, really cautious about how often we reject the null hypothesis. We will reject the null hypothesis if we're pretty darn certain that the drug will work.

We don't want to reject the null hypothesis of the drug not working in error and thereby giving the public a drug that doesn't work. So if we want to be really cautious and not reject the null hypothesis in error very much, we'll choose a low significance level, like 0.01. This means that only the most extreme 1% of cases will have the null hypothesis rejected.

Now, if we don't believe a Type I Error is going to be that bad, we might allow the significance level to be something higher, like 0.05 or 0.10. Now, those still seem like low numbers. But think about what that means. This means that one out of every 20, or one out of every 10 samples of that particular size will have the null hypothesis rejected even when it's true.

So are we willing to make that mistake one out of every 20 times or once every 10 times? Or are we only willing to make that mistake one out of every 100 times? And so setting this value to something really low reduces the probability that you make that error.

Now, the only thing to note here is that you don't want it to be too low. We have to be careful how low we set it, because the problem with setting it really, really low is that as we lower the value of a Type 1 Error, we actually increase the probability of a Type II Error. A Type II Error is failing to reject the null hypothesis when a difference does exist. What this does is it reduces the power or the sensitivity of our significance test, meaning that we will not be able to detect very real differences from the null hypothesis when they in fact exist if our alpha level is set too low.

So to recap, the probability of a Type 1 Error is a value that we get to choose in a hypothesis test. We call it the significance level. It's denoted with the Greek letter alpha.

And choosing a big one allows us to reject the null hypothesis more often, though the problem is sometimes we reject the null hypothesis in error. When the difference really doesn't exist, we say that a difference does exist. However, if we choose a really small one, we reject the null hypothesis less often. And sometimes we fail to reject the null hypothesis in error as well. So there's no foolproof method here. But we usually like to keep our significance levels low, something like 0.05 or 0.01. 0.05 is the default choice for most significance tests for most hypothesis testing.

So we talked about the significance level, also called alpha, also called the probability of a Type I Error. Good luck. And we'll see you next time.

Source: GRAPHS CREATED BY JONATHAN OSTERS

In this tutorial, you're going to learn about the power of a hypothesis test. Now, you might wonder what is power? Well, power is the ability of a hypothesis test to detect a difference that is present.

So this is the standard null hypothesis curve. The mean from the null hypothesis is here in the middle. And there are two tails, two rejection regions where we either reject the null hypothesis, or we fail to reject the null hypothesis. So these are our lines in the sand for a two-sided test.

So suppose that the mean is actually all the way out here. Now, that means that because the mean is actually different than the one from the null hypothesis, we should reject the null hypothesis. What we end up with is an identical curve to the original normal curve.

But if you take a look at this curve, the way the data is actually behaving, this is the way we thought it should behave based on the null hypothesis. But this is the way the data is actually going to behave.

This line in the sand still exists, which means that because we should reject the null hypothesis, this area over here is a mistake. Failing to reject the null hypothesis being to the left of the line in the sand is wrong if this is actually the mean, which is different from the null hypothesis's mean. So this is a type II error.

This area on the other side where we are correctly rejecting the null hypothesis when a difference is present is called power. So power is the probability of rejecting the null hypothesis correctly so rejecting when, in fact, the null hypothesis is false. It's a correct decision.

Power-- we don't need to worry about calculating it although we could calculate it by hand. It's almost always done using technology. So we're not going to be responsible for calculating power. We can have technology do it for us.

What we should understand, though, is that there are two different ways to increase the power of a hypothesis test. One way would be imagine if these normal curves were skinnier. That would move the line in the sand over this way. That would move the critical value over to the left.

If both of these normal curves were skinnier, they would look like this. So they are now identical, but they're skinnier than they were before. And notice there's a lot less orange space and a lot more yellow space.

How do we do that? How do we make these curves skinnier? Well, we decrease their standard deviation. The standard error is the standard deviation of, in this case, x bar. So how do we decrease the standard error? Well, it was sigma over the square root of n, which means if we make n bigger because it's in the denominator, the standard error will go down, which means that these will have less spread. And there will be less overlap of this curve with that curve.

Now, the problem with increasing sample size is that maybe you have logistical constraints like time or money. So you have to make those decisions if you're the person actually doing the sampling. But if you increase the sample size, and it's worth it to you, then go ahead and do it because that will increase the power of the test.

Now suppose we didn't change it and kept the sample size the same. What else is there to do to increase the amount of yellow space and decrease the amount of orange space? We could actually just literally move the critical value in closer to mu from the null hypothesis. We could move these critical values, these gates, in. So take a look. That's what we're going to do right here.

If you notice, the amount of blue space increased. Notice the amount of yellow space is also bigger than it was before. I'm going to transition back and forth. That's what it was before. That's what it is now. More yellow space, less orange space. Now, what did we do by moving these in? We actually increased the amount of blue area, which means that we increased the significance level.

We learned in a different tutorial that the significance level is the probability of a type I error. So in essence, we're actually just trading out one area for the other. By decreasing the amount of orange space, we're decreasing the probability of a type II error. But we're increasing the probability of a type I error. And in certain situations, we have to make that decision as to whether or not that is actually worth it to us.

So to recap, the power of a significance test is the probability that the null hypothesis is rejected, given that it is really false. So we're using the alternate normal curve as opposed to the one from the null hypothesis.

There are two main ways to increase the power of the significance test. One is by increasing the sample size, which then decreases the standard error of the distribution, or you can increase the significance level alpha. Both of these have benefits by increasing the power, but they both have negative trade-offs. So we talked about power and all the different things that we can do to increase it. Good luck. And we'll see you next time.