Source: Top Hat; Creative Commons: http://commons.wikimedia.org/wiki/File:Chapeauclaque.png Pool balls, tables and graph created by Jonathan Osters
In this tutorial, we're going to learn to address and check the conditions for both z-tests and t-tests. The conditions are the same, although you may recall that for z-tests the population standard deviation has to be known, and for t-tests the population standard deviation is unknown. Besides that, the conditions are the same.
So when performing a hypothesis test for a population mean there are three conditions. One has to deal with how the data were collected. Were they collected in some random way? A simple random sample is the gold standard. Second, is each observation independent of the others? We're going to be able to verify that mathematically. And third, is the sampling distribution approximately normal? Again, we'll be able to verify that a number of ways.
So first, are the data collected in some random way? The purpose is we want to make sure there's not any bias in the sample. And so, ideally, we'd like to go with a simple random sample from the population, or be able to treat our data as being a simple random sample. So it's the same old items in a hat and pull a few out scenario. Cluster samples are typically OK. Stratified random samples are typically OK. It's the randomness that's most important.
Second, the independence condition. We want to make sure that each observation doesn't affect any other observation. So there are a couple ways to do that. One, which isn't very common, is sampling with replacement. This means when you take a person out, or an item out of the population, that you put them back. And so you can sample them again. That's not typically how we do sampling. Normally, when you're sampling somebody, you don't put them back, and you can't sample them again. For instance, if you're taking a political poll you wouldn't want someone's opinion counted twice. So we need a little bit of a workaround for independence. A work around is that the population has to be large. So if we're sampling without replacement we have to check that the sample is less than 10% of the population. So if we multiply our sample size by 10, the population has to be at least that big in order for us to say that the observations are pretty much independent of each other.
And then, finally, is the sampling distribution approximately normal? The distribution of sample means the sampling distribution will be nearly normal in two cases. One is if the sample size is 30 or above. The central limit theorem says that the sampling distribution of sample means will be approximately normal when the sample size is large. For most distributions that's 30 or larger for a sample size. The other way is if the parent distribution-- the distribution of values from which we got our data-- is normal, then the sampling distribution of sample means will also be normal, regardless of the sample size.
So there's two ways to verify that. One is, if we're lucky, it might be stated within the context of the problem. If you're actually doing this, though, in real life, it would be hard to verify that for sure. If we're lucky, it'll say so in the problem. If it doesn't, then we actually have to look at our data. Graph the data in a histogram or a dot plot. And what we're looking for is approximate symmetry, a mound shape, and a lack of outliers. So we're looking for something that looks normal-ish when we graph the data.
So this is what you might have to do more often than not. And this is only necessary if the sample size is less than 30. If the sample size is over 30 then the central limit theorem applies, and we actually don't have to graph the data.
So let's look at one example. Many customers pay attention to the nutritional contents on packaged foods. So it's important that the information be accurate. In this example, we're looking at the calorie contents of some frozen dinners. And the reported calorie content was 240. So check to see if the conditions for inference are met. Maybe we want to see-- do the data support or refute the idea that the calorie content is in fact 240. So let's see if the conditions are, in fact, met.
First, it does say that it was a random sample. So the randomness condition is met. Here are the three sort of compacted. The condition for randomness. How are the data collected? Randomly. That is met. The independence condition. Is the population at least 10 times the sample size? Well, I would hope so. There has to be at least 120 frozen dinners that this company has made in their frozen dinner line. That seems reasonable to assume.
And then, lastly, is the sample size at least 30? In this problem it wasn't. Is the parent distribution normal? Well, it didn't say that in the problem either. So we have to work through the actual graphing of the data, and I've done that here. I've graphed the data in a histogram. As you can see, it's approximately symmetric, mound-shaped. So the idea that this could have come from a normal distribution is a reasonable assumption for us to be making. We don't know that the parent distribution is normal, but the parent distribution could have been normal. And so we'll go ahead with the problem, having made that assumption. So the three conditions have been checked and verified.
Let's look at one more example. Renee wants to know the average weight of women at her health club. She stands at the door and asks the first 20 women who enter if they'll step on the scale. And these are the weights of the women who said yes. Are the conditions for inference met here?
Well, in this case no. The data were not collected randomly. Look back at how the data were collected. She stands at the door and asks the first 20 women who enter if they'll step on the scale. Maybe not all 20 actually did it. Maybe the first 20 women who said yes are here. Ultimately, the sample was a convenience sample, not a random sample, and it probably suffers from voluntary response bias, if you think about it.
The women who maybe are heavier might be more self-conscious about stepping on the scale to give Renee their weight. So maybe the sample will have bias that underestimates the average weight of women in the health club. Ultimately, there's no rescuing this. We can't do a test of significance. We can't do a confidence interval. There's no rescuing poorly collected data. No fancy formula is going to help. So we actually don't need to check the remaining conditions, because inference will not be appropriate for the data, even if the other two conditions were met.
So to recap. The conditions for running a hypothesis test-- a z-test or a t-test-- are as follows. The randomness condition. How were the data collected? It should be some kind of random way. If it's not, you actually can't proceed. Independence. Is the population large in comparison to your sample, at least 10 times your sample size? And normality. Remember, there are three ways to do this. You should reference the central limit theorem if there are at least 30 observations in your sample. Or, there are two ways to verify that the parent distribution is normal. Either it will say in the problem, if we get lucky, or we have to actually graph the data and look for a mound-shaped, approximately symmetric, single-peaked distribution of data with no outliers.
And so those are the z-test conditions and t-test conditions. Remember, there was one additional condition for the z-test that required that we know the population's standard deviation.
Good luck, and we'll see you next time.