Author:
Parmanand Jagnandan

This lesson will implement a z-test for a population proportion to help determine if there is a difference in M&M color ratios between actual store bought bags of M&Ms and what the Mars Co. claims are the actual ratios.

Tutorial

Source: Parmanand Jagnandan (Images from MS ClipArt)

So let's look at how to use z-tests to determine the validity of a population proportion. So here we're going to compare population proportions of brown N&Ns to the actual ratio of brown N&Ns from a bag bought in a store. As you might imagine, I'm just using N&Ns. You can use your imagination what I really mean. I'm trying to avoid copyright infringement here.

So why do we do this? Well, the first thing, of course don't trust anybody. I mean, we want to see if Venus Chocolate Company is actually being truthful. I mean, think about it. If you've seen some of those commercials with the talking candy, I haven't seen one yet, so I don't know how to believe them.

And in addition, and as you might imagine, this technique isn't just related to candy, but it can be applied to many fields, where you're interested in determining if the expected population proportion of one population is accurate or not. So this actually leads us into our first concept we always look at when performing a z-test for a population proportion, and that is the null and alternate hypotheses.

But before that, we need to think about, well, where does the null and alternate hypotheses come from? So we might be given some data from Venus Company that says these are the breakdown of their different types of colored chocolates candy. And we see here that we're told that brown accounts for about 13% of the candy that they make. And we're looking at p0 here, which is indicating that that is the entire population of these candies.

Now, we might do a study ourselves and find out that after opening many, many bags of candy and subjecting myself to diabetes, we find that we're looking at a sample proportion-- or a sample, sorry. And we are looking at a sample proportion because we're just looking at what proportion of these colors make up our sample.

And so here I rounded the numbers off a little bit. And if you're keen, you might add these numbers up and notice that they don't actually add up exactly to 100%. And that's because I rounded these values here. If I hadn't rounded and them, they would add up to 100%.

Now, what do I see from the data that I gathered or sampled? I'm seeing that brown here is showing up 17% of the time instead of the claimed 13. So I might have some reason to doubt Venus Company here.

So once we have some data that makes us think, we need to perform a hypothesis test. The first thing we need to do is state our null hypothesis. And so the null hypothesis is trying to determine what or it states what we're trying to testing against. And it assumes that things are working as intended. So there's no difference in the population and sample proportions.

And for my null hypothesis in this case, I'm going to say H-naught. That's how we pronounce this thing here, and that's usually a symbol we use to represent the null hypothesis, that the actual and claimed population proportions are the same for brown N&Ns bought in stores as those claimed by Venus Company. Or in other words, that the actual population proportion is equal to the claimed. It's p0 is equal to 0.13. Now, notice here that I wrote 0.13, or 13%. And when you're working with percents, always remember to convert it to a decimal when you're trying to solve the problem.

So once I have the alternate hype-- or null hypothesis-- sorry-- I need to state the alternate hypothesis. Now the alternate hypothesis is saying that something isn't right with the null hypothesis. And it's basically saying that the claimed and actual population proportions are not the same in one way or another. So for my case, I'm going to say that the alternate hypothesis is that the actual and claimed population proportions are not the same for brown N&Ns. Or in other words, that the p0 not equal to 0.13.

Now, because I said p0 does not equal 0.13, it could be greater than 0.13, or it could be less than 0.13. And because I have both of these cases, I'm performing what's called a two-tailed test. So if you were to look at the z-distribution, you would be looking at both ends you'd have to consider. Now, if I just considered one of these, say, p0 greater than 0.13, I would be conducting a one-tailed test, where I might be looking over here. Or if I said p0 is less than 0.13, again I'd be performing a one-tailed test where I'd be looking over here.

So before we actually start doing any kind of calculations, we need to do a few things to check to make sure our data can be modeled using a normal distribution, which will allow us to use a z-distribution. So the first thing we need to do is actually check that our sample was taken at random. So all of these samples here were all done at random. And we know that the total sample size was 4,800. And if you're keen and you add these numbers up that I got for my sample, you'll see that it doesn't exactly add to 100%. But again, if you were to-- if I didn't round these values like I did here, you would get 100%.

Another check that we need to do is to make sure that the 10% rule is satisfied. So here, we know that we have 800 brown N&Ns. And in the population of brown N&Ns, there's millions of brown N&Ns. So our 800 is definitely less than 10% of the population of brown N&Ns, so that condition is satisfied as well.

The next thing we need to consider is to make sure our sample can be modeled with this normal distribution. And to do that, we check out that sounds n times p0 is greater than or equal to 10. And so what does that mean? Well, n, remember, is our sample size. It's 4,800. And p0 is the population proportion of brown N&Ns. In this case for our problem, we're doing brown, which is 0.13.

Now, when you do that multiplication, you'll find that the result is 624, which is definitely greater than or equal to 10. So that condition checks out.

We also need to do it for n times 1 minus p0. And so n times 1 minus p0 would be 0.13. So we have n times 0.87 or, filling in n, it would be 4,800 times 0.87. And so this would give us a value of 4,176 which is definitely greater than or equal to 10, which is the condition that we have to satisfy as well. So that checks out then that we can-- that our sample size actually is large enough that we can use the central limit theorem to model our sample as a normal distribution.

So another thing to keep in mind here, too, is that we're dealing with colors here. So we're dealing with categorical data or qualitative data, meaning that we're simply dealing with categories of colors. We're not measuring or able to measure how brown or how yellow something is.

So now let's look at some statistics and parameters for our sample and population. So we know the sample population proportion is 800 out of 4,800, which is approximately 17%. The population proportion is 13%, which was given by Venus Company. We know the complement of the population proportion, which is 100 minus 13, or 87%. That's q0. And then we know the total size of the sample is 4,800.

And so now we can calculate the standard deviation of our sample proportion using the equation the square root of p0 times q0 over n. So when we do that, we have the square root of p0 is 0.13. We to convert your percents to decimals when you're working in the math, and our q0 is 0.87, and that's all over 4,800. And so when you make that calculation, you should get a value of 0.004854.

Now, we can use that value. Notice that that's the value in the denominator here for our z-score. So that tells us then that to calculate our z-score, we can say z is equal to the sample proportion, which is 0.17, minus the population proportion of 0.13, all divided by 0.004854. And this tells us then that our z-score for our sample is going to be approximately-- and I'm just going to round to the hundredths place-- 8.24.

So what can we get out of this value? Well, the first thing we need to do before we go into that is select a confidence interval. So you can choose the confidence interval that you want. Typically it's a 0.05 or 0.01, and that's your alpha value that you'll sometimes see. For this case, I'm going to choose 0.05.

Now, what does that mean? Well, if we look at distribution, because I'm choosing a 0.05 and I'm doing a two-tailed test, it means I'm looking at the tail ends of the distribution. Now that means then that 5% of my data is going to lie in the tail ends. So this is the 2.5 percentile here, and this is the 97.5 percentile here. Or this is 0.025, and this is 0.975.

Now, the other thing to keep in mind is the z-chart that I have. Here I have a positive z-chart, meaning that I'm looking at the right side of the distribution table. So I'm going to then use this value, the 0.975, to help find the z-critical value.

So locating the 0.975, you see it's over here. And now if I go to the left, I get a value of 1.9. And if I go up, I get a value of 0.06, which corresponds to the value of 1.96. So my z-critical value then is 1.96.

So what does this tell me? Well, we knew that our z-score from our sample was 8.24. So we can see that our value for our z-score is much higher than the critical value. And when that happens, we would say then that we would reject the null hypothesis or that we have give a good reason to believe then, based on our sample, that the population proportion of brown N&Ns is not what Venus Company claimed.

Now, you can solve this problem another way by converting your z-values into probability values, or p-values. And it might not be so helpful for our z-value here, because it's 8.24, and our chart only goes up to 3.09. So our value is kind of off the chart.

But suppose we had calculated a z-value of 2.40. What would that tell us? Well, 2.40, we can find that on this chart here. So 2.4 and 0, so I get a value of-- or a p-value of 0.9918.

Now, this would correspond to-- remember, we're dealing on the right side here. So I'm going to have to do subtract this from 1, which is going to give me a value of 0.0082. Now, because I'm doing a two-tailed test, I would then double this value. So I take the 0.0082, times that by 2, and that would give me a value of 4 and 6. And then I have a four zeroes here, so one, two, three, four. So I get a value of 0.0164 for my p-value.

So what does this tell me? So this tells me then that the chance that I got the results I did by random error or the probability that I got the results I did by random error would be 0.0164, or 1.64%. So it's a really small chance that I actually got the result that I did. In fact, at this point, we know that the 5% significance level, which is corresponding to a 5%, and hence the 0.05, is actually larger than our 1.64% that we calculated here, meaning that we would reject the null hypothesis.