First, please create an account

Already have a Sophia account?

Chi-Square Statistic

Author: Ryan Backman

Video Chapters

( 00:00 - 00:48 ) Introduction to Motivating Context

( 00:49 - 01:38 ) Discussion of Hypothesis Writing

( 01:39 - 02:04 ) Definition of Observed Frequency

( 02:05 - 02:28 ) Definition of Expected Frequency

( 02:29 - 04:20 ) Calculation and Discussion of Expected Frequencies

( 04:21 - 04:29 ) Definition of Chi-Square Statistic

( 04:30 - 04:42 ) Formula for Chi-Square Statistic

( 04:43 - 06:27 ) Calculation of the Chi-Square Statistic

( 06:28 - 07:59 ) Discussion of the Chi-Square Statistic and p-value

Video Transcription

Download PDF

Hi. This tutorial covers the chi-square statistic. This symbol here is the Greek letter chi. That's what we're dealing with here. And it's that statistic squared, so we call that chi-square statistic.

All right, so let's take a look at an example here. So there are four flavors of candy in a bag, cherry, lemon, orange, and strawberry. The candy company claims the flavors are equally distributed in each bag.

I have a reason to disagree. After opening a bag of candy and sorting the flavors, the following counts were produced, cherry, lemon, orange, and strawberry. Lemon is by far my least favorite, and it always seems like there's more lemon in a bag than any other flavor, so I want to think about how I could test to see if they actually aren't equally distributed.

So if I'm thinking about equal distribution, it's helpful to think of the proportions of each flavors and then make some hypotheses based on those proportions. So what I'm going to do for my null hypothesis is I'm going to assume that all of the flavor proportions are the same. So I had four flavors, and I'm going to write my null hypothesis so that p sub 1 equals p sub 2 equals p sub 3 equals p sub 4 where each of these p's are population proportions for the four different flavors. And then for my alternate, all I'm going to say is that the null is false.

So in order to start thinking about hypothesis tests for multiple proportions like we have here, we need consider two things, what's called the observed frequency and the expected frequency. So the observed frequency is the number of experimental occurrences of a certain category. So when I was thinking about-- when I'm thinking about observed frequency, I'm thinking about the counts that I actually got. So I had 11 cherry, 15 lemon. The 11 and the 15 would be observed frequencies.

Now, we also something called expected frequencies. Expected frequencies are the number of occurrences of a certain category if the null hypothesis is true. So now what I'm doing is thinking about my null hypothesis. If that's true, if the flavor proportions were actually equal, how many would I expect in each flavor category?

So let's actually write down our observed frequencies and then calculate our expected frequencies. So my observed frequencies were 11 cherry, 15 lemon, 12 orange, and 12 strawberry. So now my expected frequencies are, again, if the null hypothesis were true, so if these four flavor categories were actually all evenly distributed

So now, in order to do that, what I need to do is calculate n, the number of candies in the bag, so what I would do here is just add up those four numbers. And I get 50, so there are 50 total candies. Now, if each of these flavor categories were equally distributed among the 50 candies, what I would just need to do then is take 50 and multiply by 25% because we'd want 25% of the 50 to be each flavor.

So my expected count in each case would be 12.5. So even though you can't get 12 and 1/2 candies, we're going to just assume that this number-- this number is still meaningful. It means usually between-- we'll have between 12 and 13 candies. So these are our observed values. These are our expected values.

Now, if these values are pretty close together, we would probably not have evidence to suggest that the null hypothesis were false. If these were really far apart, we might have evidence to suggest that the alternative hypothesis is false here. Or excuse me, it's true. The alternate hypothesis-- we would have evidence to show that it's true.

So now what we're going to do with these observed and expected values is actually calculate what's called the chi-square statistic, and the chi-square statistic is a statistic that allows for hypothesis tests to be conducted for categorical data. Now, here is the formula for the chi-squared statistic. So it's O minus E squared divided by E and then the sum of that.

So what I'm going to do now is actually calculate the chi-squared statistic for my observed values here. So chi squared is equal to the sum of O minus E squared over E. So what I'm going to do is I'm going to create a new category over here where it's O minus E squared over E, and then I'm going to calculate that for each of the four categories. And then I'm going to take the sum of all of those.

So just so you can see some of these calculations, I'm going to do 11 minus 12.5 squared divided by E, 12.5, and I end up, in this case, with 0.18. Now I'm going to do it for the next one, so O minus E quantity squared divided by E. So that's 0.5. The next one-- whoops-- 12 minus 12.5 squared divided by 12.5-- you get 0.02. And then this is the same, so that it will also be 0.02.

And then what I would do finally to calculate chi squared is I'd need to take the sum of those four numbers. 0.02 plus 0.02. And then if I add those up in the calculator, 0.18 plus 0.5 plus 0.02 plus 0.02 is 0.72. So in this case, my chi-squared statistic ends up being 0.72.

Now, large values of this chi-squared statistic will allow us to reject the null hypothesis, and small values will cause us to fail to reject the null hypothesis. Now, these decisions to fail to reject or reject the null hypothesis are based on a p-value, just like in the other hypothesis test. So we can calculate a p-value using a chi-squared value using a calculator function called chi-squared CDF.

So what I'm going to do is I'm just going to type in my chi-squared value, 0.72. Chi-squared tests are always upper-tailed tests, so we're going to put in a large number to represent infinity. And now a third argument for chi squared is something called the degrees of freedom. So in this case, my degrees of freedom for this type of test is always my number of categories minus 1, so in this case, it's 3. So I'm going to type that in, and my p-value ends up being about 0.868, which is a very large p-value. And it will not be significant.

So in this case, my chi-squared value of only 0.72 does not allow me to reject the null hypothesis, so I do not have evidence to show that the candies are not equally distributed. All right, that's been your tutorial on the chi-square statistic. Thanks for watching.

Additional Practice Problems

Terms to Know

Chi-Square Statistic: The sum of the ratios of the squared differences between the expected and observed counts to the expected counts.
Expected Frequencies: The number of occurrences we would have expected within each of the categories in a qualitative distribution if the null hypothesis were true.
Observed Frequencies: The number of occurrences that were observed within each of the categories in a qualitative distribution.

Formulas to Know

Chi-Square Statistic: $capital chi squared space equals space sum from blank to blank of fraction numerator left parenthesis O space minus space E right parenthesis squared over denominator E end fraction$