Source: Tables created by Jonathan Osters
In this tutorial, you're going to learn about the chi-square statistic and how it's calculated. We're not going to run any significance tests this tutorial because the chi-square tests have many different versions, and each of them will have their own tutorial. This tutorial is going to focus on how the statistic is calculated, as it's calculated the same regardless of the test you're running.
So let's take a look. Suppose I have a tin of colored beads. And I claim that the tin contains the colored beads in these proportions-- 35% blue, 35% green, 15% yellow, and 15% red. And I draw 10 beads from the tin. And I wonder is what I drew-- this-- consistent with the claim or not? Why or why not?
Well, it looks like I got 4 red, 3 blue, 1 green, and 2 yellow. The two yellow seems fairly consistent with the 15% claim. But the four red don't seem all that consistent with the 15% claim for red. How can we measure that discrepancy?
If the claim were true, we would have expected that out of 10, 3 and 1/2 of them would be blue, 3 and 1/2 green, 1 and 1/2 yellow, and 1 and 1/2 red. Now, you can't actually get 3 and 1/2 blue, because you can't have half of a bead to be blue. So this is sort of an idealized scenario, this 0.35 times 10 and 0.15 times 10. So these don't have to be possible actual outcomes. These are what you might expect in the long-term in samples of 10.
Whereas, in our one sample of 10, this is what we actually ended up getting-- was 3 blue, 1 green, 2 yellow, and 4 red. The question is, how can we measure the discrepancy between what we observed and what we expected. It appears that two of these were pretty close. Blue and yellow we're pretty close to what we expected. Whereas, green and red were pretty far off.
The statistic that we use to measure discrepancy from what we expect is called chi-square. And every time we use it, it's calculated this way-- take the observed values, the ones that we got, subtract the expected values, the ones that we thought we would get-- those are the 1.5's and the 3.5's-- from the table, square that difference, divide by the expected, and then add up all of those fractions.
We'll show you what that looks like here. The expecteds were the 3 and 1/2, 3 and 1/2, 1 and 1/2, and 1 and 1/2. The observed were the 3, 1, 2, and 4.
What I would do in each of these is do the 3 minus 3 and 1/2, which gives you 1/2, square it, and then divide, in this case, by 3 and 1/2. I would do the same thing for each of the categories and then add them all up together. In this case, the chi-square statistic value is 6.1905.
Now, it's worth noting that in this case, the conditions for inference with a chi-square test are not met. This is only meant to illustrate how a chi-square statistic would be calculated, although we can't do any real chi-square inference on this because the sample size isn't large enough.
And so to recap. The chi-square statistic is a measure of discrepancy across categories from what we would have expected in categorical data. We can only use it for data that appear in categories or qualitative data. The expected values may not be whole numbers since the expected values are long-term average values.
And we can use a table to calculate the chi-square statistic or we can use technology. A lot of the times, the statistic is calculated using technology. In our example, because it was so simple and there weren't that many categories, just the four of them, we used a table to calculate it.
So we learned how to calculate the chi-square statistic. Good luck. And we'll see you next time.
The sum of the ratios of the squared differences between the expected and observed counts to the expected counts.
The frequencies within each of the categories in a qualitative distribution.
The frequencies we would have expected within each of the categories in a qualitative distribution if the null hypothesis were true.