Source: Tables created by Jonathan Osters
In this tutorial, we're going to look at a chi-square test for goodness of fit.
So the chi-square distribution is something we should look at first. It's a skewed to the right distribution that measures, generally, discrepancy from what a sample of categorical data would look like if we had an idea for what the population should look like in those categories. If we had an idea of that in mind, a low value of chi-square would indicate not a very large departure, not a very large discrepancy, versus a larger chi-square value would indicate a large discrepancy. The p-value is always the area in the chi-square distribution to the left of our particular chi-square statistic that we end up calculating. And, again, the values on the left-- low values of chi-square-- are likely to happen by chance, and high values of chi-square are unlikely to happen by chance.
The degrees of freedom for the chi-square distribution is the number of categories minus 1. Just like the T distribution, the chi-square distribution is actually a family of curves. And the shape changes a little bit, based on the degrees of freedom, but it's always skewed to the right.
The conditions for actually going and using that chi-square distribution are that the data represent a simple random sample from the population, the observations should be sampled independently from the population-- and that's the population is at least 10 times sample size condition. The 10% of the population condition that we've been using many times over-- and that the sample size is large. This is similar to the conditions that we came up with when we were checking normality in other hypothesis tests, although we're not trying to use the normal distribution. We're trying to use the chi-square distribution.
Now, the question is how large is a large sample size? So how large is considered a large sample? In this case, all the expected counts have to be at least 5
So let's take a look at an example. In the book Outliers, Malcolm Gladwell outlines a trend that he finds in professional hockey, related to birth month. If you see the following distribution here of number of hockey players born in a particular month, is this what you would have expected, given that the population falls out this way?
So let's take a look. It certainly appears that the earlier months of the year have larger numbers of NHL players born in them. Certainly not very consistent with the nearly uniform distribution of the population. What we would have expected is that-- of those 512 professional hockey players-- 8% of them would have been born in January. 7% of them would have been born in February. And et cetera. So we can find a whole list of expected values for these lists.
So it looks like this. We would have expected 9% of the players to have been born in each of July, August, September, October, and December. And so all of those expected values are 46.08. Whereas, what actually ended up happening was apparently just 30 were born in December. So when we perform a chi-square Goodness-Of-Fit Test, the null hypothesis is that the distribution of the population of hockey players follows that specified distribution that we claimed it did. The same as the distribution for everyone who was born in, in that case, Canada, because it's Canadian hockey players. The alternative hypothesis is the distribution of birth months for hockey players differs from the distribution of birth months for the general populace.
We'll state that our significance level will be 5%. If we get a p-value below 0.05, we'll reject the null hypothesis. So let's look at the conditions. First, was this a simple random sample? We can treat it as such. This was a sample of hockey players born between 1980 and 1990. We can't imagine that that's going to be particularly different or unrepresentative. So we can treat this as a random sample of players who have played or will play pro hockey.
What about the independence piece? We're going to have to assume there are at least 10 times as many players who have ever played pro hockey as there were in our sample, such that we can get that independence piece. So we have to assume that there are at least 5,120 players who have ever played pro hockey.
And, finally, are all the expected counts above 5? Well, the smallest number occurred here in February. 35.84. So, yes. When you look at the entire row of expected values all of them are over 5.
So what we're going to do is we're going to calculate our chi-square statistic. Our chi-square statistic is going to be the observed minus the expected for each month squared, divided by the expected for each month. And then we're going to do-- this is the chi-square value from January. This is a component from January. This is the component from February. On, and on, and on. This is the component from December.
When you add all of those components together, you get the chi-square value of 34.21. In this case, it's also a good idea to state that the degrees of freedom was 11 degrees of freedom. Now, remember. There were 512 hockey players, but there were 12 categories. The degrees of freedom is the number of categories minus 1. The p-value obtained from technology-- most of the time we calculate the p-values and the test statistic using technology-- is 0.00033. That's a low p-value. Our p-value is less than 0.05. Since our p-value is low, we're not willing to attribute the difference-- such a deviation from what we thought would be the norm among the hockey players-- that is, that they followed the population's birth month distribution-- the fact that they deviate from that so strongly, we're not willing to attribute that to chance. We reject the null hypothesis in favor of the alternative, and conclude that the distribution of birth months for professional hockey players differs significantly from the birth month distribution for the general populace.
And so, to recap. The chi-square statistic is a measure of discrepancy across categories from what we would have expected in our categorical data. We thought that we had a particular distribution. Remember, the expected values, like the ones we got here, might not be whole numbers, since each expected value is a long term average. In this tutorial we talked about the chi-square distribution being a skewed right distribution, and chi-square statistics near zero being more common if the null hypothesis is true. We also performed a Goodness-Of-Fit Test. The Goodness-Of-Fit Test is used to see if the distribution across categories for data fit a hypothesized distribution across categories. We had some distribution in mind that we thought they should fit. Do they fit? A chi-square Goodness-Of-Fit Test will tell us.
Good luck, and we'll see you next time.
A hypothesis test where we test whether or not our sample distribution of frequencies across categories fits with hypothesized probabilities for each category.