+
4 Tutorials that teach Chi-Square Test for Goodness-of-Fit
Take your pick:
Chi-Square Test for Goodness-of-Fit

Chi-Square Test for Goodness-of-Fit

Rating:
Rating
(0)
Description:

This lesson will explain the chi-square test for goodness-of-fit.

(more)
See More

Try Our College Algebra Course. For FREE.

Sophia’s self-paced online courses are a great way to save time and money as you earn credits eligible for transfer to over 2,000 colleges and universities.*

Begin Free Trial
No credit card required

25 Sophia partners guarantee credit transfer.

221 Institutions have accepted or given pre-approval for credit transfer.

* The American Council on Education's College Credit Recommendation Service (ACE Credit®) has evaluated and recommended college credit for 20 of Sophia’s online courses. More than 2,000 colleges and universities consider ACE CREDIT recommendations in determining the applicability to their course and degree programs.

Tutorial

What's Covered

This tutorial will cover chi-square test for goodness of fit. You’ll learn about:

  1. The Chi-Square Distribution
  2. Conditions
  3. The Chi-Square Test for Goodness-of-Fit

1. The Chi-Square Distribution

The chi-square distribution is a good place to start.


The chi-square distribution is a skewed to the right distribution that measures, generally, discrepancy from what a sample of categorical data would look like if you had an idea of what the population should look like in those categories.

  • If you had an idea of that in mind, a low value of chi-square would indicate not a very large departure, not a very large discrepancy.
  • A larger chi-square value would indicate a large discrepancy.

The p-value is always the area in the chi-square distribution to the left of your particular chi-square statistic that we end up calculating. The values on the left (low values of chi-square) are likely to happen by chance, and high values of chi-square are unlikely to happen by chance.

The degrees of freedom for the chi-square distribution is the number of categories minus 1. Just like the T distribution, the chi-square distribution is actually a family of curves. The shape changes a little bit, based on the degrees of freedom, but it's always skewed to the right.


2. Conditions

The conditions for using that chi-square distribution are:

  1. The data represent a simple random sample from the population.
  2. The observations should be sampled independently from the population, and the population is at least 10 times sample size condition, which is called the 10% of the population condition.
  3. The sample size is large. This is similar to the conditions for checking normality in other hypothesis tests.

So how large is considered a large sample? In this case, all the expected counts have to be at least 5.


3. The Chi-Square Test for Goodness-of-Fit

When should you perform a Chi-Square Goodness-Of-Fit Test?

Term to Know

  • Chi-Square Test for Goodness-of-Fit
  • A hypothesis test where we test whether or not our sample distribution of frequencies across categories fits with hypothesized probabilities for each category.

It's easiest to practice in an example.

Think About It

In the book Outliers, Malcolm Gladwell outlines a trend that he finds in professional hockey, related to birth month.

Here is a distribution of the number of hockey players born in a particular month.

Is this this what you would expect, given the general population?

It certainly appears that the earlier months of the year have larger numbers of NHL players born in them, which is not very consistent with the nearly uniform distribution of the population. What you would have expected is that, of those 512 professional hockey players, 8% of them would have been born in January, 7% of them would have been born in February, and etc.

So it looks like this:

You would have expected 9% of the players to have been born in each of July, August, September, October, and December. And so all of those expected values are 46.08. However, apparently just 30 were born in December. The same as the distribution for everyone who was born in, in that case, Canada, because it's Canadian hockey players.

  • The null hypothesis is that the distribution of the population of hockey players follows that specified distribution that we claimed it did.
  • The alternative hypothesis is the distribution of birth months for hockey players differs from the distribution of birth months for the general populace.

Choose a significance level of 5%. If you get a p-value below 0.05, we'll reject the null hypothesis. Take a look at the conditions:

  1. First, was this a simple random sample? You can treat it as such. This was a sample of hockey players born between 1980 and 1990. There's no reason to imagine that that's going to be particularly different or unrepresentative. So you can treat this as a random sample of players who have played or will play pro hockey.
  2. What about independence? You have to assume that there are at least 10 times as many players who have ever played pro hockey as there were in our sample, such that we can get that independence piece. That would mean that you have to assume that there are at least 5,120 players who have ever played pro hockey.
  3. Are all the expected counts above 5? The smallest number occurred here in February. 35.84. So, yes, when you look at the entire row of expected values all of them are over 5.

So what you're going to do is to calculate your chi-square statistic. The chi-square statistic is going to be the observed minus the expected for each month squared, divided by the expected for each month.

When you add all of those components together, you get the chi-square value of 34.21. In this case, it's also a good idea to state that the degrees of freedom was 11 degrees of freedom.

There were 512 hockey players, but there were 12 categories. The degrees of freedom is the number of categories minus 1.

The p-value obtained from technology -- most of the time we calculate the p-values and the test statistic using technology -- is 0.00033. That's a low p-value, less than 0.05. Since your p-value is low, you can't attribute the difference to chance.

This means that you must reject the null hypothesis in favor of the alternative and conclude that the distribution of birth months for professional hockey players differs significantly from the birth month distribution for the general populace.


Summary

The chi-square statistic is a measure of discrepancy across categories from what we would have expected in our categorical data. The expected values might not be whole numbers, since each expected value is a long term average. The chi-square distribution is a skewed right distribution, and chi-square statistics near zero are more common if the null hypothesis is true. The Goodness-Of-Fit Test is used to see if the distribution across categories for data fit a hypothesized distribution across categories.

Thank you and good luck!

Source: THIS WORK IS ADAPTED FROM SOPHIA AUTHOR JONATHAN OSTERS

TERMS TO KNOW
  • Chi-Square Test for Goodness-of-Fit

    A hypothesis test where we test whether or not our sample distribution of frequencies across categories fits with hypothesized probabilities for each category.