Author:
Anthony Varela

This lesson will help explain which inference test should be used based upon the data set.

Tutorial

There are a variety of inference tests that you can perform on your data. But you need to know what kind of data you're dealing with to choose the best or the appropriate inference test. So in this video, we are going to interpret this decision tree that goes through a series of questions to help determine what inference test you should use for your data.

So the first question that you need to ask is what kind of data am I dealing with. Is it categorical, qualitative, or is it quantitative? So the big difference here-- are you having descriptive categories for your data, such as size or color? Or is it something that you can measure numerically, something you can count? That would be quantitative.

If you're dealing with categorical or qualitative data, your next question is how many population proportions am I dealing with? One or two or more? If you're dealing with just one, you will perform a one-proportion z-test using a normal distribution. So that's the inference test you would use for one population proportion within categorical data.

If you're dealing with two or more, you'll perform a chi-squared test. And there are different kinds of chi-squared tests. And we'll talk about more in detail in a few moments, goodness of fit, homogeneity, or association independence. So that sums up the categorical/qualitative side of things.

Let's jump over to quantitative. And I have only listed-- this is a partial list of inference tests you can use on the quantitative side. Really, I'm only listing for one population mean because that's what we're going to focus on. If you're dealing with two or more, you would perform a type of ANOVA test, which we're not going to get into.

So on quantitative data, if you're dealing with one population mean, the big question here is do we know what the standard deviation for the population is. If we don't know what it is, which is typically the case, we use a one-sample t-test using a t-distribution. If we happen to know what the population standard deviation, which is rare, we perform a one-sample z-test using a normal distribution.

So now that we've gone through this decision tree, let's walk through some examples and we'll decide what the appropriate inference test is. So my first example deals with blue Skittles. And I want to know if there's a difference in distribution of blue Skittles in a bag that I get from a store versus the claim for the distribution of blue Skittles in the entire population. So let's walk through that decision tree.

Well, I know that I'm doing with categorical or qualitative data. We're talking about color. And I'm dealing with just one population proportion, a single color. We're talking about blue Skittles. So I know, then, that I'm going to perform a one proportion z-test using a normal distribution.

Now example two is still dealing with Skittles, but it's a bit different here. Is there a difference in distribution of Skittle colors in a bag versus the claim for the distribution of Skittle colors in the entire population? So I'm not dealing with just one type of Skittle. Not one color, but all of them. So walking through our decision tree, my data is still categorical or qualitative. We're s dealing with colors. But I'm dealing with two or more population proportions. I'm dealing with all colors, not just one. So I'm going to be performing a chi-squared test. But which kind?

Well, I want to know how well does the distribution fit the claim. So I'm going to be performing a goodness of fit chi-squared test. And in this type of test-- just a note about degrees of freedom. Degrees of freedom equals n minus 1, where n is the sample size. So if my sample is 100, my degrees of freedom would be 100 minus 1, or 99.

Now in our next example, we're talking about activity by age group. And I want to know if there's a difference in distribution of activity level, such as non-active, active, highly active, or by age group. Are you in your 20s, your 30s, your 40s, et cetera. So let's walk through that decision tree.

We're dealing with categorical data. We've categorized activity level. And we have also put age into categories as well. I'm dealing with two or more population proportions. So I'll be performing a chi-squared test. But which kind?

Well, I want to know what the association is between activity level and age group. So that leads me, then, to perform a chi-squared test for association or independence. And degrees of freedom for this type of test is-- it depends on the table. You're going to count the number of rows and columns in your table of data, subtract 1 from each of those numbers, and multiply them together. And we'll actually look at an example of that in a minute here.

Now if I'm looking at AP exam scores-- so I want to know if there is a difference in distribution of passing exam scores at a particular high school between the years 2005 and 2007. So going through that decision tree again, we're doing with categorical or qualitative data. We've categorized the scores. And we're dealing with two or more population proportions. This would be 2005, 2006, 2007, so each year. So I know that I'll be performing a chi-squared test again. But which kind?

I want to know if there's a significant difference across the population proportions from year to year to year. So I'm going to be performing a chi-squared test for homogeneity. And the degrees of freedom is similar to our last test where you're counting the number of rows and columns in your data table, subtracting 1 from each, and then multiplying those two numbers together. So let's look at a more concrete example to illustrate that point real quick.

So this is my table of data for the AP exam scores from 2005 to 2007. And as we can see, my table has three rows. So here's one row, two rows, three rows. And it also has three columns, one, two, and three. Notice I'm looking at just this area right here. So there are three rows and three columns. So my degrees of freedom is 3 minus 1 times 3 minus 1, or 2 times 2, which is 4.

Now that was the categorical or qualitative side of things. We're going to go through one more example to talk about quantitative data. And this is about average household income. So is the average household income in a particular neighborhood above the national average?

So let's go through our decision tree. We're dealing with quantitative data this time. We're dealing with income. That's something that can be counted and measured in dollars. We're dealing with one population mean.

And so now our big question is do we know the population standard deviation or not. If we don't know the population standard deviation, which is typically the case, we perform a one-sample t-test using a t-distribution. And our degrees of freedom is n minus 1, once again, where n equals your sample size. Now if we happen to know what the population standard deviation, which is rare, we perform a one-sample sample z-test using a normal distribution.

So I hope that this video tutorial helps you walk through those questions to answer to help decide what type of inference test is appropriate for my data.