Source: Graphs, charts, and tables created by Jonathan Ostrens
In this tutorial, we're going to explain the difference between sample statistics and population parameters. These are fairly analogous things. Statistics though come from samples, and parameters are fixed values for the population.
So when we take a sample, we do our best to try and obtain values that are accurate, and that represent the true values for the population. So for instance, in election season, suppose we took a simple random sample of 500 people from a town of 10,000, and found that in this particular poll, 285 of those 500 plan to vote for Candidate Y. So that would mean that our best guess for the proportion of the town that will vote for Candidate Y, when the election actually does happen, is 285 out of 500, or 57%. That's our best guess. This is what we got from our sample. This 57% is a sample statistic.
We don't know what the real proportion is of people who will vote for Candidate Y. We only know that after election day. But for now, this is our best guess as to the proportion that will vote for Candidate Y. So we are using the results of our sample to make an estimate for the value on the population. So for this percentages value in the population-- that's the number that we don't know-- and we're using this 57% to estimate it.
So a statistic is a measurement from a sample, and a parameter is the corresponding measurement for the population. This is something that we can find in a sample. The only way to figure out a parameter is to take a census. So in our example, the sample proportion was 57%. The population proportion, we don't know. We won't know it until election day.
In general, these are the notations for the sample statistics that we generate most often. The sample proportion is shown as p hat. The sample mean is shown as x bar. And a sample standard deviation is shown as s. A population proportion-- so these are for populations now. Proportions are p-- so this without the hat-- p. A population mean is denoted as the Greek letter mu, and a population standard deviation is shown with the Greek letter sigma. So real quick recap, a sampling distribution is the distribution of all possible values of particular statistic like a sample mean, or a sample proportion could take after all possible samples of the given size are taken.
So here's our handy dandy spinner with three 1s, a 2, two 3s, and two 4s, so 8 equally sized sectors. If you spun it four times to obtain a sample mean, the sample mean might be 250 as a result of a sample where the first spin was a 2, the second spin was a 4, the third spin was a 3, and the fourth spin was a 1.
Or the sample might look like this. In which case, it's mean would be 2.25. Or it might look like this. In which case, the sample mean is 3.5. Or the mean might be 2 or 1.5 or 1.25. There's lots and lots and lots of possible samples that could be taken of size 4 for this spinner. And there are lots and lots of possible means that could arise from those samples.
What we're going to do to create a sampling distribution is we're going to take those means, those sample means, and place them here on the x-axis of a graph. And we're going to place them one at a time over and over. And if we went to the extreme and took every possible sample of size 4 for that spinner, the graph would look something like this. So consider every possible set of four outcomes and every possible mean that could arise.
If we took every possible scenario and plotted it's mean, we would create a sampling distribution of sample means. You might notice that an average of 4, a sample mean of 4, happens occasionally that requires that you get 4 every time. Sample mean of 1 occurs sometimes. But it seems like the most common values for the sample means were between 2 and 3.
Now, this is that same graph converted into a histogram. So you can notice it's bell-shaped. This would be the sampling distribution if the sample size was 1. This means that 1 occurs about 3/8 of the time, 2 occurs about 1/8 of the time, 3 occurs about a fourth of the time, and 4 occurs about a fourth of the time.
You'll notice, the distribution of sample means when the sample size is 4, the shape is significantly different from when the sample size was 1. And then look at what happens at 9 and 20. These are samples of size 9-- their averages. And these are the averages from samples of size 20.
And so there should be some things that you can recognize here about all four of these. There's some similarities and some differences. Similarities are their centers. All of them are centered at two and three-eighths. They're all centered at the same place.
You'll notice that some of these are more tightly packed around that number, like the samples of size 20 are more tightly packed around that number than the samples of size 1, for instance. But they all are centered at that very same number. What we can see here is that the mean of the sampling distribution of sample means is the same as the mean for the population. In this case, it was two and three-eighths.
How about spread? The arrows on each of these indicate the standard deviation of each distribution. Notice the arrows on the first distribution are very wide, and they seem to diminish in size as each distribution is graphed. While we get to the lowest distribution here where the sample size was 20, it's spread is much, much less.
So the rule that's being followed is that the standard deviation for the sampling distribution of sample means is the standard deviation of the population divided by the square root of sample size. What that indicates is that when the sample size is 4, the standard deviation of that sampling distribution of sample means is going to be half as large as it was when the sample size was one. When the sample size is 9, it's going to be a third the size of the original standard deviation. And when n is 20, it's going to be the original standard deviation divided by the square root of 20.
And then finally, this measured center, this measured spread-- let's describe the shape of these distributions. You'll notice that the shape is becoming more and more like the normal distribution as the sample size increases.
There's a theorem that we have that describes that. And it's called the central limit theorem. It says that when the sample size is large-- large being at least 30 for most distributions with a finite standard deviation-- the sampling distribution of the sample means is approximately normal. Which means we can use the normal distribution to calculate probabilities on them, which is nice because normal calculations are easy to do.
So it's going to be normal, or approximately normal, with a mean of the same as that of the population, and a standard deviation equal to the standard deviation of the population divided by the square root of sample size.
So to recap, statistics are sample measures that we can use to estimate parameters, which are corresponding population measure. And that only works when the sample is carried out well. For instance, if there's bias, then those wouldn't accurately reflect the population measures. Repeated sampling causes sampling distributions. The sampling distribution of sample means, in particular, has key features regarding its shape, center, and spread. Good luck and we'll see you next time.