Free Professional Development

Or

2
Tutorials that teach
Working with Data from Multiple Samples

Take your pick:

Tutorial

Source: All graphs created by Dan Laub, Image of graduation cap, PD, http://bit.ly/1OUwJgu; Image of school, PD, http://bit.ly/1kctUPr

[MUSIC PLAYING] Hi, Dan Laub here. And in this lesson we're going to be working with data from multiple samples. But before we do so, let's talk about the key objective for this lesson. By the end of this lesson, you should be able to understand that sample means will better approximate population means as the sample sizes get larger. So let's get started.

Recall that a sample is a selection of observations from a larger group which is known as a population. The idea behind taking a sample is that it is substantially easier to obtain a small number of observations than it is all observations. And the sample should be randomly drawn so as to improve the probability of the sample observations being representative of the population. This is due to the results of random sample being subject to chance. It will not be providing the observer with an exact estimate of what the population looks like.

Also recall that we use the mean or average of a group of observations as a method of determining the center of a data set. Regardless of whether this data is a sample or a population, the mean is still a useful measure of center. However, we cannot expect the mean of a random sample and the mean of the population to be the same, since even though this sample may be randomly selected it is still only a portion of the population.

For example, suppose that we were to select 100 random people on a college campus and ask them their age. Is it possible that the mean age of our sample would equal the mean age of everyone on campus? Probably not. But why?

Well, maybe many of the 100 people we asked happen to randomly be freshman. So the mean was likely to be less than that of the overall population of the college. Or maybe we randomly sampled a lot of faculty members, so the mean of the sample was actually higher than that of the campus population.

Typically, the mean of one sample is not itself an accurate representative of the entire population. Because of this, we need to take multiple samples, finding the mean of all the samples that we take. This approach will generally provide more accurate data than simply taking one sample.

When we collect a lot of random samples we will get a lot of sample means, one for each sample. However, keep in mind that there is likely to be a difference between how the sample means that we collect our distributed relative to the population mean. It is still possible that even if we collect multiple samples that none of the sample means will be close to the population mean. Generally, by collecting more samples we will get a mean for each sample. And looking at the mean of all the sample means, the estimates of the population will tend to be more accurate.

It is important to realize that the shape of the distribution of the sample means are normally distributed. Meaning that they follow a bell shaped curve in which the center is the population mean. If the samples we randomly draw are large enough, the distribution of the sample means will approach a normal distribution that is centered on the population mean.

Remember that the standard deviation is a measurement of how much a data set varies from the mean or how spread out the data is. A relatively high standard deviation indicates that the data is more spread out than if the standard deviation were lower. In the event that the sample sizes are small, the sample means will typically be more spread out from the center of the distribution, or possibly not even normally distributed at all. If the sample sizes are large, on the other hand, the sample means will tend to be clustered close to the center of the distribution. When the sample distribution needs to be close to our normal distribution, we need to draw more samples.

Just like we can determine a z-score for individual values, we can also determine the z-score of a mean. By using a z-distribution graph, the means whose z-scores are closer to zero are more likely to occur in a sample. And the means that have z-scores farther from zero are less likely to occur in a sample.

For example, suppose that we know the mean number of children per classroom in American elementary schools is 27 with a standard deviation of 3. Additionally, let's assume that the population distribution in this scenario is not normally distributed. As you can see in this graph, a sample of 100 schools indicates that the standard deviation of the sample is 5.

As we continue to look at more samples, we can see that as the sample size changes, the shape of the sample distribution changes as well. As you can see on the graph, the sample means of four of the samples are 25.3 students per class, 28.9 students per class, 27.9 students per class, and 26.8 students per class. Notice in particular that smaller samples result in a wider distribution, while larger samples result in a narrower distribution.

So let's go back to our objective just to make sure we covered what we said we would. We wanted to be able to understand that sample means better approximate population means as the sample sizes get larger, and we did that. So again, my name is Dan Laub, and I hope you got some value from this lesson.