This tutorial is going to cover sampling, both with and without replacement. Specifically you will focus on:
With replacement means that you put everything back once you've selected it. Without replacement means that each observation is not put back once it's selected. Once it's selected, it's out. It can't be selected again.
Typically, one big requirement for statistical inference is that the individuals, the values from the sample, are independent. One doesn't affect any of the others. Ideally, this would mean sampling with replacement.
Here is an example using cards.
The probability that you select a spade on the first draw is one fourth.
Now suppose that you draw it and you don't put it back. Take the 10 of spades, and pull it out.
The probability of a spade, there's only 12 left out of 51 cards, is not one fourth.
The first draw and the second draw are dependent. The probability of a spade changed after knowing that you got a spade on the first draw.
What if the card got replaced? The probability of a spade on the first draw is one fourth. Then you pull the 10 of spades. Then you put it back.
Now what's the probability of a spade on the second draw?
It's one fourth again. It's the same 52 cards. So you have the same likelihood of selecting a spade.
It's not often that you sample with replacement. This is a huge deal because typically sampling with replacement will lead to independence, which is a requirement for a lot of statistical analysis. But you wouldn't call a person for their opinion in a poll twice, so we don't put someone back into the population and see if you can sample them again. It doesn't make sense to do in real life.
You need a little bit of a workaround. Even though the sampling done in real life doesn't technically fit the definition for independent observations, there's going to be a workaround. There's a big but here.
Suppose that your population was very large. Suppose you had four decks of cards, totaling 208 different cards.
Suppose the worst case scenario happened in terms of independence and every card picked was the same suit. Take five diamonds from the group.
The probability of a diamond on the first draw- there were 52 diamonds out of 208 cards- which is one fourth probability, same as if there were one deck. The larger population actually has an effect now. Look at the probability of a diamond on the last draw. It's 48 remaining diamonds out of 204 remaining cards.
The probability is about 0.24, which is different than 0.25, but not hugely. This is even after five draws.The probability of a diamond didn't change particularly that much from the first to the last draw.
You’ll have a catch for independence. When you sample without replacement, if the population is large enough, then the probabilities don't shift very much as you sample. The sampling without replacement becomes almost independent because the probabilities don't change very much.
The question is, when is the population large enough? How large is considered a large population? You're going to institute a rule. A large population is going to be at least 10 times larger than the sample. The population is greater than or equal to 10 times n, the sample size.
If that's the case, then you're going to say that the probabilities don't shift very much when you sample several items, n items from the population. Therefore, you can treat the sampling as being almost independent.
Sampling with replacement is kind of the gold standard. It always creates independent trials. The probability of particular events doesn't change at all from trial to trial. However, in real life, when you sample without replacement, the probabilities do necessarily change. Your workaround is that if the population from which you're sampling is at least 10 times larger than the sample that you're drawing, the trials can be considered nearly independent.
Source: This work adapted from Sophia Author Jonathan Osters.
A sampling plan where each observation that is sampled is replaced after each time it is sampled, resulting in an observation being able to be selected more than once.
A sampling plan where each observation that is sampled is kept out of subsequent selections, resulting in a sample where each observation can be selected no more than one time.