Hi. This tutorial covers the mean. All right, a lot of times, when people see the word mean they instantly think of the word average. So average can actually be represented by multiple different measures. So we're going to actually define mean as a precise term.
The mean is defined as a measure obtained by adding all of the values in a data set and dividing by the number of values. The mean can be used to describe the center, or sometimes we call it the central tendency, of a data distribution. So the population mean is often unknown and is denoted with the Greek letter mu. So this is our Greek letter mu. A lot of times, I'll draw mu like that.
Since it's coming from the population and it's used to denote a population mean, the reason it is often unknown is because very seldomly will you have every element of a population. Populations are generally large, so you won't have the entire population. So a lot of times, we'll not be able to calculate mu.
Now, the mean of a sample generally can be calculated. And the sample mean is denoted with the symbol x bar, x with a bar over it, so we call it x bar. So we're going to kind of concentrate more on the sample mean.
So if we put the sample mean into a formula, our formula is going to look something like this. So it's x bar equals. Now we have the Greek letter sigma here, which represents the sum. We'll talk about that in a minute. And then it has some index value. So i equals 1, and then at the top, it has n. So what that means is that it's going to sum up all of the x values starting with x sub 1 and ending at x sub n.
So if you have 100 values, it's going to add up x1 plus x2 plus x3, all the way up to x sub 100. And then we would divide by the sample size n. So, in that case, we would divide by n, where n is the sample size, And the Greek letter sigma, again, is here, represents summation notation. And that's just to provide you with the definition of summation notation. It's a concise way to represent the sum of similar terms, which are expressed following the sigma. So it's basically anything that's after that sigma, that's what we're going to be adding up.
So now let's actually find the sample mean of a data set. So pretty simple data set. So what we're going to do first is find the sum of all of these. So I'm just going to do this quickly in the calculator. So I just add all those up. And if I hit Enter here, I get a sum of 105. And then I'm going to take that sum, and I'm going to divide by the sample size, which in this case is nine. 1, 2, 3, 4, 5, 6, 7, 8, 9, so 9 values.
Hit Enter there, and I get 11.667. So to kind of summarize what we just did there, x bar equals the sum 2 plus 2 plus 4 plus 8 plus 8 plus 9 plus 10 plus 10 plus 52. And let's just make sure I have all the values here. 1, 2, 3, 4, 5, 6, 7, 8, 9, and I do. And we divided that by 9. We got a sum of 105. And if we divide that by 9, we end up with a sample mean approximately equal to 11.667. And that's rounded to the nearest thousandth.
Now, if we kind of interpret that mean, 11.667, and kind of place that within our data set, that would be in between 10 and 52. So eight out of the nine values are lower than your mean, and you only have one value that's above the mean. So I would say that this mean right here doesn't do a real great job of summarizing this data set. And the reason is because of this 52. And remember, we call that 52 an outlier.
So that outlier is really distorting this sample mean. We can see that without that outlier, you're going to get a mean much more toward kind of the middle of those values. But with that large outlier, you are going to get a mean that seems a little bit higher than maybe you would think would summarize that data set.
So when you have that outlier that distorts it, a lot of times what we'll say is that the mean is sensitive to outliers. The mean is really influenced by those really high or really low values. So one thing to consider when you're calculating a mean, if you have outliers, you might want to use a different measure of center other than the mean.
So we have another type of mean called the weighted mean. Weighted mean is a type of mean where some data values contribute more than others. So let's take a look at an example. Suppose you're interested in determining the average amount of money a typical shopper spent on a trip to a sporting goods store. It is known that men spend $22 on a typical trip to the sports store, and women spend $28. It is also known that about 55% of shoppers at this store are men.
All right, so let's check it out. Let's just start by notating our two means. A lot of times, what I'll do is I'll use subscripts to denote between the men and women's means. So this is x bar with a subscript of m. So I'll say that as x bar sub m and x bar sub w. So the mean for men was $22. The mean for women was $28.
Now, to calculate the weighted mean, the weighted mean is going to equal each of these individual means multiplied by their weights. So I'm going to take this $22, and I'm going to multiply it by the weight for men. And the weight is 55%, but I want that written as a decimal, so 0.55. So it's 22 times 0.55 plus 28 times 0.45.
So once I add those, that will give me the weighted mean. So let's go ahead and do that. 22 times 0.55 plus 28 times 0.45. That ends up giving me 24.7. So this is in terms of money. So I'm going to write that as $24.70. So if we are just to find the average of these two numbers, so if they were weighted equally, we would end up with $25.
But since the men are weighted more-- this 22 is weighted more than the 28-- we're going to get a number a little bit below 25. So we end up with $24.70. So that is how to calculate a weighted mean. So that has been the tutorial on the mean. Thanks for watching.
Source: IMAGE OF TABLE CREATED BY RYAN BACKMAN
Hi. This tutorial covers the median. So the median is the middle value of an ordered data set. A median can be used to describe the center or the central tendency of a data distribution. Remember, when you're describing a distribution, you want to mention the center, the spread, the shape, and any unusual features. So the median is one way of measuring the center, so quantifying where that center really is.
All right, so now if n-- so there's a distinction now based on your sample size. Because if you have an even number of numbers, you won't have one middle value. But if n is odd, the median is the middle number of the ordered list. If n is even, the median is the mean of the two middle numbers of an ordered list. So one thing that's really big here is that you have to have an ordered list.
OK, so let's take a look at an example. Considering the following data set collected in 2002 from 10 midsize luxury cars, each involved in a five mile per hour test crash, the data shows the cost in dollars of repairing the car. So notice that a couple cars here must not have been damaged at all, so that was zero dollars. One car cost $900. One car cost $1,254. This one cost $4,194.
I think the reason that this one-- the article where this data came from described this car here, all the airbags deployed, so they had to replace all the airbags, which was pretty costly. So let's determine the median car repair costs, or the median repair costs for the sample of cars. So to find the median, the first thing we need to do is put the data in order. So I'm going to go least to greatest.
So 0, 0. Sometimes it's helpful to cross them off. 0, 0, 234. The next largest is 670. Then we have 707, then 769, then 900, then 979, then 1,254, and 4,194. OK, so make sure that you have the same number of numbers in your ordered list as you did whether you started with-- so I'm just going to count to make sure I have 10. So 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. OK, good.
Now, one way of calculating the mean-- well, first of all, for this median, we have an even number of numbers. So we know that we're going to have to add two values together and divide by 2 to figure out what the actual median is. So one way that people do it is they'll cross off values from each end, cross off the same number of values and get to the median value. I think it's easier, though, just to kind of think about where it's going to be.
So if we have 10 values, we know that the median will be between the fifth and the sixth value. That will give me five values on each side of the median. So really, if I just count in 1, 2, 3, 4, 5, 707 is kind of the lower end and 769. So this would be my fifth. This would be my sixth. So I know my median is going to be right in the middle here.
So a lot of times, I'll just separate into the lower five and the upper five that way. And then my median I'm going to calculate by taking the average of 709 and 769-- sorry, 707 and 769. Add those together and divide by 2. OK, so to do that, 707 plus 769 divided by 2. We end up with 738. So my median repair cost is $738.
So if you notice, this value does seem to represent the data pretty well. We do have a couple in the 700s, one in 600s, a couple in the 900s. So I would say 738 is a pretty good way of measuring the center of the distribution. Notice that this number, this really high outlier, the airbag one, that's an outlier. But that outlier didn't really significantly affect this median. Even if this value were 10,000, that median's is going to be the same.
Or even if this number weren't an outlier-- let's say it was 1,260, our median won't change. So what we say is that the median is resistant to outliers. If you've studied the mean, you know that the mean is sensitive to outliers. The median is resistant to outliers. So it's kind of nice that if you have a data set with outliers, the median is sometimes a better measure of center than the mean.
All right, let's take a look at in another example. We're also going to calculate the median, but now we're going to look at it-- well, what if it's summarized in a frequency distribution instead of just a big list? So what this distribution represents is the ages of US presidents at the day of inauguration. So President Obama is the 44th president, so in this case, n is 44. And I want the median.
So the median is going to be in between the 22nd and the 23rd data value. All right, so if I look at that, I want to figure out-- so notice here, we have two presidents that were between 40 and less than 45 at their inauguration. There were seven between 45 and less than 50. So what I can do here is see where the 22nd and the 23rd data value fell.
So what I'm going to do is I'm just going to add up the frequencies, kind of one by one, and try to see where the 22nd and 23rd data values are. So I know that up to here, there were 22. And then down to here, there were 34. So what that means is that my median is really going to be right on this boundary here. So my median, I know, will lie in between these two boundaries.
So that means that our median is going to be that boundary. The boundary is 55. So I know that my median here would be 55, because it was right on that boundary. So that's one way to calculate it.
So let's look at the same example. We're just going to change it slightly. So now suppose we remove President Obama from the list. So he was 47 when he was inaugurated. So notice that seven now went down to a six. We want to know now what is the new median.
So now, up to here, there's only 21 values. So that now means that both the 20 second and the 23rd value are in this interval. Because we don't actually have the raw data, we won't be able to give the median a number. So what I say here is that instead of just having one number as the median, we're now going to have a median class.
So, in this case, the median class is 55 to less than 60. Since we have no way of knowing what number is our actual median, 55 to less than 60 is my median class. And then, just to define that, the median class is the interval in which the middle data value lies. So the median class of the previous example is 55 to less than 60. All right, that has been your tutorial on the median. Thanks for watching.
Hi. This tutorial covers the mode. So the mode is defined as the most frequently occurring data value. So the mode can be used to describe the center or the central tendency of a data distribution. Remember, when you're describing a data distribution, you want to comment on the center, the spread, and the shape of the distribution, as well as any other unusual features.
And one way of measuring the center is to calculate the mode. The other measures of center are the mean and the median. One thing I'll note about the mode is that if you don't have a most frequently occurring value-- let's say all of your data values only show up once. Sometimes you won't have a mode. So I would like to point out that sometimes your data distributions will not have a mode, all right?
But a lot of times they do. And it's easier to calculate the mode if the data is ordered or summarized in a table or a graph. All right, so let's take a look at an example. The following data show the number of runs scored by the Minnesota Twins baseball team in the final 10 games of the 2011 season. So we have 4, 4, 3, 5, 2, 6, 6, 3, 7, 1. So in the very last game of the season, they only scored one run. Second-to-last game, they scored seven runs.
So let's go ahead and determine the mode number of runs scored over this time period. So, again, I think it's easier to do this if the data's in order, so I'm going to order it. So 1 is my smallest, then 2. We have two 3's, two 4's, a 5, two 6's, and a 7. Again, I like to count to make sure I have all of them-- 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. I do. So I do have all 10 of my values.
Now, the mode is the most frequently occurring number. Now, in this case, I don't just have one value that represents the mode. So notice, 1 had a frequency of one. 2 had a frequency of one. 3 had a frequency of two. 4 had a frequency of two. 5, two. 6, two. 7, one. So, actually, here, I really have three modes.
This won't always be like this. Sometimes you'll just have one mode. But your three modes are 3, 4, and 6, OK? So it is OK to have more than one mode. Generally, it's not as useful if you have multiple modes. Maybe if we went back another game and they scored six runs, if six happened three times, that's going to be a little more meaningful if you just have one mode. But it is possible to have multiple modes.
All right, so let's take a look at another example, also about the Minnesota Twins. The following frequency table shows the number of runs scored by the Twins for all 162 games in 2011. OK, so now notice, instead of just 10 games, we have all 162. So if you were to add up all these frequencies, these will add up to 162.
And then notice here your runs scored. All of the numbers are accounted for. It looks like they never scored 12 runs in a game, so it jumps from 11 to 13. So now the mode here is really easy to determine, because now you just look for what's the most frequent number. And if I look down, 28 is the most frequent number. 26 is close. Or 28 the highest frequency. 28 matches up with one. So, in this case, my mode is equal to one.
That's actually maybe a good reason why they didn't win that many games in 2011, is because 28 games, they only scored one run. One run will not win you many baseball games. So it's pretty easy to determine the mode when you have a frequency table like this. Again, you can have multiple modes if you're looking at a frequency, also. So let's say that this was also 28. Then my mode would be both 1 and 3.
OK, so that's kind of how to determine the mode using a frequency table. All right, and the last thing we'll look at is that for larger data sets, there can sometimes be local high points that are also called modes. So if we look in this example-- so we said that one run was the mode, because that was the most frequently occurring value. But we can also see that 3 is another very high point here. So sometimes we'll consider both 1 and 3 to both be modes.
Now, if we look at this distribution, we can see that there are some local high points here, also. So we have a local high point here around the 140s, a local high point here around 180. So since both of these are high points, we could call both of those two modes. So since this distribution has two modes, we can call it a bimodal distribution, bimodal meaning two modes.
So, again, this is a pretty clear picture of a bimodal distribution, because you do have those two local peaks. All right, that has been the tutorial on the mode. Thanks for watching.