First, please create an account

Already have a Sophia account?

Data Analysis

Author: Ryan Backman

Video Chapters

( 00:00 - 00:43 ) Definition of Data Analysis

( 00:44 - 01:55 ) Comparison of Distributions with Different Centers

( 01:56 - 02:21 ) Definition of Center

( 02:22 - 03:26 ) Comparison of Distributions with Different Spread

( 03:27 - 04:46 ) Definition of Spread

( 04:47 - 06:19 ) Comparison of Distributions with Different Shapes

( 06:20 - 07:04 ) Definition of Shape

( 07:05 - 07:49 ) Analysis of a Distribution with Outliers

( 07:50 - 08:42 ) Definition of Outlier

( 08:43 - 09:39 ) Recap

Video Transcription

Download PDF

Hi. This tutorial covers Data Analysis. So let's start by defining that term, data analysis. Data analysis is the process of describing a data distribution by analyzing the center, spread, shape, and any unusual features, especially things like outliers.

All right. So let's look at the distribution of exam score data. These are going to be in dot plot-- the type of graph is going to be a dot plot-- for several different high school math classes. OK? All the dot plots you'll see will consist of classes, of different math classes, and all of the exams were out of 100 possible points. OK?

So we're going to start with three classes to begin with, class A, class B, and class C. Each dot on this dot plot represents one student, and it represents the score that they got on the test. I think there are a hundred students represented in each class.

So the first question is what is the distinctive difference between the distributions of scores for classes A, B, and C? OK. So take a minute to just look at each distribution, and the main difference in those distributions were where they were centered. OK? So if we're thinking about where they're centered, class A is centered at a much larger value than, say, class C. OK? In class A, I'd say a lot of people were clustered in this 80, maybe 75 to 85 range. Whereas, down here, most of the students here were clustered around maybe 65 to maybe down here around 55. OK?

So this word center is a way to describe the middle of the data distribution. So the distribution of dots are centered at different values. A typical student in class A would do better than a typical student in classes B or C. So if you had a child in one of these classes, I would certainly want my child to be in class A rather than class C.

All right. So let's take a look at three new classes. These classes are D, E, and F. So again, let's think about the differences between these three classes. So what is the distinctive difference among the distributions of scores for classes D, E, and F? OK? So just take a minute to look at those three distributions.

Now, what I would say is the major difference here is how spread out the data values are. OK? We can see, in class D, just about everybody scored between 60 and 80. Nobody above 80, nobody below 60. OK?

Now, if we go to E, most of the data is still clustered in this region, but now we can see that there's a few over here in the 80s. There's also a few down here in the 50s. OK? And then class F, again, a lot of students in this 60 to 80 range, but now we can see that they're really getting spread out. So we've got some students down here in the 40s, also some students here in the 90s.

So I would say the center of these three distributions is about the same, but now the distinctive difference is the spread. So the spread is a way of describing how close or far the data are from the middle, or the center. Spread is also known as variation. So those words can be used interchangeably. So the spread of exam scores among students in class F is much greater than the spread of exam scores in class D.

So again, if we take a look at the three classes, one advantage of your student being in class D is that you know that the student's going to pass. They're going to get in the 60 to 80 range. That's just a given here based on the distribution. A disadvantage of being class D is there's no way you can get in the 80s or the 90s, so you can't really get an A or B on the test. OK?

Jump down to class F, a disadvantage is that really the scores are pretty unpredictable. Hopefully, your student wouldn't be in the 40s, but it's a possibility based on this class. One good thing is that your student could be one of these that were in the 90s. So when you have high variability, or high spread, the scores are a lot less predictable, but in this case, there is the possibility of getting a really good score.

OK. Let's look at three more classes. OK? So now, we're looking at G, H, and I. OK? So what is the distinctive difference between the distributions of scores for classes G, H, and I? OK?

So now, let's take a look at these. OK? So I would still say that, in general, they seem to be centered still around the same values. Maybe, somewhere around 50 would be a good place for the distribution to be centered at.

I would say the spread is about the same. This is about 20 to 80. This is going about 30 to 90. This is going about 15 to about 70. So the spread aren't drastically different.

What's different here is the shapes of the distributions. So if we think about where the major cluster of the data is compared to the rest of the values, we can see for class G, the clusters seems to be right about them in the middle, and then really it tails off to both sides. OK? If we look at class H, we have a big cluster in the lower values, and the tail of the distribution tails off to the right. OK? And then the last one, we have a cluster of data near the larger end-- or the higher end of the distribution, and the tail goes off to the left. OK? So what I would say here is the shapes of these distributions are a lot different. OK?

So shape is a way of describing how the data looks when graphed. OK? So generally, you're going to need a graph, not just a table, to look at the shape. OK? So the largest cluster of students in class G scored in the middle of the class spectrum. The largest cluster of students in class H scored in the lower end of the spectrum, and the largest cluster of students in class I scored in the higher end of the spectrum. OK? So generally, if I had a student in one of those classes, I would probably want them in class I, because chances are, they'd be in that cluster towards the larger values.

OK. One last dot plot, now we're not comparing this one to anything. We're just taking a look at just the distribution on its own. So it says, what's the distinguishing feature of the distribution of scores for class K?

Now, what I would say is, if we think about center, spread, and shape, the center is somewhere between 65 and 75. There's not a lot of spread. It's between about 60 and 80, similar to some of the other classes.

I would say, the distinguishing feature that student and that student. OK? This student scored way, way above the rest of the class. This student scored way below the rest of the class. Those two values are called outliers.

So an outlier is a data value that is significantly higher or lower than the majority of the other data values. So there must be a gap between the outlier and the bulk of the other values. OK? So if we had a bunch of students that scored in the 80s and 90s, this guy might not be considered an outlier.

The reason he's an outlier isn't necessarily because he scored really well. It's because he scored so much better than the rest. So because we had this gap, that made it an outlier. OK? Because there was this gap here, that made this guy an outlier. OK?

So you need to have a gap in order to have an outlier. OK? So class K has two outliers. Students had scores that were significantly different than those of their classmates. One was significantly higher. One was significantly lower.

All right. So to summarize, when describing distributions, always makes sure to comment on center, spread, shape, and any outliers, if there are any present. Generally, when you're describing center and spread, oftentimes, it's useful to do some computations, to calculate some numerical summaries, to actually measure the center and spread. The best way to analyze the shape of the distribution is to produce some sort of graph, like the dot plots we saw here.

Outliers, there are ways of doing calculations and looking at the graphs for determining outliers. But again, it's important to always comment on center, spread, shape, and outliers, if they are present. So this has been your tutorial on Data Analysis. Thanks for watching.

Terms to Know

Center: The "middle" of the data set. There are many measures of center.
Data Analysis: The understanding of the key features of a set of data - shape, center, spread, and outliers.
Outliers: Points in a data set that are so high or so low as to be unusual, given the rest of the values.
Shape: The qualitative description of the clustering of data points in a certain location when the data are graphed.
Spread: The numerical description of how close the numbers are to the center.