First, please create an account

Already have a Sophia account?

Data Analysis

Author: Jonathan Osters

Source: Images of graphs created by Jonathan Osters

Video Chapters

( 00:00 - 00:43 ) Components of Data Analysis

( 00:44 - 01:31 ) Shape

( 01:32 - 02:07 ) Center

( 02:08 - 02:44 ) Spread

( 02:45 - 03:33 ) Outliers

( 03:34 - 04:20 ) Recap

Video Transcription

Download PDF

This tutorial is going to teach you about data analysis. Data analysis is what we do once we've collected our data. We're going to take a look at it. And see if we can identify trends or key features.

Now the key features that we're looking for, we're going to look for several components, four of them, in fact. We're going to identify the shape that a distribution will have, the center of that distribution, the spread which could be measured a couple of different ways. And we'll see if there are any outliers. We'll go through all of these in greater detail.

Shape first, shape is sort of a qualitative notion. It's telling us where most of the points lie in the distribution. So, for instance, this shape you would say that most of the data points are up here in this hump. And there's not a whole lot of data points over here in what we'd call the tail. Or it could look sort of the opposite way. This distribution is called skewed to the right. It has a hump on the left and a tail on the right. This distribution is called skewed to the left. It has a tail on the left.

Center is essentially what it sounds like. It's wherever the middle is. However, there are a couple different ways to measure center. If you look at this arrow it's right below the peak. If you look at this one it's sort of a little further off to the right. But it looks like if you drew a line here about half the area would be to the left of it and about half the area would be to the right of it. And then there's yet another one over here. Which one is the correct measure of center? Well they're all different measures. And they're all correct for different situations.

And then spread gives a numerical value relating how spread out the data points are. And again there are several different measures. For instance, maybe I'm just interested in where most of the data points lie, which would be here below the hump. Or maybe I'm interested in the full range of data points from the lowest all the way the highest. And that would be a different measurement.

And finally outliers are important to look for. Outliers are important data points. They are so high or low that they would be considered unusual. Now they're not just the highest or lowest numbers. But they're really far above the next highest number in the data set or really far below the next lowest number in the data set.

So for instance, suppose that a small class took an exam. And the scores were as follows, 90, 98-- so some people were doing really well. And most everyone scored in the 80s or the 90s. Except for this one person with the 46. That 46 would be considered an outlier. It's so much lower than the rest of the pack.

And so to recap data analysis consists of clearly describing the four key elements which were shape, center, spread, and, if there are any, outliers. There are some standard descriptions that we can use to describe shape like skewed to the left and skewed to the right. But there are several different measures for center and spread. And those are typically numbers. And then finally outliers are values that are so high above the rest of the data set or so far below that they would be considered unusual. And we should at least mention what they are. Good luck. And we'll see you next time.

Terms to Know

Center: The "middle" of the data set. There are many measures of center.
Data Analysis: The understanding of the key features of a set of data - shape, center, spread, and outliers.
Outliers: Points in a data set that are so high or so low as to be unusual, given the rest of the values.
Shape: The qualitative description of the clustering of data points in a certain location when the data are graphed.
Spread: The numerical description of how close the numbers are to the center.