Measures of Central Tendency (Or Why You Shouldn’t Trust Averages)

Today I want to cover what are called measures of central tendency. The measure used most often is average. In my hobbyist opinion, average is abused and misused too often, leading to a misunderstanding of what it really means.


Averages

Average is one of the worst facts you can offer to somebody. The feeling I get when somebody offers a pointless average is the same feeling I get when I ask for Dr. Pepper and I’m offered a Mr. Pibb.

Average is calculated by taking all of the values of a dataset and then dividing that number by the number of values. So the average of {0,3,7} is 10 divided by 3, or 3.33.

Not all averages are bad. In fact, average is often used scientifically. However, my issues with averages can be summed up in two facts.

  1. Median is more robust.
  2. Average does not mean what you think it means.

To illustrate point one, I’ll let /u/conmanau explain.

The data set is nearly identical. Changing one variable, albeit a little dramatically, changes the mean by a lot. While average and median have different uses, it’s important to remember how susceptible averages are to outliers.

“But wait! The comment says that the mean is a useful measure!” I am not arguing against averages. Just its use. The end of the comment brings up an important point. If you can’t see the data set, be wary. Oftentimes you

On to point 2. Too many people think that average is “typical” or the halfway point between the lowest values and the highest. The measures of central tendency that I list here are all attempts to describe what is typical; in really large, as well as very small, data sets, typical can mean three very different things depending on how you look at it. As for the halfway point, that would technically be median.

One of the most blatant misuses of average that I can easily point out is found in a previous blog post of mine.

According to the National Association of Colleges and Employers, the average starting salary for a college graduate in 2015 was $50,651. I’m not a fan of averages but finding median salary for college graduates is tough.

So what NACE is saying is, on average, the recent college graduate starts out with a salary of $50,651.

Does that mean that university seniors should expect that much when they graduate in a month? No.

Why? Pretend the table below represents starting salaries for new graduates (in thousands). If each data set had its own news story, which one do you think would receive more views?

 Data Set  Median  Average
{40,45,50,55,60}  50 50
 {40,45,50,55,120}  50 62

The average of the second data set is 24% higher than the first data set. However, the 120 is more likely an outlier than it is a reliable result.

Saying that the average income of a recent graduate is $50,000 is an empty fact. If you’re trying to present what the “typical” graduate is likely to receive, you’re better off discussing median.


giphyMedian

Enough beating up on average. Average is flaky but what can we do about it?

Your first response is probably median. Which isn’t a bad idea. It isn’t perfect, but it solves the problem of robustness.

Honestly, the best thing to do is probably to show average and median. If they differ by a large percent, something is wonky.

Median is found by taking all of the numbers in ascending order (smallest to largest) and then finding the middle-most value. If the data set has an even number of values, average the two in the middle. The median is the 50% mark of the data set. Half of the data is above the median. The other half is below.

However, if you read this essay, you’ll find that median has its own problems. The author, Stephen Jay Gould, writes that when he learned he had cancer he found out that the median patient would live another eight months. Which means he had a 50% chance of being dead in eight months. Before he resigned himself to his new reality, he noticed that the distribution of the graph was skewed to the right.

360084-image0Basically, he was looking at a data set similar to data set A in the image to the right. Click the image on the right to learn more about what skewed data means.

“The graph was skewed to the right?” From the essay:

In a symmetrical distribution, the profile of variation to the left of the central tendency is a mirror image of variation to the right. In skewed distributions, variation to one side of the central tendency is more stretched out – left skewed if extended to the left, right skewed if stretched out to the right.

What does this mean? Well, half of the people with this diagnosis would live past eight months. But the average was higher than the median, so the “typical” patient would live longer than eight months.

Why does this happen? As the essay explains, the absolute minimum for life expectancy was zero; as in, they discovered the cancer during the autopsy. The upper maximum, however, is many years.

Stephen Gould lived another 20 years after his diagnosis and the cause of death wasn’t even the mesothelioma that he was diagnosed with.

The point isn’t that median is superior to average. The point is that neither are perfect. While one is a better representation of data in some cases, I still think it is important to list both and explain why that happens.


Mode

dotplot_unimodal_distributionThere is also mode. This one is simple and not often used but it does have its place. Somewhere. Usually you would check the mode distribution to see how your data is distributed and if it is skewed.

Mode is the number that is most often repeated. So the data represented to the right (found here) would have a mode of 8, along with being slightly skewed to the left.


Trimmed Mean

Then there’s this guy. Not used too often, it’s taking the data set without the top 5% and the bottom 5% of values. It’s a way to get an average while also accounting for outliers. It presents the same problems as standard averages do, just on a smaller scale.


Conclusion

Okay, so what’s the point?

  1. You can tell a lot about a data set by knowing its distribution.
  2. Don’t trust averages, or medians, or modes on their own.
  3. Consider multiple measures of central tendency or the data set itself.
  4. If a study (or source, or news story, or Facebook friend) gives you one measure without the distribution, they have an agenda.

A simple guide for when to use what measure is below, pulled from here. To understand what the different variables mean, read here. This graph is a one-stop shop for the best measure. As I just explained, you can’t rely on one simple method. But you can start there. If you see a news story that uses “average age” then you should know from the following table that story is fishy. Age falls under the ordinal variable and should be represented with a median.

Type of Variable Best measure of central tendency
Nominal Mode
Ordinal Median
Interval/Ratio (not skewed) Mean
Interval/Ratio (skewed) Median

If you want to hear from experts, ask the people over at SurveyMonkey. They offer another insight into median versus mean. Or you can ask the USDA, which offers a skewed graph and discusses these differences.

I tried to keep this entire post as non-technical as possible. Many of the links go deeper into this subject than I wanted to cover. If you have questions or need something clarified, feel free to ask.

Advertisements

3 thoughts on “Measures of Central Tendency (Or Why You Shouldn’t Trust Averages)”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s