Correlation versus Causation

I really, really wish this wasn’t necessary. If you have ever argued on social media, somebody has told you that correlation does not imply causation. Yet it seems like people think it only applies to the other side of a debate and never their side.

So I’m going to give you examples on why a simple correlation is not enough. Then I’ll try to explain what you need in order to prove the effects of correlation.

Correlation Does Not Imply Causation

Below is a chart I created with data from the Treasury Department and NASA so you know you can trust the data quality. It shows the correlation between the US government debt and the anomaly (in Celsius) of the global temperature of Earth. Since the independent variable is the anomaly and the dependent variable is the US debt, clearly you can see that as global temperatures rise, so does the US debt.

So if you want to trim down the debt because you don’t understand it, maybe tackling climate change will help.


Another glaring issue is the age of Miss America by year. This is a major cause for concern.

Data from Wikipedia and CDC, graph from Tyler Vigen (click to see more graphs like this)

Other than those pesky outliers, you can see a very clear connection between the age of Miss America and murders by steam, hot vapours and hot objects. We should set a cap on the age for entrants so that we can keep these kinds of deaths low.

Nevermind that the age is increasing by 1.25 on the left axis. Lets ignore that we have a top of 8 murders by this method (which, while sad, represent an extremely small portion of all murders). We see a correlation!

The real cause of global warming:

Graph from Wikipedia article on the Flying Spaghetti Monster

More pirates means a lower average temperature! Other than the out-of-order and unproportional bottom axis, of course.

Finally, the best way to decrease highway fatalities is by importing more lemons from Mexico.

From the Journal of Chemical Information and Modeling

Want to cook up some correlations? Google has you covered! There’s even a semi-active subreddit for this.


Okay, fine, how do we know if there’s a relationship in the data?

You can’t, really. You need to start with the scientific method.

We know that more carbon in the atmosphere leads to higher temperatures. We also know that humans have drastically increased their carbon output over the last two centuries. Are humans to blame for increase in global temperatures? This is what most of the debate over climate change is about: causality. We can’t exactly put a sensor in a piece of carbon and see if it contributes to temperature increases.

The answer lies somewhere in a well-designed randomized experiment. At its most basic level, this is an experiment where you have a control group and an experimental group. The control group is the placebo, or unaffected, or “normal” group. The experimental group is the one being studied or tested. Ideally, neither group knows they’re in that group (if they’re humans).

In fact, when setting up the experiment, it’s best to have a double-blind procedure in place. What this means is even the scientists and experimenters don’t know who is in what group.

A quick note: The above explanation is very surface-level. It assumes we’re talking about people and they way they interact with the world. It applies to nearly all experiments. For example, with climate change you can choose from ten regions across the world. Five are in a control group of forests, oceans, deserts and other places with few human inhabitants. The experimental group of five regions can be cities and trade routes. You find the temperature and the amount of carbon in a double-blind study and you’ve got yourself a scientific study (probably not a good one, I’m not a scientist).

giphy1Another note: I don’t like surveys or polls. Self-reported studies are the worst. But I’ll get into that another time.

I should clarify something. I’m not saying you must have a randomized double-blind experiment before you come to a conclusion. Most people will never need to go through one of these in their lives.

What I am saying is if somebody is making a claim that millennials are lazier than their parents, simple observation isn’t going to cut it. (Unless they fail to be outraged in which case I’ll be convinced.) It’s important to know when a basic correlation isn’t going to cut it. If somebody is making claims about a demographic or health or any other factor that varies greatly person to person, it’s best to have one of these scientific observations. If you’re trying to learn about a team of ten salespeople, you can skate by.

In my job, I spend a lot of time looking for causation. It doesn’t need to be scientific but I do try to offer average and median, as well as a graph. And if you are going to be using correlation to find causation, always consider extraneous or lurking variables.

The next time somebody tries to draw a shady causation, I’m going to send them here. Be aware of tricky graphs, unrelated data sets (that lack an explanation for the connection), and things of that nature. Actually, just be aware of all graphs until you can check that the data is collected and presented in an unbiased way. Again, not saying you can’t trust anything. I’m saying don’t just take anything as gospel or fact because it fits your agenda.

Hopefully you can have this conversation from XKCD now.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s