# Lies, Damned Lies, and Statistics (17): The Correlation-Causation Problem and Omitted Variable Bias, aka “Jumping to Conclusions”

correlation vs causation

###### (source)

Some more detailed information after my casual remark on the correlation-causation problem. Here’s a fictitious example of what is meant by “Omitted Variable Bias“, a type of statistical bias that illustrates this problem. Suppose we see from Department of Defense data that male U.S. soldiers are more likely to be killed in action than female soldiers. Or, more precisely and in order to avoid another statistical error, the percentage of male soldiers killed in action is larger than the percentage of female soldiers. So there is a correlation between the gender of soldiers and the likelihood of being killed in action.

One could – and one often does – conclude from such a finding that there is a causation of some kind: the gender of soldiers increases the chances of being killed in action. Again more precisely: one can conclude that some aspects of gender – e.g. a male propensity for risk taking – leads to higher mortality.

However, it’s here that the Omitted Variable Bias pops up. The real cause of the discrepancy between male and female combat mortality may not be gender or a gender related thing, but a third element, an “omitted variable” which doesn’t show in the correlation. In our fictional example, it may be the type of deployment: it may be that male soldiers are more commonly deployed in dangerous combat operations, whereas female soldiers may be more active in support operations away from the front-line.

correlation and causation

###### (source)

OK, time for a real example. It has to do with home-schooling. In the U.S., many parents decide to keep their children away from school and teach them at home. For different reasons: ideological ones, reasons that have to do with their children’s special needs etc. The reasons are not important here. What is important is that many people think that home-schooled children are somehow less well educated (parents, after all, aren’t trained teachers). However, proponents of home-schooling point to a study that found that these children score above average in tests. However, this is a correlation, not necessarily a causal link. It doesn’t prove that home-schooling is superior to traditional schooling. Parents who teach their children at home are, by definition, heavily involved in their children’s education. The children of such parents do above average in normal schooling as well. The omitted variable here is parents’ involvement. It’s not the fact that the children are schooled at home that explains their above average scores. It’s the type of parents. Instead of comparing home-schooled children to all other children, one should compare them to children from similar families in the traditional system.

###### (source)

Greg Mankiw believes he has found another example of Omitted Variable Bias in this graph plotting test scores for U.S. students against their family income:

###### (source, the R-square for each test average/income range chart is about 0.95)

[T]he above graph … show[s] that kids from higher income families get higher average SAT scores. Of course! But so what? This fact tells us nothing about the causal impact of income on test scores. … This graph is a good example of omitted variable bias … The key omitted variable here is parents’ IQ. Smart parents make more money and pass those good genes on to their offspring. Suppose we were to graph average SAT scores by the number of bathrooms a student has in his or her family home. That curve would also likely slope upward. (After all, people with more money buy larger homes with more bathrooms.) But it would be a mistake to conclude that installing an extra toilet raises yours kids’ SAT scores. … It would be interesting to see the above graph reproduced for adopted children only. I bet that the curve would be a lot flatter. Greg Mankiw (source)

Meaning that adopted children, who usually don’t receive their genes from their new families, have equal test scores, no matter if they have been adopted by rich or poor families. Meaning in turn that the wealth of the family in which you are raised doesn’t influence your education level, test scores or intelligence.

However, in his typical hurry to discard all possible negative effects of poverty, Mankiw may have gone a bit too fast. While it’s not impossible that the correlation is fully explained by differences in parental IQ, other evidence points elsewhere. I’m always suspicious of theories that take one cause, exclude every other type of explanation and end up with a fully deterministic system, especially if the one cause that is selected is DNA. Life is more complex than that. Regarding this particular matter, take a look back at this post, which shows that education levels are to some extent determined by parental income (university enrollment is determined both by test scores and by parental income, even to the extent that people from high income families but with average test scores, are slightly more likely to enroll in university than people from poor families but with high test scores).

What Mankiw did, in trying to avoid the Omitted Variable Bias, was in fact another type of bias, one which we could call the Singular Variable Bias: assuming that a phenomenon has a singular cause. In honor of Professor Mankiw (who does some good work, see here for example), I propose that henceforth we call it the Mankiw Bias.