This isn’t the first time I mention sample sizes as a common problem in statistics. Usually, the problem is one of survey design: insufficiently large sample sizes for respondents produce unreliable survey results.
However, the same error – or fraud, when the error is willful – can occur in data interpretation. Take a look at this graph by John Taylor:
Taylor’s conclusion: The data on spending shares show that the most effective way to reduce unemployment is to raise investment as a share of GDP. But why begin the scatter plot in 1990? There’s no good reason. In fact, most folks typically download the entire history of available macro data. … The chart below goes back to 1948:
This is a form of cherry-picking data that allows you to “prove” a strong correlation where there’s actually none at all. In this way, you’ll find a correlation in almost all data sets, as long as you pick a sufficiently small sample of the set. In this example, you can only limit the selection to the last two decades if you have a good argument about why the economy is different now compared to some decades ago, and why there’s a correlation now when there wasn’t before. However, that argument – which would be interesting – seems to be lacking. And if it’s lacking, there’s no excuse for cherry picking the last two decades.
This is not about how important race actually is in the minds and behavior of many, but about how important it should be. On a more serious note, here are some data on how people really think about race.
And in case you think this is just the occasional slip of the mind:
Update: yet another statistics sin by Fox:
In case you don’t immediately notice what’s wrong here (and of course Fox News did everything to make it difficult for you to notice), here’s an annotated version:
An honest version of the graph would look like this:
I think it’s impossible that this is simply a mistake. I mean, how on earth could one make a mistake like this? Do they draw their graphs by hand?
More statistical jokes.
The answer is here.
Researcher Hans Rosling, the great statistics visualizer, uses his cool data tools to show how countries are pulling themselves out of poverty. He also mentions human rights in general (not just the right not to suffer poverty), fertility rates, life expectancy, child mortality, good governance and other stuff this blog cares about. Watch:
Sometimes, when you want to compare two time-series which are far apart from each other in terms of numbers – such as, for example, the yearly average number of inhabitants of NY and their yearly average height (the former being in the millions, the latter in the single digits) – you have to plot one series on the left y-axis and the other on the right y-axis, each with a different scale. If you put both on the same y-axis (usually the left) you have to use the same scale. In my example, the line for the average height would just be a flat line at the bottom of the graph and coincide more or less with the x-axis because the numbers are too small compared with the numbers for population. If you put them on two different y-axes, you’ll be able to compare them.
Here’s an example I discussed before. Proponents of the death penalty usually show the following famous graph in order to “prove” that capital punishment results in fewer homicides in the U.S., and is therefore a successful deterrent:
What’s wrong with this graph is that they tried to jam the two series – which are totally different in terms of magnitude – into one y-axis. To do so, they recalculated the number of murders series. Rather than giving the numbers as they are, they give the numbers per 66.000 people. Why this strange number: 66.000? Why not the more obvious 100.000, or why not plot the two series on different y-axis? Because now they can give the impression that the recent rise in the number of executions is closely correlated with the recent drop in the number of homicides.
Now compare this graph to this version, using the same data (but going back a bit further in time) and another graphical presentation:
The important differences:
- that the second graph uses two y-axes
- and it counts the number of executions per homicide, and not just the total number of executions – from the point of view of deterrence, this is obviously the better measure.
We can see from the second graph that the recent upswing in the number of executions is really quite small, compared to earlier periods (there was moratorium on executions in the U.S. in the early 1970s). Unless deterrence has somehow become much more effective than it was in the early parts of the 20th century – which is doubtful given the relatively low numbers of executions and the relatively humane methods – it’s doubtful that such a relatively small increase in the number of executions during the last decades is the cause of the extraordinary decrease in the number of homicides during the same period.
When we look at the whole time series, going back in time long enough; when we use both y-axes; and when we avoid using strange measures such as murders per 66.000 people or executions tout court rather than executions per homicide, then there isn’t a clear correlation between executions and decreasing numbers of murders.
Of course, using a left and right y-axis can also be misleading. I’ll post an example when I come across one.
Another common manipulation of statistics: play a bit with the starting and ending values on the y-axis of your graphs. This can give astonishing results. I prepared a fictional example. Compare the two graphs:
The data are absolutely the same, but the y-axis in the second graph starts at 3,500 instead of 0, giving the impression that government violation of freedom of speech in Dystopia has risen sharply in 2008, compared to the year before, whereas in reality things are just as awful, more or less, as before.
E.D. Kain of The League of Ordinary Gentlemen believes he has spotted a real-life example of this kind of manipulation. While it’s not difficult to find such examples, this isn’t one. On the contrary, Kain himself commits the mistake he accuses someone else of making. Let me explain. He points to this graph from Conor Clarke on Andrew Sullivan’s blog:
This graph, illustrating (or not, if you’re Kain) the drop in effective income tax rates for the top 1% of Americans from the Clinton to the Bush years, is used by many to argue that a small increase in taxation for the super-rich wouldn’t mean Armageddon. At first sight, the y-axis does indeed look like it has been manipulated in order to highlight a sharp decline in tax rates for the rich.
Hence, Kain goes to work and “corrects” the chart, making the y-axis start at 0% and end at 100%:
Just goes to show that manipulation can also mean using the apparently “neutral” starting and ending points of 0 and 100. Not only does he remove all useful information from the previous graph; he also assumes that taxes can somehow be close to 0% or 100%. One shouldn’t assume this, since it never happens in reality. Making the graph start at 0 and end at 100 means assuming it can happen, and is therefore disingenuous. An example: suppose I want to show that life expectancy hasn’t risen a lot over the last centuries (which isn’t true). So I include the extreme of 500 years as the end value in my y-axis. Nobody ever lives or will live till he or she is 500. Obviously, the graph will show no visible increase in life expectancy, even if people now live twice as long as a thousand years ago, on average (which is the case).
Lesson: minimum and maximum values in y-axis should be close to realistic real-life minimums and maximums. In that respect, the Clarke graph is better. (Although he could have used a longer period, avoiding another error).
Just to show that this type of lie occurs in real life: