This isn’t the first time I mention sample sizes as a common problem in statistics. Usually, the problem is one of survey design: insufficiently large sample sizes for respondents produce unreliable survey results.
However, the same error – or fraud, when the error is willful – can occur in data interpretation. Take a look at this graph by John Taylor:
Taylor’s conclusion: The data on spending shares show that the most effective way to reduce unemployment is to raise investment as a share of GDP. But why begin the scatter plot in 1990? There’s no good reason. In fact, most folks typically download the entire history of available macro data. … The chart below goes back to 1948:
This is a form of cherry-picking data that allows you to “prove” a strong correlation where there’s actually none at all. In this way, you’ll find a correlation in almost all data sets, as long as you pick a sufficiently small sample of the set. In this example, you can only limit the selection to the last two decades if you have a good argument about why the economy is different now compared to some decades ago, and why there’s a correlation now when there wasn’t before. However, that argument – which would be interesting – seems to be lacking. And if it’s lacking, there’s no excuse for cherry picking the last two decades.