Daniel Lemire's blog

, 2 min read

Statistics is overrated: the rise of data science

With the industrial and scientific revolutions, we saw the rise of enormous bureaucracies collecting reliable numbers. For the first time in history, we could ask about the total production of silver in England and get a meaningful answer.

But data is rarely complete. We most often only have partial views. Thankfully clever people observed that it is generally unnecessary to do the full computation. From a small representative sample, you can almost always tell what the whole population looks like. To know the average height of the American, you do not need to collect the height of every single American… a few hundreds or thousands is enough… as long as they are representative. So we went from a pre-industrial world where people rarely quantified anything, to a world where everything must be accounted for. When we can’t count, we sample and estimate with good margins of error. So far so good. But what to do with all these numbers? Well, we must do something, anything, and if it sounds impressive and reputable, all the better!

There is a deluge of research papers using fancy statistical tests. Among those are the p-value significant tests. Except that, hardly anybody knows what a p-value actually means. And does any of it help anyone? Does steak cause cancer? Who knows? We have all these statistical “proofs” of contradictory results, all based on a glorious statistical analysis. Where is the evidence that it brings us closer to the truth?

Even the American Statistical Association says that p values cannot determine whether a hypothesis is true or whether results are important. (Baker, 2016) And the famous statistician Andrew Gelman goes further: the problems are deeper, and the solution is not to reform p-values or to replace them with some other statistical summary or threshold.

Why do people go on? Is it because it brings an air of respectability to the whole process?

Meanwhile, silly computer scientists actually do separate spam from real emails. We really do defeat human beings at Chess and Go. We really do figure out whether your credit card purchase was a fraud or not.

The end-game for computer scientists is to match and surpass the human mind in its ability to process information. The end game for medical researchers is to keep us all perfectly healthy. What is the end-game for statistics? Do statisticians bring us ever closer to the statistical truth or do they expect us to churn out an ever greater number of p-values each year? Like librarians and journalists, statisticians are ripe for disruption. There is a new discipline called “data science”. Ironically, it was founded by statisticians in 2001, shortly before the human genome project was completed (2003). If you look around, you will find many young (and not-so-young) people calling themselves data scientists.

They all exploit data, they all make it speak, they all try to bring out value from data. But how many of them are statistics college major do you think? Software ate libraries and newspapers and it is now eating statistics.