How do we know something is true? For some problems in science, it is enough to introspect and use logical or mathematical reasoning. In many areas of science, however, the only way to get to the truth is by doing an experiment. In medicine, we carry out randomised controlled trials on drug efficacy and so on. In psychology, we conduct planned experiments to understand human behaviour, memory and cognitive processes. But interpreting such data requires statistical inference, and this is where experimental science has always struggled, especially in medicine and the humanities.
In 2015, a Science article caused an upheaval in the psychological sciences. A group of researchers attempted to replicate a hundred published studies. They found that two thirds of these could not reproduce the so-called “statistically significant” effects found in the original studies, so the published studies had failed a basic check. Cancer studies have faced similar problems with non-replicable findings – a stark reminder that this replication crisis can have real-world consequences.
What went wrong here? One major problem is what counts as news in science. Researchers are incentivised to publish results showing that X has an effect on Y. This bias towards finding effects is an unintended consequence of a statistical paradigm called null hypothesis significance testing. The idea (a strange amalgam of proposals made by the statisticians Ronald Fisher, Jerzy Neyman and Egon Pearson) is to try to reject a straw-man null hypothesis. This paradigm works just fine when the prior probability of detecting an effect is high. An example of a powerful and easily replicated experimental finding is the Stroop effect: naming the word “green” is harder when the word itself is written in the colour red versus when it is written in green.
But when the effect being measured is very subtle, highly variable or just plain absurd, noise can often look like a signal. An example is the claim by J.A. Bargh and others in 1996 that exposing people to words related to old age makes them walk more slowly. For such experiments, which depend on inherently noisy measurements, the effects will tend to fluctuate wildly. Researchers, conditioned to farm data for large effects that can be published in major journals, will tend to selectively report these overly large effects. And these these effects may never replicate. Social psychology is littered with non-replicable findings such as the Bargh study.
Without intending to, universities and funding agencies encourage such distortion of experimental outcomes by rewarding sensational findings. Research groups are encouraged to issue “press releases”, which often present overblown claims that lead to further distortion in the popular media, the latter being almost always on the lookout for clickbait titles. Department budgets are often decided by simply counting the number of publications produced by faculty without regard to what is in those papers. Funding agencies measure performance by metrics like the h-index and volume of publications. Obtaining third-party funding is sometimes a goal in itself for researchers, and this feeds into the cycle of pumping out more and more publications farmed for significance. The most fantastic recent example of this is the work of Cornell University food researcher Brian Wansink.
Apart from the distorted incentive structures in academia, a major cause of the replication crisis is scientists’ flawed understanding of statistical theory. In many experimentally oriented fields, statistical theory is taught in a very superficial manner. Students often learn cookbook methods from their advisors, mechanically following “recommendations” from self-styled experts. These students go on to become professors and editors-in-chief of journals, and perpetuate mistaken ideas from their new positions of authority. A widespread belief among many experimentalists in the psychological sciences is that one can answer a question definitively by running an experiment. However, the inherent uncertainty and ambiguity that is always present in a statistical analysis is not widely appreciated.
As the statistician Andrew Gelman, Columbia University, keeps pointing out on his blog, the possibility of measurement errors is the big, dirty secret of experimental science. Another statistician recently made the startling observation that if measurement errors were to be taken into account in every study in the social sciences, nothing would ever get published.
There have been two diametrically opposed responses to the replication crisis. One group of researchers is facing the crisis head on, by instituting new measures that (they hope) will mitigate problems. These measures include pre-registered reports, whereby a planned analysis is peer-reviewed in the usual manner and accepted by a journal even before the experiment is run, so that there is no incentive any more to farm the results for significance. This way, no matter what the outcome of the study, the result will be published.
Another important move has been towards making statistical analysis code and data openly available. Even today, scientists often refuse to release their published data and code, making it impossible to retrace the steps that led to a published result or to check for potential confounding factors in the study. To counteract this tendency, movements are starting worldwide to implement research transparency.
A second group of scientists simply refuses to accept that there is a replication crisis. This group includes eminent professors such as Bargh (Yale University) and Daniel T. Gilbert (Harvard University); the latter has even gone on record to say that the replication rate of the Science article discussed above may be “statistically indistinguishable from 100%.” Others, such as Susan Fiske (Princeton University), have referred to critics of published work as “methodological terrorists”. Researchers in this group have either an incomplete understanding of the statistical issues behind non-replicability or are simply unwilling to accept that their field is facing a crisis.
What will it take for this crisis to end? Apart from obvious things, such as changing incentives at the institutional level, and the changes already happening (open science, pre-registered analyses, replications), a change in attitude is necessary. This has to come from individual scientists in leadership positions. First, researchers need to cultivate a learner’s mindset: they should be willing to admit that a particular result may not be true. Researchers often succumb to the instinct to stand their ground and to defend their position. Cultivating uncertainty about one’s own beliefs is hard – but there is no other way in science. Senior scientists should lead by example, by embracing uncertainty and trying to falsify their favourite theories. Second, the quality of statistical training provided to experimental scientists has to improve. We need to prepare the next generation to go beyond a point-and-click mentality.
Will these changes happen? At least in the psychological sciences, things are already better today than in 2015, when the Science study came out. More and more young scientists are becoming aware of problems and they are addressing them constructively through open data practices and other measures. Modern developments like MOOCs are also likely to help in disseminating statistical theory and practice. Researchers have also been posting their data on the Open Science Framework before their papers are published. It remains to be seen how much of an improvement this will lead to in the long run.
Shravan Vasishth is a professor of psycholinguistics and neurolinguistics at the Department of Linguistics, University of Potsdam, Germany.