Health

We Found Only One-Third of Published Psychology Research is Reliable – Now What?

What does it mean if the majority of what’s published in journals can’t be reproduced? Credit: Maggie Villiger, CC BY-ND

What does it mean if the majority of what’s published in journals can’t be reproduced? Credit: Maggie Villiger, CC BY-ND

The ability to repeat a study and find the same results twice is a prerequisite for building scientific knowledge. Replication allows us to ensure empirical findings are reliable and refines our understanding of when a finding occurs. It may surprise you to learn, then, that scientists do not often conduct – much less publish – attempted replications of existing studies.

Journals prefer to publish novel, cutting-edge research. And professional advancement is determined by making new discoveries, not painstakingly confirming claims that are already on the books. As one of our colleagues recently put it, “Running replications is fine for other people, but I have better ways to spend my precious time.”

Once a paper appears in a peer-reviewed journal, it acquires a kind of magical, unassailable authority. News outlets, and sometimes even scientists themselves, will cite these findings without a trace of skepticism. Such unquestioning confidence in new studies is likely undeserved, or at least premature.

A small but vocal contingent of researchers – addressing fields ranging from physics to medicine to economics – has maintained that many, perhaps most, published studies are wrong. But how bad is this problem, exactly? And what features make a study more or less likely to turn out to be true?

We are two of the 270 researchers who together have just published in the journal Science the first-ever large-scale effort trying to answer these questions by attempting to reproduce 100 previously published psychological science findings.

Attempting to re-find psychology findings

The results are bound and shelved – but are they reproducible? Credit: Maggie Villiger, CC BY-ND

The results are bound and shelved – but are they reproducible? Credit: Maggie Villiger, CC BY-ND

Publishing together as the Open Science Framework and coordinated by social psychologist Brian Nosek from the Center for Open Science, research teams from around the world each ran a replication of a study published in three top psychology journals – Psychological Science; Journal of Personality and Social Psychology; and Journal of Experimental Psychology: Learning, Memory, and Cognition. To ensure the replication was as exact as possible, research teams obtained study materials from the original authors, and worked closely with these authors whenever they could.

Almost all of the original published studies (97%) had statistically significant results. This is as you’d expect – while many experiments fail to uncover meaningful results, scientists tend only to publish the ones that do.

What we found is that when these 100 studies were run by other researchers, however, only 36% reached statistical significance. This number is alarmingly low. Put another way, only around one-third of the rerun studies came out with the same results that were found the first time around. That rate is especially low when you consider that, once published, findings tend to be held as gospel.

The bad news doesn’t end there. Even when the new study found evidence for the existence of the original finding, the magnitude of the effect was much smaller — half the size of the original, on average.

One caveat: just because something fails to replicate doesn’t mean it isn’t true. Some of these failures could be due to luck, or poor execution, or an incomplete understanding of the circumstances needed to show the effect (scientists call these “moderators” or “boundary conditions”). For example, having someone practice a task repeatedly might improve their memory, but only if they didn’t know the task well to begin with. In a way, what these replications (and failed replications) serve to do is highlight the inherent uncertainty of any single study – original or new.

More robust findings more replicable

Given how low these numbers are, is there anything we can do to predict the studies that will replicate and those that won’t? The results from this Reproducibility Project offer some clues.

There are two major ways that researchers quantify the nature of their results. The first is a p-value, which estimates the probability that the result was arrived at purely by chance and is a false positive. (Technically, the p-value is the chance that the result, or a stronger result, would have occurred even when there was no real effect.) Generally, if a statistical test shows that the p-value is lower than 5%, the study’s results are considered “significant” – most likely due to actual effects.

Another way to quantify a result is with an effect size – not how reliable the difference is, but how big it is. Let’s say you find that people spend more money in a sad mood. Well, how much more money do they spend? This is the effect size.

We found that the smaller the original study’s p-value and the larger its effect size, the more likely it was to replicate. Strong initial statistical evidence was a good marker of whether a finding was reproducible.

Studies that were rated as more challenging to conduct were less likely to replicate, as were findings that were considered surprising. For instance, if a study shows that reading lowers IQs, or if it uses a very obscure and unfamiliar methodology, we would do well to be skeptical of such data. Scientists are often rewarded for delivering results that dazzle and defy expectation, but extraordinary claims require extraordinary evidence.

Although our replication effort is novel in its scope and level of transparency – the methods and data for all replicated studies are available online – they are consistent with previous work from other fields. Cancer biologists, for instance, have reported replication rates as low as 11%25%.

We have a problem. What’s the solution?

Recruitment of volunteers for new studies is ongoing. What about revisiting past findings? Credit: Maggie Villiger, CC BY-ND

Recruitment of volunteers for new studies is ongoing. What about revisiting past findings? Credit: Maggie Villiger, CC BY-ND

Some conclusions seem warranted here.

We must stop treating single studies as unassailable authorities of the truth. Until a discovery has been thoroughly vetted and repeatedly observed, we should treat it with the measure of skepticism that scientific thinking requires. After all, the truly scientific mindset is critical, not credulous. There is a place for breakthrough findings and cutting-edge theories, but there is also merit in the slow, systematic checking and refining of those findings and theories.

Of course, adopting a skeptical attitude will take us only so far. We also need to provide incentives for reproducible science by rewarding those who conduct replications and who conduct replicable work. For instance, at least one top journal has begun to give special “badges” to articles that make their data and materials available, and the Berkeley Initiative for Transparency in the Social Sciences has established a prize for practicing more transparent social science.

Better research practices are also likely to ensure higher replication rates. There is already evidence that taking certain concrete steps – such as making hypotheses clear prior to data analysis, openly sharing materials and data, and following transparent reporting standards – decreases false positive rates in published studies. Some funding organizations are already demanding hypothesis registration and data sharing.

Although perfect replicability in published papers is an unrealistic goal, current replication rates are unacceptably low. The first step, as they say, is admitting you have a problem. What scientists and the public now choose to do with this information remains to be seen, but our collective response will guide the course of future scientific progress.

The Conversation

Elizabeth Gilbert is PhD Student in Psychology at University of Virginia and Nina Strohminger is Postdoctoral Fellow at the School of Management at Yale University

This article was originally published on The Conversation.

  • HiroRoshi

    These days most ‘scientific’ findings published are psuedo-science at best. Last year I examined a few dozen articles from various science pubs and found that 75% were not presented with any indication of following the scientific method. Many were based on a single observation.

    An example is a study of warbler birds that were purported to avoid tornadoes before they occur. It was entirely based on one single observation. You can’t draw a line or even propose a direction with a single dot. It very well may be the case but there was not enough information to suggest this, especially for a science rag to publish it.

    Remember that faster than light neutrino claims a few years ago? General media and even science pubs were all over calling it ‘fact’ before it got peer review.

    There are too many ‘scientist’ wanabees and they are desperate to get funding. They take shortcuts, establish half-researched findings as ‘scientific fact’ and hope someone listens. Unfortunately too many do as there is no such thing as journalistic integrity anymore, including with most science pubs. They just need clicks and the more sensationalized the better.

  • monsoon23

    That’s crazy.

  • D.P.

    I think that this is true for all fields with varying degrees. “Handle with care” is the only option available.

  • DUB

    Thanks to Elizabeth Gilbert and Nina Strohminger for this excellent article.

    The key point is really quite simple:
    “We must stop treating single studies as unassailable authorities of the truth. Until a discovery has been thoroughly vetted and repeatedly observed, we should treat it with the measure of skepticism that scientific thinking requires. After all, the truly scientific mindset is critical, not credulous. There is a place for breakthrough findings and cutting-edge theories, but there is also merit in the slow, systematic checking and refining of those findings and theories.”

    The scientific method demands that any proposed theory be repeatedly tested by the scientist
    who proposed the theory,then vetted and replicated many times over by other independent researchers.Anyone who publishes a scientific conclusion,without sufficient self-testing and independent vetting and replication,is not following scientific method,is not doing science,and
    is not a reputable scientist. There is too much chasing of fame and fortune in our society today,
    even,sadly,in the scientific community.

  • http://www.realty.com/ Frank Lipsky

    Psychology is not a science; when it does not contain good mathematical models of physical behaviour. It is simply junk science-that not only can not reach six sigma; lt’s 1 sigmas levels are doubtful.
    what is not in doubt big pharma will continue exploit this fact PAXIL XANAX AND ADDERAL and 50 year old tri-cyclic.
    Aside if the SSRI’s are so good and my experience is they are ;why are tricyclics still allowed on the market

  • pegan

    This is an important article. It suspect that it will generate a lot of comments. But I would have like to have read examples of studies that defied replication. I would like to have read of how these studies have been taken as gospel but are now in question.

  • digitihead

    Why do you give the wrong definition of a p-value then put its actual definition in parentheses? A p-value does not and cannot quantify whether the results are by chance alone; it’s calculated based on the premise that the null hypothesis is true and quantifies the chances of how often we would see results as extreme or more extreme than what was observed under that premise. Basically, it’s assumed that the results are by chance alone, not whether they happened by chance alone.

    Moreover, a p-value has the same interpretation whether it’s 0.9, 0.06, 0.05, or 0.001. The only difference is whether those numbers fall below some arbitrary line drawn by Ronald Fisher 80 years ago that are scientifically indefensible today and it’s that slavish adherence that is behind the mess you describe in this article.