Dawn of the Replications

16 September 2012 by Suzi Gage, posted in Uncategorized

A worrying wave of misconduct revelations is currently sweeping science (see this excellent article by Alok Jha for an overview, and Neuroskeptic’s 9 circles of scientific hell). Psychology seems to have borne the brunt of these thus far, but it’s unlikely to only be a problem there. A number of great suggestions as to how science can get its house in order and improve the quality of research have been suggested recently, in particular by Chris Chambers and Petroc Sumner, in this Guardian article. I’m going to focus on one of their suggestions; how it’s currently problematic, and how it can be improved.

Replication. The core of any scientific discovery, or it should be.

Replications: but each is slightly different

Replications: but each is slightly different

A small statistics lesson: p values represent the likelihood that an observed association is a chance finding (or more precisely the probability of obtaining the observed data if the null hypothesis you are testing is true). They’re useful as we can only take a sample of an underlying population to test, rather than testing everyone, so we need to take account of that when we assess our findings. Often (particularly in psychology) a cutoff of 0.05 is used (although there are issues here, but that’s another story, and indeed how the blog got its name). This means 5% of the time, results would suggest an association where one doesn’t truly exist in the underlying population.  If all scientists used this cutoff point, and every research paper was a unique single experiment, that means 5 in every 100 results where p=0.05 would be chance findings, or ‘false positives’.

This is a problem, but easily solvable. We do the experiment once, we find evidence of an association. Is this one of the 95, or one of the 5? Well, we do it again. New data, new analysis. And we do it again. And if the finding remains consistent over these replications, it’s much less likely to be chance.

Only this doesn’t happen enough, and when it does happen, it isn’t always accurate. Why not? I think it’s a problem with the culture of science: publication pressures and scarcity of funding conspiring to create an environment where replication is not rewarded. This is precisely the wrong way round. Replication is essential; in an ideal world, it would be replications of findings that were greeted with a media fanfare and published in the highest journals, rather than the original findings themselves. Now isn’t the time for pointing fingers, and I don’t know how it happened, but at present journals want novel findings, scientists want good publications and Universities want REF scores to secure funding. All this creates an environment where time-consuming replications are often ‘jazzed up’ in order to make them more likely to get published.

This is potentially very damaging . A replication that doesn’t find the original effect is very hard to get published. Sometimes people don’t even bother submitting such findings to journals, creating a publication bias, sometimes they aim very low and the results get ignored or missed. Worst of all, sometimes the original design will be tweaked, or many different variations on the original design all get tested. This can take the form of multiple or sub group analyses (the more statistical tests you run, the more likely you are to find one that is publishable). Gene environment interactions are a good example of multiple testing. One genetic locus can be analysed in a few ways (recessive, additive, dominant), the environmental exposure can often be categorised in a number of ways (e.g. presence/absence versus categorical scale of severity) and the outcome could be binary (with various thresholds), categorical or continuous. If you tested each combination, it’s easy to see how the number of statistical tests adds up, and selective reporting of the one analysis that finds a result adds to publication bias and gives misleading weight to a finding that may have been chance.

The combined forces of publication bias favouring ‘significant’ findings and the need for novel research to publish can lead to a lack of reporting of multiple statistical tests undertaken to find the ‘significant’ result, and the publication of the extremes of the distribution of findings, rather than a more representative spread. Publication bias is additionally problematic in that it can lead to a lot of wasted time. An initial finding might get a lot of attention, so a number of research groups may attempt to replicate. If they fail to find the same result, but don’t publish, more and more researchers may attempt to replicate in the future, wasting their time and important resources that could be used elsewhere.

If you replicate a finding, it should be the same design, with a new dataset. What can and sometimes does happen is that a slight tweak to the first replication leads to a further tweak for the next, and so on, until the study is barely recognisable from the original, and certainly not more evidence supporting the original finding. For example, the culture in genetics research supports replication. However, due to the pressures to publish novel findings, ‘replication’ can be loosely defined, so finding the same effect, but only in men with beards, can be published as a replication of an original study. This can be more damaging than there being no replication at all, as it gives more prominence to false positives.

We need a cultural shift in the importance of replications. Studies that attempt and fail to replicate need to be published with equal weight to those that do. Without the necessity to have a novel finding in order to publish, replications could get the journal space they need.

But perhaps even more importantly, a firm definition of ‘replication’ is needed, to prevent these dangerous partial replications from giving more weight to findings than there should be. Without a partial replication, a single finding is treated with caution, but ‘replication studies’ that cherry pick from a number of sub-sample analyses and present only ‘significant’ findings as a replication is damaging. A true replication should use an identical study design, on a newly collected sample, and be from an independent lab (the problems caused by failure on this last point is illustrated well here).

Statistics are an incredibly useful tool to a scientist, but we need to use and interpret them correctly. We mustn’t be ‘toddlers with harpoons’ (yes I watched the Thick of It just before writing this post), we need to wield our statistical arsenal with care.

11 Responses to “Dawn of the Replications”

  1. Mark Stokes Reply | Permalink

    Great post on a very important issue.

    As highlighted here, the core of the problem is the relative lax statistical threshold that is required for publishing results in many fields. At p<.05, there is a one-in-twenty chance of a false positive, therefore it is vital the claims are validated by further research.

    An alternative (but closely related) solution is simply to agree on a more conservative statistical threshold for accepting an effect as significant in the first place (see my recent blog post: http://the-brain-box.blogspot.co.uk/2012/09/must-we-really-accept-1-in-20-false.html). Shifting the burden to individual studies would not be depend on definitions of replication, or increasing the academic reward for setting out to validate someone else's findings.

    Finally, a more substantial buffer between signal and noise would also probably discourage outright fraud, as a greater threshold needs to be crossed in the first place. Currently, the high chance of legitimate false positives presumably provides a fertile grey area for dubious practices to flourish.

  2. Suzi Gage Reply | Permalink

    Completely agree that p=0.05 is a wrong-headed way to assess 'statistical significance' (that's a whole other blog, as indeed you've written!), but regardless of where an arbitrary cutoff is put, partial replications are as damaging if not more so, as they propagate the false positives through the literature where a single study would be treated with more caution.

  3. Corneel Reply | Permalink

    Ooh yes. I love Blade Runner. That is Daryl Hannah, if memory serves me well.

  4. Martin Poulter Reply | Permalink

    Just want to suggest a refinement to the statistics lesson.

    You write "If all scientists used this cutoff point, and every research paper was a unique single experiment, that means 5 in every 100 results where p=0.05 would be chance findings, or ‘false positives’." This assumes that the null hypothesis is always true; that the scientists are looking for effects that are never there. If you drop this assumption, the the maths gets a bit more complicated: it's no longer true that 5% of all results are false positives.

    Then in the next paragraph, "We do the experiment once, we find evidence of an association. Is this one of the 95, or one of the 5?" Well, if we keep the assumption that the null hypothesis is true, then this result is by definition one of the 5%. But you're not assuming that; you're quite rightly allowing the possibility there could be a real effect. So what do "the 95" and "the 5" refer to? They are not the percentages of veridical results and false positive results.

    It's not the case that 5% of positive results are going to be false positives. The probability of a false result, given a positive result is not the same thing as the probability of a positive result given the null hypothesis: how to relate these two things is called the reverse inference problem, and it depends on additional information. If scientists are looking for effects that are never real, then 100% of the positive findings are false positives. If, on the other hand, scientists' hunches are always right and the null hypothesis is always false, then the proportion of false positives can be negligibly small. So I wouldn't write as it "the 5%" because this perpetuates a common statistical error (in the readers' mind, even though you're clear about it yourself). It also undermines the point that you otherwise eloquently make, if people wrongly get the idea that no more than 5% of results might be affected by this problem.

    There is a much cited 2005 paper by John Ioannidis that gives an equation relating the base rate of true hypotheses, the significance criterion, and the false positive rate (among other things). Cheers,

    • Suzi Gage Reply | Permalink

      Martin, you are of course correct. And I was trying so hard to avoid this pitfall! Really appreciate the comment, one of my favourite things about writing this blog is when I see or am shown ways I can improve my writing! Thanks for commenting.

Leave a Reply

− 7 = one