Dawn of the Replications
A worrying wave of misconduct revelations is currently sweeping science (see this excellent article by Alok Jha for an overview, and Neuroskeptic’s 9 circles of scientific hell). Psychology seems to have borne the brunt of these thus far, but it’s unlikely to only be a problem there. A number of great suggestions as to how science can get its house in order and improve the quality of research have been suggested recently, in particular by Chris Chambers and Petroc Sumner, in this Guardian article. I’m going to focus on one of their suggestions; how it’s currently problematic, and how it can be improved.
Replication. The core of any scientific discovery, or it should be.
A small statistics lesson: p values represent the likelihood that an observed association is a chance finding (or more precisely the probability of obtaining the observed data if the null hypothesis you are testing is true). They’re useful as we can only take a sample of an underlying population to test, rather than testing everyone, so we need to take account of that when we assess our findings. Often (particularly in psychology) a cutoff of 0.05 is used (although there are issues here, but that’s another story, and indeed how the blog got its name). This means 5% of the time, results would suggest an association where one doesn’t truly exist in the underlying population. If all scientists used this cutoff point, and every research paper was a unique single experiment, that means 5 in every 100 results where p=0.05 would be chance findings, or ‘false positives’.
This is a problem, but easily solvable. We do the experiment once, we find evidence of an association. Is this one of the 95, or one of the 5? Well, we do it again. New data, new analysis. And we do it again. And if the finding remains consistent over these replications, it’s much less likely to be chance.
Only this doesn’t happen enough, and when it does happen, it isn’t always accurate. Why not? I think it’s a problem with the culture of science: publication pressures and scarcity of funding conspiring to create an environment where replication is not rewarded. This is precisely the wrong way round. Replication is essential; in an ideal world, it would be replications of findings that were greeted with a media fanfare and published in the highest journals, rather than the original findings themselves. Now isn’t the time for pointing fingers, and I don’t know how it happened, but at present journals want novel findings, scientists want good publications and Universities want REF scores to secure funding. All this creates an environment where time-consuming replications are often ‘jazzed up’ in order to make them more likely to get published.
This is potentially very damaging . A replication that doesn’t find the original effect is very hard to get published. Sometimes people don’t even bother submitting such findings to journals, creating a publication bias, sometimes they aim very low and the results get ignored or missed. Worst of all, sometimes the original design will be tweaked, or many different variations on the original design all get tested. This can take the form of multiple or sub group analyses (the more statistical tests you run, the more likely you are to find one that is publishable). Gene environment interactions are a good example of multiple testing. One genetic locus can be analysed in a few ways (recessive, additive, dominant), the environmental exposure can often be categorised in a number of ways (e.g. presence/absence versus categorical scale of severity) and the outcome could be binary (with various thresholds), categorical or continuous. If you tested each combination, it’s easy to see how the number of statistical tests adds up, and selective reporting of the one analysis that finds a result adds to publication bias and gives misleading weight to a finding that may have been chance.
The combined forces of publication bias favouring ‘significant’ findings and the need for novel research to publish can lead to a lack of reporting of multiple statistical tests undertaken to find the ‘significant’ result, and the publication of the extremes of the distribution of findings, rather than a more representative spread. Publication bias is additionally problematic in that it can lead to a lot of wasted time. An initial finding might get a lot of attention, so a number of research groups may attempt to replicate. If they fail to find the same result, but don’t publish, more and more researchers may attempt to replicate in the future, wasting their time and important resources that could be used elsewhere.
If you replicate a finding, it should be the same design, with a new dataset. What can and sometimes does happen is that a slight tweak to the first replication leads to a further tweak for the next, and so on, until the study is barely recognisable from the original, and certainly not more evidence supporting the original finding. For example, the culture in genetics research supports replication. However, due to the pressures to publish novel findings, ‘replication’ can be loosely defined, so finding the same effect, but only in men with beards, can be published as a replication of an original study. This can be more damaging than there being no replication at all, as it gives more prominence to false positives.
We need a cultural shift in the importance of replications. Studies that attempt and fail to replicate need to be published with equal weight to those that do. Without the necessity to have a novel finding in order to publish, replications could get the journal space they need.
But perhaps even more importantly, a firm definition of ‘replication’ is needed, to prevent these dangerous partial replications from giving more weight to findings than there should be. Without a partial replication, a single finding is treated with caution, but ‘replication studies’ that cherry pick from a number of sub-sample analyses and present only ‘significant’ findings as a replication is damaging. A true replication should use an identical study design, on a newly collected sample, and be from an independent lab (the problems caused by failure on this last point is illustrated well here).
Statistics are an incredibly useful tool to a scientist, but we need to use and interpret them correctly. We mustn’t be ‘toddlers with harpoons’ (yes I watched the Thick of It just before writing this post), we need to wield our statistical arsenal with care.