On scientific publishing practices in the face of public health crises

22 Jun 2015 by Trevor Bedford

After 400 years and the rise of the internet, the manuscript remains sacrosanct to scientific discourse. The system of scientific journals and peer review prioritizes findings to be right rather than timely, which is usually not such a bad thing, but in the case of infectious disease outbreaks, science comes to be at odds with public health efforts. In the recent case of the 2014-2015 West African Ebola epidemic, there was an incredibly well-done publication early on by Pardis Sabeti and colleagues in September 2014 that analyzed Ebola genomes sequences from Sierra Leone through June 2014. Sequences from this paper were made available on Genbank in June. However, following this, little additional genomic data appeared, in large part due to difficulties in shipping samples out of West Africa. Because of the lack of data, during the height of the epidemic in fall 2014, we had no idea what was going in terms of the ongoing evolution of the virus.

Instead, we are just now learning the details of the evolution and spread of the virus during the course of the epidemic. In the last month, we saw papers published by Tong et al. detailing 175 new Ebola genomes mostly from Western Sierra Leone, Carroll et al. detailing 179 new genomes mostly from Guinea and Park et al. detailing 232 new genomes mostly from Eastern Sierra Leone. As discussed previously, I was involved in the evolutionary analyses in the Park et al. paper looking at purifying vs positive selection on the Ebola genome.

It’s unfortunate that it’s taken so long for these results to see the full light of day. At a minimum, genome sequences should be released to the community as soon as they’re in a cleanly sharable state. In these sorts of analyses, the genome sequence itself isn’t so useful or relevant, it’s the differences between sequences that gives insight into evolution and transmission. We need to pool data to really understand what’s going on. In this case, the Tong et al. paper was first received by Nature on Jan 30, but sequences didn’t appear in Genbank nearly until publication on May 13. The Carroll et al. paper was first received on April 9 and sequences were released publicly on May 11 prior to publication on June 17. This is a nice step forward. Most of the sequences from Park et al. were released as they came off the sequencing machine in December, January and March prior to publication on June 18. Additionally, 85 genomes from Guinea were released in pre-publication form by the Institute Pasteur on May 26.

I see early data release as an important step forward for the community. Tying data release to publication seems like an obviously poor idea when there are public health implications (for more discussion see Yozwiak et al.). However, on Friday, I was quoted by the NYT as saying: “you could imagine a situation where you don’t really have to publish your Nature paper; instead, you make a blog post”. Here, I’m picturing something more than data sharing. Posting sequences is great. But it still means that insights are locked away prior to publication. We need to move away from the idea of a definitive published manuscript as being the only worthwhile target for scientific outputs. Andrew Rambaut has set up the public forum virological.org that is now being used to post both data and analyses of ongoing Ebola and MERS-CoV evolution in near real-time by Paul Kellam and colleagues. Data and analyses can be shared publicly online before publication. The paper can come later and represent the definitive analysis. I recognize that papers still need to be produced to fill out CVs and to cement the scientific literature, but a lot can be done before the paper appears. Preprints only get us part of the way there (although Haldane’s Sieve is a fantastic idea). They usually appear alongside initial manuscript submission in fairly polished form. We need lighter weight discourse as well.

Previously, I blogged about the need for a “GitHub of Science”. I now believe that plain old GitHub is the “GitHub of Science”. Scripts and pipelines can be posted to GitHub; results and analyses can be discussed on public forums (like virological.org); figures posted to figshare; papers can be reviewed after publication via blog. There’s no technological limitation to making this work. We have all the tools we need at our disposal. Let’s commit to doing a better job of this. In some cases, actual lives are at stake.