We've just posted a manuscript to bioRxiv on transmission dynamics of Middle East respiratory syndrome (MERS) coronavirus or MERS-CoV. MERS-CoV has been identified as the cause of sporadic outbreaks of severe respiratory illness in the Middle East, largely in the Arabian peninsula, since 2012. Its epidemiology has sometimes been described as mysterious, since only the most severe cases are usually admitted to hospitals, sometimes without reports of contact with camels, the accepted reservoir for the virus. That, as well as large hospital-associated outbreaks of MERS, have suggested that there should be a sizeable community transmission contribution to the ongoing outbreaks.

Although parallels between severe acute respiratory syndrome coronavirus (SARS-CoV) and MERS-CoV were inevitably going to be drawn, there have been clear indications that MERS-CoV is a different kind of beast. Unlike SARS-CoV that spread rapidly to other countries, primary MERS cases have been restricted to the Arabian peninsula and outbreaks outside of it have been brought under control relatively quickly. This is a pattern strongly suggestive of a repeated zoonotic spillover where the virus is jumping into humans repeatedly in the area where the reservoir and humans overlap, but the virus transmits poorly between humans and goes extinct. Despite these kinds of evidence this pattern has not been clearly confirmed. We thought that genomic sequences could provide an ideal window into these epidemiological patterns.

In order to establish cross-species transmission with sequence data one would ideally need a large sample of viral sequences from the reservoir as well as the 'sink' host. These data exist for MERS-CoV, but have been sampled very unevenly. MERS-CoV genomes that we have collated are heavily skewed towards the human side (174 genomes) compared to the camel reservoir (100 genomes), in addition to human sequences coming predominantly from hospital outbreaks. What has been clear so far is that MERS-CoV sequences from camels are more distantly related to each other on average than MERS-CoV sequences from humans, but most ancestral state reconstruction methods that could be used to infer the host of MERS-CoV lineages are agnostic to such signals. This is where the structured coalescent comes in. By explicitly modelling the evolution of MERS-CoV in a population structured along host boundaries we can estimate migration rates between the two hosts.

We find exactly what we would expect – MERS-CoV is almost exclusively a virus of camels and humans are an incidental and ultimately dead-end host. None of the 56 viral lineages we saw entering humans ever made it out of humans to contribute to the long-term evolution of MERS-CoV. We went a bit further here and applied the logic we used in our paper on Zika virus in Florida. Having identified the cross-species transmission events we could ask what the distribution of clade sizes resulting from those spill-over events tells us about MERS-CoV transmissibility. We estimate that the basic reproductive number for MERS-CoV is almost certainly below 0.91, indicating that it is unlikely to establish self-sustaining transmission chains in humans. The corollary of this is that there must have been hundreds of MERS-CoV spill-over events from camels into humans, most probably restricted to primary cases.

What does this all mean for public healthcare response? For one, it's clear that camels are the sole focus of MERS-CoV evolution and until it is controlled there humans will be at risk. Second, as mentioned previously, MERS-CoV is different from SARS-CoV and the evidence so far indicates that MERS-CoV does not do so well in humans. And even though there is no selective pressure on the virus in camels to transmit effectively between humans, repeated spill-over events mean that if such a variant were to emerge in camels it is very likely to find itself in humans eventually. Lastly, there is (again) much to be said about sequence data. We are not at a stage where we can identify pathogens before they spread widely if they are good at human-to-human transmission, but for viruses like MERS-CoV that are new and capable of generating stuttering transmission chains sequence data are ideal. Genome sequences, when gathered consistently, across affected areas with appropriate metadata are an incredibly powerful tool that combines diagnostics, typing and detailed evolutionary history in a single standardised bundle that can be used and shared easily.

Over the past year, we have been privileged to get to work with fantastic colleagues at Oxford, Birmingham, University of São Paulo, FIOCRUZ Salvador, Scripps, USAMRIID and elsewhere on a multiple studies tracing the genomic epidemiology of the Zika epidemic in the Americas. Although manuscripts were posted to bioRxiv in January and February, today sees their formal release in Nature (Faria et al and Grubaugh et al) and Nature Protocols (Quick et al). A fourth paper (Metsky et al) that we were not involved with was also published today.

The paper "Establishment and cryptic transmission of Zika virus in Brazil and the Americas" represents the outcome of of the ZiBRA project, wherein an international team of scientists, lead by groups from Brazil and the UK, traveled across the coast of NE Brazil with a mobile lab to Zika conduct diagnostic surveillance and sequencing. I tagged along for a portion of the trip and mainly helped to sort out bioinformatics and metadata. Later on, Alli flew to Salvador and São Paulo to assist with the final sequencing push. The 53 Zika genomes contributed by ZiBRA project have done much to resolve the origins of the Zika epidemic in the Americas. It is now clear that the Zika epidemic derives from a single introduction into NE Brazil sometime between Aug and Dec 2013. However, the first diagnostically confirmed case of Zika wasn't until Mar 2015. Thus, there had been over a year of cryptic transmission and by the time Zika was first identified it had already spread throughout much of Brazil.

The paper "Genomic epidemiology reveals multiple introductions of Zika virus into the United States" investigates the only sizable Zika outbreak in the USA that has so far occurred. Last summer, over 250 cases without travel history were reported in Miami-Dade county. In this paper, teams from Scripps and USAMRIID sequenced 29 human cases and 7 pooled mosquito samples from local traps. Gytis played a significant role in the phylogenetic analysis and in investigating travel connections between Miami and Zika endemic areas. I helped with the epidemiological modeling. I was pleased to work out a model to estimate R0 from the degree of clustering observed in the phylogeny along with known case counts. This work showed (at least to me) a surprising degree of clustering and hence significant ongoing local transmission throughout summer 2016. As in Brazil, Zika arrived in Florida earlier than expected from case diagnostics alone. We also observe a strong Caribbean connection in imports of Zika into Miami.

All three groups were fantastic about sharing sequences and I did my best to keep nextstrain.org updated as genomes were released. At this point, Nextstrain shows a comprehensive tree putting all of these Zika genomes (along with others) into a unified context.

Artwork courtesy of Sharon Isern.

We just had a big paper accepted in Nature, which looks at the entirety of the West African Ebola virus epidemic of 2013-2015. The project has existed in a variety of incarnations for well over a year with hints here and there of something big in the making, which unsurprisingly is the last bit of work that followed me from my PhD in Edinburgh to Seattle.


Whereas most publications over the last couple of years have focused on specific regions of the three most affected countries (Sierra Leone, Liberia and Guinea) and over specific time periods, we have analysed all publicly available data (comprising over 5% of all known Ebola virus disease cases!) to arrive at an overarching narrative for the epidemic. By using a Bayesian generalised linear model jointly with phylogenetic inference we not only reconstructed the history of the epidemic from its beginning to its end, but also inferred where the virus had been and what factors were associated with its spread. Our key findings are that:

  • Ebola virus migration largely followed a classic gravity model with international borders acting as potent barriers. Large population centers tended to receive more infected travellers, especially if incoming cases were from locations that were physically closer. However, migration was reduced if locations were in different countries (i.e. separated by an international border) and further apart.

  • Regions immediately bordering the three most affected countries, in Guinea-Bissau, Senegal, Mali, and Cote d'Ivoire, were spared their own Ebola outbreaks largely because of their remoteness. By looking into correlates of local Ebola virus proliferation we identified regions of these four neighbouring countries that had the potential to develop large outbreaks, had the virus been introduced.

  • The population of Ebola virus in West Africa was comprised of small mobile transmission chains, rather than large sweeping outbreaks. Individual transmission chains had poor persistence within any given location, so migration played a key role in sustaining the epidemic.

Beyond science our work exemplifies the slow turnover of ideas that has been taking place in the field. First and foremost, the West African epidemic was the first infectious disease epidemic of its kind, where the sequence data were generated, analysed and shared in real time. It is a huge step in the right direction and could not possibly have happened without the data collectors' efforts, support and trust. I sincerely hope that the collaborative spirit of 2014-2015 lives on for when the next outbreak hits.

Secondly, there is a lot to be said about leveraging cutting edge technologies against challenges like the West African Ebola virus epidemic. I doubt we could have learned nearly as much about this epidemic from small regions of the viral genome (as was convention for sequencing just years before), nor from a small number of of strains. This wouldn't have happened were it not for vast advances in sequencing technology. Now you can pick and choose from any number of sequencing platforms tailored to your needs, be it identifying rare viral variants within individual patients or bringing a sequencer anywhere you go in your pocket.

Lastly, our work is one of many other publications that highlight how important sequencing is to modern infectious disease outbreak response. Sequencing is occasionally equated with stamp collecting, which unjustly ignores the unparalleled perspective that sequence data offer into the heart of any outbreak - the history of the pathogen itself. A handful of sequences go a long way in the right hands, like identifying the origins of the epidemic, documenting unusual patterns of transmission, tracking individual transmission chains in the last stages of the epidemic and linking flare ups of Ebola virus to latent infections, to name a few.

And finally I would like to point out that work of such proportions does not happen in a vacuum. Although there were many who have directly or indirectly contributed to this project, Philippe Lemey, Marc Suchard, Trevor Bedford, Andy Tatem, Luiz Max Carvalho and Andrew Rambaut were the real analytical, theoretical and creative masterminds behind the whole thing. Many thanks for letting me be a part of something this big and exciting.

Background

Avian influenza A(H7N9) has resulted in annual epidemics over the past five years in China. H7N9 has a high mortality rate of around 40% and the currently ongoing fifth epidemic is the largest yet, with 460 infections reported by February 27th, 2017 (WHO report). Transmission is still predominantly from poultry, where H7N9 viruses continue to circulate, and human-to-human transmission is thought to be rare. The CDC considers H7N9 to have "the greatest potential to cause a pandemic" of all influenza A viruses. For more information, see this CDC information page.

Nextstrain

Nextstrain now has the ability to display phylogenies and geographical data for both the NA and HA genes, drawn from over 1200 samples covering all five human epidemics of H7N9. This analysis was possible thanks to the data sharing of the influenza research community through GISAID. It is our hope that making these analyses available to the community will aid understanding of this epidemic as it unfolds. Please note that the analysis currently presented in nextstrain is preliminary and further research is required.

Phylogeny & Geographic Distribution

geographic-distribution The HA phylogeny indicates that the expansion of a single lineage contributes 86% of sequences from the current epidemic. Temporal analysis indicates that this lineage originated during the 2015 (third) epidemic, however only one isolate from this lineage was sampled during the fourth epidemic. Inference of the geographical distribution of H7N9 indicate frequent jumps throughout the eastern coast of China, with limited dispersion elsewhere. Host jumps have also been inferred, however incomplete sampling restricts our ability to comment further.

Insertion in the Protease Cleavage Site

Highly pathogenic avian influenza viruses are often characterized by insertions in the host protease cleavage site, which enhance the cleavage of HA protein to HA1 and HA2 - a process necessary for infection. We find a lineage consisting of four isolates with a four amino-acid insertion (KRTA) in this region, in agreement with Iuliano et al. This insertion is not present in any other isolates, and this lineage contributes less than 10% of the sequences from the current epidemic. Interestingly, this lineage appears to have diverged from the lineage causing most fifth epidemic infections in mid-2014. Despite the lack of expansion during this epidemic, the potential for a highly pathogenic H7N9 variant is worrying and warrants closer inspection.

Reassortment

reassortment Reassortment led to the origin of human H7N9 and continues to play a role during the epidemics. The above figure shows that the current dominant clade in the HA phylogeny, which contributes the majority of fifth epidemic cases, is comprised of at least two NA clades due to reassortment. The isolates containing the protease cleavage site insertion (in HA) are monophyletic in both HA and NA segments.

Acknowledgements

A full list of labs and authors who have made data available for analysis in these samples is available in this spreadsheet. All figures from nextstrain. Many thanks to Gytis Dudas, Richard Neher and Trevor Bedford for assistance.

I've written before on the moral imperative for timely data sharing of pathogen genome sequences during an outbreak. With Alli Black and others from the lab and elsewhere, we've been attempting our first sequencing work from Zika samples collected from the US Virgin Islands. We have so far produced 11 Zika genomes. We have a GitHub repo for the sequencing work with detailed experimental protocols and bioinformatic pipelines. We've also released a "marker paper" on bioRxiv that spells out analyses that we intend to do with these data (in this case, an in-depth look at the USVI Zika outbreak). We've pushed all 11 genomes to nextstrain.org/zika.

Richard Neher and I have compiled another report on recent patterns of seasonal influenza virus evolution with an eye toward projecting forward to the SH 2017 and the NH 2017-2018 flu seasons. All analyses are based on the nextflu platform.

This time, there's little action in H1N1pdm, Vic and Yam, which are showing limited variation within their populations. However, there has arisen substantial variation within H3N2 viruses, wherein multiple competing clades are currently vying for success. The previously noticed 171K clade did indeed continue to dominate in the population, but there are now credible competitors arising as well. At this point, it's difficult to perceive an obvious winner among these competing lineages, though 171K/121K and T131K/R142K are strong contenders.

I've generally really liked bioRxiv as venue for these sorts of "technical reports". I'm treating these reports almost on par to a publication. Although there is no peer-review as the timescale doesn't allow for it, they are still something that I base scientific reputation on.

I'm honored to announce that Richard Neher and I have won the Open Science Prize for our work on nextstrain.org. This has been a really fun journey. The initial idea stemmed from a workshop at the Kavli Institute in Santa Barbara in summer 2014, where there were lots of discussions between me, Richard, Michael Lässig, Marta Łuksza and Colin Russell about flu forecasting. This inspired me to start on a prototype pipeline that would download flu data, build trees and do a simple D3 visualization. I put this up on GitHub and wasn't doing much with it until Richard picked it up and used it for a project on "local branching index". We joined forces at this point and ended up with the first version of nextflu in February 2015. We've been been working steadily to improve flu functionality since then. The next major innovation came in summer 2015, when Nick Loman contacted us about getting Ebola phylogenies shared from his on-the-ground work. We stood up a pipeline heavily borrowed from flu in June 2015 and continued updating the site as new data came in from Nick, Josh Quick, Matt Cotten, Ian Goodfellow and others. Since then, we've been trying to stay on top of Zika virus, with an initial version going up in Feb 2016 with all of the 17 available Zika genomes. Nick Loman (again), Oli Pybus and Kristian Andersen have been great at sharing sequences for this. With Alli Black in the lab, we also got involved in the actual sequencing work in Brazil and in the USVI.

So, it seems fitting that almost exactly two years after initial launch of nextflu in Feb 2015, that we're launching a completely revamped nextstrain.org site. We've been engaged in this refactor for almost 9 months now and it's finally out the door. We have a bunch of new features (like a zoomable map showing transmissions, multiple tree layouts, root-to-tip plots, multiselect filters, and sharable visualization state via the URL). All this was made possible by a lot of clever and dedicated work by Colin Megill and also James Hadfield. Check out the new site. We hope you find it interesting / useful.

We just posted a paper to bioRxiv looking at the dynamics of cross-species transmission of SIVs (HIV's close relatives that infect other species of primates). This was my Epidemiology MS thesis project here in the Bedford lab, and was my first computational project.

SIVs infect over 45 different species of primates, and HIV emerged as a human pathogen through at least 12 independent transmissions of SIVs from chimpanzees, gorillas, and sooty mangabeys to humans. Individual occurences of SIVs switching hosts have been sporadically documented, but we still had no idea how regularly SIVs switch hosts -- i.e., we had no idea whether or not the transmissions that sparked the HIV pandemic were unusual occurences.

Many of these viruses have been sequenced in recent years. While we weren't able to study them all, we were able to get enough sequence data (shout out to the fantastic Los Alamos National Labs database) to study the history of SIV cross-species transmission (CST) among 24 different primates. We used this data to assess how frequently viruses from different lineages recombine (part of one genome and part of another genome getting "pasted together"), and to look at how often they've switched hosts over evolutionary time. Our phylogenetic analysis found that SIV evolution has been shaped by at least 13 instances of interlineage recombination, and identified 14 novel, ancient CST events. We found that on average, each linaege of SIV switches hosts about once every 6.25 substitutions per site (these are funny units because SIVs are millions of years old, but they essentially mean the amount of evolutionary time required to see 6.25 substitutions in each site of the genome). We also observed more CST events between closely related primates, and find that viruses and hosts have extensively coevolved (and likely cospeciated). Taken together, our results show that SIV biology has been extensively shaped by CST, but it's still a rare phenomenon over evolutionary time.