We just had a big paper accepted in Nature, which looks at the entirety of the West African Ebola virus epidemic of 2013-2015. The project has existed in a variety of incarnations for well over a year with hints here and there of something big in the making, which unsurprisingly is the last bit of work that followed me from my PhD in Edinburgh to Seattle.

Whereas most publications over the last couple of years have focused on specific regions of the three most affected countries (Sierra Leone, Liberia and Guinea) and over specific time periods, we have analysed all publicly available data (comprising over 5% of all known Ebola virus disease cases!) to arrive at an overarching narrative for the epidemic. By using a Bayesian generalised linear model jointly with phylogenetic inference we not only reconstructed the history of the epidemic from its beginning to its end, but also inferred where the virus had been and what factors were associated with its spread. Our key findings are that:

  • Ebola virus migration largely followed a classic gravity model with international borders acting as potent barriers. Large population centers tended to receive more infected travellers, especially if incoming cases were from locations that were physically closer. However, migration was reduced if locations were in different countries (i.e. separated by an international border) and further apart.

  • Regions immediately bordering the three most affected countries, in Guinea-Bissau, Senegal, Mali, and Cote d'Ivoire, were spared their own Ebola outbreaks largely because of their remoteness. By looking into correlates of local Ebola virus proliferation we identified regions of these four neighbouring countries that had the potential to develop large outbreaks, had the virus been introduced.

  • The population of Ebola virus in West Africa was comprised of small mobile transmission chains, rather than large sweeping outbreaks. Individual transmission chains had poor persistence within any given location, so migration played a key role in sustaining the epidemic.

Beyond science our work exemplifies the slow turnover of ideas that has been taking place in the field. First and foremost, the West African epidemic was the first infectious disease epidemic of its kind, where the sequence data were generated, analysed and shared in real time. It is a huge step in the right direction and could not possibly have happened without the data collectors' efforts, support and trust. I sincerely hope that the collaborative spirit of 2014-2015 lives on for when the next outbreak hits.

Secondly, there is a lot to be said about leveraging cutting edge technologies against challenges like the West African Ebola virus epidemic. I doubt we could have learned nearly as much about this epidemic from small regions of the viral genome (as was convention for sequencing just years before), nor from a small number of of strains. This wouldn't have happened were it not for vast advances in sequencing technology. Now you can pick and choose from any number of sequencing platforms tailored to your needs, be it identifying rare viral variants within individual patients or bringing a sequencer anywhere you go in your pocket.

Lastly, our work is one of many other publications that highlight how important sequencing is to modern infectious disease outbreak response. Sequencing is occasionally equated with stamp collecting, which unjustly ignores the unparalleled perspective that sequence data offer into the heart of any outbreak - the history of the pathogen itself. A handful of sequences go a long way in the right hands, like identifying the origins of the epidemic, documenting unusual patterns of transmission, tracking individual transmission chains in the last stages of the epidemic and linking flare ups of Ebola virus to latent infections, to name a few.

And finally I would like to point out that work of such proportions does not happen in a vacuum. Although there were many who have directly or indirectly contributed to this project, Philippe Lemey, Marc Suchard, Trevor Bedford, Andy Tatem, Luiz Max Carvalho and Andrew Rambaut were the real analytical, theoretical and creative masterminds behind the whole thing. Many thanks for letting me be a part of something this big and exciting.


Avian influenza A(H7N9) has resulted in annual epidemics over the past five years in China. H7N9 has a high mortality rate of around 40% and the currently ongoing fifth epidemic is the largest yet, with 460 infections reported by February 27th, 2017 (WHO report). Transmission is still predominantly from poultry, where H7N9 viruses continue to circulate, and human-to-human transmission is thought to be rare. The CDC considers H7N9 to have "the greatest potential to cause a pandemic" of all influenza A viruses. For more information, see this CDC information page.


Nextstrain now has the ability to display phylogenies and geographical data for both the NA and HA genes, drawn from over 1200 samples covering all five human epidemics of H7N9. This analysis was possible thanks to the data sharing of the influenza research community through GISAID. It is our hope that making these analyses available to the community will aid understanding of this epidemic as it unfolds. Please note that the analysis currently presented in nextstrain is preliminary and further research is required.

Phylogeny & Geographic Distribution

geographic-distribution The HA phylogeny indicates that the expansion of a single lineage contributes 86% of sequences from the current epidemic. Temporal analysis indicates that this lineage originated during the 2015 (third) epidemic, however only one isolate from this lineage was sampled during the fourth epidemic. Inference of the geographical distribution of H7N9 indicate frequent jumps throughout the eastern coast of China, with limited dispersion elsewhere. Host jumps have also been inferred, however incomplete sampling restricts our ability to comment further.

Insertion in the Protease Cleavage Site

Highly pathogenic avian influenza viruses are often characterized by insertions in the host protease cleavage site, which enhance the cleavage of HA protein to HA1 and HA2 - a process necessary for infection. We find a lineage consisting of four isolates with a four amino-acid insertion (KRTA) in this region, in agreement with Iuliano et al. This insertion is not present in any other isolates, and this lineage contributes less than 10% of the sequences from the current epidemic. Interestingly, this lineage appears to have diverged from the lineage causing most fifth epidemic infections in mid-2014. Despite the lack of expansion during this epidemic, the potential for a highly pathogenic H7N9 variant is worrying and warrants closer inspection.


reassortment Reassortment led to the origin of human H7N9 and continues to play a role during the epidemics. The above figure shows that the current dominant clade in the HA phylogeny, which contributes the majority of fifth epidemic cases, is comprised of at least two NA clades due to reassortment. The isolates containing the protease cleavage site insertion (in HA) are monophyletic in both HA and NA segments.


A full list of labs and authors who have made data available for analysis in these samples is available in this spreadsheet. All figures from nextstrain. Many thanks to Gytis Dudas, Richard Neher and Trevor Bedford for assistance.

I've written before on the moral imperative for timely data sharing of pathogen genome sequences during an outbreak. With Alli Black and others from the lab and elsewhere, we've been attempting our first sequencing work from Zika samples collected from the US Virgin Islands. We have so far produced 11 Zika genomes. We have a GitHub repo for the sequencing work with detailed experimental protocols and bioinformatic pipelines. We've also released a "marker paper" on bioRxiv that spells out analyses that we intend to do with these data (in this case, an in-depth look at the USVI Zika outbreak). We've pushed all 11 genomes to nextstrain.org/zika.

Richard Neher and I have compiled another report on recent patterns of seasonal influenza virus evolution with an eye toward projecting forward to the SH 2017 and the NH 2017-2018 flu seasons. All analyses are based on the nextflu platform.

This time, there's little action in H1N1pdm, Vic and Yam, which are showing limited variation within their populations. However, there has arisen substantial variation within H3N2 viruses, wherein multiple competing clades are currently vying for success. The previously noticed 171K clade did indeed continue to dominate in the population, but there are now credible competitors arising as well. At this point, it's difficult to perceive an obvious winner among these competing lineages, though 171K/121K and T131K/R142K are strong contenders.

I've generally really liked bioRxiv as venue for these sorts of "technical reports". I'm treating these reports almost on par to a publication. Although there is no peer-review as the timescale doesn't allow for it, they are still something that I base scientific reputation on.

I'm honored to announce that Richard Neher and I have won the Open Science Prize for our work on nextstrain.org. This has been a really fun journey. The initial idea stemmed from a workshop at the Kavli Institute in Santa Barbara in summer 2014, where there were lots of discussions between me, Richard, Michael Lässig, Marta Łuksza and Colin Russell about flu forecasting. This inspired me to start on a prototype pipeline that would download flu data, build trees and do a simple D3 visualization. I put this up on GitHub and wasn't doing much with it until Richard picked it up and used it for a project on "local branching index". We joined forces at this point and ended up with the first version of nextflu in February 2015. We've been been working steadily to improve flu functionality since then. The next major innovation came in summer 2015, when Nick Loman contacted us about getting Ebola phylogenies shared from his on-the-ground work. We stood up a pipeline heavily borrowed from flu in June 2015 and continued updating the site as new data came in from Nick, Josh Quick, Matt Cotten, Ian Goodfellow and others. Since then, we've been trying to stay on top of Zika virus, with an initial version going up in Feb 2016 with all of the 17 available Zika genomes. Nick Loman (again), Oli Pybus and Kristian Andersen have been great at sharing sequences for this. With Alli Black in the lab, we also got involved in the actual sequencing work in Brazil and in the USVI.

So, it seems fitting that almost exactly two years after initial launch of nextflu in Feb 2015, that we're launching a completely revamped nextstrain.org site. We've been engaged in this refactor for almost 9 months now and it's finally out the door. We have a bunch of new features (like a zoomable map showing transmissions, multiple tree layouts, root-to-tip plots, multiselect filters, and sharable visualization state via the URL). All this was made possible by a lot of clever and dedicated work by Colin Megill and also James Hadfield. Check out the new site. We hope you find it interesting / useful.

We just posted a paper to bioRxiv looking at the dynamics of cross-species transmission of SIVs (HIV's close relatives that infect other species of primates). This was my Epidemiology MS thesis project here in the Bedford lab, and was my first computational project.

SIVs infect over 45 different species of primates, and HIV emerged as a human pathogen through at least 12 independent transmissions of SIVs from chimpanzees, gorillas, and sooty mangabeys to humans. Individual occurences of SIVs switching hosts have been sporadically documented, but we still had no idea how regularly SIVs switch hosts -- i.e., we had no idea whether or not the transmissions that sparked the HIV pandemic were unusual occurences.

Many of these viruses have been sequenced in recent years. While we weren't able to study them all, we were able to get enough sequence data (shout out to the fantastic Los Alamos National Labs database) to study the history of SIV cross-species transmission (CST) among 24 different primates. We used this data to assess how frequently viruses from different lineages recombine (part of one genome and part of another genome getting "pasted together"), and to look at how often they've switched hosts over evolutionary time. Our phylogenetic analysis found that SIV evolution has been shaped by at least 13 instances of interlineage recombination, and identified 14 novel, ancient CST events. We found that on average, each linaege of SIV switches hosts about once every 6.25 substitutions per site (these are funny units because SIVs are millions of years old, but they essentially mean the amount of evolutionary time required to see 6.25 substitutions in each site of the genome). We also observed more CST events between closely related primates, and find that viruses and hosts have extensively coevolved (and likely cospeciated). Taken together, our results show that SIV biology has been extensively shaped by CST, but it's still a rare phenomenon over evolutionary time.

A couple of months ago I tweeted that we had our first Bedford lab wet lab (full disclosure: it’s a bench, but start small right?). Well I’m excited to say that we have just released our first bit of Bedford lab-generated sequence data and pushed results to Nextstrain Zika!

These data are 5 (draft) Zika genomes from clinical samples collected in the U.S. Virgin Islands. After getting some experience sequencing on the MinION down in Brazil, I spent the first two weeks of December getting amplicons and sequencing on island in St. Croix. The Caribbean in December, it’s rough, I know. This work is in collaboration with the VI Department of Health, who have generously given me access to their samples and let me take over their lab when I've been down. As a doctoral student in epidemiology, it’s an incredible opportunity to run a study from start to finish, not to mention investigate an outbreak in close to real-time. I’m really excited about it.

Importantly this has been a group effort. The fact that we have this data is a huge testament to the benefits of open science. I’m not a wet-lab scientist by training, and Zika is not the easiest virus to sequence. This project could have been really painful, and the fact that it hasn’t been owes a lot to the openness of other groups to share their knowledge, experience, and protocols. I’m so thankful to Josh Quick and Nate Grubaugh who were incredibly responsive when I had questions or needed help with the protocol, and Nick Loman for freely sharing his entire bioinformatic pipeline. Additionally my lab has been amazing both as a sounding board for ideas and for helping with the data processing and analysis. To have so many people come together to help a project succeed is wonderful, especially so when you’re a student trying to figure things out for the first time. Keep an eye out for more data coming out soon! We'll keep the zika-seq project updated with new sequences as we generate them.

We've just published a paper in Virus Evolution investigating the evolutionary dynamics of infectious hematopoietic necrosis virus (IHNV) in Pacific salmon. This is work that I did during my Master's degree in the Kurath lab and that I continued to develop during the first year of my PhD here in the Bedford lab.

IHNV is endemic along the Pacific coast of Canada and the United States, from California up to Alaska, and also in the Columbia River Basin. The Columbia River Basin is a large and complex watershed draining most of Washington, Oregon and Idaho, and one of the largest salmon runs in the continental US. There's a fair amount of interest in IHNV because it can cause severe epidemics, with up to 90% mortality rates, that can greatly affect conservation hatcheries and commercial aquaculture. Because there are no treatments for IHNV, a lot of effort goes in to understanding viral transmission dynamics in the hopes of preventing big outbreaks.

For this paper we sequenced over 1200 viral isolates collected over a 40-year time period. We combined sequence data with epidemiologic data to explore possible relationships between evolutionary dynamics and epidemiological characteristics of the virus. Our work revealed two previously unrecognized subgroups of U genogroup IHNV which were associated with disctinct epidemiologic patterns. One subgroup was detected more frequently in Chinook salmon and steelhead trout in the Columbia River Basin, while the other was detected more frequently in sockeye salmon in coastal watersheds. These associations were supported by FST and by phylogeographic analysis. Notably the geographic structure we observed supports hypotheses that fish-to-fish transmission of IHNV occurs mainly in fresh water, when migratory fish populations are divided across watersheds.