Avian influenza A(H7N9) has resulted in annual epidemics over the past five years in China. H7N9 has a high mortality rate of around 40% and the currently ongoing fifth epidemic is the largest yet, with 460 infections reported by February 27th, 2017 (WHO report). Transmission is still predominantly from poultry, where H7N9 viruses continue to circulate, and human-to-human transmission is thought to be rare. The CDC considers H7N9 to have "the greatest potential to cause a pandemic" of all influenza A viruses. For more information, see this CDC information page.


Nextstrain now has the ability to display phylogenies and geographical data for both the NA and HA genes, drawn from over 1200 samples covering all five human epidemics of H7N9. This analysis was possible thanks to the data sharing of the influenza research community through GISAID. It is our hope that making these analyses available to the community will aid understanding of this epidemic as it unfolds. Please note that the analysis currently presented in nextstrain is preliminary and further research is required.

Phylogeny & Geographic Distribution

geographic-distribution The HA phylogeny indicates that the expansion of a single lineage contributes 86% of sequences from the current epidemic. Temporal analysis indicates that this lineage originated during the 2015 (third) epidemic, however only one isolate from this lineage was sampled during the fourth epidemic. Inference of the geographical distribution of H7N9 indicate frequent jumps throughout the eastern coast of China, with limited dispersion elsewhere. Host jumps have also been inferred, however incomplete sampling restricts our ability to comment further.

Insertion in the Protease Cleavage Site

Highly pathogenic avian influenza viruses are often characterized by insertions in the host protease cleavage site, which enhance the cleavage of HA protein to HA1 and HA2 - a process necessary for infection. We find a lineage consisting of four isolates with a four amino-acid insertion (KRTA) in this region, in agreement with Iuliano et al. This insertion is not present in any other isolates, and this lineage contributes less than 10% of the sequences from the current epidemic. Interestingly, this lineage appears to have diverged from the lineage causing most fifth epidemic infections in mid-2014. Despite the lack of expansion during this epidemic, the potential for a highly pathogenic H7N9 variant is worrying and warrants closer inspection.


reassortment Reassortment led to the origin of human H7N9 and continues to play a role during the epidemics. The above figure shows that the current dominant clade in the HA phylogeny, which contributes the majority of fifth epidemic cases, is comprised of at least two NA clades due to reassortment. The isolates containing the protease cleavage site insertion (in HA) are monophyletic in both HA and NA segments.


A full list of labs and authors who have made data available for analysis in these samples is available in this spreadsheet. All figures from nextstrain. Many thanks to Gytis Dudas, Richard Neher and Trevor Bedford for assistance.

I've written before on the moral imperative for timely data sharing of pathogen genome sequences during an outbreak. With Alli Black and others from the lab and elsewhere, we've been attempting our first sequencing work from Zika samples collected from the US Virgin Islands. We have so far produced 11 Zika genomes. We have a GitHub repo for the sequencing work with detailed experimental protocols and bioinformatic pipelines. We've also released a "marker paper" on bioRxiv that spells out analyses that we intend to do with these data (in this case, an in-depth look at the USVI Zika outbreak). We've pushed all 11 genomes to

Richard Neher and I have compiled another report on recent patterns of seasonal influenza virus evolution with an eye toward projecting forward to the SH 2017 and the NH 2017-2018 flu seasons. All analyses are based on the nextflu platform.

This time, there's little action in H1N1pdm, Vic and Yam, which are showing limited variation within their populations. However, there has arisen substantial variation within H3N2 viruses, wherein multiple competing clades are currently vying for success. The previously noticed 171K clade did indeed continue to dominate in the population, but there are now credible competitors arising as well. At this point, it's difficult to perceive an obvious winner among these competing lineages, though 171K/121K and T131K/R142K are strong contenders.

I've generally really liked bioRxiv as venue for these sorts of "technical reports". I'm treating these reports almost on par to a publication. Although there is no peer-review as the timescale doesn't allow for it, they are still something that I base scientific reputation on.

I'm honored to announce that Richard Neher and I have won the Open Science Prize for our work on This has been a really fun journey. The initial idea stemmed from a workshop at the Kavli Institute in Santa Barbara in summer 2014, where there were lots of discussions between me, Richard, Michael Lässig, Marta Łuksza and Colin Russell about flu forecasting. This inspired me to start on a prototype pipeline that would download flu data, build trees and do a simple D3 visualization. I put this up on GitHub and wasn't doing much with it until Richard picked it up and used it for a project on "local branching index". We joined forces at this point and ended up with the first version of nextflu in February 2015. We've been been working steadily to improve flu functionality since then. The next major innovation came in summer 2015, when Nick Loman contacted us about getting Ebola phylogenies shared from his on-the-ground work. We stood up a pipeline heavily borrowed from flu in June 2015 and continued updating the site as new data came in from Nick, Josh Quick, Matt Cotten, Ian Goodfellow and others. Since then, we've been trying to stay on top of Zika virus, with an initial version going up in Feb 2016 with all of the 17 available Zika genomes. Nick Loman (again), Oli Pybus and Kristian Andersen have been great at sharing sequences for this. With Alli Black in the lab, we also got involved in the actual sequencing work in Brazil and in the USVI.

So, it seems fitting that almost exactly two years after initial launch of nextflu in Feb 2015, that we're launching a completely revamped site. We've been engaged in this refactor for almost 9 months now and it's finally out the door. We have a bunch of new features (like a zoomable map showing transmissions, multiple tree layouts, root-to-tip plots, multiselect filters, and sharable visualization state via the URL). All this was made possible by a lot of clever and dedicated work by Colin Megill and also James Hadfield. Check out the new site. We hope you find it interesting / useful.

We just posted a paper to bioRxiv looking at the dynamics of cross-species transmission of SIVs (HIV's close relatives that infect other species of primates). This was my Epidemiology MS thesis project here in the Bedford lab, and was my first computational project.

SIVs infect over 45 different species of primates, and HIV emerged as a human pathogen through at least 12 independent transmissions of SIVs from chimpanzees, gorillas, and sooty mangabeys to humans. Individual occurences of SIVs switching hosts have been sporadically documented, but we still had no idea how regularly SIVs switch hosts -- i.e., we had no idea whether or not the transmissions that sparked the HIV pandemic were unusual occurences.

Many of these viruses have been sequenced in recent years. While we weren't able to study them all, we were able to get enough sequence data (shout out to the fantastic Los Alamos National Labs database) to study the history of SIV cross-species transmission (CST) among 24 different primates. We used this data to assess how frequently viruses from different lineages recombine (part of one genome and part of another genome getting "pasted together"), and to look at how often they've switched hosts over evolutionary time. Our phylogenetic analysis found that SIV evolution has been shaped by at least 13 instances of interlineage recombination, and identified 14 novel, ancient CST events. We found that on average, each linaege of SIV switches hosts about once every 6.25 substitutions per site (these are funny units because SIVs are millions of years old, but they essentially mean the amount of evolutionary time required to see 6.25 substitutions in each site of the genome). We also observed more CST events between closely related primates, and find that viruses and hosts have extensively coevolved (and likely cospeciated). Taken together, our results show that SIV biology has been extensively shaped by CST, but it's still a rare phenomenon over evolutionary time.

A couple of months ago I tweeted that we had our first Bedford lab wet lab (full disclosure: it’s a bench, but start small right?). Well I’m excited to say that we have just released our first bit of Bedford lab-generated sequence data and pushed results to Nextstrain Zika!

These data are 5 (draft) Zika genomes from clinical samples collected in the U.S. Virgin Islands. After getting some experience sequencing on the MinION down in Brazil, I spent the first two weeks of December getting amplicons and sequencing on island in St. Croix. The Caribbean in December, it’s rough, I know. This work is in collaboration with the VI Department of Health, who have generously given me access to their samples and let me take over their lab when I've been down. As a doctoral student in epidemiology, it’s an incredible opportunity to run a study from start to finish, not to mention investigate an outbreak in close to real-time. I’m really excited about it.

Importantly this has been a group effort. The fact that we have this data is a huge testament to the benefits of open science. I’m not a wet-lab scientist by training, and Zika is not the easiest virus to sequence. This project could have been really painful, and the fact that it hasn’t been owes a lot to the openness of other groups to share their knowledge, experience, and protocols. I’m so thankful to Josh Quick and Nate Grubaugh who were incredibly responsive when I had questions or needed help with the protocol, and Nick Loman for freely sharing his entire bioinformatic pipeline. Additionally my lab has been amazing both as a sounding board for ideas and for helping with the data processing and analysis. To have so many people come together to help a project succeed is wonderful, especially so when you’re a student trying to figure things out for the first time. Keep an eye out for more data coming out soon! We'll keep the zika-seq project updated with new sequences as we generate them.

We've just published a paper in Virus Evolution investigating the evolutionary dynamics of infectious hematopoietic necrosis virus (IHNV) in Pacific salmon. This is work that I did during my Master's degree in the Kurath lab and that I continued to develop during the first year of my PhD here in the Bedford lab.

IHNV is endemic along the Pacific coast of Canada and the United States, from California up to Alaska, and also in the Columbia River Basin. The Columbia River Basin is a large and complex watershed draining most of Washington, Oregon and Idaho, and one of the largest salmon runs in the continental US. There's a fair amount of interest in IHNV because it can cause severe epidemics, with up to 90% mortality rates, that can greatly affect conservation hatcheries and commercial aquaculture. Because there are no treatments for IHNV, a lot of effort goes in to understanding viral transmission dynamics in the hopes of preventing big outbreaks.

For this paper we sequenced over 1200 viral isolates collected over a 40-year time period. We combined sequence data with epidemiologic data to explore possible relationships between evolutionary dynamics and epidemiological characteristics of the virus. Our work revealed two previously unrecognized subgroups of U genogroup IHNV which were associated with disctinct epidemiologic patterns. One subgroup was detected more frequently in Chinook salmon and steelhead trout in the Columbia River Basin, while the other was detected more frequently in sockeye salmon in coastal watersheds. These associations were supported by FST and by phylogeographic analysis. Notably the geographic structure we observed supports hypotheses that fish-to-fish transmission of IHNV occurs mainly in fresh water, when migratory fish populations are divided across watersheds.

Back in May, we (Richard Neher and I) learned that was selected as a finalist for the Open Science Prize, a new initiative jointly funded by the NIH, the Wellcome Trust and HHMI. Each of the six finalists were asked to build a prototype of their project and present this prototype at the BD2K Open Data Science Symposium at the beginning of December. It was interesting seeing the other entries to the competition. As it turned out, everyone made a website. And each group was offering a layer of added value on top of publicly available data. In one example, providing a platform for sharing health and genetic information for people suffering from rare diseases and in another example, implementing a database for worldwide air quality data. A few years ago, I wrote about the possibility of a GitHub of Science. At the time, I wasn't sure exactly what this meant. I had a vague idea that someone could take a paper and fork it and add additional analyses on top of the original. Now, the future seems much more clear —

Just as software APIs allow open source software to be built layer-upon-layer, all six of the Open Science Prize finalists supply something like an API in which inputs of publicly available data are processed to yield derived outputs that encourage sharing, synthesis and understanding. I can totally imagine a scientific ecosystem in which open science projects (websites) rely on a stack of data and outputs from other groups, but produce their own data and outputs for downstream analysis. With nextstrain, we'd like to do something like this for pathogen phylogenetics and provide a basis for downstream epidemiological and evolutionary analyses. It seems like such a model could grow to live alongside the dominant (and incredibly worthwhile) scientific discourse occurring via peer-reviewed publication.

There is now public voting to determine which three entries will move forward to the final round. Although I think all six OSP entries were pretty great, we'd very much appreciate your vote. Please go to and vote by Jan 6.

Watercolor courtesy of Matt Cotten.