This year has made us appreciate how little we understand about seasonal H3N2 influenza despite extensive research efforts since its emergence in 1968. One unavoidable fact is that H3N2 evolves rapidly, accumulating mutations to its hemagglutinin surface protein (HA) that enable it to escape our acquired immunity from previous infections or vaccinations. Most efforts to understand HA evolution focus on mutations that increase viral fitness by enabling escape from the immune system. These escape mutations are often described in the context of a fitness trade-off with viral replication or transmission. However, there have been no systematic studies of how individual mutations to HA affect these core replicative functions of H3N2. In a recent collaboration with the Bloom lab, led by Juhye Lee, we investigated the functional effects of all possible single amino-acid mutations to the HA of a single, recent H3N2 strain.

Juhye performed deep mutational scanning experiments to quantify the effects of mutations to HA on viral growth in cell culture. These experiments measured the preferred amino acid composition at each position in HA, allowing us to calculate the fitness effect of mutations from one amino acid to another. To determine whether our measurements approximated the fitness of mutations in natural populations, we investigated the evolutionary fates of 1321 mutations in H3N2 strains sampled from 1968 to 2018. Specifically, we compared each mutation's maximum global frequency reached in nature to its corresponding experimental mutational effect. We found that successful mutations in nature generally had neutral or beneficial experimental mutation effects, while unsuccessful mutations had deleterious mutational effects. This correlation between experimentally-measured and natural fitness effects of H3N2 mutations disappeared when we substituted our H3N2 measurements with previous measurements for a lab-adapted H1N1 strain. Indeed, we observed a significant shift in preferred amino acid compositions between H3N2 and H1N1. It is possible that this shift reflects differences between the two viral lineages in both the folding of HA and the selective pressures constraining HA evolution.

Our results suggest that experimental measurements of mutational effects in HA can help predict the evolution of seasonal influenza within a specific lineage. While these measurements do not represent the true fitness of mutations in nature, they are an important first step toward filling a gap in our understanding of H3N2 evolution. This study also prepares us for future investigations of how mutations allow viruses to escape detection by human antibodies. The combination of deep mutational scanning measurements for viral growth and immune escape should allow us to build more accurate, experimentally-informed evolutionary models for seasonal influenza.

We're looking for a developer to ramp up our efforts with Job advertisement follows:

A developer position is available immediately in the Bedford lab at the Fred Hutch to improve backend infrastructure of and work with public health and academic partners to streamline data sharing and real-time analysis.

In collaboration with Neher lab at the University of Basel, we've built the Nextstrain platform to conduct real-time genomic epidemiology to aid understanding of pathogen spread and improve outbreak response. Pathogen genomic data can reveal otherwise hidden connections between infections and be used to infer patterns of epidemic growth, geographic spread and adaptive evolution. However, only through open sharing of genomic data can these inferences be fully realized. Our aim with is to provide a platform for both data sharing and analysis. This platform won the Open Science Prize in Feb 2017 and has already been instrumental in analysis of Ebola spread in West Africa, Zika spread in the Americas and is used by the World Health Organization to aid in the process of influenza vaccine strain selection.

The codebase is completely open source at Currently, we use a data parsing / cleaning module to canonicalize data from disparate sources, a RethinkDB database to host clean data, an informatic / pipeline module to process genomic data into annotated evolutionary trees and a browser-based frontend to display interactive visualizations. All backend / compute is written in Python and all frontend is written in JavaScript. At this point, the frontend has seen more development than the backend. We are now looking to improve backend infrastructure to allow easier sharing of data from outside groups and to automatically run builds when new data appears. This developer position would be in charge of backend infrastructure, but also work directly with public health and academic partners to incorporate new datasets and make an effective platform for applied genomic epidemiology.

The ideal candidate would have expertise in Python, databasing, bioinformatics and compute infrastructure. Database knowledge is required to host genomic data and provide APIs to outside groups to push data to a shared database. Informatics and compute knowledge is necessary to deploy automatically spin up builds as new data appears. This broadly aligns with experience in backend web development. Experience with frontend web development, particularly JavaScript, React and D3 would be a plus, but not at all a requirement. The ideal candidate should also have excellent communication skills as interfacing with collaborators is a key aspect.

Primary job responsibilities include: (1) managing Nextstrain database, (2) working with collaborators to keep data flowing through Nextstrain pipeline and (3) building infrastructure to streamline (1) and (2).

The Fred Hutch is located in South Lake Union in Seattle, WA and offers a dynamic work environment with cutting-edge science and computational resources. The position is available immediately with flexible starting dates. Informal inquires are welcome. Applications will be accepted until the position is filled. We offer a competitive salary commensurate with skills and experience, along with benefits. The Fred Hutch and the Bedford lab are committed to improving diversity in the computational sciences. Applicants of diverse backgrounds are particularly encouraged to apply. Depending on the applicant, this position could be a full-time salaried employee, a part-time employee or a contracted consultant.

For more information about the lab, please see the our website at To apply for the position please send (1) current resume, (2) code samples or links to published/distributed code and (3) contact information for two references to

In 2016 and 2017, mumps outbreaks were reported in several countries, and the CDC reported 5,629 cases within the United States. Washington state has among the highest incidence rates in the country, reporting 891 confirmed cases between October 2016 and September 2017. We are collaborating with the Washington State Department of Health to sequence mumps virus samples collected from throughout the outbreak. We will use these data to determine the number and size of distinct transmission clusters, describe where distinct transmission clusters were likely introduced from, and describe how the virus spread within the state. This analysis is greatly aided by recent mumps virus sequencing efforts by the British Columbia CDC and the Broad Institute, as pooling data provides critical context.

We recently completed sequencing the first batch of mumps virus genomes provided by the Washington State Department of Health and have released the first 27 draft genomes on Protocols for sample preparation and sequencing are available at The 27 sequences from Washington state were collected between December 2016 and April 2017. The vast majority of these sequences (25 out of 27) cluster together within a single large clade, which we will refer to as the primary outbreak clade. This finding indicates that the majority of transmission within Washington likely occurred due to a single introduction of mumps followed by sustained person-to-person transmission. This conclusion may change as we sequence further viruses, which may provide evidence for additional clades of circulating viruses. The primary outbreak clade is closely related to all sequenced viruses sampled from the Arkansas outbreak, and is nested within the diversity of viruses sampled during Massachusetts outbreaks in 2016. Thus, we hypothesize that mumps outbreaks within the US are likely related. We also find a single genome from Washington state that clusters outside of the primary outbreak clade, which could represent a separate introduction of Mumps virus to Washington which did not yield sustained person-to-person transmission. We note that further sequencing and more sophisticated genomic analysis is required to confidently determine the total number of introductions that occurred, and how each introduction contributed to observed transmission. Finally, the primary outbreak clade also includes a single sequence from British Columbia, providing evidence for some degree of transmission between the US and Canada.

We are aiming to sequencing ~100 clinical samples in total, and will continue sequencing and adding data to in the coming months. Stay tuned!

Richard Neher and I have compiled another report on recent patterns of seasonal influenza virus evolution that attempts to project forward to the Northern Hemisphere 2017-2018 and the Southern Hemisphere 2018 flu season. All analyses are based on the nextflu platform.

Here, we see relatively little going on in B/Vic and B/Yam viruses. Both have little genetic or antigenic variation and we expect the current vaccine choice to stay a good match. We observe a near sweep of a new genetic variant of A/H1N1pdm viruses. However, this new variant lacks antigenic differences that would warrant a vaccine update.

There is substantial variation within A/H3N2 viruses. These have continued to diversify and there are now multiple distinct antigenic variants circulating. We identify 5 major clades that are currently vying for dominance. It appears unlikely that a single variant will come to dominate the population in the near future and instead there is likely to be continued circulating diversity. This makes choice of vaccine strain highly difficult; it is not at all obvious that there is better choice than the current A/Hong Kong/4801/2014 vaccine strain.

We've just posted a manuscript to bioRxiv on transmission dynamics of Middle East respiratory syndrome (MERS) coronavirus or MERS-CoV. MERS-CoV has been identified as the cause of sporadic outbreaks of severe respiratory illness in the Middle East, largely in the Arabian peninsula, since 2012. Its epidemiology has sometimes been described as mysterious, since only the most severe cases are usually admitted to hospitals, sometimes without reports of contact with camels, the accepted reservoir for the virus. That, as well as large hospital-associated outbreaks of MERS, have suggested that there should be a sizeable community transmission contribution to the ongoing outbreaks.

Although parallels between severe acute respiratory syndrome coronavirus (SARS-CoV) and MERS-CoV were inevitably going to be drawn, there have been clear indications that MERS-CoV is a different kind of beast. Unlike SARS-CoV that spread rapidly to other countries, primary MERS cases have been restricted to the Arabian peninsula and outbreaks outside of it have been brought under control relatively quickly. This is a pattern strongly suggestive of a repeated zoonotic spillover where the virus is jumping into humans repeatedly in the area where the reservoir and humans overlap, but the virus transmits poorly between humans and goes extinct. Despite these kinds of evidence this pattern has not been clearly confirmed. We thought that genomic sequences could provide an ideal window into these epidemiological patterns.

In order to establish cross-species transmission with sequence data one would ideally need a large sample of viral sequences from the reservoir as well as the 'sink' host. These data exist for MERS-CoV, but have been sampled very unevenly. MERS-CoV genomes that we have collated are heavily skewed towards the human side (174 genomes) compared to the camel reservoir (100 genomes), in addition to human sequences coming predominantly from hospital outbreaks. What has been clear so far is that MERS-CoV sequences from camels are more distantly related to each other on average than MERS-CoV sequences from humans, but most ancestral state reconstruction methods that could be used to infer the host of MERS-CoV lineages are agnostic to such signals. This is where the structured coalescent comes in. By explicitly modelling the evolution of MERS-CoV in a population structured along host boundaries we can estimate migration rates between the two hosts.

We find exactly what we would expect – MERS-CoV is almost exclusively a virus of camels and humans are an incidental and ultimately dead-end host. None of the 56 viral lineages we saw entering humans ever made it out of humans to contribute to the long-term evolution of MERS-CoV. We went a bit further here and applied the logic we used in our paper on Zika virus in Florida. Having identified the cross-species transmission events we could ask what the distribution of clade sizes resulting from those spill-over events tells us about MERS-CoV transmissibility. We estimate that the basic reproductive number for MERS-CoV is almost certainly below 0.91, indicating that it is unlikely to establish self-sustaining transmission chains in humans. The corollary of this is that there must have been hundreds of MERS-CoV spill-over events from camels into humans, most probably restricted to primary cases.

What does this all mean for public healthcare response? For one, it's clear that camels are the sole focus of MERS-CoV evolution and until it is controlled there humans will be at risk. Second, as mentioned previously, MERS-CoV is different from SARS-CoV and the evidence so far indicates that MERS-CoV does not do so well in humans. And even though there is no selective pressure on the virus in camels to transmit effectively between humans, repeated spill-over events mean that if such a variant were to emerge in camels it is very likely to find itself in humans eventually. Lastly, there is (again) much to be said about sequence data. We are not at a stage where we can identify pathogens before they spread widely if they are good at human-to-human transmission, but for viruses like MERS-CoV that are new and capable of generating stuttering transmission chains sequence data are ideal. Genome sequences, when gathered consistently, across affected areas with appropriate metadata are an incredibly powerful tool that combines diagnostics, typing and detailed evolutionary history in a single standardised bundle that can be used and shared easily.

Over the past year, we have been privileged to get to work with fantastic colleagues at Oxford, Birmingham, University of São Paulo, FIOCRUZ Salvador, Scripps, USAMRIID and elsewhere on a multiple studies tracing the genomic epidemiology of the Zika epidemic in the Americas. Although manuscripts were posted to bioRxiv in January and February, today sees their formal release in Nature (Faria et al and Grubaugh et al) and Nature Protocols (Quick et al). A fourth paper (Metsky et al) that we were not involved with was also published today.

The paper "Establishment and cryptic transmission of Zika virus in Brazil and the Americas" represents the outcome of of the ZiBRA project, wherein an international team of scientists, lead by groups from Brazil and the UK, traveled across the coast of NE Brazil with a mobile lab to Zika conduct diagnostic surveillance and sequencing. I tagged along for a portion of the trip and mainly helped to sort out bioinformatics and metadata. Later on, Alli flew to Salvador and São Paulo to assist with the final sequencing push. The 53 Zika genomes contributed by ZiBRA project have done much to resolve the origins of the Zika epidemic in the Americas. It is now clear that the Zika epidemic derives from a single introduction into NE Brazil sometime between Aug and Dec 2013. However, the first diagnostically confirmed case of Zika wasn't until Mar 2015. Thus, there had been over a year of cryptic transmission and by the time Zika was first identified it had already spread throughout much of Brazil.

The paper "Genomic epidemiology reveals multiple introductions of Zika virus into the United States" investigates the only sizable Zika outbreak in the USA that has so far occurred. Last summer, over 250 cases without travel history were reported in Miami-Dade county. In this paper, teams from Scripps and USAMRIID sequenced 29 human cases and 7 pooled mosquito samples from local traps. Gytis played a significant role in the phylogenetic analysis and in investigating travel connections between Miami and Zika endemic areas. I helped with the epidemiological modeling. I was pleased to work out a model to estimate R0 from the degree of clustering observed in the phylogeny along with known case counts. This work showed (at least to me) a surprising degree of clustering and hence significant ongoing local transmission throughout summer 2016. As in Brazil, Zika arrived in Florida earlier than expected from case diagnostics alone. We also observe a strong Caribbean connection in imports of Zika into Miami.

All three groups were fantastic about sharing sequences and I did my best to keep updated as genomes were released. At this point, Nextstrain shows a comprehensive tree putting all of these Zika genomes (along with others) into a unified context.

Artwork courtesy of Sharon Isern.

We just had a big paper accepted in Nature, which looks at the entirety of the West African Ebola virus epidemic of 2013-2015. The project has existed in a variety of incarnations for well over a year with hints here and there of something big in the making, which unsurprisingly is the last bit of work that followed me from my PhD in Edinburgh to Seattle.

Whereas most publications over the last couple of years have focused on specific regions of the three most affected countries (Sierra Leone, Liberia and Guinea) and over specific time periods, we have analysed all publicly available data (comprising over 5% of all known Ebola virus disease cases!) to arrive at an overarching narrative for the epidemic. By using a Bayesian generalised linear model jointly with phylogenetic inference we not only reconstructed the history of the epidemic from its beginning to its end, but also inferred where the virus had been and what factors were associated with its spread. Our key findings are that:

  • Ebola virus migration largely followed a classic gravity model with international borders acting as potent barriers. Large population centers tended to receive more infected travellers, especially if incoming cases were from locations that were physically closer. However, migration was reduced if locations were in different countries (i.e. separated by an international border) and further apart.

  • Regions immediately bordering the three most affected countries, in Guinea-Bissau, Senegal, Mali, and Cote d'Ivoire, were spared their own Ebola outbreaks largely because of their remoteness. By looking into correlates of local Ebola virus proliferation we identified regions of these four neighbouring countries that had the potential to develop large outbreaks, had the virus been introduced.

  • The population of Ebola virus in West Africa was comprised of small mobile transmission chains, rather than large sweeping outbreaks. Individual transmission chains had poor persistence within any given location, so migration played a key role in sustaining the epidemic.

Beyond science our work exemplifies the slow turnover of ideas that has been taking place in the field. First and foremost, the West African epidemic was the first infectious disease epidemic of its kind, where the sequence data were generated, analysed and shared in real time. It is a huge step in the right direction and could not possibly have happened without the data collectors' efforts, support and trust. I sincerely hope that the collaborative spirit of 2014-2015 lives on for when the next outbreak hits.

Secondly, there is a lot to be said about leveraging cutting edge technologies against challenges like the West African Ebola virus epidemic. I doubt we could have learned nearly as much about this epidemic from small regions of the viral genome (as was convention for sequencing just years before), nor from a small number of of strains. This wouldn't have happened were it not for vast advances in sequencing technology. Now you can pick and choose from any number of sequencing platforms tailored to your needs, be it identifying rare viral variants within individual patients or bringing a sequencer anywhere you go in your pocket.

Lastly, our work is one of many other publications that highlight how important sequencing is to modern infectious disease outbreak response. Sequencing is occasionally equated with stamp collecting, which unjustly ignores the unparalleled perspective that sequence data offer into the heart of any outbreak - the history of the pathogen itself. A handful of sequences go a long way in the right hands, like identifying the origins of the epidemic, documenting unusual patterns of transmission, tracking individual transmission chains in the last stages of the epidemic and linking flare ups of Ebola virus to latent infections, to name a few.

And finally I would like to point out that work of such proportions does not happen in a vacuum. Although there were many who have directly or indirectly contributed to this project, Philippe Lemey, Marc Suchard, Trevor Bedford, Andy Tatem, Luiz Max Carvalho and Andrew Rambaut were the real analytical, theoretical and creative masterminds behind the whole thing. Many thanks for letting me be a part of something this big and exciting.


Avian influenza A(H7N9) has resulted in annual epidemics over the past five years in China. H7N9 has a high mortality rate of around 40% and the currently ongoing fifth epidemic is the largest yet, with 460 infections reported by February 27th, 2017 (WHO report). Transmission is still predominantly from poultry, where H7N9 viruses continue to circulate, and human-to-human transmission is thought to be rare. The CDC considers H7N9 to have "the greatest potential to cause a pandemic" of all influenza A viruses. For more information, see this CDC information page.


Nextstrain now has the ability to display phylogenies and geographical data for both the NA and HA genes, drawn from over 1200 samples covering all five human epidemics of H7N9. This analysis was possible thanks to the data sharing of the influenza research community through GISAID. It is our hope that making these analyses available to the community will aid understanding of this epidemic as it unfolds. Please note that the analysis currently presented in nextstrain is preliminary and further research is required.

Phylogeny & Geographic Distribution

geographic-distribution The HA phylogeny indicates that the expansion of a single lineage contributes 86% of sequences from the current epidemic. Temporal analysis indicates that this lineage originated during the 2015 (third) epidemic, however only one isolate from this lineage was sampled during the fourth epidemic. Inference of the geographical distribution of H7N9 indicate frequent jumps throughout the eastern coast of China, with limited dispersion elsewhere. Host jumps have also been inferred, however incomplete sampling restricts our ability to comment further.

Insertion in the Protease Cleavage Site

Highly pathogenic avian influenza viruses are often characterized by insertions in the host protease cleavage site, which enhance the cleavage of HA protein to HA1 and HA2 - a process necessary for infection. We find a lineage consisting of four isolates with a four amino-acid insertion (KRTA) in this region, in agreement with Iuliano et al. This insertion is not present in any other isolates, and this lineage contributes less than 10% of the sequences from the current epidemic. Interestingly, this lineage appears to have diverged from the lineage causing most fifth epidemic infections in mid-2014. Despite the lack of expansion during this epidemic, the potential for a highly pathogenic H7N9 variant is worrying and warrants closer inspection.


reassortment Reassortment led to the origin of human H7N9 and continues to play a role during the epidemics. The above figure shows that the current dominant clade in the HA phylogeny, which contributes the majority of fifth epidemic cases, is comprised of at least two NA clades due to reassortment. The isolates containing the protease cleavage site insertion (in HA) are monophyletic in both HA and NA segments.


A full list of labs and authors who have made data available for analysis in these samples is available in this spreadsheet. All figures from nextstrain. Many thanks to Gytis Dudas, Richard Neher and Trevor Bedford for assistance.