The field of genomic epidemiology focuses on using the genetic sequences of pathogens to understand patterns of transmission and spread. Viruses mutate very quickly and accumulate changes during the process of transmission from one infected individual to another. The novel coronavirus which is responsible for the emerging COVID-19 pandemic mutates at an average of about two mutations per month. After someone is exposed they will generally incubate the virus for ~5 days before symptoms develop and transmission occurs. Other research has shown that the "serial interval" of SARS-CoV-2 is ~7 days. You can think of a transmission chain as looking something like:

where, on average, we have 7 days from one infection to the next. As the virus transmits, it will mutate at this rate of two mutations per month. This means, that on average every other step in the transmission chain will have a mutation and so would look something like:

These mutations are generally really simple things. An 'A' might change to a 'T', or a 'G' to a 'C'. This changes the genetic code of the virus, but it's hard for a single letter change to do much to make the virus behave differently. However, with advances in technology, it's become readily feasible to sequence the genome of the novel coronavirus. This works by taking a swab from someone's nose and extracting the RNA in the sample and then determining the 'letters' of this RNA genome using chemistry and very powerful cameras. Each person's coronavirus infection will yield a sequence of 30,000 'A', 'T', 'G' or 'C' letters. We can use these sequences to reconstruct which infection is connected to which infection. As an example, if we sequenced three of these infections and found:

We could take the "genomes" ATTT, ATCT and GTCT and infer that the infection with sequence ATTT lead to the infection with sequence ATCT and this infection lead to the infection with sequence GTCT. This approach allows us learn about epidemiology and transmission in a completely novel way and can supplement more traditional contact tracing and case-based reporting.

For a few years now, we've been working on the Nextstrain software platform, which aims to make genomic epidemiology as rapid and as useful as possible. We had previously applied this to outbreaks like Ebola, Zika and seasonal flu. Owing to advances in technology and open data sharing, the genomes of 140 SARS-CoV-2 coronaviruses have been shared from all over the world via As these genomes are shared, we download them from GISAID and incorporate them into a global map as quickly as possible and have an always up-to-date view of the genomic epidemiology of novel coronavirus at

The big picture looks like this at the moment:

where we can see the earliest infections in Wuhan, China in purple on the left side of the tree. All these genomes from Wuhan have a common ancestor in late Nov or early Dec, suggesting that this virus has emerged recently in the human population.

The first case in the USA was called "USA/WA1/2020". This was from a traveller directly returning from Wuhan to Snohomish County on Jan 15, with a swab collected on Jan 19. This virus was rapidly sequenced by the US CDC Division of Viral Diseases and shared publicly on Jan 24 (huge props to the CDC for this). We can zoom into the tree to place WA1 among related viruses:

The virus has an identical genome to the virus Fujian/8/2020 sampled in Fujian on Jan 21, also labeled as a travel export from Wuhan, suggesting a close relationship between these two cases.

Last week the Seattle Flu Study started screening samples for COVID-19 as described here. Soon after starting screening we found a first positive in a sample from Snohomish County. The case was remarkable in that it was a "community case", only the second recognized in the US, someone who had sought treatment for flu-like symptoms, been tested for flu and then sent home owing to mild disease. After this was diagnostically confirmed by Shoreline Public Health labs on Fri Feb 28 we were able to immediately get the sample USA/WA2/2020 on a sequencer and have a genome available on Sat Feb 29. The results were remarkable. The WA2 case was identical to WA1 except that it had three additional mutations.

This tree structure is consistent with WA2 being a direct descendent of WA1. If this virus arrived in Snohomish County in mid-January with the WA1 traveler from Wuhan and circulated locally for 5 weeks, we'd expect exactly this pattern, where the WA2 genome is a copy of the WA1 genome except it has some mutations that have arisen over the 5 weeks that separate them.

Again, this tree structure is consistent with a transmission chain leading from WA1 to WA2, but we wanted to assess the probability of this pattern arising by chance instead of direct transmission. Scientists often try to approach this situation by thinking of a "null model", ie if it was coincidence, how likely of a coincidence was it? Here, WA1 and WA2 share the same genetic variant at site 18060 in the virus genome, but only 2/59 sequenced viruses from China possess this variant. Given this low frequency, we'd expect probability of WA2 randomly having the same genetic variant at 2/59 = 3%. To me, this not quite conclusive evidence, but still strong evidence that WA2 is a direct descendent of WA1.

Additional evidence for the relationship between these cases comes from location. The Seattle Flu Study had screened viruses from all over the greater Seattle area, however, we got the positive hit in Snohomish County with cases less than 15 miles apart. This by itself would only be suggestive, but combined with the genetic data, is firmer evidence for continued transmission.

I've been referring to this scenario as "cryptic transmission". This is a technical term meaning "undetected transmission". Our best guess of a scenario looks something like:

We believe this may have occurred by the WA1 case having exposed someone else to the virus in the period between Jan 15 and Jan 19 before they were isolated. If this second case was mild or asymptomatic, contact tracing efforts by public health would have had difficulty detecting it. After this point, community spread occurred and was undetected due to the CDC narrow case definition that required direct travel to China or direct contact with a known case to even be considered for testing. This lack of testing was a critical error and allowed an outbreak in Snohomish County and surroundings to grow to a sizable problem before it was even detected.

Knowing that transmission was initiated on Jan 15 allows us to estimate the total number of infections that exist in this cluster today. Our preliminary analysis puts this at 570 with an 90% uncertainty interval of between 80 and 1500 infections.

Back on Feb 8, I tweeted this thought experiment:

We know that Wuhan went from an index case in ~Nov-Dec 2019 to several thousand cases by mid-Jan 2020, thus going from initial seeding event to widespread local transmission in the span of ~9-10 weeks. We now believe that the Seattle area seeding event was ~Jan 15 and we're now ~7 weeks later. I expect Seattle now to look like Wuhan around ~1 Jan, when they were reporting the first clusters of patients with unexplained viral pneumonia. We are currently estimating ~600 infections in Seattle, this matches my phylodynamic estimate of the number of infections in Wuhan on Jan 1. Three weeks later, Wuhan had thousands of infections and was put on large-scale lock-down. However, these large-scale non-pharmaceutical interventions to create social distancing had a huge impact on the resulting epidemic. China averted many millions of infections through these intervention measures and cases there have declined substantially.

This suggests that this is controllable. We're at a critical junction right now, but we can still mitigate this substantially.

Some ways to implement non-pharmaceutical interventions include:

  • Practicing social distancing, such as limiting attendance at events with large groups of people
  • Working from home, if your job and employer allows it
  • Staying home if you are feeling ill
  • Take your temperature daily, if you develop a fever, self-isolate and call your doctor
  • Implementing good hand washing practices - it is extremely important to wash hands regularly
  • Covering coughs and sneezes in your elbow or tissue
  • Avoiding touching your eyes, nose, and mouth with unwashed hands
  • Disinfecting frequently touched surfaces, such as doorknobs
  • Beginning some preparations in anticipation of social distancing or supply chain shortages, such as ensuring you have sufficient supplies of prescription medicines and ensuring you have about a 2 week supply of food and other necessary household goods.
  • With these preparation in mind, it is important to not panic buy. Panic buying unnecessarily increases strain on supply chains and can make it difficult to ensure that everyone is able to get supplies that they need.

For more information please see:

I started following what's now referred to as "novel coronavirus (nCoV)" on Jan 6 when I started to notice reports of a cluster of viral pneumonia of unknown origin in Wuhan, China. Just 4 days later on Jan 10, a first genome was released on only to be followed by five more the following day via From very early on, it was clear that the nCoV genomes lacked the expected genetic diversity that would occur with repeated zoonotic events from a diverse animal reservoir. The simplest parsimonious explanation for this observation was that there was a single zoonotic spillover event into the human population in Wuhan between mid-Nov and mid-Dec and sustained human-to-human transmission from this point. However, at first I struggled to reconcile this lack of genetic diversity with WHO reports of "limited human-to-human" transmission. The conclusion of sustained human-to-human spread became difficult to ignore on Jan 17 when nCoV genomes from the two Thai travel cases that reported no market exposure showed the same limited genetic diversity. This genomic data represented one of the first and strongest indications of sustained epidemic spread. As this became clear to me, I spent the week of Jan 20 alerting every public health official I know.

At this moment there are 54 publicly shared viral genomes, with genomes being shared by public health and academic groups all over the world 3-6 days after sample collection. I can't overstate how remarkable this is and what an inflection point it is for the field of genomic epidemiology. Seasonal influenza had been far ahead of the general curve, but there we were still generally seeing a ~1 month turnaround from sample collection to genome in the best of circumstances. Getting to a 3-6 day turnaround opens up huge new avenues in epidemiology.

Since the first nCoV genome was shared on Jan 10, we've been tracking viral transmission and evolution on aiming to have ~1hr turnarounds from public deposition of genome data to inclusion in the live transmission tracking. We are also producing public situation reports describing what can be concluded from current genomic data. These reports have now been generously translated into 5 other languages by volunteers from Twitter. With groups all over the world working tirelessly to generate genomic data as rapidly as possible, I'm feeling a moral obligation to not hold up the analysis side. The entire Nextstrain team (shoutouts to Richard Neher, Emma Hodcroft, James Hadfield, Kairsten Fay, Thomas Sibley, Misja Ilcisin and Jover Lee 🙌) have come together to conduct analyses and tailor the platform for nCoV response. There's also been a remarkable amount of sharing of pre-publication analyses on and bioRxiv and scientific communication on Twitter. Although the situation is looking a bit dire at the moment, it's been humbling to see scientists from all over the world break down traditional barriers to rapid scientific progress.

Genomic epidemiological studies have been used in academic contexts to reconstruct regional transmission of Ebola during the West African outbreak, estimate when Zika came to Brazil, and investigate how seasonal influenza circulates around the world. But these types of studies have moved out of the ivory tower, and public health agencies regularly sequence and analyze whole pathogen genomes to support surveillance and epidemiologic investigations of foodborne diseases, tuberculosis, and influenza, among other pathogens. Indeed, almost every infectious disease program at the Centers for Disease Control and Prevention now uses pathogen genomics, with increasing adoption by state and local health departments as well.

Pathogen genomics is a great addition to the public health toolbox. However, genomic data is complex and needs transformation from its raw form prior to analysis. Increasing use of pathogen genomics will require that public health agencies invest in advanced computational infrastructure, develop a broader technical workforce, and investigate new approaches to integrated data management and stewardship. As the number of agencies with genomic surveillance capabilities grows we'll need a unified network of validated, reproducible ways to analyze data. The question then is how do we build that ecosystem?

In collaboration with the CDC's Office of Advanced Molecular Detection (OAMD) we've written a whitepaper describing ten recommendations for supporting open pathogen genomic analysis in public health settings, which we've just posted to (bioRxiv doesn't take editorial content such as this).

To get a sense of the current landscape of pathogen genomic analysis in public health agencies, including investigating challenges encountered and overcome, we conducted a series of long form interviews with public health practitioners who use pathogen genomic data. We spoke with various branches and divisions at CDC, as well as state public health labs in the United States, provincial public health labs in Canada, and representatives from the European CDC. In a concurrent effort, the Africa CDC investigated similar questions and assessed capabilities for building genomic surveillance across the African continent. We learned a lot from these interviews about what parts of genomic surveillance are working well in public health agencies, as well as areas that need to be improved. This information forms the basis of our proposals.

This paper is just the first step in what we hope is a community-based discussion and development effort of standards and tools for everything from databases to pipelines to data visualization capabilities. These community-based efforts will be guided and supported by the Public Health Alliance for Genomic Epidemiology (PHA4GE). Announced in October 2019, PHA4GE is a global coalition that is actively working to establish consensus standards; document and share best practices; improve the availability of critical bioinformatic tools and resources; and advocate for greater openness, interoperability, accessibility and reproducibility in public health microbial bioinformatics. If you're interested in joining in on this effort, please get in touch!

Our paper out today summarises twenty years of West Nile virus spread and evolution in the Americas visualised by Nextstrain, the result of a fantastic collaboration between multiple groups over the past couple of years. I wanted to give a bit of a backstory as to how we got here, how we’re using Nextstrain to tell stories, and where I see this kind of science going.

I’m not going to use this space to rephrase the content of the paper — it’s not a technical paper and is (I hope) easy to read and understand. The paper summarises all the available genomic data of WNV in the Americas, reconstructs the spread of the disease (westwards across North America with recent jumps into Central & South America), with each figure being a Nextstrain screenshot with a corresponding URL so that you can access an interactive, continually updated view of that same figure.

Instead I’d like to focus on how we used Nextstrain, and in particular its new narrative functionality, to present data in an innovative and updatable way. But first, what’s Nextstrain and how did this collaboration start?

How this all came about

Nextstrain has been up and running for around three years and is “an open-source project to harness the scientific and public health potential of pathogen genome data”. Nextstrain uses reproducible bioinformatics tooling (“augur”) and an innovative interactive visualisation platform (“auspice”) to allow us to provide continually updated views into the phylogenomics of various pathogens, all available on

Nate Grubaugh, who had just moved from Kristian Andersen’s group in San Diego to a P.I. position at Yale, was doing amazing work collecting, collaborating, and sequencing different arboviruses. Nate wanted to be able to continually share results from the WNV work, including the WestNile4k project, and Nextstrain provided the perfect tool for this — it’s fast, so analyses can be rerun whenever new data are available and the results are available for everyone to see and interact with online. Nate, his postdoc Anderson Brito, and myself set things up (all the steps to reproduce the analysis are on GitHub) and was born.

The proof is in the pudding and as a result of sharing continually updated data through Nextstrain, Nate had new collaborators reach out to him. The data they contributed helped to fill in the geographic coverage and improve our understanding of this disease’s spread.

Towards a new, interactive storytelling method of presenting results

Inspired by interactive visualisations and storytelling — which caused me to take a left-turn during my PhD — I wanted to allow scientists to use Nextstrain to tell stories about the data they were making available. I'm a big believer in Nextstrain’s mission to provide interactive views into the data (I helped to build it after all), but understanding what the data is telling you often requires considerable expertise in phylogenomics.

Nextstrain narratives allow short paragraphs of text to be “attached” to certain views of the data. By scrolling through the paragraphs you are presented with a story, allowing conveyance of the author’s interpretation and understanding of the data. At any time you can jump back to a “fully interactive” Nextstrain view & interrogate the data yourself.

So, the content of the paper we’ve just published is available as an interactive narrative at I encourage you to go and read it (by scrolling through each paragraph), interact with the underlying data (click “Explore the data yourself” in the top-right corner), and compare this to the paper we’ve just published.

WNV Narrative demo

We’re only beginning to scratch the surface of different ways to present scientific data & findings — see Brett Victor’s talks for a glimpse into the future. In a separate collaboration, we’ve been using narratives to provide situation-reports for the ongoing Ebola outbreak in the DRC every time new samples are sequenced, helping to bridge the gap between genomicists and epidemiologists. If you’re interested in writing a narrative for your data (or any data available on Nextstrain) then see this section of the auspice documentation.

A big thanks to all the amazing people involved in this collaboration, especially Anderson & Nate, as well as Trevor Bedford & Colin Megill for help in designing the narratives interface.

I've been remiss for the past year about posting our biannual flu report publicly. We've now however posted our Sep 2019 flu report to bioXriv where it details recent seasonal influenza evolution during 2019 and projections for spread over the next 12 months to Sep 2020. Our timing with this report is designed to correspond to the timing of the World Health Organization's Vaccine Composition Meeting being held this week in Geneva. Richard Neher has lead much of this analysis, with John Huddleston providing fitness model projections and Barney Potter contributing to data curation.

With each of the reports, we generally end up focusing on a handful of emerging clades within each influenza lineage and tracking their rate of global spread and viral characteristics. In one current example, H3N2 viruses have diversified into a large number of competing lineages, however, over the course of 2019 we've seen the emergence and spread of A1b/197R viruses as well as A1b/137F viruses. Over the course of the past ~9 months these clades have grown from nearly 0% global frequency to a combined >50% global frequency. Previously, Richard and colleagues had identified local branching index (LBI) as a strong predictor of future strain success. The idea is basically that clades that are currently outcompeting their relatives are estimated to be higher fitness and so are predicted to continue to increase in frequency into the future. In previous reports, we've used LBI to project which clades will come in to dominate.

More recently, John has sought to build a fitness model that makes quantitative predictions of clade frequencies based on LBI as well as viral characteristics. There is some description of this model in the September report. We're hoping to have a preprint and source code shared shortly. However, we've now elected to start including live model predictions for H3N2 at The bottom panel shows frequencies of different clades up to present as well as a forecast over the following 12 months:


Here, it's clear that the model follows LBI in predicting the further growth of 197R viruses. Additionally, in the "color by" dropdown menu you can now select "fitness" to show fitness estimates for each virus and also select "distance to future population" to show amino acid match of sampled viruses to the predicted future population.

These forecasts will now be made automatically alongside our weekly site updates.

We have a new preprint up on bioRxiv describing within-host evolution of H5N1 avian influenza viruses sampled from humans and domestic ducks in Cambodia!

Why should we care about avian flu in Cambodia?

We've been collaborating with the Institut Pasteur du Cambodge (IPC) to try to understand how H5N1 avian influenza viruses evolve during cross-species transmission. H5N1 viruses are highly pathogenic avian influenza viruses that naturally circulate in aquatic birds, but can cross species barrier and cause spillover infections in humans. Although H5N1 viruses aren't currently capable of transmitting among humans efficiently, laboratory studies suggest that only a few mutations might be required to render them human-adapted. Influenza viruses generate lots of genetic diversity within a single infected host, leading to concern that continued spillover infection might one day facilitate human adaptation. Unfortunately, assessing cross-species transmission risk is really difficult, and the data we have currently comes from animal experiments and modelling studies. Because spillover infection is rare, it has been difficult to study how H5N1 viruses might evolve during natural infection, in either humans or birds.

H5 avian influenza viruses are endemic in Cambodia, and are frequently detected in domestic birds in live bird markets throughout the country. The Institut Pasteur du Cambodge conducts regular poultry market surveillance and outbreak investigation for avian influenza viruses, making it an incredible resource for studying avian influenza virus circulation and evolution. IPC and collaborators in China previously generated deep sequence data from a unique dataset of 8 humans and 5 domestic ducks infected with H5N1 and sampled in Cambodia between 2010 and 2014. This dataset provided a great opportunity to examine whether human adaptation occurs during natural spillover infection. Although a couple other studies have looked at within-host diversity in infected humans, data from infected poultry has been more difficult to come by. Because this dataset also included data from infected poultry collected in the same geographic location and time, we could compare the evolutionary patterns we observed in humans to those in birds.

What can within-host diversity tell us about the potential for H5N1 to adapt to humans?

When we compared within-host evolution in these two hosts, we found that virus populations in both humans and ducks are mostly comprised of low-frequency variation (present in <10% of the population), that is shaped heavily by purifying selection, genetic drift, and demography. This is important because we didn't see strong signatures of rampant positive selection in humans. However, we did detect a few putative human-adapting mutations in multiple, independent humans. Two human samples contained an E627K mutation in the polymerase subunit PB2, a well-known marker of mammalian adaptation that has been repeatedly shown to improve human replication in animal and cell culture models. We also found mutations in the receptor binding protein, HA, that have been phenotypically linked to improved human receptor binding. Two humans harbored an A150V mutation within-host, which contributes to receptor binding and was also identified in H5N1-infected humans in Vietnam, while 2 others harbored an HA Q238L, a mutation identified in ferret transmission studies as a determinant of human receptor binding and transmission. These results show that H5N1 viruses have the capacity to generate known makers of human adaptation during natural spillover infection. This is important because it suggests that molecular markers identified in laboratory studies also evolve in nature, at least in this genetic backbone, and may be useful for surveillance.

within-host SNVs

We next wanted to determine whether there were other mutations within-host that might be human-adaptive. To test this, we generated phylogenetic trees for all currently available H5N1 sequences and queried whether mutations we found in our dataset were enriched along branches leading to human infections. This analysis showed that both PB2 E627K and HA A150V were heavily enriched on phylogenetic branches leading to human infections, suggesting that they are likely human-adapting. However, we also found that about half of the mutations detected in our dataset are never detected on the H5N1 phylogeny. This suggests that fraction of variation generated within-host is likely deleterious, and purged from the H5N1 population over time.

within-host SNVs

What we learned and open questions

By studying within-host diversity, we were able to learn a few important things from this dataset. The first is that H5N1 viruses have very clear potential to generate human-adapting mutations within-host. The fact that we identify previously validated markers of mammalian adaptation and identify mutations that are enriched on spillover branches in nature support this. Importantly though, all of the putative human-adapting mutations we found remained at low frequencies in our samples, despite 5-14 days of infection. Our data therefore also underscore that even mutations that have been hypothesized to be strongly beneficial (PB2 E627K and HA Q238L) may remain at low frequencies in vivo. This suggests that factors like purifying selection, randomness, and short infection times counteract the adaptive potential of H5N1 viruses to evolve during any individual spillover infection. Although this result is somewhat nuanced, it makes sense given what we know about avian influenza. While animal experiments suggest that human transmissibility should be easy to evolve, H5N1 has never actually done so in nature. Although H5N1 has clear potential to evolve within-host, a combination purifying selection, randomness, and epistasis likely restrict its ability to evolve extensively during a single infection.

This study was small and only examined two H5N1 genetic backbones, so there are lots of open questions that remain. How would the patterns we observe in this data compare to spillover infections with other genetic backbones? Would our findings in poultry be the same if we had access to hundreds of samples, over many years of sample collection? Are there other mutations that elicit host-adapting phenotypes that are yet undiscovered? Are certain viral backbones more conducive to human adaptation than others? What environmental factors contribute to spillover? These are all challenging, open questions that I hope we can answer one day.

To look at the data and analyses...

If you're interested in checking out how we did any of this or looking at the data yourself, all of the code for the figures and analysis of data described in the manuscript are freely available at All of the raw sequence data is available from the SRA under accession number PRJNA547644, and the bioinformatic pipeline used to process the raw FASTQ files is available here. You can also find other useful data files, like the within-host variant calls and phylogenetic trees in the GitHub repo.

This has been an incredible opportunity to work with a large group of collaborators from all across the world to answer some interesting questions about avian influenza evolution. I'd like to give a special thank you to Paul Horwood, Philippe Dussart, Philippe Buchy, Erik Karlsson, Srey Viseth Horm and Sareth Rith from Institut Pasteur for getting this project off the ground, and for all of the amazing work they are doing for avian influenza surveillance in Cambodia. Thank you to Lifeng Li, Yongmei Liu, Huachen Zhu, and Yi Guan for generating the original sequence data and for sharing it with us. And of course, a huge thank you to Tom Friedrich and Trevor Bedford for working with me on this project during my transition between labs.

If you are reading this blog, you are probably already onboard with releasing data openly as it's generated. You may even have led the charge in getting other researchers and/or journals to be more open with protocols, data, and analyses. You probably don't need a reminder of why open data sharing is awesome, but sometimes it's nice to take stock and remember that what we're pushing for has real tangible value.

Louise has been investigating the genomic epidemiology of the mumps outbreak in Washington, and I've been helping out a bit too. If you want some big picture details about the project, you can read more about it here and here. As part of this project, Louise been managing Nextstrain Mumps. Recently, Patrick Stapleton from Public Health Ontario shared his sequences with us (thank you!!). Louise rebuilt Nextstrain with them, and we were totally struck by how much context the Ontario viruses provided for one of our "one-off" Washingtonian sequences.

Take a look at the image with the before and after. Before, we see that we have a couple of viruses that don't nest within the primary Washington outbreak clade. They are most closely related to Canadian viruses (Manitoba, BC), but those branches are pretty long, an indicator that there's unsampled transmission occurring. This crops up not infrequently in sparse datasets, but I still always find myself wanting to know where this transmission chain was circulating before it pops up on our radar. Importantly, this isn't just about my curiosity; knowing where importations come from, and their frequency, is important for tailoring surveillance efforts and designing or evaluating infection control measures.

Here we get lucky on two counts: 1) other people are sequencing mumps, and 2) they like sharing data! With the Ontario viruses included in the tree, we see that Washington.USA/2017321 is a very clear introduction of mumps from Ontario into Washington. Given the high genetic similarity between this Washington strain and the viruses from Ontario, it seems pretty likely that this was a direct introduction or perhaps a very recent introduction followed by a short transmission chain.

This may seem trivial, but you can play through some different scenarios to show that it's not. Before, we don't really have any idea what is going on with Washington.USA/2017321. We might ask, does this strain represent a tiny chunk of a lot of transmission that is going unobserved? If so, do we need to ramp up surveillance in a particular population? With just a bit of context we realize that no, we don't need to pour a whole bunch of resources into figuring out what's going on here. We have a travel-associated introduction, and while it's good to follow up on close contacts, we probably don't need to take significant resources away from another cluster to look into this one.

This is such a clear example of how much more value we can get out of genomic surveillance when we pool our data. Other people's sequences provide context for our own. With this project, we've been incredibly fortunate to have lots of people sharing sequences with us. Many thanks to Jenn Gardy and Jeff Joy in British Columbia, Shirlee Wohl in Massachusetts, and Patrick Stapleton in Ontario. And of course thank you to all the authors who have put sequences up openly on GenBank. The mumps phylogeny would look ridiculous without you.

Positions for a full-stack developer and bioinformatician are available immediately in the Bedford lab at the Fred Hutch to work on an open-source genomic epidemiology research platform enabling a large-scale study of respiratory illness in Seattle.

In collaboration with groups at the Fred Hutch, the University of Washington, Seattle Children's, the Institute for Disease Modeling, and area hospitals, we're embarking on a high-resolution, multi-year study of influenza and other respiratory illnesses in Seattle. Through the study thousands of influenza and other respiratory pathogens will be sampled and sequenced in near-real time and from these viral genome sequences, transmission dynamics uncovered. The primary task of both positions will be to develop the information and analysis systems underlying the study's research aims, with potential expansion to new studies around the world in the future. This new platform will build upon the software behind Nextstrain, an award-winning tool for tracking infectious disease epidemics developed in collaboration with the Neher lab at the University of Basel. Nextstrain won the Open Science Prize in Feb 2017 and has already been instrumental in analysis of Ebola spread in West Africa, Zika spread in the Americas and is used by the World Health Organization to aid in the process of influenza vaccine strain selection.

The envisioned software platform will ingest subject and sample metadata, lab assay results, and raw and processed genome data as well as provide access to this curated data to streamline downstream analysis. This platform will be open-source, adaptable, and designed with future outbreak surveillance in mind, regardless of if the targeted pathogen is viral, bacterial, or eukaryotic in nature. On top of this data warehousing platform, we will deploy analytic pipelines that align sequences, build phylogenies and reconstruct city-scale transmission chains.

The ideal candidate for the full-stack developer position would have expertise in web development, relational database systems, cloud infrastructure, and software engineering best practices. Experience working with genomic data or in systems integration is a plus but not a requirement.

The ideal candidate for the bioinformatician position would have expertise in genomics, molecular biology, pipeline automation, and software development practices. Experience working with cloud infrastructure and web technology is a plus but not a requirement.

Candidates for both positions should be fluent in at least one high-level programming language, such as Python, JavaScript, Ruby, or Perl. Candidates should also have excellent written and verbal communication skills. Interfacing with project collaborators in-person and online is a key aspect of both positions. Both positions will work within a small team of existing members of the Bedford lab. If you think you might be a great fit for this position but are concerned about meeting all qualifications, we'd like to hear from you.

The Fred Hutch is located in South Lake Union in Seattle, WA and offers a dynamic work environment with cutting-edge science and computational resources. The position is available immediately with flexible starting dates. Informal inquires are welcome. Applications will be accepted until the position is filled. We offer a competitive salary commensurate with skills and experience, along with benefits. The Fred Hutch and the Bedford lab are committed to improving diversity in the computational sciences. Applicants of diverse backgrounds are particularly encouraged to apply. Depending on the applicant, this position could be a full-time salaried employee, a part-time employee, or a contracted consultant.

For more information about the lab, please see our website at To apply for the position please send (a) your current resume, (b) code samples or links to published/distributed code you've written, and (c) contact information for two references to