bedford lab / blog

Postdoc in applying sequence language models to understand and forecast evolution

10 Apr 2025 by Trevor Bedford

We have an opening for a postdoc in the Bedford Lab at the Fred Hutch Cancer Center to work on developing and applying DNA and protein language models to understand and forecast viral evolution.

The Bedford Lab has worked extensively in the field of viral evolutionary forecasting. In this context, we’ve developed models to estimate fitness of seasonal influenza variants from genetic sequence data and to then use fitness estimates to forecast variant frequencies (Huddleston et al., eLife, 2020). We’ve taken a similar approach to forecasting SARS-CoV-2 variants in applying multinomial logistic regression (MLR) to estimate variant fitness and to project frequencies forward in time (Abousamra et al., PLoS Comput Biol, 2024). This approach underlies our live SARS-CoV-2 evolutionary forecasts at nextstrain.org/sars-cov-2/forecasts. Our influenza forecasts are directly utilized by the World Health Organization in the twice yearly vaccine strain selection meetings for seasonal influenza.

Recent advances in deep learning, especially transformer-based language models for protein sequences (see ESM3) and DNA sequences (see Evo2), present exciting new avenues to enhance evolutionary predictions. These models, trained to predict residues or nucleotides based on sequence context, have potential to significantly improve predictions of variant fitness and evolution.

In this role, you’ll initially focus on incorporating state-of-the-art language models to assess and predict the fitness of influenza and SARS-CoV-2 variants, comparing these predictions to our established statistical models. A key aim is to leverage these advanced models to provide deeper insights than traditional “mutational load” metrics, which simply count the number of amino acid changes. Additionally, you will explore how embedding spaces derived from these language models could offer new perspectives on evolutionary processes (see Hie et al for an example of looking at semantic change via embedding). Beyond applying existing language model frameworks, you’ll have opportunities to design novel model architectures to describe the process of sequence evolution.

The ideal candidate will have experience working with deep learning models via PyTorch or other frameworks. However, candidates with more traditional experience in sequence data and phylogenetic approaches who are excited to dive into deep learning models are also strongly encouraged to apply. Candidates should have experience in at least one programming language and a proven track-record of peer reviewed publications. A quantitative background is essential, though PhDs from diverse fields including biology, mathematics, statistics, physics and computer science are welcome. The Fred Hutch is an equal opportunity employer committed to workforce diversity. Applicants of diverse backgrounds are particularly encouraged to apply.

The position is available immediately with flexible starting dates. Informal inquires are welcome. Applications will be accepted until the position is filled. Fred Hutch offers competitive compensation and benefits packages.

To apply please submit

cover letter that includes the names and contacts for three references and a short statement of research interests
a current CV
code samples or links to code on GitHub

Send application materials or inquiries to tbobfuscate@bedford.io.

This is a general area of interest for the lab. If you’re interested in deep learning for biological sequence data and are not at the postdoc career stage, please still get in touch.

Transmission at the population level from identical sequences

5 Mar 2025 by Cécile Tran Kiem

Our work on analyzing patterns of occurrence of pairs of identical sequences between population groups was just published in Nature! There, we present a new method developed to characterize transmission at the population level by analyzing the groups (e.g. age groups, geographies) in which pairs of identical sequences are collected. The rationale for our approach is explained in Figure 1. As mutations accumulate in pathogen genomes over successive transmission generations, epidemiologically linked individuals are infected by genetically similar pathogens. Among these genetically close pathogens, identical sequences capture the most epidemiologically linked individuals. Intuitively, if transmission is frequent between regions A and B, we expect to observe many pairs of identical sequences between these two regions.

Figure 1. The clustering of identical pathogen sequences across population groups reflects underlying disease transmission patterns at the population level and can be used to characterize patterns of spread between groups. In this toy figure, each color represents a different cluster of identical sequences.

We designed a relative risk (RR) metric that quantifies how many pairs of identical sequences are observed between two population groups compared to expectations from where pairs of identical sequences are coming from. To apply our RR framework, we used a large SARS-CoV-2 sequence dataset coming from WA genomic sentinel surveillance (more than 114,000 sequences with matched metadata that include age and home location information). We found that occurrence patterns of identical sequences between counties are consistent with local spread (with identical sequences being particularly enriched in pairs observed within the same county or between geographically close counties), while also being imprinted by the geographical structure at the state level. When comparing the RR of observing identical sequences between counties with the RR of movement between counties (estimated from mobile phone and commuting mobility data), we find a strong agreement between these two data sources. We also investigated outliers in the relationship between sequence and mobility data, which we were able to link with transmission between postal codes with male state prisons.

We additionally looked at occurrence patterns of identical sequences between age groups, which we found to be highly consistent with expectations from social contact data. We found that transmission patterns between age groups differed across spatial scales, for example with identical sequences having an increased risk of being observed within the same age groups at shorter geographic distance, suggesting that these type of transmission events rather occur at the local level.

The last decade has propelled us to a new era in terms of pathogen sequence dataset size, but existing phylogeographic approaches tend to be computationally very costly and hence unable to fully leverage these large datasets. We hope our approach is a helpful contribution to the study of these very large datasets by circumventing the need to infer a phylogenetic tree and directly relying on identical sequences. In former work, we had already leveraged these identical sequence clusters to infer the reproduction number and transmission heterogeneity from their size distribution. These two pieces of work highlight the potential for methods studying genetically proximal sequences in uncovering both key transmissibility parameters and transmission patterns between population groups.

Automated maps of seasonal flu and SARS-CoV-2 viruses show important evolutionary groups

20 Nov 2024 by John Huddleston

Why do we get sick from the flu or SARS-CoV-2 so many times in our lives?

As I write this, I’m getting over a week-long cold caused by some virus that probably wasn’t SARS-CoV-2 (the only virus I can test for at home). The odds are good that it was a type of virus like the seasonal flu that has infected me before and that has now managed to escape my existing immunity. This kind of reinfection happens all of the time. Viruses exist only because they succeed in accomplishing two main goals (Figure 1):

make more copies of themselves
transmit from one host to another

**Figure 1. Viruses have two goals: make more copies and infect new hosts.**. Each larger orange circle represents a single copy of a virus.

When a virus infects us, it makes many more copies of itself with a pretty terrible copy machine that makes mistakes or “mutations” with each copy. The new mutated copies are still close enough to the original to be considered the same type of virus (like seasonal flu) but different enough that our immune systems may not recognize them.

When we sneeze or cough in an elevator and transmit one of those mutated copies to someone else, the copy could look different enough to that person’s immune system that the virus can infect them again, make more copies of itself with more mutations, and then transmit again to someone new. For a prettier visual explanation of this process, check out Jonathan Corum’s and Carl Zimmer’s beautiful article about how coronavirus mutates and spreads.

What can we learn about mutations we find in viruses?

As a virus researcher in Trevor Bedford’s lab at the Fred Hutchinson Cancer Center, I spend a lot of time thinking about these viral mutations. For example, when we find a lot of seasonal flu viruses with the same mutation that allows those viruses to reinfect a lot of people in the world, we can usually track that mutation back to a single common ancestor of all those recently successful virus copies. For SARS-CoV-2, these groups of successful virus copies tend to get names like “Delta” or “Omicron” or “JN.1”.

Most of the time, we can use the collection of mutations that each virus has to build a family tree of all the virus copies we’ve observed in the world. These virus trees work because we assume that each new virus copy descended from a single parent copy. When we see the same mutations in two copies of a virus, we can calculate the chance that they came from the same parent (Figure 2). These family trees of viruses shows us which common ancestors of recent viruses were the most successful and which mutations were associated with that success. Virus researchers use this kind of information to decide whether enough mutations have occurred to require an update to vaccines like the seasonal flu or SARS-CoV-2 vaccines.

Example virus sequence alignment and family tree — **Figure 2. An example virus family tree (left) inferred from the mutations found in each virus (colored circles on the right).** Pairs of viruses that share the same mutations are more likely to have a common ancestor, as shown by the corresponding colored circles on the branches of the family tree leading to those viruses. To learn more about this subject, see the Nextstrain guide to interpreting these types of trees.

Unfortunately, it is possible to get infected by multiple copies of the same type of virus at the same time. When this infection by multiple copies happens, the different infecting virus copies can make new copies of themselves in the same place in our bodies and accidentally include bits of each other in the new copies. These bigger changes in the new virus copies break the rules that allow us to make virus family trees and they happen often enough that researchers have spent a lot of time making new computational tools to make family trees for viruses that have multiple parents.

In the Bedford lab, we recently stumbled on a new approach to find groups of virus copies that share the same mutations no matter how many parents they have and without building a family tree at all. This approach was a long time in the making, though, and started in July 2019 when a rising junior in high school, Sravani Nanduri, joined the lab for a 2-month summer internship under the joint mentorship of Alli Black and myself. Sravani already knew how to write computer programs, but she wanted to learn more about programming and data visualization for biology.

Her internship project came from an idea Trevor had: what if, instead of building family trees of viruses based on their shared mutations, we could put viruses on a two-dimensional map where the distances between each pair of viruses reflected the mutations that differed between them?

We had a lot of questions for a 2-month internship project: How would we build these maps? Would the same groups of viruses we see in a tree place together in the maps? What would the distance between any two virus copies actually mean on one of these maps? How would we visualize these maps? What would be the most fun bits of this project for Sravani to work on? Sravani, Alli, Trevor, and I ended up sketching out the following example of what a final visualization would be for the project (Figure 3), with the idea that Sravani would apply a couple of well-known methods to one type of virus and plot the resulting maps for each method alongside the tree of the same virus copies.

Original whiteboard sketch of Sravani's summer internship project — Figure 3. The original whiteboard sketch of Sravani's summer internship project showing the family tree of a single type of virus (top left) and sketches of what maps from different methods might look like including PCA, t-SNE, and UMAP. We wanted this figure to be interactive, so viewers could select viruses in one panel to highlight their corresponding positions in other panels.

To make the project more interesting from a data science perspective, we agreed that the visualization should be interactive, so we could select viruses in the tree or one of the maps and the same viruses would get highlighted in the other panels of the figure.

Over 2 months, Sravani learned how to:

work with virus mutation data
build virus trees from mutation data
calculate distances between pairs of virus copies based on their mutations
make two-dimensional maps from mutation data using methods with exciting names like principal components analysis (PCA), multidimensional scaling (MDS), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP)
plot trees and maps in a single interactive figure that allowed us to highlight bits of the tree or a map and see the same viruses in the other parts of the figure

By August 2019, Sravani had made the prototype below (Figure 4) from mutations in a type of seasonal flu called “H3N2” which causes the most hospitalizations and deaths each year.

Sravani's final internship prototype showing maps based on flu mutations — Figure 4. Static view of Sravani's final internship prototype showing individual viruses in a family tree (top left) and corresponding positions of the same viruses in maps based on flu mutations including PCA (top middle), MDS (top right), t-SNE (bottom left), and UMAP (bottom right). Viruses in two specific groups from the tree (blue and orange) have been selected to show how their placement in the tree compares to their placement in the maps.

The prototype revealed some interesting patterns:

Most of the maps placed pairs of viruses with the same mutations closer together than pairs with different mutations.
Some of the maps (like MDS’s) actually acted like a real map with the distance between viruses on the map matching exactly the number of mutations that differed between those viruses.
Other maps (like t-SNE’s) didn’t act like real maps, but they tightly clustered similar viruses into groups in the same space where we could easily find those groups by eye.
The groups of viruses in these maps often matched the groups we had already defined in the tree.

Sravani and I were excited enough about these results to agree that we should keep this project going a little longer. In October 2019, we decided to meet once a month while Sravani refined the prototype above and drafted a short summary of the results in the form of a scientific paper that we could post online somewhere.

Sravani and I continued to meet monthly through the beginning of the SARS-CoV-2 pandemic, she learned how to write a scientific paper, wrote the first full draft of a paper, and referenced this work in her college applications. By June 2023, we’d both been busy with other projects. Sravani had been focused on class work as an undergraduate in the University of Washington’s Computer Science program. I had been working with the Nextstrain team on pandemic response efforts. Despite our other commitments, Sravani was eager to revise the original paper and publish it in a scientific journal.

We decided to focus on two viruses (seasonal influenza H3N2 and SARS-CoV-2) and the original four methods of making maps (PCA, MDS, t-SNE, and UMAP). We wanted to measure how well the groups of viruses that we found in these maps matched the groups from family trees that experts had already identified for flu and SARS-CoV-2. We found that groups from t-SNE quite closely matched the expert group definitions for both flu and SARS-CoV-2, as shown by the figure below where flu viruses are colored by their expert-assigned groups (Figure 5).

Flu family tree and maps from H3N2 HA viruses — **Figure 5. Flu family tree (top) and maps from H3N2 HA viruses based on PCA (middle left), MDS (middle right), t-SNE (bottom left) and UMAP (bottom right).** Viruses are colored by their genetic group assigned by experts. Viruses that place together in these groups from the family tree also tend to place together in the maps from different methods. Click and drag in a panel above to select specific viruses. Hover your mouse pointer above each circle in the plot to get details about the corresponding virus.

These results suggested that we could use these maps of viral mutations to automatically define new, meaningful groups of viruses that could be reviewed by experts instead of requiring experts to manually define these groups. This result was surprising because the methods we use to make these maps have no understanding of virus evolution; they only have a sense of how many mutations are shared or not between pairs of viruses.

We also realized we could make maps from viruses that had multiple parents even when the standard methods to build family trees wouldn’t work. For example, each flu virus is made up of 8 separate pieces that need to get bundled together to make a complete virus. When we get infected by a single flu virus, that virus will make copies of all 8 pieces and its child viruses will get those copies from the same parent. When we get infected by more than one flu virus at the same time, those viruses can accidentally swap some of their 8 pieces such that parts of their child viruses come from different parents. (Scientists call this swapping process “reassortment”.) This accidental swapping of viral pieces means that we normally have to make separate family trees for each of the 8 pieces because the methods to make family trees assume that each virus piece comes from a single parent. To build a family tree that allows for multiple parents, researchers have developed more sophisticated methods that try to work out which of the 8 pieces for each virus belong to which parent.

The map methods we used in this project didn’t know anything about virus biology and didn’t make any assumptions about how many parents each virus had. As a result, we figured we could easily build maps from multiple viral pieces at once to find meaningful groups that would otherwise require more complicated methods to find. To test this idea, we used a newly developed method, TreeKnit, written by Pierre Barrat-Charlaix and Richard Neher that uses the theoretical concepts of virus evolution to make family trees of seasonal flu that allow each virus to have more than one parent. This method requires us to make a separate family tree for each viral piece and then it finds the groups of viruses that most likely have the same parents across all viral pieces. Figure 6 below shows an example output for two pieces of seasonal flu. The family tree on the left is for a piece called HA and on the right is a piece called NA. The lines connect the same viruses in the left tree to the right tree. The colors show the groups that TreeKnit calculated as most likely descending from the same parent for both pieces.

Family trees of flu virus genes with HA tree on the left and NA tree on the right and tips colored by genetic groups from TreeKnit — **Figure 6. Family trees of two seasonal flu virus pieces including "HA" on the left and "NA" on the right.** Lines connect the same viruses in the left and right trees. The colors indicate groups of viruses that TreeKnit identified as likely descending from the same parents for both HA and NA.

Next, we made maps for the seasonal flu pieces HA and NA, automatically found groups in each map, and calculated the distance between the groups we found and the groups from TreeKnit. We found that the groups from these simple map-based methods often closely matched the groups found by the more sophisticated TreeKnit program, with t-SNE groups being especially good (Figure 7). These results suggested that we could use these simple methods to find meaningful groups of viruses using information from all viral pieces.

HA/NA embeddings with tree — **Figure 7. Family tree of seasonal flu's HA and maps for seasonal flu pieces HA and NA.** Colors show the groups found by TreeKnit to likely descend from the same parent across both HA and NA pieces. Despite knowing nothing about virus biology, the map methods place viruses from the same parents close together and into similar groups as the more sophisticated TreeKnit method that does know about virus biology.

Five years after starting this project, Sravani is now a senior in University of Washington’s Computer Science program. She has presented her work on this project at her first international research conference in Italy, and she has published this work in her first lead-author scientific manuscript in the journal Virus Evolution. We now routinely make maps of seasonal flu viruses in our weekly Nextstrain analyses (for example, see today’s results for H3N2) to look for new groups of viruses that might become more successful at infecting people. We have also begun to apply these maps to recent flu viruses collected from birds and cows where viruses with multiple parents tend to be better at jumping into new hosts. We still have a lot of questions about how to apply these maps to different viruses or bigger datasets, but we’ve learned a lot already from a project that started as a 2-month internship led by a motivated and dedicated young researcher.

To learn more about this project, read Sravani’s paper and explore the interactive views of our maps for flu and SARS-CoV-2 on Nextstrain and our interactive figures on GitHub.

Openings for bioinformatics analyst and software engineer to contribute to Nextstrain platform

21 Jul 2023 by Trevor Bedford

These positions have been filled.

Positions for a bioinformatics analyst and a software engineer are available immediately in the Bedford lab at the Fred Hutch. Details for both positions follow:

Bioinformatics Analyst II/III

We have an opening for a bioinformatician in the Bedford lab at the Fred Hutch to work on genomic epidemiology and evolutionary analysis of pathogens such as SARS-CoV-2, seasonal influenza, and other emerging and endemic pathogens. This position will contribute to ongoing work for the Bedford lab and Nextstrain.

Nextstrain is an award-winning project for tracking infectious disease epidemics developed in collaboration with the Neher lab at the University of Basel. Nextstrain won the Open Science Prize in Feb 2017 and has been instrumental in analysis of the SARS-CoV-2 pandemic, Ebola outbreaks, Zika spread in the Americas and is used by the World Health Organization to aid in the process of influenza vaccine strain selection. The software we write to power all parts of Nextstrain—bioinformatics, visualizations, analysis pipelines, data management, and more—is entirely open-source and available to the public. We work with public health entities and scientists across the world, both formally and informally, to expand pathogen surveillance capabilities and to improve the automation and robustness of these analyses. Our goal is to empower the wider genomic epidemiology and public health communities to tweak our analyses, create new ones, and communicate scientific insights using the same tools we do.

Responsibilities

This role advances the research aims of the Bedford lab and the Nextstrain team through a combination of independent work, collaboration with scientists and software developers in the group, and interactions with the wider public health and science communities. In this role, the bioinformatician will:

Develop and maintain analytic pipelines such as those that clean and ingest genome metadata, build phylogenetic trees, and run forecasting models for SARS-CoV-2 and other pathogens
Improve the robustness, automation, and monitoring of our existing pathogen pipelines
Develop reproducible pipelines to expand surveillance of endemic and emerging human pathogens, in collaboration with both internal and external groups
Participate in community outreach through office hours, discussion forums, and mailing lists
Write and maintain thorough documentation on software and pipelines
Design software with a diverse range of collaborators and users in mind
Contribute to the Nextstrain team’s decision-making and planning processes
Present at Bedford lab meetings

Qualifications

Minimum qualifications

Master’s degree in bioinformatics, computational biology, biology, or related field with at least three years’ direct experience in computational analysis of large sequence-based molecular data sets.
Fluency in at least one high-level programming language, such as Python, R, Ruby, JavaScript, or Perl
Familiarity with version control and other software development best practices
Experience with workflow managers such as Snakemake, Nextflow, or WDL
Knowledge of molecular biology
Motivated to learn new skills and technologies and collaborate within an existing team’s practices
Excellent written and verbal communication skills

Preferred qualifications

Expertise in genomics
Knowledge of automated testing and workflows such as GitHub Actions
Experience configuring and deploying analyses on a cloud infrastructure

The position is available immediately with flexible starting dates. Informal inquiries are welcome. Applications will be accepted until the position is filled. We offer a competitive salary commensurate with skills and experience, along with benefits. We are committed to improving diversity in the computational sciences. Applicants of diverse backgrounds are particularly encouraged to apply. This is a full-time (40 hours/week) position, but depending on the applicant, could be a salaried employee or contracted hourly consultant. An ideal candidate would be local to the Seattle area or willing to relocate, but remote work is also an option.

To aid in applicant review, a coding sample is requested. We’re happy to review whatever you’re most proud of (in any programming language). If you don’t have code that can be publicly shared, that’s okay. Please apply anyway and just let us know that this isn’t available.

If you think you might be a great fit for this position but are concerned about meeting all qualifications, we’d like to hear from you. Please email Trevor Bedford at tbobfuscate@bedford.io and John Huddleston jhuddlesobfuscate@fredhutch.org.

To apply for this position, please go to the official Fred Hutch listing.

Software Engineer II

The Bedford Lab at the Fred Hutch is seeking a software engineer to work on Nextstrain, an award-winning project for tracking infectious disease epidemics such as the SARS-CoV-2 pandemic, Ebola outbreaks, Zika spread in the Americas, seasonal flu, and other emerging and endemic pathogens. This position will augment our existing team to design, develop, maintain, operate, and support our software and services that empower research scientists and public health practitioners in the lab and around the world.

Nextstrain, developed in collaboration with the Neher Lab at the University of Basel, provides tools for evolutionary analysis of pathogens and genomic epidemiology. We write open source software in a public development style to power all parts of Nextstrain—bioinformatics, visualizations, analysis pipelines, data management, and more—and our analyses use open data whenever possible. We work with public health entities and scientists across the world, both formally and informally, to expand pathogen surveillance capabilities and to improve the automation and robustness of these analyses. Our goal is to empower the wider genomic epidemiology and public health communities to tweak our analyses, create new ones, and communicate scientific insights using the same tools we do.

About the role

This position will be responsible for general software engineering and development work across the entire Nextstrain stack. This includes command-line applications for bioinformatics and data/workflow management (e.g. Augur, Nextstrain CLI), visualization applications for phylogenetics (e.g. Auspice), full-stack web applications for sharing analyses (e.g. nextstrain.org), workflows for data curation and analysis (e.g. ncov-ingest), runtimes for Nextstrain analyses (e.g. docker-base, conda-base), and internal tooling/infrastructure to support all of that.

What we provide

Empowerment to craft software that helps protect the world from epidemics and pandemics
Thrive in an ecosystem of cross-disciplinary learning, drawing insights from scientists, public health practitioners, and fellow software developers
A team that believes in continuous learning and cultivates an environment where all members of the group help each other
Opportunity for growth as a software developer in areas of personal interest (e.g. front-end JavaScript, back-end infrastructure, data pipelines, being a project lead, etc.)
A team culture that champions a healthy work life balance
A competitive compensation package, with comprehensive health and welfare benefits

What you’ll do

Design, develop, test, document, and maintain software under a coherent ecosystem
Release new versions of packaged programs for installation by users and deploy new versions of hosted services to users
Configure and manage cloud infrastructure resources (e.g. AWS, Heroku, Terraform)
Create, extend, and troubleshoot automated workflows (e.g. GitHub Actions, Snakemake, Nextflow, WDL)
Participate in constructive code review processes with other team members
Support internal and external users of software projects via various communication channels

Integrating with an existing team both in-person and online is a key aspect of this position. This position will work daily within a small team of Bedford Lab members and collaborators. The Nextstrain team communicates openly about project and organizational decisions and encourages participation by all team members in decision-making.

Minimum qualifications

3+ years of experience in software engineering
Fluency in Python and JavaScript/TypeScript, or fluency in similar languages
Proficiency with Linux/Unix and command-line interfaces
Proficiency with version control and software development best practices
Excellent written and verbal communication skills
Motivation to learn and collaborate within an existing team’s practices

The position is available immediately with flexible starting dates. Informal inquiries are welcome. Applications will be accepted until the position is filled. We offer a competitive salary commensurate with skills and experience, along with benefits. We are committed to improving diversity in the computational sciences. Applicants of diverse backgrounds are particularly encouraged to apply. This is a full-time (40 hours/week) position, but depending on the applicant, could be a salaried employee or contracted hourly consultant. An ideal candidate would be local to the Seattle area or willing to relocate, but remote work is also an option.

To aid in applicant review, we request you submit a cover letter, your resume, and a coding sample. For the coding sample, we’re happy to review whatever you’re most proud of (in any programming language). If you don’t have code that can be publicly shared, that’s okay. Please apply anyway and just let us know that this isn’t available.

If you’re interested in this position but are concerned about meeting all the qualifications, we’d like to hear from you. Please email Trevor Bedford at tbobfuscate@bedford.io and Thomas Sibley at tsibleyobfuscate@fredhutch.org.

To apply for this position, please go to the official HHMI listing.

Openings for bioinformatician and full-stack developer to contribute to Nextstrain platform

12 Jul 2021 by Trevor Bedford

Positions for a bioinformatician and a full-stack developer are available immediately in the Bedford lab at the Fred Hutch. Details for both positions follow:

Bioinformatician

We have an opening for a bioinformatician in the Bedford lab at the Fred Hutch to work on genomic epidemiology and evolutionary analysis of pathogens including SARS-CoV-2, influenza and Ebola virus. This position will contribute to ongoing work on two major projects: Nextstrain and Seattle Flu Study.

Nextstrain is an award-winning tool for tracking infectious disease epidemics developed in collaboration with the Neher lab at the University of Basel. Nextstrain won the Open Science Prize in Feb 2017 and has been instrumental in analysis of the SARS-CoV-2 pandemic, Ebola outbreaks, Zika spread in the Americas and is used by the World Health Organization to aid in the process of influenza vaccine strain selection.

The Seattle Flu Study is a collaboration of groups at the Brotman Baty Institute, the Fred Hutch, the University of Washington, and Seattle Children’s. Already in its third year, this study has produced high-resolution analyses of the spread of SARS-CoV-2 and influenza in Seattle by building a software platform that processes subject and sample metadata, lab assay results, and raw and processed genome data in near-real time.

Responsibilities

The role involves both development and maintenance of bioinformatic analyses and pipelines which underpin both projects’ research aims. This will involve a mixture of independent work, collaboration with scientists in the group and interactions with the wider community. The vast majority of code is open-source. Specific examples from Nextstrain include analytic pipelines that clean and ingest genome metadata, construct consensus genomes, and build phylogenetic trees, as well as tools to enable a diverse range of collaborators to run SARS-CoV-2 analyses through Nextstrain. Work on Seattle Flu Study focuses on pipelines to assemble raw sequence data into consensus SARS-CoV-2 and influenza genomes and deposition of these consensus genomes to public databases.

Interfacing with project collaborators in-person and online is a key aspect of this position. The bioinformatician will work within a small team of existing members of the Bedford lab and the larger research group of the Seattle Flu Study. The Nextstrain team communicates openly about project and organizational decisions and encourages participation by all team members in the decision-making process.

Qualifications

Minimum qualifications

Fluency in at least one high-level programming language, such as Python, R, Ruby, JavaScript or Perl
Knowledge of molecular biology
Motivated to learn new skills and technologies
Excellent written and verbal communication skills

Preferred qualifications

Expertise in genomics
Experience with pipeline or workflow automation
Familiarity with software development best practices
Experience configuring and deploying analyses on a cloud infrastructure
Experience and willingness to participate in team decision-making processes

The Fred Hutch is located in South Lake Union in Seattle, WA and offers a dynamic work environment with cutting-edge science and computational resources. The position is available immediately with flexible starting dates. Informal inquiries are welcome. Applications will be accepted until the position is filled. We offer a competitive salary commensurate with skills and experience, along with benefits. The Fred Hutch and the Bedford lab are committed to improving diversity in the computational sciences. Applicants of diverse backgrounds are particularly encouraged to apply. Depending on the applicant, this position could be a full-time salaried employee, a part-time employee, or a contracted consultant. An ideal candidate would be local to the Seattle area or willing to relocate, but remote work is also an option.

To apply for this position please go to the Fred Hutch Careers Job ID 19821.

To aid in applicant review, a coding sample is requested. We’re happy to review whatever you’re most proud of (in any programming language). If you don’t have code that can be publicly shared, that’s okay. Please apply anyway and just let us know that this isn’t available.

If you think you might be a great fit for this position but are concerned about meeting all qualifications, we’d like to hear from you. Please email Trevor Bedford at tbedfordobfuscate@fredhutch.org or John Huddleston at jhuddlesobfuscate@fredhutch.org.

Full-stack Developer

Position for a full-stack developer is available immediately in the Bedford lab at the Fred Hutch to work on an open-source software platform for genomic epidemiology and evolutionary analysis of pathogens including SARS-CoV-2, influenza and Ebola virus. This position will contribute to ongoing work on Nextstrain, one of the lab’s major projects.

Nextstrain is an award-winning tool for tracking infectious disease epidemics developed in collaboration with the Neher lab at the University of Basel. Nextstrain won the Open Science Prize in Feb 2017 and has been instrumental in analysis of the SARS-CoV-2 pandemic, Ebola outbreaks, Zika spread in the Americas and is used by the World Health Organization to aid in the process of influenza vaccine strain selection.

Responsibilities

This role would be responsible for development work up-and-down the entire Nextstrain software stack and involve both back-end and front-end development. All development occurs in an open-source fashion via github.com/nextstrain. Specific priorities currently include infrastructure and pipelines to ingest and curate genomic data from public databases, optimizing use of cloud computing services to process this data, services to host and share analyses uploaded by Nextstrain users, and development of command line tools for working with Nextstrain. Informatic work focuses on development of the Augur bioinformatics toolkit and pathogen-specific workflows. Front-end work focuses on user functionality at nextstrain.org, including management of cloud computing and storage, as well as visualization improvements to the Auspice visualization JavaScript application. Contributing to documentation on the Nextstrain software stack is a vital responsibility of this position.

Interfacing with project collaborators in-person and online is a key aspect of this position. The developer will work within a small team of existing members of the Bedford lab as well as other contributors to Nextstrain. The Nextstrain team communicates openly about project and organizational decisions and encourages participation by all team members in the decision-making process.

Qualifications

Minimum qualifications

Fluency in at least one high-level programming language, such as Python, R, Ruby, JavaScript or Perl
Excellent written and verbal communication skills
Experience in the following areas:
- Web development
- Database systems
- Cloud infrastructure
- Software engineering and documentation best practices

Preferred qualifications

Experience working with genomic data
Systems integration
Experience designing effective data visualizations
Experience and willingness to participate in team decision-making processes

The Fred Hutch is located in South Lake Union in Seattle, WA and offers a dynamic work environment with cutting-edge science and computational resources. The position is available immediately with flexible starting dates. Informal inquiries are welcome. Applications will be accepted until the position is filled. We offer a competitive salary commensurate with skills and experience, along with benefits. The Fred Hutch and the Bedford lab are committed to improving diversity in the computational sciences. Applicants of diverse backgrounds are particularly encouraged to apply. Depending on the applicant, this position could be a full-time salaried employee, a part-time employee, or a contracted consultant. An ideal candidate would be local to the Seattle area or willing to relocate, but remote work is also an option.

To apply for this position please go to the Fred Hutch Careers Job ID 19820.

To aid in applicant review, a coding sample is requested. We’re happy to review whatever you’re most proud of (in any programming language). If you don’t have code that can be publicly shared, that’s okay. Please apply anyway and just let us know that this isn’t available.

If you think you might be a great fit for this position but are concerned about meeting all qualifications, we’d like to hear from you. Please email Trevor Bedford at tbedfordobfuscate@fredhutch.org or John Huddleston at jhuddlesobfuscate@fredhutch.org.

Predicting seasonal influenza evolution

13 Oct 2020 by John Huddleston and Pierre Barrat-Charlaix

In this post, we summarize and synthesize the results of our recent efforts to predict influenza evolution as described in Huddleston et al. 2020 and Barrat-Charlaix et al. 2020.

Why do we try to predict seasonal influenza evolution?

Seasonal influenza (or “flu”) sickens or kills millions of people per year. Flu vaccines are one of the most effective preventative measures against infection. However, flu vaccines require almost a year to develop and can only contain a single representative virus per flu lineage (A/H3N2, A/H1N1pdm, B/Victoria, and B/Yamagata). These limitations require researchers to predict which single current flu virus will be the most representative of the flu population one year in the future. The better these predictions are, the more likely the vaccine will prevent illness and death from infection.

How do we think flu evolves?

Flu rapidly accumulates mutations during replication, due to its error-prone RNA polymerase. For most flu genes, most new amino acid mutations will weaken the functionality of their corresponding proteins and reduce the virus’s fitness. For flu’s primary surface proteins, hemagglutinin (HA) and neuraminidase (NA), some amino acid mutations modify binding sites of host antibodies from previous infections. These mutations increase a virus’s fitness by allowing the virus to escape existing antibodies in a process called antigenic drift (Figure 1). Mutations in HA and NA create fitness trade-offs, where beneficial mutations facilitate antigenic drift against a background of deleterious mutations.

Figure 1. HA accumulates beneficial mutations in its head domain (sites with color) that enable escape from antibody binding and deleterious mutations in its stalk domain (sites in gray) that reduce its ability to infect new host cells. The linear genome view on the left shows how sites from HA’s head domain map to the three-dimensional structure of an HA trimer. The site highlighted in yellow reveals where different amino acid mutations allowed a flu virus to escape binding from existing antibodies in a human’s polyclonal sera (Lee et al. 2019). Explore this figure interactively with dms-view.

Viruses carrying beneficial mutations should grow exponentially relative to viruses lacking those mutations (Figure 2A). Beneficial mutations on different genetic backgrounds will compete with each other in a process known as clonal interference (Figure 2B). If beneficial mutations have large effects on fitness, the fitness of the genetic background where the beneficial mutations occur is less important for the success of the virus than the fitness effect of the beneficial mutations themselves (Figure 3). If beneficial mutations have similar, smaller effects on fitness, a virus’s overall fitness depends on the effect of the beneficial mutations and the relative fitness of its genetic background. In this case, the ultimate success and fixation of these beneficial mutations depends, in part, on the number of deleterious mutations that already exist in the same genome (Figure 4).

Figure 2. Individuals in asexually reproducing populations tend to grow exponentially relative to their fitness (left). Normalization of frequencies to sum to 100% represents competition between viruses for hosts through clonal interference and reveals how exponentially growing viruses can decrease in frequency when their relative fitness is low (right).

Figure 3. The shape of fitness landscapes depends, in part, on mutation effect sizes. Mutations with similar, smaller effects (blue and orange circles) produce a smooth Gaussian fitness distribution while mutations with large effect sizes (green, yellow, and purple circles) produce a more discrete fitness distribution. From Figure 1A and B of Neher 2013.

Figure 4. The fixation probability of a beneficial mutation is a function of the mutation’s genetic background. When mutations have similar, smaller effects, the fitness of a beneficial mutation’s genetic background (red) contributes to the mutation’s fixation probability (green). Mutations that ultimately fix originate from distribution given by the product of the background fitness and the fixation probability (blue). From Figure 2C of Neher 2013.

What is predictable about flu evolution?

The expectations from population genetic theory described above and previous experimental work suggest that aspects of flu’s evolution might be predictable. Mutations in HA and NA that alter host antibody binding sites and enable viruses to reinfect hosts should be under strong positive selection. We expect these strongly beneficial mutations to sweep through the global flu population at a rate that depends on the importance of their genetic background. We also do not expect that every site in HA or NA will acquire beneficial mutations. For example, fewer than a quarter of HA’s 566 amino acid sites are under positive selection (Bush et al. 1999), have undergone rapid sweeps (Shih et al. 2007), or contributed to antigenic drift (Wolf et al. 2006). Importantly, not all of these sites contribute equally to antigenic drift (Koel et al. 2013). Additionally, the complex and strong pressures of existing human immunity appear to constrain the space of antigenic phenotypes that viruses can explore at any given time (Smith et al. 2004, Bedford et al. 2012).

Recently, researchers have built on this evidence to create formal predictive models of flu evolution. Neher et al. 2014 used expectations from traveling wave models to define the “local branching index” (LBI), an estimate of viral fitness. LBI assumes that most extant viruses descend from a highly fit ancestor in the recent past and uses patterns of rapid branching in phylogenies to identify putative fit ancestors (Figure 5). Neher et al. 2014 showed that LBI could successfully identify individual ancestral nodes that were highly representative of the flu population one year in the future.

Figure 5. Local branching index (LBI) estimates the fitness of viruses in a phylogeny. A) LBI assumes that mutations at the high fitness edge of a current population will seed future populations. From Figure 5D of Neher 2013. B) In practice, LBI tends to identify clusters of recently expanding populations, as shown in this seasonal influenza A/H3N2 phylogeny from Nextstrain. Explore LBI values in the current Nextstrain phylogeny for A/H3N2.

Łuksza and Lässig 2014 developed a mechanistic model to forecast flu evolution based on population genetic theory and previous experimental work. This model assumed that flu viruses grow exponentially as a function of their fitness, compete with each other for hosts through clonal interference, and balance positive effects of mutations at sites previously associated with antigenic drift and deleterious effects of all other mutations. Instead of predicting the most representative virus of the future population, Łuksza and Lässig 2014 explicitly predicted the future frequencies of entire clades.

Despite the success of these predictive models, other aspects of flu evolution complicate predictions. When multiple beneficial mutations with large effects emerge in a population, the clonal interference between viruses reduces the probability of fixation for all mutations involved. Flu populations also experience multiple bottlenecks in space and time including transmission between hosts, global circulation, and seasonality. These bottlenecks reduce flu’s effective population size and reduce the probability that beneficial mutations will sweep globally. Finally, antigenic escape assays with polyclonal human sera suggest that successful viruses must accumulate multiple beneficial mutations of large effect to successfully evade the diversity of global host immunity (Lee et al. 2019).

Does flu evolve like we think it does?

In Barrat-Charlaix et al. 2020, we investigated the predictability of flu mutation frequencies. We explicitly avoided modeling flu evolution and focused on an empirical account of long-term outcomes for mutation frequency trajectories. We selected all available HA and NA sequences for flu lineages A/H3N2 and A/H1N1pdm, performed multiple sequence alignments per lineage and gene, binned sequences by month, and calculated the frequencies of mutations per site and month. From these data, we constructed frequency trajectories of individual mutations that were rising in frequency from zero. We expected these rising mutations to represent beneficial, large-effect mutations that would sweep through the global population as predicted by the population genetic theory described above. By considering individual mutations, we effectively averaged the outcomes of these mutations across all genetic backgrounds. We evaluated the outcomes of trajectories for mutations that had risen from 0% to approximately 30% global frequency and classified trajectories for mutations that fixed, died out, or persisted as polymorphisms.

Figure 6. Mutation trajectories for seasonal influenza A/H3N2 where mutations rose from a frequency of zero to approximately 30% frequency. Dashed horizontal lines represent thresholds for fixation (red) and loss (blue). Trajectory colors also indicate eventual fixation (red), loss (blue), or persistence as a polymorphism (black). The thick black dashed line indicates the average frequency of all trajectories shown. For the interactive figure, hover over individual trajectories to highlight their full extent and details about the current frequency of a given mutation at each timepoint. Use the radio buttons to filter trajectories by segment and outcome. (After Figure 1B in Barrat-Charlaix et al 2020.)

The average trajectory of individual rising A/H3N2 mutations failed to rise toward fixation (Figure 6). Instead, the future frequency of these mutations was no higher on average than their initial frequency. We repeated this analysis for mutations with initial frequencies of 50% and 75% and for mutations in A/H1N1pdm and found nearly the same results. From these results, we concluded that it is not possible to predict the short-term dynamics of individual mutations based solely on their recent success.

Next, we calculated the fixation probability of each mutation trajectory based on its initial frequency. Surprisingly, we found that the fixation probabilities of A/H3N2 mutations were equal to their initial frequencies. This pattern corresponds to what we expect for mutations evolving neutrally, where population genetic theory predicts that fixation probability is equal to current mutation frequency. Generally, the pattern remained the same even when we binned mutations by high LBI, presence at epitope sites, multiple appearances of a mutation in a tree, geographic spread, or other potential metrics associated with high fitness. We concluded that the recent success of rising mutations provides no information about their eventual fixation.

We tested whether we could explain these results by genetic linkage or clonal interference by simulating flu-like populations under these evolutionary constraints. Mutation trajectories from simulated populations were more predictable than those from natural populations. The closest our simulations came to matching the uncertainty of natural populations was when we dramatically increased the rate at which the fitness landscape of simulated populations changed. These results suggested that we cannot explain the unpredictable nature of flu mutation trajectories by linkage or clonal interference alone.

Since flu mutation trajectories lacked “momentum” and LBI did not provide information about eventual fixation of mutations, we wondered whether we could identify the most representative sequence of future populations with a different metric. The consensus sequence is provably the best predictor for a neutrally evolving population. We found that the consensus sequence is often closer to the future population than the virus sequence with the highest LBI. Indeed, we found that the top LBI virus was frequently similar to the consensus sequence and often identical.

Taken together, our results from this empirical analysis reveal that beneficial mutations of large effect do not predictably sweep through flu populations and fix. Instead, the average outcome for any individual mutation resembles neutral evolution, despite the strong positive selection expected to act on these mutations. Although simulations rule out clonal interference between large effect mutations as an explanation for these results, we cannot discount the role of multiple mutations of similar, smaller effects in the overall fitness of flu viruses and the fixation of “rafts” of co-evolving mutations.

Can we forecast flu evolution?

In Huddleston et al. 2020, we built a modeling framework based on the approach described in Łuksza and Lässig 2014 to forecast flu A/H3N2 populations one year in advance. We used this framework to predict the sequence composition of the future population, the frequency dynamics of clades, and the virus in the current population that most represented the future population. As in Barrat-Charlaix et al. 2020 and Łuksza and Lässig 2014, we assumed that viruses grow exponentially as a function of their fitness and that viruses with similarly high fitness compete with each other under clonal interference. In contrast to Barrat-Charlaix et al. 2020, we considered the fitness of complete amino acid haplotypes instead of individual mutations.

We estimated fitness with metrics based on HA sequences and experimental measurements of antigenic drift and functional constraint. The sequence-based metrics included the epitope cross-immunity and mutational load estimates defined by Łuksza and Lässig 2014, LBI from Neher et al. 2014, and “delta frequency”, a measure of recent change in clade frequency analogous to Barrat-Charlaix’s rising mutations. The experimental metrics included a cross-immunity measure based on hemagglutination inhibition (HI) assays (Neher et al. 2016) and an estimate of functional constraint based on mutational preferences from deep mutational scanning experiments (Lee et al. 2018).

We trained models based on each of these metrics independently and in relevant combinations of complementary metrics. For each model, we fit coefficients per fitness metric that minimized the distance between the estimated and observed amino acid haplotype composition of the future (Figure 7). These coefficients represent the effect of each metric on flu fitness. As a control, we also calculated the distance to the future population for a “naive” model that assumed the future population is the same as the current population. To test our framework, we simulated 40 years of evolution for flu-like populations with SANTA-SIM and fit models to these data. After verifying our framework with simulated populations, we trained models for natural A/H3N2 populations using 25 years of historical data. We tested the accuracy of each model by applying the coefficients from the training data to forecasts of new out-of-sample data from the last 5 years of A/H3N2 evolution.

Figure 7. Schematic representation of the fitness model for simulated H3N2-like populations wherein the fitness of strains at timepoint t determines the estimated frequency of strains with similar sequences one year in the future at timepoint u. Strains are colored by their amino acid sequence composition such that genetically similar strains have similar colors. A) Strains at timepoint t, x(t), are shown in their phylogenetic context and sized by their frequency at that timepoint. The estimated future population at timepoint u, x̂(u), is projected to the right with strains scaled in size by their projected frequency based on the known fitness of each simulated strain. B) The frequency trajectories of strains at timepoint t to u represent the predicted the growth of the dark blue strains to the detriment of the pink strains. C) Strains at timepoint u, x(u), are shown in the corresponding phylogeny for that timepoint and scaled by their frequency at that time. D) The observed frequency trajectories of strains at timepoint u broadly recapitulate the model’s forecasts while also revealing increased diversity of sequences at the future timepoint that the model could not anticipate, e.g. the emergence of the light blue cluster from within the successful dark blue cluster. Model coefficients minimize the earth mover’s distance between amino acid sequences in the observed, x(u), and estimated, x̂(u), future populations across all training windows. (After Figure 1 in Huddleston et al 2020.)

We found that the most robust forecasts depended on a combined model of experimentally-informed antigenic drift and sequence-based mutational load. Importantly, this model explicitly accounts for the benefits of antigenic drift and the costs of deleterious mutations. This model also slightly outperformed the naive model in its estimation of future clade frequencies. However, we found that the naive model often selected individual strains that were as close to the future population as the best biologically-informed model. The naive model’s estimated closest strain to the future is effectively the weighted average of the current population and conceptually similar to the consensus sequence of the population. From these results, we concluded that the predictive gains of fitness models depend on the prediction target.

Surprisingly, the sequence-based metrics of epitope cross-immunity and delta frequency and the mutational preferences from DMS experiments had little predictive power. These metrics failed to make accurate forecasts because of their dependence on a specific historical context. For example, the original epitope cross-immunity metric (Łuksza and Lässig 2014) depends on a predefined list of epitope sites that were originally identified in a retrospective study of flu sequences up through 2005 (Shih et al. 2007). This metric correspondingly failed to predict the future after 2005, suggesting that its previous success depended on inadvertently borrowing information from the future. Similarly, the mutational preferences from DMS experiments measure effects of all single amino acid mutations to the genetic background of the virus A/Perth/16/2009. The metric based on these preferences failed to predict the future after 2009, reflecting the strong dependence of these preferences on their original genetic background. Both delta frequency and LBI suffered from overfitting to the training data, in a more general form of historical dependence.

How do results from our two studies compare?

The two studies we have presented here use different approaches to analyze the same natural flu populations. We completed these two studies mostly independently and have only now begun to reconcile their findings. We were especially interested to understand how simulated populations from the two studies differed and whether the optimal predictor from Barrat-Charlaix et al. 2020 could also be an accurate fitness metric in the modeling framework from Huddleston et al. 2020.

Simulated populations play an important role in our two studies. We generated these simulated data as a source of truth where we understand the population dynamics because we defined them. In Barrat-Charlaix et al. 2020, the simulated binary populations from ffpopsim (Zanini and Neher 2012) evolved under strong epistasis and immune escape pressure. These populations showed us that mutation trajectories could be predictable under these population genetic constraints. In Huddleston et al. 2020, the simulated nucleotide populations from SANTA-SIM (Jariani et al. 2019) also evolved under strong epistasis, purifying selection, and an “exposure dependent” fitness function that mimics immune escape pressure. We used these populations to confirm that our forecasting framework could accurately predict the composition of future populations. Interestingly, when we inspected the predictability of the mutation trajectories for these simulated populations, we found that they resembled the weak predictability of natural H1N1pdm trajectories (Figure 8). Despite the weak predictability of mutation trajectories from these simulated populations, we were able to forecast the composition of their future populations. These results highlight the importance of using complete haplotypes to make predictions, as individual mutation trajectories remain difficult to predict.

Figure 8. Comparison of rising trajectories for natural H1N1pdm trajectories from Barrat-Charlaix et al. 2020 and simulated flu-like populations from Huddleston et al. 2020. A) Rising trajectories for H1N1pdm mutations as reported in Figure S9 of Barrat-Charlaix et al. 2020. B) Rising trajectories for flu-like populations simulated with SANTA-SIM in Huddleston et al. 2020. Mutation trajectories from simulated populations resemble those of natural H1N1pdm mutations.

We also wanted to know whether the optimal metric from Barrat-Charlaix et al. 2020 for selecting a representative of the future, the consensus sequence of the current population, could make accurate forecasts in the modeling framework from Huddleston et al. 2020. We noted above that the closest strain to the future selected by the naive model from Huddleston et al. 2020 is analogous to the consensus sequence of the current population. One important difference is that the naive model has to select a previously sampled strain while the consensus sequence represents a hypothetical strain that may not exist in nature. To understand whether the consensus sequence could also improve forecasts of the future population’s haplotype composition, we developed a new fitness metric called the “distance from consensus”. For each timepoint in our forecasting analysis, we constructed the amino acid consensus sequence from all extant strains and calculated the pairwise distance between the consensus and each extant strain. If the consensus sequence is the best representation of the future population, we expected the corresponding model’s coefficients to be consistently negative. This negative coefficient would have the effect of penalizing strains whose amino acid sequences diverged greatly from the consensus sequence.

Figure 9. Model coefficients and distance to the future for LBI, HI antigenic novelty, and distance from consensus metrics. A) Coefficients are shown per validation timepoint (solid circles, N=23) with the mean +/- standard deviation in the top-left corner. For model testing, coefficients were fixed to their mean values from training/validation and applied to out-of-sample test data (open circles, N=8). B) Distances between projected and observed populations are shown per validation timepoint (solid black circles) or test timepoint (open black circles). The mean +/- standard deviation of distances per validation timepoint are shown in the top-left of each panel. Corresponding values per test timepoint are in the top-right. The naive model’s distance to the future (light gray) was 6.40 +/- 1.36 AAs for validation timepoints and 6.82 +/- 1.74 AAs for test timepoints. The corresponding lower bounds on the estimated distance to the future (dark gray) were 2.60 +/- 0.89 AAs and 2.28 +/- 0.61 AAs.

We fit a model to this new metric using the same 25 years of historical A/H3N2 data described in Huddleston et al. 2020 and tested the robustness of the model on the last 5 years of A/H3N2 data. We compared the performance of this model to models for LBI and experimental measures of antigenic drift (HI antigenic novelty). For the first half of the training period, the distance to consensus metric received a coefficient of zero, meaning it did not improve forecasts over the naive model (Figure 9). In the second half of the training period, the metric received a strong negative coefficient, as we expected. When we applied the mean coefficient from the training period to out-of-sample data in the test period, we found that the distance from consensus metric outperformed LBI and performed only slightly worse than the antigenic drift metric. These results support findings from both of our studies. The consensus sequence is a more robust representative of the future than LBI, as shown in Barrat-Charlaix et al. 2020. However, experimental measurements of antigenic drift still provide more information about the future population than sequence-only metrics, as shown in Huddleston et al. 2020. We anticipate that this new distance from consensus metric could eventually replace the existing mutational load metric in a combined model with HI antigenic novelty. This new combined model could potentially provide better estimates of functional constraint (by limiting changes from the consensus) and antigenic drift (by using experimental measures of antigenic drift phenotypes.)

How have these results changed how we think about flu evolution?

In general, we found that the evolution of H3N2 flu populations remains difficult to predict. The frequency dynamics and fixation probabilities of individual mutations resemble neutrally evolving alleles. We can weakly predict the frequency dynamics of flu clades when we combine experimental and genetic data in models that account for antigenic drift and mutational load. In the best case, we can use these same biologically-informed models to predict the sequence composition of future flu populations. However, these complex fitness models do not always outperform simpler models, when predicting which individual virus is the most representative of the future population. In Barrat-Charlaix et al. 2020, the consensus sequence of the current population was as close or closer to the future population than the sequence with the highest local branching index. In Huddleston et al. 2020, a naive model estimated the single closest strain to the future nearly as well as the best biologically-informed models.

Successful flu predictions depend on the choice of prediction targets and fitness metrics. Future prediction efforts should attempt to estimate the composition of future populations instead of future clade frequencies. Fitness models should account for the genetic background of beneficial mutations and favor fitness metrics that are the least susceptible to model overfitting and historical contingency. The benefits of considering the genetic background of individual mutations in HA suggest that considering the context of all genes should yield gains, too. We need measures of antigenic drift from human antisera to complement current measures based on ferret antisera. We may also improve forecast accuracy by accounting for flu’s global migration patterns. Finally, we should make the forecasting problem itself easier by embracing efforts to reduce the lag between vaccine composition decisions and distribution to the public.

Cryptic transmission of novel coronavirus revealed by genomic epidemiology

2 Mar 2020 by Trevor Bedford

The field of genomic epidemiology focuses on using the genetic sequences of pathogens to understand patterns of transmission and spread. Viruses mutate very quickly and accumulate changes during the process of transmission from one infected individual to another. The novel coronavirus which is responsible for the emerging COVID-19 pandemic mutates at an average of about two mutations per month. After someone is exposed they will generally incubate the virus for ~5 days before symptoms develop and transmission occurs. Other research has shown that the “serial interval” of SARS-CoV-2 is ~7 days. You can think of a transmission chain as looking something like:

where, on average, we have 7 days from one infection to the next. As the virus transmits, it will mutate at this rate of two mutations per month. This means, that on average every other step in the transmission chain will have a mutation and so would look something like:

These mutations are generally really simple things. An ‘A’ might change to a ‘T’, or a ‘G’ to a ‘C’. This changes the genetic code of the virus, but it’s hard for a single letter change to do much to make the virus behave differently. However, with advances in technology, it’s become readily feasible to sequence the genome of the novel coronavirus. This works by taking a swab from someone’s nose and extracting the RNA in the sample and then determining the ‘letters’ of this RNA genome using chemistry and very powerful cameras. Each person’s coronavirus infection will yield a sequence of 30,000 ‘A’, ‘T’, ‘G’ or ‘C’ letters. We can use these sequences to reconstruct which infection is connected to which infection. As an example, if we sequenced three of these infections and found:

We could take the “genomes” ATTT, ATCT and GTCT and infer that the infection with sequence ATTT lead to the infection with sequence ATCT and this infection lead to the infection with sequence GTCT. This approach allows us learn about epidemiology and transmission in a completely novel way and can supplement more traditional contact tracing and case-based reporting.

For a few years now, we’ve been working on the Nextstrain software platform, which aims to make genomic epidemiology as rapid and as useful as possible. We had previously applied this to outbreaks like Ebola, Zika and seasonal flu. Owing to advances in technology and open data sharing, the genomes of 140 SARS-CoV-2 coronaviruses have been shared from all over the world via gisaid.org. As these genomes are shared, we download them from GISAID and incorporate them into a global map as quickly as possible and have an always up-to-date view of the genomic epidemiology of novel coronavirus at nextstrain.org/ncov.

The big picture looks like this at the moment:

where we can see the earliest infections in Wuhan, China in purple on the left side of the tree. All these genomes from Wuhan have a common ancestor in late Nov or early Dec, suggesting that this virus has emerged recently in the human population.

The first case in the USA was called “USA/WA1/2020”. This was from a traveller directly returning from Wuhan to Snohomish County on Jan 15, with a swab collected on Jan 19. This virus was rapidly sequenced by the US CDC Division of Viral Diseases and shared publicly on Jan 24 (huge props to the CDC for this). We can zoom into the tree to place WA1 among related viruses:

The virus has an identical genome to the virus Fujian/8/2020 sampled in Fujian on Jan 21, also labeled as a travel export from Wuhan, suggesting a close relationship between these two cases.

Last week the Seattle Flu Study started screening samples for COVID-19 as described here. Soon after starting screening we found a first positive in a sample from Snohomish County. The case was remarkable in that it was a “community case”, only the second recognized in the US, someone who had sought treatment for flu-like symptoms, been tested for flu and then sent home owing to mild disease. After this was diagnostically confirmed by Shoreline Public Health labs on Fri Feb 28 we were able to immediately get the sample USA/WA2/2020 on a sequencer and have a genome available on Sat Feb 29. The results were remarkable. The WA2 case was identical to WA1 except that it had three additional mutations.

This tree structure is consistent with WA2 being a direct descendent of WA1. If this virus arrived in Snohomish County in mid-January with the WA1 traveler from Wuhan and circulated locally for 5 weeks, we’d expect exactly this pattern, where the WA2 genome is a copy of the WA1 genome except it has some mutations that have arisen over the 5 weeks that separate them.

Again, this tree structure is consistent with a transmission chain leading from WA1 to WA2, but we wanted to assess the probability of this pattern arising by chance instead of direct transmission. Scientists often try to approach this situation by thinking of a “null model”, ie if it was coincidence, how likely of a coincidence was it? Here, WA1 and WA2 share the same genetic variant at site 18060 in the virus genome, but only 2/59 sequenced viruses from China possess this variant. Given this low frequency, we’d expect probability of WA2 randomly having the same genetic variant at 2/59 = 3%. To me, this not quite conclusive evidence, but still strong evidence that WA2 is a direct descendent of WA1.

Additional evidence for the relationship between these cases comes from location. The Seattle Flu Study had screened viruses from all over the greater Seattle area, however, we got the positive hit in Snohomish County with cases less than 15 miles apart. This by itself would only be suggestive, but combined with the genetic data, is firmer evidence for continued transmission.

I’ve been referring to this scenario as “cryptic transmission”. This is a technical term meaning “undetected transmission”. Our best guess of a scenario looks something like:

We believe this may have occurred by the WA1 case having exposed someone else to the virus in the period between Jan 15 and Jan 19 before they were isolated. If this second case was mild or asymptomatic, contact tracing efforts by public health would have had difficulty detecting it. After this point, community spread occurred and was undetected due to the CDC narrow case definition that required direct travel to China or direct contact with a known case to even be considered for testing. This lack of testing was a critical error and allowed an outbreak in Snohomish County and surroundings to grow to a sizable problem before it was even detected.

Knowing that transmission was initiated on Jan 15 allows us to estimate the total number of infections that exist in this cluster today. Our preliminary analysis puts this at 570 with an 90% uncertainty interval of between 80 and 1500 infections.

Back on Feb 8, I tweeted this thought experiment:

We know that Wuhan went from an index case in ~Nov-Dec 2019 to several thousand cases by mid-Jan 2020, thus going from initial seeding event to widespread local transmission in the span of ~9-10 weeks. We now believe that the Seattle area seeding event was ~Jan 15 and we’re now ~7 weeks later. I expect Seattle now to look like Wuhan around ~1 Jan, when they were reporting the first clusters of patients with unexplained viral pneumonia. We are currently estimating ~600 infections in Seattle, this matches my phylodynamic estimate of the number of infections in Wuhan on Jan 1. Three weeks later, Wuhan had thousands of infections and was put on large-scale lock-down. However, these large-scale non-pharmaceutical interventions to create social distancing had a huge impact on the resulting epidemic. China averted many millions of infections through these intervention measures and cases there have declined substantially.

This suggests that this is controllable. We’re at a critical junction right now, but we can still mitigate this substantially.

Some ways to implement non-pharmaceutical interventions include:

Practicing social distancing, such as limiting attendance at events with large groups of people
Working from home, if your job and employer allows it
Staying home if you are feeling ill
Take your temperature daily, if you develop a fever, self-isolate and call your doctor
Implementing good hand washing practices - it is extremely important to wash hands regularly
Covering coughs and sneezes in your elbow or tissue
Avoiding touching your eyes, nose, and mouth with unwashed hands
Disinfecting frequently touched surfaces, such as doorknobs
Beginning some preparations in anticipation of social distancing or supply chain shortages, such as ensuring you have sufficient supplies of prescription medicines and ensuring you have about a 2 week supply of food and other necessary household goods.
With these preparation in mind, it is important to not panic buy. Panic buying unnecessarily increases strain on supply chains and can make it difficult to ensure that everyone is able to get supplies that they need.

For more information please see:

Early warnings of novel coronavirus from genomic epidemiology and the global open scientific response

31 Jan 2020 by Trevor Bedford

I started following what’s now referred to as “novel coronavirus (nCoV)” on Jan 6 when I started to notice reports of a cluster of viral pneumonia of unknown origin in Wuhan, China. Just 4 days later on Jan 10, a first genome was released on Virological.org only to be followed by five more the following day via GISAID.org. From very early on, it was clear that the nCoV genomes lacked the expected genetic diversity that would occur with repeated zoonotic events from a diverse animal reservoir. The simplest parsimonious explanation for this observation was that there was a single zoonotic spillover event into the human population in Wuhan between mid-Nov and mid-Dec and sustained human-to-human transmission from this point. However, at first I struggled to reconcile this lack of genetic diversity with WHO reports of “limited human-to-human” transmission. The conclusion of sustained human-to-human spread became difficult to ignore on Jan 17 when nCoV genomes from the two Thai travel cases that reported no market exposure showed the same limited genetic diversity. This genomic data represented one of the first and strongest indications of sustained epidemic spread. As this became clear to me, I spent the week of Jan 20 alerting every public health official I know.

At this moment there are 54 publicly shared viral genomes, with genomes being shared by public health and academic groups all over the world 3-6 days after sample collection. I can’t overstate how remarkable this is and what an inflection point it is for the field of genomic epidemiology. Seasonal influenza had been far ahead of the general curve, but there we were still generally seeing a ~1 month turnaround from sample collection to genome in the best of circumstances. Getting to a 3-6 day turnaround opens up huge new avenues in epidemiology.

Since the first nCoV genome was shared on Jan 10, we’ve been tracking viral transmission and evolution on nextstrain.org/ncov aiming to have ~1hr turnarounds from public deposition of genome data to inclusion in the live transmission tracking. We are also producing public situation reports describing what can be concluded from current genomic data. These reports have now been generously translated into 5 other languages by volunteers from Twitter. With groups all over the world working tirelessly to generate genomic data as rapidly as possible, I’m feeling a moral obligation to not hold up the analysis side. The entire Nextstrain team (shoutouts to Richard Neher, Emma Hodcroft, James Hadfield, Kairsten Fay, Thomas Sibley, Misja Ilcisin and Jover Lee 🙌) have come together to conduct analyses and tailor the platform for nCoV response. There’s also been a remarkable amount of sharing of pre-publication analyses on Virological.org and bioRxiv and scientific communication on Twitter. Although the situation is looking a bit dire at the moment, it’s been humbling to see scientists from all over the world break down traditional barriers to rapid scientific progress.

Why do we get sick from the flu or SARS-CoV-2 so many times in our lives?

What can we learn about mutations we find in viruses?

Could we find groups of related flu viruses without building a family tree?

Can we find groups of related flu viruses when we can’t build a family tree?

Bioinformatics Analyst II/III

Responsibilities

Qualifications

Minimum qualifications

Preferred qualifications

Software Engineer II

About the role

What we provide

What you’ll do

Minimum qualifications

Bioinformatician

Responsibilities

Qualifications

Minimum qualifications

Preferred qualifications

Full-stack Developer

Responsibilities

Qualifications

Minimum qualifications

Preferred qualifications

Why do we try to predict seasonal influenza evolution?

How do we think flu evolves?

What is predictable about flu evolution?

Does flu evolve like we think it does?

Can we forecast flu evolution?

How do results from our two studies compare?

How have these results changed how we think about flu evolution?