Real-time tracking of influenza evolution

Augur

Note: As of Sep 2017, these processing scripts are deprecated in favor of nextstrain/augur. Current Nextflu builds run off the flu build detailed here. The code in this directory is kept in place for archival reasons.

Augur is the processing pipeline to track flu evolution. It currently

  • imports public sequence data
  • subsamples, cleans and aligns sequences
  • builds a phylogenetic tree from this data
  • reports statistics about mutations and branching patterns of the tree
  • infers mutation frequency trajectories through time
  • infers antigenic phenotypes from titer data

Pipeline

The entire pipeline is run with process.py.

Sequence download, cleaning and alignment

Download

Virus sequence data is manually downloaded from the GISAID EpiFlu database. Data from GISAID may not be disclosed outside the GISAID community. We are mindful of this and raw GISAID data has not been released publicly as part of this project. The current pipeline is designed to work specifically for HA from influenza H3N2. Save GISAID sequences as data/gisaid_epiflu_sequence.fasta.

Filter

Keeps viruses with fully specified dates, cell passage and only one sequence per strain name. Subsamples to 50 (by default) sequences per month for the last 3 (by default) years before present. Appends geographic metadata. Subsampling prefers longer sequences over shorter sequences and prefer more geographic diversity over less geographic diversity.

Align

Aligns sequences with mafft. Testing showed a much lower memory footprint than muscle.

Clean

Clean up alignment so that reference frame is kept intact. Remove sequences that don't conform to a rough molecular clock and remove known reassortant sequences and other outliers.

Tree processing

Infer

Uses FastTree to get a starting tree, and then refines this tree with RAxML.

Refine

Reroot the tree based on outgroup strain, collapse nodes with zero-length branches, ladderize the tree and collect strain metadata.

Frequency estimation

Estimate genotype and clade frequency trajectories using a Bernoulli observation model combined with a genetic drift model of process noise.

Streamline

Prep and remove cruft from data files for auspice visualization.