Pipeline components for real-time phylodynamic analysis

Augur and snakemake

The output of one augur command serves as input to the next and hence these commands need to be executed in order. This is a common pattern in bioinformatics and multiple tools -- so called workflow managers -- exist to facilitate this. Within nextstrain, we have relied on Snakemake.

Snakemake breaks a workflow into a set of rules that are specified in a file called Snakefile. Each rule takes a number of input files, specifies a few parameters, and produces output files. A simple rule would look like this: python rule filter: input: sequences = "data/sequences.fasta", metadata = "data/metadata.tsv" output: sequences = "results/filtered.fasta" params: min_date = 2012 shell: """ augur filter \ --sequences {input.sequences} \ --metadata {input.metadata} \ --output {output.sequences} \ --min-date {params.min_date} """ This rule would produce results/filtered.fasta from the input files data/sequences.fasta and data/metadata.tsv using the augur filter command. Note that we explicitly specify what is an input and what is an output file. To filter our data, we would now call snakemake as bash snakemake results/filtered.fasta and snakemake will run the same command as specified above.

So far, this is just a complicated reformulation of what we did above, but the benefit of workflow management becomes clear once we add more steps. The next natural step in our phylogenetic pipeline is aligning the filtered sequences and we define a rule align. bash rule align: input: sequences = rules.filter.output.sequences, reference = "config/zika_outgroup.gb" output: alignment = "results/aligned.fasta" shell: """ augur align \ --sequences {input.sequences} \ --reference-sequence {input.reference} \ --output {output.alignment} """ If you now want to generate the alignment, you can type bash snakemake results/aligned.fasta and snakemake will

  • determine that results/aligned.fasta is an output of rule align
  • check whether all required input files are in place and run the necessary rules if not
  • run rule align and check whether the file results/aligned.fasta appeared.

If you supply a reference sequence to augur align, augur will include that reference sequence in the alignment and strip all insertions relative to that reference. This will guarantee a consistent reference coordinate system and genome annotation later on.