Data integration, real-time pipelines and visualization strategies
Trevor Bedford (@trvrb)
4 Oct 2017
ARTIC Network Meeting
University of Edinburgh
Studies of Ebola phylogenetics published alongside outbreak
Carroll et al. 2015. Nature., Park et al. 2015. Cell.,
Arias et al. 2016. Virus Evol., Quick et al. 2016. Nature.
Difficult to extract a comprehensive picture from these studies
Comprehensive analysis published in Apr 2017 (on bioRxiv Sep 2016)
There have also been publications on the ongoing Zika epidemic
Metsky et al. 2017. Nature., Grubaugh et al. 2017. Nature.,
Faria et al. 2017. Nature.
There currently does not exist a comprehensive Zika phylogeny
in the literature (or bioRxiv)
- Trees in these papers used 174, 104 and 200 genomes, respectively
- There are now 542 genomes in Genbank
Even if these genomes did not change the story told by these papers,
they could improve credible intervals and connect dots not available
to the original papers
Outline
- What we're trying to do
- Brief overview of approach
- Design choices we've made
Key challenges
- Timely analysis and sharing of results critical
- Dissemination must be scalable
- Integrate many data sources
- Results must be easily interpretable and queryable
Rethink database of virus and titer data
- Harmonizes data from different sources
- Integrates different types of data (serology, sequences, case details)
- Provides an interface for downstream analysis
Build scripts to align sequences, build trees and annotate
- Flexible build scripts to incorporate different viruses and analyses
- Constructs time-resolved phylogenies
- Annotates with geographic transitions and mutation events
Example augur pipeline for 1600 Ebola genomes
- Align with MAFFT (34 min)
- Build ML tree with RAxML (54 min)
- Temporally resolve tree and geographic ancestry with TreeTime (16 min)
- Total pipeline (1 hr 46 min)
Web visualization of resulting trees
- Interactive data exploration and filtering
- Framework through React / D3
- Connects phylogeny, geography and genotypes
Analysis targets
- Phylogeny
- Mutations present
- Geographic transitions
- Root-to-tip plot
Example from Faria et al
Example from Metsky et al
Metsky et al. 2017. Nature.
Viz strategies
Core of visualization is the ability to make comparisons
Color to link attributes across panels
Transitions to maintain object constancy
Filtering time, space and other attributes
Viz challenges
- Recombination / reassortment
- Combining fragments with full genomes
- Combining metadata of varying resolution
- Conveying uncertainty
Importance of curation by domain experts
Public dissemination vs on-site investigation
Acknowledgements
Nextstrain software development: Richard Neher, James Hadfield, Colin Megill, Sidney Bell,
Charlton Callender, Barney Potter, John Huddleston
Advice / support: Andrew Rambaut, Nick Loman, Ian Goodfellow, Matt Cotten, Paul Kellam,
Kristian Andersen, Nathan Grubaugh, Pardis Sabeti