Nextstrain build for novel coronavirus SARS-CoV-2

Compiled Nextstrain SARS-CoV-2 resources are available at Follow @nextstrain for updates.

This phylogeny shows evolutionary relationships of SARS-CoV-2 viruses from the ongoing COVID-19 pandemic. Although the genetic relationships among sampled viruses are generally quite clear, there is considerable uncertainty surrounding estimates of specific transmission dates and in reconstruction of geographic spread. Please be aware that specific inferred geographic transmission patterns and temporal estimates are only a hypothesis.

There are millions of complete SARS-CoV-2 genomes available on open databases and this number increases every day. This visualization can only handle ~4000 genomes in a single view for performance and legibility reasons. Because of this we subsample available genome data for our analysis views. We provision multiple views to focus subsampling on different geographic regions and different time periods. These views are available through the “Dataset” dropdown on the left or by clicking on the following links:

  past 1 month past 2 months past 6 months all time
global global/1m global/2m global/6m global/all-time
Africa africa/1m africa/2m africa/6m africa/all-time
Asia asia/1m asia/2m asia/6m asia/all-time
Europe europe/1m europe/2m europe/6m europe/all-time
North America north-america/1m north-america/2m north-america/6m north-america/all-time
Oceania oceania/1m oceania/2m oceania/6m oceania/all-time
South America south-america/1m south-america/2m south-america/6m south-america/all-time

Site numbering and genome structure uses Wuhan-Hu-1/2019 as reference. The phylogeny is rooted relative to early samples from Wuhan. Temporal resolution assumes a nucleotide substitution rate of 8 × 10^-4 subs per site per year. Mutational fitness is calculated using results from Obermeyer et al (under review). Full details on bioinformatic processing can be found here.

The analysis on this page uses data from NCBI GenBank as a source following Open Data principles, such that we can make input data and intermediate files available for further analysis. Open Data is data that can be freely used, re-used and redistributed by anyone - subject only, at most, to the requirement to attribute and sharealike. But be aware that not all regions are well represented in open databases and some of the above trees might lack recent data from particular geographic regions.

We gratefully acknowledge the authors, originating and submitting laboratories of the genetic sequences and metadata for sharing their work in open databases. Please note that although data generators have generously shared data in an open fashion, that does not mean there should be free license to publish on this data. Data generators should be cited where possible and collaborations should be sought in some circumstances. Please try to avoid scooping someone else’s work. Reach out if uncertain. An attribution table is available by clicking on “Download Data” at the bottom of the page and then clicking on “Strain Metadata” in the resulting dialog box.

To maximize the utility and visibility of these generously shared data, we provide preprocessed files that can serve as a starting point for additional analyses.

All sequences and metadata

Now also available with zstd compression allowing much faster decompression:

Subsampled sequences and intermediate files

The files below exist for every region (global, africa, asia, europe, north-america, oceania and south-america) and correspond to each region’s 6 month timespan build (e.g. global/6m, africa/6m, asia/6m, etc). Files for the 2m and all-time builds (e.g. global/2m, global/all-time, etc.) are not yet available. The links below refer to the ${BUILD_PART_0} region; substitute ${BUILD_PART_0} with another region name in the links if desired.