Nextstrain build for novel coronavirus SARS-CoV-2
Source code
Contributors
Latest commits
Pages

Overview of remote nCoV files (intermediate build assets)

This page provides an overview of intermediate files which Nextstrain produces. Where appropriate, these files can be starting points for the ncov pipeline (discussed below).

We have two GitHub repositories which routinely upload files to S3 buckets: ncov-ingest and ncov. Each of those runs separate pipelines for GISAID and GenBank (aka “open”) data sources; these pipelines start with data curation and QC steps and end with the phylogenetic analyses you can see on nextstrain.org

The GISAID data is stored at s3://nextstrain-ncov-private and is not publicly available, in line with the GISAID Terms of Use (this is used internally by Nextstrain).

The open (GenBank) data is publicly available at three endpoints:

  • https://data.nextstrain.org/files/ncov/open/
  • s3://nextstrain-data/files/ncov/open/
  • gs://nextstrain-data/files/ncov/open/ (mirrored daily from S3 by the Broad Institute)

Our intention is to make GenBank intermediate files open and available for everyone to use, and to keep these files up-to-date. The paths for specific files are the same under each endpoint, e.g. https://data.nextstrain.org/files/ncov/open/metadata.tsv.gz, s3://nextstrain-data/files/ncov/open/metadata.tsv.gz, and gs://nextstrain-data/files/ncov/open/metadata.tsv.gz all exist. See below for a list of files that exist. If you’re running workflows on AWS or GCP compute that fetch this data, please use the S3 or GS URLs, respectively, for cheaper (for us) and faster (for you) data transfers. Otherwise, please use the https://data.nextstrain.org URLs.

Note that even though the s3://nextstrain-data/ and gs://nextstrain-data/ buckets are public, the defaults for most S3 and GS clients require some user to be authenticated, though the specific user/account doesn’t matter. In the rare case you need to access the S3 or GS buckets anonymously, the easiest way is to configure your inputs using https://nextstrain-data.s3.amazonaws.com/files/ncov/open/ or https://storage.googleapis.com/nextstrain-data/files/ncov/open/ URLs instead.

Depending on your execution environment, you may need to install additional Python dependencies for specific support of the different URL schemes (https, s3, gs). The workflow will produce an error at the start if additional dependencies are needed to fetch your configured inputs. Both https and s3 should work out of the box in the standard Nextstrain Conda and Docker execution environments.

All available genomes and metadata

Entire metadata & sequences data is uploaded from the ncov-ingest workflows for each of the gisaid and open sources:

  • metadata.tsv.gz
  • sequences.fasta.xz
  • nextclade.tsv.gz
  • additional_info.tsv.gz (GISAID only)
  • flagged_metadata.txt.gz (GISAID only)

The ncov repository contains two Nextstrain-only workflows which use the above data to generate our analyses visible on nextstrain.org. A side-effect of this is the creation and upload of processed versions of the entire dataset:

  • aligned.fasta.xz alignment via nextalign. The default reference genome is MN908947 (Wuhan-Hu-1).
  • mutation-summary.tsv.xz A summary of the data in aligned.fasta.xz.
  • masked.fasta.xz Masked alignment
  • filtered.fasta.xz The masked alignment excluding data with incomplete / invalid dates, unexpected genome lengths, missing metadata etc. We also maintain a list of sequences to exclude which are removed at this step. These sequences represent duplicates, outliers in terms of divergence or sequences with faulty metadata.

Subsampled datasets

Our GISAID and GenBank (open) profiles each define 7 builds (a Global build and one build per region: Africa, Asia, Europe, Oceania, North and South America). Each of these is a different subsample of the entire dataset, and each will result in the following intermediates uploaded:

  • {build_name}/sequences.fasta.xz
  • {build_name}/metadata.tsv.xz
  • {build_name}/aligned.fasta.xz
  • {build_name}/{build_name}.json (the main Auspice dataset file)
  • {build_name}/{build_name}_tip-frequencies.json
  • {build_name}/{build_name}_root-sequence.json

Summary of available GenBank (open) files

Each regional build (global, africa, asia, europe, north-america, oceania and south-america) contains a subsampled set of approximately 4000 sequences. They are a good starting point if you are seeking a representative sample of data. Where available, this table also provides the URL for the resulting Auspice visualisation of the data.

Please note that these files are uploaded in two batches (see above for details). This means that the full GenBank metadata and sequences are typically updated a couple of hours before the more processed files.

description type address
Full GenBank data metadata https://data.nextstrain.org/files/ncov/open/metadata.tsv.gz
  sequences https://data.nextstrain.org/files/ncov/open/sequences.fasta.xz
  aligned https://data.nextstrain.org/files/ncov/open/aligned.fasta.xz
  masked https://data.nextstrain.org/files/ncov/open/masked.fasta.xz
  filtered https://data.nextstrain.org/files/ncov/open/filtered.fasta.xz
Global sample metadata https://data.nextstrain.org/files/ncov/open/global/metadata.tsv.xz
  sequences https://data.nextstrain.org/files/ncov/open/global/sequences.fasta.xz
  aligned https://data.nextstrain.org/files/ncov/open/global/aligned.fasta.xz
  auspice https://nextstrain.org/ncov/open/global
Africa sample metadata https://data.nextstrain.org/files/ncov/open/africa/metadata.tsv.xz
  sequences https://data.nextstrain.org/files/ncov/open/africa/sequences.fasta.xz
  aligned https://data.nextstrain.org/files/ncov/open/africa/aligned.fasta.xz
  auspice https://nextstrain.org/ncov/open/africa
Asia sample metadata https://data.nextstrain.org/files/ncov/open/asia/metadata.tsv.xz
  sequences https://data.nextstrain.org/files/ncov/open/asia/sequences.fasta.xz
  aligned https://data.nextstrain.org/files/ncov/open/asia/aligned.fasta.xz
  auspice https://nextstrain.org/ncov/open/asia
Europe sample metadata https://data.nextstrain.org/files/ncov/open/europe/metadata.tsv.xz
  sequences https://data.nextstrain.org/files/ncov/open/europe/sequences.fasta.xz
  aligned https://data.nextstrain.org/files/ncov/open/europe/aligned.fasta.xz
  auspice https://nextstrain.org/ncov/open/europe
North America sample metadata https://data.nextstrain.org/files/ncov/open/north-america/metadata.tsv.xz
  sequences https://data.nextstrain.org/files/ncov/open/north-america/sequences.fasta.xz
  aligned https://data.nextstrain.org/files/ncov/open/north-america/aligned.fasta.xz
  auspice https://nextstrain.org/ncov/open/north-america
Oceania sample metadata https://data.nextstrain.org/files/ncov/open/oceania/metadata.tsv.xz
  sequences https://data.nextstrain.org/files/ncov/open/oceania/sequences.fasta.xz
  aligned https://data.nextstrain.org/files/ncov/open/oceania/aligned.fasta.xz
  auspice https://nextstrain.org/ncov/open/oceania
South America sample metadata https://data.nextstrain.org/files/ncov/open/south-america/metadata.tsv.xz
  sequences https://data.nextstrain.org/files/ncov/open/south-america/sequences.fasta.xz
  aligned https://data.nextstrain.org/files/ncov/open/south-america/aligned.fasta.xz
  auspice https://nextstrain.org/ncov/open/south-america

Starting your build from these intermediates

Each workflow defines one or more inputs in the builds.yaml file.

In the simplest form, an input specifies a local path to some metadata and sequences, like so:

inputs:
  - name: example-data
    metadata: data/example_metadata.tsv
    sequences: data/example_sequences.fasta

Using the above table, we can easily modify this to create a build which uses the global subsample of GenBank data:

inputs:
  - name: global-representative-genbank-sample
    metadata: https://data.nextstrain.org/files/ncov/open/global/metadata.tsv.gz
    sequences: https://data.nextstrain.org/files/ncov/open/global/sequences.fasta.gz

To avoid unnecessarily aligning these sequences, we can instead start from the aligned sequences, like so:

inputs:
  - name: global-representative-genbank-sample
    metadata: https://data.nextstrain.org/files/ncov/open/global/metadata.tsv.gz
    aligned: https://data.nextstrain.org/files/ncov/open/global/aligned.fasta.gz

The following starting points are available:

  • replace sequences with aligned (skips alignment 😉)
  • replace sequences with masked (skips alignment and masking steps)
  • replace sequences with filtered (skips alignment, masking and basic filtering steps)

Compressed vs uncompressed starting points

The workflow supports compressed metadata and sequences for any input stage. Files may be compressed using xz (.xz) or gzip (.gz) compression.