Clade Naming & Definitions
The nomenclature used by Nextstrain to designate clades for SARS-CoV-2 is driven by the following objectives:
- label genetically well defined clades that have reached significant frequency and geographic spread,
- allow for transient clade designations that are elevated to major clades if they persist and rise in frequency,
- provide memorable but informative names,
- gracefully handle clade naming in the upcoming years as SARS-CoV-2 becomes a seasonal virus.
We name a new major clade when it either reaches a frequency of 20% globally. When calculating these frequencies, care has to be taken to achieve approximately even sampling of sequences in time and space since sequencing effort varies strongly between countries. A clade name consists of the year it emerged and the next available letter in the alphabet. A new clade should be at least 2 mutations away from its parent major clade.
We name major clades by the year they are estimated to have emerged and a letter, e.g. 19A, 19B, 20A. The yearly reset of letters will ensure that we don't progress too far into the alphabet, while the year-prefix provides immediate context on the origin of the clade that will become increasingly important going forward. These are meant as major genetic groupings and not intended to completely resolve genetic diversity.
The hierarchical structure of clades is sometimes of interest. Here, the “derivation” of a major clade can be labeled with the familiar “.” notation as in 19A.20A.20C for the major clade 20C.
Within these major clades, we monitor potential ‘emerging clades', which we will label by their parent clade and the nucleotide mutation(s) that defines them (ex: 19A/28688C). It should be noted however, that these mutations are only meaningful in that they define the clade. Once a subclade reaches (soft) criteria on frequency, spread, and genetic distinctiveness, it will be renamed to a major clade (hypothetically 19A/28688C to 20D).
|Clade||Primary Countries||Mutations||Max Frequency|
|19A||Asia: China/Thailand||Root clade||65-47% Globally in Jan|
|19B||Asia: China||C8782T T28144C||28-33% Globally in Jan|
|20A||N America/Europe/Asia: USA, Belgium, India||C14408T A23403G||41-46% Globally Apr-May|
|20B||Europe: UK, Belgium, Sweden||G28881A G28882A G28883C||19-20% Globally Mar-Apr|
|20C||N America: USA||C1059T G25563T||19-21% Globally Apr|
You can view the current clades on the Global SARS-CoV-2 Nextstrain tree here.
Identifying Nextstrain Clades
To make it easy for users to identify the Nextstrain clade of their own sequences, we provide a simple python script that can be run on any Fasta file to assign appropriate clades.
This script is part of the ‘ncov’ github repository, but does not require running any other part of the pipeline. However ‘augur’ must be installed to run the script. This can be done a number of different ways, but is often most easily done using ‘pip’.
Note when running this script you can supply
--sequences if your sequences require aligning first. If you already have aligned your sequences to the
ncov repository reference (for example, from running the repository), you can supply
--alignment. If you supply sequence that are not aligned to the
ncov reference, you may get bad results!