SARS-CoV-2 genomic nomenclature


 

Trevor Bedford (@trvrb)
Associate Professor, Fred Hutchinson Cancer Research Center
2 Feb 2021
SARS-CoV-2 Evolution Working Group
WHO
 
Slides at: bedford.io/talks

Overview of clade/lineage assignments

Distribution of clade/lineage assignments

  • 5 Nextstrain clades (20A, 20B, 20C, 20G, 20I) comprise ~80% of circulating viruses
  • 2 GISAID clades (GH, GR) comprise ~80% of circulating viruses
  • 5 PANGO lineages (B.1, B.1.1.7, B.1.2, B.1.177, B.1.351) comprise ~50% of circulating viruses

Nextstrain clade nomenclature

  • Naming proceeds in hurricane-style year-letter format, ie 21A, 21B, 21C, etc...
  • A clade is labeled when it reaches >20% global or >30% regional frequency for >2 months
  • A named clade must be at least 2 nucleotide mutations distant from its parent clade
  • A named clade is immediately labeled when a "variant of concern" (VOC) is recognized; these clades are dual-labeled with clade and variant name, eg 20H/501Y.V2

Label subclades by mutations

"California variant" as case study

  • Groups notice a clade bearing spike mutations S13I, W152C and L452R that has been rising in frequency in Southern California and want to discuss this publicly and discuss it as "CAL.20C"
  • At time of recognition, this clade is scattered across multiple PANGO lineages and so there's no way for this "variant" to be announced with a matching PANGO designation
  • Subsequently, PANGO nomenclature has been updated and this clade is now distributed between sister lineages B.1.427 and B.1.429. Currently to the refer to the "California variant" in PANGO nomenclature, you'd say "a variant comprised of lineages B.1.427 and B.1.429".
  • In Nextstrain nomenclature, this variant can be labeled at time of initial recognition as 20C/S:452R without any registration or updating required on our part

Tension between few labels for vaccine strain selection and many labels for genomic epidemiology, but a middle ground is possible

  • Clades/lineages should be monophyletic and distinct from one another to encourage accurate recall when assigning new genomes
  • Demarcations should be coarse enough to faciliate accurate recall and parsable dynamics; I'd like fewer than 50 demarcations per year
  • I have no strong preference between hurricane-style 21A names and hierarchical B.1.427 names
  • Lean on "/ mutation" labeling to still allow specificity without having to have an overlong catalog of clades/lineages