Positive selection underlies repeated knockout of ORF8 in SARS-CoV-2 evolution
Cassia Wagner1,2,*, Kathryn E. Kistler 2,3, Garrett A. Perchetti 4, Noah Baker 4, Lauren A. Frisbie 5, Laura Marcela Torres 5, Frank Aragona 5, Cory Yun 5, Marlin Figgins 2,6, Alexander L. Greninger 2,4, Alex Cox 5, Hanna N. Oltean 5, Pavitra Roychoudhury 2,4, Trevor Bedford 1,2,3
1 Department of Genome Sciences, University of Washington, Seattle, WA, USA;
2 Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Center, Seattle, WA, USA;
3 Howard Hughes Medical Institute, Seattle, WA, USA;
4 Department of Laboratory Medicine and Pathology, University of Washington, Seattle, Washington, USA;
5 Washington State Department of Health, Shoreline, Washington, USA;
6 Department of Applied Mathematics, University of Washington, Seattle, Washington, USA.
</sup> *Corresponding author: cassiasw@uw.edu
Knockout of the ORF8 protein has repeatedly spread through the global viral population during SARS-CoV-2 evolution. Here we use both regional and global pathogen sequencing to explore the selection pressures underlying its loss. In Washington State, we identified transmission clusters with ORF8 knockout throughout SARS-CoV-2 evolution, not just on novel, high fitness viral backbones. Indeed, ORF8 is truncated more frequently and knockouts circulate for longer than for any other gene. Using a global phylogeny, we find evidence of positive selection to explain this phenomenon: nonsense mutations resulting in shortened protein products occur more frequently and are associated with faster clade growth rates than synonymous mutations in ORF8. Loss of ORF8 is also associated with reduced clinical severity, highlighting the diverse clinical impacts of SARS-CoV-2 evolution.
Structure of this repository
This repository includes the code for the analyses and figures for the above manuscript.
Clinical data from Washington State Disease Reporting System is not included as this data is derived from confidential medical records.
GISAID metadata and sequenced used in the analysis may be accessed at gisaid.org/EPI_SET_230921by.
The SARS-CoV-2 UShER phylogeny is available from UShER.
codecontains the scripts for all analyses.datacontains simulated clinical data containing all variables used in severity analysis to check code quality. It also contains a subset of the clinical data variables, which we have permission to share. This folder also contains mutation annotations from Obermeyer et al. Please access the GISAID sequences and SARS-CoV-2 UShER phylogeny using the above links.nextstrain_buildcontains the identified clusters and the configurations for building the nextstrain trees to identify transmission clusters of gene knockouts in Washington State.envscontains the conda config files for python code & notebooks and for matUtils.notebookscontains jupyter notebooks for plotting results and initial analyses.paramsincludes the SARS-CoV-2 reference genomes used in analyses & the config file for snakemake pipeline.ushercontains results from analyses using the usher phylogeny.intrahostcontains intrahost variants after filtering to remove samples that did not pass QC.
Setup & installation
Use mamba to quickly (~5 min) install matUtils & python notebooks environments. The environment for python scripts & notebooks can be set up & activated using:
# Install
mamba env create -f envs/orf8ko.yaml
# Activate
mamba activate orf8ko
The environment for matUtils can be set up & activated using:
# Install
mamba env create -f envs/usher-env.yaml
# Activate
mamba activate usher-env
Rscripts were run in RStudio using R version 4.1.2. The R environment dependencies are listed in envs/renv.lock. To use this environment:
# Install renv
install.packages("renv")
# Create and activate renvironment
renv::restore(lockfile = 'envs/renv.lock')
This process should take a few minutes.
Running the analyses
- Run
code/find_ko.pyon .fasta alignment of WA sequences to call potential gene knockouts. See above to access sequences and metadata from GISAID. - Build and call transmission clusters using
nextstrain_build - Run intrahost analysis using
notebooks/intrahost_analysis.ipynb - Calculate dN/dS using the snakemake workflow:
code/dNdS_snakefile. See above to download the UShER tree for this analysis. - Call mutation clusters from UShER tree using
code/getMutationClusters.py - Model cluster growth rates using:
code/clusterSize_regression.R - Run clade-level analyses using the snakemake workflow:
code/variant_snakefile. See above to download the UshER tree for this analysis. code/combineClinicalData.Ris used to generate the dataframe for clinical analysis.- Use
code/Fig5.Rto run the clinical severity analysis. Although we cannot share the full clinical data to protect patient privacy, we have provideddata/clinical_example.tsvas a demo dataset. We have also provided a subset of clinical variables, which we are able to share to while protecting patient privacy, atdata/clinical_subset.tsv.