Positive selection underlies repeated knockout of ORF8 in SARS-CoV-2 evolution

Cassia Wagner^1,2,*, Kathryn E. Kistler ^2,3, Garrett A. Perchetti ⁴, Noah Baker ⁴, Lauren A. Frisbie ⁵, Laura Marcela Torres ⁵, Frank Aragona ⁵, Cory Yun ⁵, Marlin Figgins ^2,6, Alexander L. Greninger ^2,4, Alex Cox ⁵, Hanna N. Oltean ⁵, Pavitra Roychoudhury ^2,4, Trevor Bedford ^1,2,3

¹ Department of Genome Sciences, University of Washington, Seattle, WA, USA;
² Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Center, Seattle, WA, USA;
³ Howard Hughes Medical Institute, Seattle, WA, USA;
⁴ Department of Laboratory Medicine and Pathology, University of Washington, Seattle, Washington, USA;
⁵ Washington State Department of Health, Shoreline, Washington, USA;
⁶ Department of Applied Mathematics, University of Washington, Seattle, Washington, USA.
^{</sup> *Corresponding author: cassiasw@uw.edu}

Knockout of the ORF8 protein has repeatedly spread through the global viral population during SARS-CoV-2 evolution. Here we use both regional and global pathogen sequencing to explore the selection pressures underlying its loss. In Washington State, we identified transmission clusters with ORF8 knockout throughout SARS-CoV-2 evolution, not just on novel, high fitness viral backbones. Indeed, ORF8 is truncated more frequently and knockouts circulate for longer than for any other gene. Using a global phylogeny, we find evidence of positive selection to explain this phenomenon: nonsense mutations resulting in shortened protein products occur more frequently and are associated with faster clade growth rates than synonymous mutations in ORF8. Loss of ORF8 is also associated with reduced clinical severity, highlighting the diverse clinical impacts of SARS-CoV-2 evolution.

Structure of this repository

This repository includes the code for the analyses and figures for the above manuscript. Clinical data from Washington State Disease Reporting System is not included as this data is derived from confidential medical records. GISAID metadata and sequenced used in the analysis may be accessed at gisaid.org/EPI_SET_230921by.
The SARS-CoV-2 UShER phylogeny is available from UShER.

code contains the scripts for all analyses.
data contains simulated clinical data containing all variables used in severity analysis to check code quality. It also contains a subset of the clinical data variables, which we have permission to share. This folder also contains mutation annotations from Obermeyer et al. Please access the GISAID sequences and SARS-CoV-2 UShER phylogeny using the above links.
nextstrain_build contains the identified clusters and the configurations for building the nextstrain trees to identify transmission clusters of gene knockouts in Washington State.
envs contains the conda config files for python code & notebooks and for matUtils.
notebooks contains jupyter notebooks for plotting results and initial analyses.
params includes the SARS-CoV-2 reference genomes used in analyses & the config file for snakemake pipeline.
usher contains results from analyses using the usher phylogeny.
intrahost contains intrahost variants after filtering to remove samples that did not pass QC.

Setup & installation

Use mamba to quickly (~5 min) install matUtils & python notebooks environments. The environment for python scripts & notebooks can be set up & activated using:

# Install
mamba env create -f envs/orf8ko.yaml

# Activate
mamba activate orf8ko

The environment for matUtils can be set up & activated using:

# Install
mamba env create -f envs/usher-env.yaml

# Activate
mamba activate usher-env

Rscripts were run in RStudio using R version 4.1.2. The R environment dependencies are listed in envs/renv.lock. To use this environment:

# Install renv
install.packages("renv")

# Create and activate renvironment
renv::restore(lockfile = 'envs/renv.lock')

This process should take a few minutes.

Running the analyses

Run code/find_ko.py on .fasta alignment of WA sequences to call potential gene knockouts. See above to access sequences and metadata from GISAID.
Build and call transmission clusters using nextstrain_build
Run intrahost analysis using notebooks/intrahost_analysis.ipynb
Calculate dN/dS using the snakemake workflow: code/dNdS_snakefile. See above to download the UShER tree for this analysis.
Call mutation clusters from UShER tree using code/getMutationClusters.py
Model cluster growth rates using: code/clusterSize_regression.R
Run clade-level analyses using the snakemake workflow: code/variant_snakefile. See above to download the UshER tree for this analysis.
code/combineClinicalData.R is used to generate the dataframe for clinical analysis.
Use code/Fig5.R to run the clinical severity analysis. Although we cannot share the full clinical data to protect patient privacy, we have provided data/clinical_example.tsv as a demo dataset. We have also provided a subset of clinical variables, which we are able to share to while protecting patient privacy, at data/clinical_subset.tsv.