Assessing evolutionary pressures on the SARS-CoV-2 Mu (B.1.621) clade

Assessing evolutionary pressures on the SARS-CoV-2 μ (B.1.621) clade

Authors: Alexander G Lucaci1, Jordan D Zehr1, Stephen D Shank1, Darren Martin2, Anton Nekrutenko3, Sergei L Kosakovsky Pond1

  1. Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, Pennsylvania, USA
  2. Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA

Objectives

We identify genomic sites in μ (B.1.621, Latif et al. 2021) clade sequences that may be subject to selective forces and could be prioritized for further studies, but have not yet reached high frequencies.

Our approach

  1. We perform comprehensive selection analyses on a representative sample of sequences from this clade and other (background) clades using RASCL, see the Methods section below for additional details.
  2. We identify and rank sites (+1 for each category) according to the following protocol:
    1. Inferred to be under positive selective pressure.
    2. Are not clade-defining mutations.
    3. Contain mutations that are not predictable based on the evolution of Sarbecovirus sequences. (Lytras et al 2021)
    4. Contain mutations that occur in a large fraction of unique haplotypes. This was shown to be predictive of near term growth in a separate analysis from our global SARS-CoV-2 analysis (Maher et al 2021)
  3. Additionally, where available, we provide interpretation of our identified sites in terms of
    1. Known functional significance.
    2. Temporal growth trends.
    3. Location on 3D structure of the protein.

Results

Data analyzed

We present an analysis of the B.1.621 (μ) variant, performed on individual genes/protein products, using a median of 101 μ sequences downsampled from all available sequences in GISAID to represent the genomic diversity in this clade (see the Methods section below for details). A similarly downsampled global SARS-CoV-2 reference dataset from publicly available sequences via the Virus Pathogen Database and Analysis Resource (ViPR) (Pickett et al., 2012) database is used as background.

Sites identified by a RASCL analysis

Full analysis details can be found in this ObservableHQ notebook

Our analysis identified 67 (Table 1 in the ObservableHQ notebook) individual codon sites (among 1643 sites that are polymorphic in the amino-acid space) that showed evidence of episodic diversifying selection along internal branches of this clade using the MEME method at q ≤ 0.20 false discovery rate (FDR). A total of 5 sites (Table 3 in the ObservableHQ notebook) were found to be subject to directional selection using the FADE method.

We identify high-priority sites in SARS-CoV-2 μ B.1.621 sequences (Table 1). We assign a “Rank” to each site based on a point system described in the “Approach” section above. Briefly, each site is given a point for each of the follow requirements that are met: found to be positively selected, the site occurs outside of the clade defining site set, the site has any unexpected mutations as described in our Sarbecoviruses evolutionary analysis notebook, the site has a mutation present above the minimum threshold in the SARS-CoV-2 Global Haplotype analysis (for this we remove any mutation present in background clade, and remove gaps).

# Coordinate (SARS-CoV-2) Gene/ORF Codon (in gene/ORF) p-value q-value Rank Physiochemical Property
1 16075 RDRP/ORF1b 879/870Y 0.0452115 0.140312 4
2 28873 N 201G 0.0460285 0.138086 4
3 28253 ORF8 121VHF 0.0300385 0.180231 4
4 25336 S 1259HV 0.0157386 0.134903 4
5 25333 S 1258D 0.00852771 0.0959367 4
6 13516 RDRP/ORF1b 26/17I 0.0336153 0.168077 4
7 14122 RDRP/ORF1b 228/219D 0.036867 0.170155 4
8 14530 RDRP/ORF1b 364/355F 0.0416466 0.144161 4
9 14767 RDRP/ORF1b 443/434V 0.0384624 0.161005 4
10 14785 RDRP/ORF1b 449/440T 0.0383252 0.168257 4 charge
11 17976 Helicase/ORF1b 580/1503F 0.0113156 0.107201 3
12 20550 Endornase/ORF1b 310/2361I 0.00645087 0.0893198 3
13 21234 Methyltransferase/ORF1b 192/2589Y 0.049474 0.134929 3
14 21640 S 27LT 0.000824192 0.0247258 3
15 21997 S 146YNTPS 0.000598969 0.0215629 3
16 22003 S 148S 0.0221244 0.165933 3
17 22000 S 147HQNP 0.0110123 0.1166 3 Overall, secondary
18 19482 Exonuclease/ORF1b 481/2005V 0.0474395 0.133424 3
19 25707 ORF3a 106FPIL 0.0410847 0.145005 3
20 27210 ORF6 4PHI 7.98568e-06 0.000479141 3
21 29023 N 251S 0.0425228 0.141743 3
22 19548 Exonuclease/ORF1b 503/2027VS 0.0470548 0.134442 3
23 29443 N 391AN 0.0325738 0.167522 3
24 14470 RDRP/ORF1b 344/335I 0.0339005 0.164921 3
25 18327 Exonuclease/ORF1b 96/1620I 0.0395957 0.15494 3
26 2944 NSP3/ORF1a 633/894S 0.0246692 0.158588 3
27 17820 Helicase/ORF1b 528/1451S 0.0392635 0.160624 3
28 17025 Helicase/ORF1b 263/1186I 0.0241484 0.167182 3
29 15001 RDRP/ORF1b 521/512C 0.0457987 0.139725 3
30 9472 NSP4/ORF1a 307/3070T 0.040304 0.145094 3
31 9424 NSP4/ORF1a 291/3054T 0.0402514 0.147862 3
32 9139 NSP4/ORF1a 196/2959S 0.0111345 0.111345 3
33 8614 NSP4/ORF1a 21/2784VT 0.0400482 0.153376 2
34 6535 NSP3/ORF1a 1273/2091D 0.0246588 0.164392 2
35 8578 NSP4/ORF1a 9/2772H 0.0198022 0.162018 2 volume
36 19479 Exonuclease/ORF1b 480/2004CV 0.00358349 0.0586389 2 volume, charge
37 8659 NSP4/ORF1a 36/2799N 0.0441831 0.142017 2 charge
38 18960 Exonuclease/ORF1b 307/1831V 0.0462054 0.134145 2
39 16260 Helicase/ORF1b 8/931Y 0.0309811 0.17989 2
40 2230 NSP2/ORF1a 476/656SA 0.0316127 0.172433 2

Table 1. A table of high-priority sites in SARS-CoV-2 μ B.1.621 sequences. We assign a “Rank” to each site based on a point system described in the “Approach” section above.

Evidence of natural selection history operating on SARS-CoV-2 genomes

For the set of high-priority SARS-CoV-2 genomic sites (taken from Table 1), sites inferred from μ B.1.621 sequences, we observe when and how selection (positively or negatively) operated on them, through a series of 3-month overlapping intervals going back to the beginning of the pandemic. The earliest intervals ends in February 2020 and the latest - in September 2021.

Figure 1. Evolutionary trajectories of 40 high-priority selected sites (from Table 1). If a site was found to be positively (red) or negatively (blue) selected during a specific time period, a bubble will be drawn at a corresponding point on the plot. The area of the bubble is scaled as -log10 p, where p is the p-value of the FEL likelihood ratio test. Larger bubbles correspond to smaller p-values; p-values are not directly comparable between different time windows and different genes due to differences in sample sizes and other factors. The x-axis shows the endpoint of the time-window; e.g., Mar 30th 2021 will correspond to the analysis performed with the data from Jan 01 2021 to Mar 30th 2021. Figures like this can be generated with the Evidence of natural selection history operating on SARS-CoV-2 genomes ObservableHQ notebook.

Temporal trends of high-priority sites in Spike in μ B.1.621 sequences.

Figure 2. Temporal trends of the substitution combinations at all sites represented in Table 1 in the Spike gene for B.1.621 (μ) sequences in 2021 (from left to right: S/27, S/146,S/147, S/1258, S/1259). The symbol “.” denotes the reference residue at that site. Figures like this can be generated using Trends in mutational patterns across SARS-CoV-2 Spike enabled by data from / Sergei Pond / Observable. Additional search parameters include “B.1.621[pangolin] AND 20210101[after]”

Temporal trends of high-priority sites in RDRP in μ B.1.621 sequences.

Figure 3. Temporal trends of the substitution combinations at all sites represented in Table 1 in the RDRP (RNA-dependent RNA polymerase) gene for B.1.621 (μ) sequences in 2021 (from left to right: RDRP/26, RDRP/228, RDRP/344, RDRP/364, RDRP/443, RDRP/449, RDRP/521, RDRP/879). The symbol “.” denotes the reference residue at that site. Figures like this can be generated using Trends in mutational patterns across SARS-CoV-2 Spike enabled by data from / Sergei Pond / Observable. Additional search parameters include “B.1.621[pangolin] AND 20210101[after]”

Potential biological and clinical significance of mutations

The ongoing monitoring of emergent VOIs and VOCs can detect adaptive mutations before they rise to high frequency, and help establish their relationship to key clinical parameters including pathogenicity and transmissibility. Additionally, continued evolution within a particular clade may form the foundation for a subclade with further functional sites of interest. Based on current information for the Spike gene from Stanford Coronavirus Antiviral & Resistance Database (CoVDB) we include several annotations with clinical relevance.

SARS-CoV-2 B.1.621 (μ) Spike gene, sites of interest from Table 1 of our ObservableHQ notebook.

  • Spike site 144 epitope binding to mAbs
  • Spike site 145 epitope binding to mAbs
  • Spike site 146 epitope binding to mAbs
  • Spike site 147 epitope binding to mAbs
  • Spike site 148 epitope binding to mAbs
  • Spike site 417 epitope binding to mAbs
  • Spike site 501 epitope binding to mAbs

Location of sites inferred to be under under positive selective pressure in the Spike gene from B.1.621 (μ) sequences on the structure of the protein.


Figure 4. Spike protein crystal structure annotation 6CRZ with MEME sites, a measure of episodic selection (These sites are listed in Table 1 of the ObservableHQ notebook). The color legend for these figures is as follows: the NTD region is highlighted in Blue, the RBD region is highlighted in Green, The Heptad Repeat (HR) region is highlighted in Ruby, MEME (Positively selected) sites are highlighted in Orange. Figures like this can be generated using the Categorical site highlighting with NGL ObservableHQ notebook. To interact with the figure above visit: Categorical highlighting RASCL/Mu

Methods

The RASCL application is designed to accept two inputs: a “query” whole genome sequence dataset, and a “background” whole genome sequence dataset. Users are requested to input a whole genome dataset (fasta, unaligned), which will be defined as the “query” sequence dataset, and can be typically retrieved from data repositories such as GISAID. Typically, a background sequence dataset is assembled from publicly available sequences via the ViPR database, and is provided by default unless otherwise specified. However, our application also tolerates the usage of another clade of interest to be used as “background”, which allows one to directly interrogate between-clade evolutionary dynamics. Whole genome datasets are broken down into their respective coding sequences from the NCBI SARS-CoV-2 reference annotation. The gene set includes structural (Spike, Matrix, Nucleocapsid, Envelope) and non-structural (leader, nsp2, nsp3, nsp4, 3C, nsp6, nsp7, nsp8, nsp9, nsp10, helicase, exonuclease, endornase, ORF3a, ORF6, ORF7a, ORF8, RdRp, methyltransferase) genes. Individual genes from query and background datasets are processed from here onwards. Our procedure includes several quality control steps including sanitizing fasta input headers and striking ambiguous codons from an alignment. In order for our analyses to complete rapidly, alignments are subsampled using genetic distances provided by the TN93 (Tamura and Nei 1993) algorithm. A combined (query + background) alignment is created and with sequences kept from the background dataset divergent enough to be useful for subsequent selection analysis. Inference of a maximum likelihood phylogenetic tree (RAxML-NG, Kozlov et al., 2019) is performed on the combined dataset. Trees are labelled for “query” and “background” branches. Analyses SLAC (Pond and Frost, 2005), BGM (Poon et al., 2008), FEL (Pond and Frost, 2005), MEME (Murrell et al., 2012), aBSREL (Smith et al., 2015), BUSTEDS (Wisotsky et al., 2020), RELAX (Wertheim et al., 2015), CFEL (Pond et al., 2021) are performed with state of the art molecular evolution models from the HyPhy (Pond et al., 2020) suite of bioinformatics tools. We report on sites of interest which together may represent regions of ongoing evolution within a specific clade. Results are combined into easily readable and web-accessible JSON files used for web processing. Visualization of results is done with full-feature interactive notebooks through ObservableHQ.

Data description

Data retrieval was performed through GISAID on September 1st 2021 with 3,288 sequences downloaded for the B.1.621 (μ) variant of interest. Additional parameters for the search included: ‘low coverage excl.’ and ‘complete’.

Software availability

RASCL is freely available via a dedicated Github repository:

Acknowledgements

We thank the global community of health-care workers and scientists who work tirelessly to face the pandemic head-on. We are also thankful for the GISAID submitters throughout the world. Additionally, we thank members of the Datamonkey / HyPhy and Galaxy teams for their continued assistance in the development and application of our software.

References

  1. Alaa Abdel Latif, Julia L. Mullen, Manar Alkuzweny, Ginger Tsueng, Marco Cano, Emily Haag, Jerry Zhou, Mark Zeller, Emory Hufbauer, Nate Matteson, Chunlei Wu, Kristian G. Andersen, Andrew I. Su, Karthik Gangavarapu, Laura D. Hughes, and the Center for Viral Systems Biology. outbreak.info, (available at outbreak.info SARS-CoV-2 data explorer). Accessed 8 October 2021.
  2. Exploring the natural origins of SARS-CoV-2 in the light of recombination. Spyros Lytras, Joseph Hughes, Darren Martin, Arné de Klerk, Rentia Lourens, Sergei L Kosakovsky Pond, Wei Xia, Xiaowei Jiang, David L Robertson
    bioRxiv 2021.01.22.427830; doi: https://doi.org/10.1101/2021.01.22.427830
  3. Predicting the mutational drivers of future SARS-CoV-2 variants of concern. M. Cyrus Maher, Istvan Bartha, Steven Weaver, Julia di Iulio, Elena Ferri, Leah Soriaga, Florian A. Lempp, Brian L. Hie, Bryan Bryson, Bonnie Berger, David L. Robertson, Gyorgy Snell, Davide Corti, Herbert W. Virgin, Sergei L. Kosakovsky Pond, Amalio Telenti
    medRxiv 2021.06.21.21259286; doi: Predicting the mutational drivers of future SARS-CoV-2 variants of concern | medRxiv
  4. Elbe, S., and Buckland-Merrett, G. (2017) Data, disease and diplomacy: GISAID’s innovative contribution to global health. Global Challenges, 1:33-46. DOI:10.1002/gch2.1018 PMCID: 31565258
  5. Pickett BE, Sadat EL, Zhang Y, Noronha JM, Squires RB, Hunt V, Liu M, Kumar S, Zaremba S, Gu Z, Zhou L, Larson CN, Dietrich J, Klem EB, Scheuermann RH. ViPR: an open bioinformatics database and analysis resource for virology research. Nucleic Acids Res. 2012 Jan;40(Database issue):D593-8. doi: 10.1093/nar/gkr859. Epub 2011 Oct 17. PMID: 22006842; PMCID: PMC3245011.
  6. Alexey M Kozlov, Diego Darriba, Tomáš Flouri, Benoit Morel, Alexandros Stamatakis, RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference, Bioinformatics, Volume 35, Issue 21, 1 November 2019, Pages 4453–4455, RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference | Bioinformatics | Oxford Academic
  7. Sergei L Kosakovsky Pond, Art FY Poon, Ryan Velazquez, Steven Weaver, N Lance Hepler, Ben Murrell, Stephen D Shank, Brittany Rife Magalis, Dave Bouvier, Anton Nekrutenko, Sadie Wisotsky, Stephanie J Spielman, Simon DW Frost, Spencer V Muse (2020) HyPhy 2.5—A Customizable Platform for Evolutionary Hypothesis Testing Using Phylogenies. 1 Molecular Biology and Evolution 37.1 (2020): 295-299.
  8. Sergei L. Kosakovsky Pond and Simon D. W. Frost (2005) Not So Different After All: A Comparison of Methods for Detecting Amino Acid Sites Under Selection Molecular Biology and Evolution 22(5): 1208-1222
  9. Art F. Y. Poon, Fraser I. Lewis, Simon D. W. Frost, Sergei L. Kosakovsky Pond, Spidermonkey: rapid detection of co-evolving sites using Bayesian graphical models, Bioinformatics, Volume 24, Issue 17, 1 September 2008, Pages 1949–1950, Spidermonkey: rapid detection of co-evolving sites using Bayesian graphical models | Bioinformatics | Oxford Academic
  10. Ben Murrell, Joel O. Wertheim, Sasha Moola, Thomas Weighill, Konrad Scheffler and Sergei L. Kosakovsky Pond (2012) Detecting Individual Sites Subject to Episodic Diversifying Selection PLoS Genetics 8(7): e1002764
  11. M. D. Smith, J. O. Wertheim, S. Weaver, B. Murrell, K. Scheffler and S. L. Kosakovsky Pond Less Is More: An Adaptive Branch-Site Random Effects Model for Efficient Detection of Episodic Diversifying Selection Molecular Biology and Evolution 32: 1342-1353
  12. Wisotsky SR, Kosakovsky Pond SL, Shank SD, Muse SV. Synonymous Site-to-Site Substitution Rate Variation Dramatically Inflates False Positive Rates of Selection Analyses: Ignore at Your Own Peril. Mol Biol Evol. 2020 Aug 1;37(8):2430-2439. doi: 10.1093/molbev/msaa037. PMID: 32068869; PMCID: PMC7403620.
  13. Wertheim JO, Murrell B, Smith MD, Kosakovsky Pond SL, Scheffler K. RELAX: detecting relaxed selection in a phylogenetic framework. Mol Biol Evol. 2015 Mar;32(3):820-32. doi: 10.1093/molbev/msu400. Epub 2014 Dec 23. PMID: 25540451; PMCID: PMC4327161.
  14. Kosakovsky Pond SL, Wisotsky SR, Escalante A, Magalis BR, Weaver S. Contrast-FEL-A Test for Differences in Selective Pressures at Individual Sites among Clades and Sets of Branches. Mol Biol Evol. 2021 Mar 9;38(3):1184-1198. doi: 10.1093/molbev/msaa263. PMID: 33064823; PMCID: PMC7947784.
1 Like