A COMMON PALINDROMIC RNA SEQUENCE AS UNITARY CONTRIBUTOR TO COPY-CHOICE RECOMBINATION IN SARS-COV-2
This is intended to be in the same vein as my original post in this thread, just before midnight of Feb 6, evaluating similarities in Coronavirus sequences at the level of viral RNA. In this case, the subject is the observation that the receptor-binding domain of SARS-CoV-2 bears significantly high similarity to that of a Coronavirus recently obtained from a pangolin, namely Pan_SL-CoV_GD/P1L.
I have hesitated to comment, because an extensive analysis of this issue has been posted as a pre-review manuscript on the preprint website bioRxiv since March 24, and my preference would have been to wait for its publication, along with the posting of the Pan_SL-CoV_GD/P1La sequence on Genbank. It is now 7 weeks later, and neither the paper nor sequence have been posted on PubMed or Genbank. In this era of rapid publication, especially for COVID-related work, this is highly unusual.
The current citation is:
Emergence of SARS-CoV-2 through Recombination and Strong Purifying Selection
Xiaojun Li, Elena E. Giorgi, Manukumar Honnayakanahalli Marichann, Brian Foley, Chuan Xiao, Xiang-Peng Kong, Yue Chen, Bette Korber, Feng Gao
bioRxiv 2020.03.20.000885; doi: https://doi.org/10.1101/2020.03.20.000885
I will begin by saying I concur completely with this paper, by a team of authors I hold in high regard. I have also communicated directly with Brian Foley of the team several days ago. I wish only to add additional information consistent with my previous posts on this thread, and not “republish” their work in any way.
Likewise, colleagues of mine in a global collaboration posted here that SARS-CoV-2 was not derived from any pangolin sequence, in:
I concur completely with their analysis, subsequently published in Nature Medicine, done largely at the amino acid level.
I wish here to compare Bat RaTG13 and Pan_SL-CoV_GD/P1L with SARS-CoV-2 principally at the level of viral RNA sequence, to show how a common palindromic RNA sequence may be the unitary contributor to several events of copy-choice recombination that gave rise to these viral sequences.
The Pan_SL-CoV_GD/P1L sequence I used is the original incomplete RNA sequence, first described in:
Lam, T.T., Shum, M.H., Zhu, H. et al. Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins. Nature (2020). Identifying SARS-CoV-2-related coronaviruses in Malayan pangolins | Nature
While incomplete, the gaps do not affect the areas of sequence to be discussed here. I will use the DNA equivalent, derived by reverse transcription, as is common practice.
The central point of the relationship between the receptor-binding domain (RBD) among the three viral sequences can be seen in this amino acid alignment, derived from the much more complex Figure 2A of the Li et al. 2020 paper, to wit:
Amino acid changes are highlighted in blue This makes their point that the RBD of SARS-CoV-2 is not derived from a virus similar to Bat RaTG13, but rather from one similar to a virus derived from pangolin.
The authors do briefly allude to a much greater difference in RNA sequence between SARS-CoV-2 and the pangolin virus, but I would submit that a closer look at the nature of that difference should be made more clearly. What follows is an annotated alignment of the RNA sequences in this region, from each of the three viruses.
As I first posted concerning the relatedness at the RNA level between Bat RaTG13 and SARS-CoV-2, this alignment is replete with wobble base mutations (blue arrows), between SARS-CoV-2 and Pan_SL-CoV_GD/P1L. There are 28 in all over a span of 268 nt between apparent changes of track from RaTG13-like to Pan_SL-CoV_GD/P1L-like and back to RaTG13-like sequence.
We know from many other virological examples that it takes several decades to accumulate this level of wobble-base mutagenesis, as I described Feb 6 in my first post to this thread. In this case, my estimate would be divergence over a span of 40 years.
So the recombination event resulting in this RBD sequence being in SARS-CoV-2 occurred in a decade around 1980. Not only could this not have occurred in a lab, but it is also unlikely to have occurred in a pangolin.
Pangolins are solitary animals, meeting only to mate. They are very unlikely to be capable of horizontal transmission of a virus, about as unlikely as hermits living in the wilderness. Rather, they reflect transmission from bats within their range of habitation. So the copy-choice recombination event that led to SARS-CoV-2 having an RBD sequence capable to binding to the ACE-2 receptor occurred in a bat cave four decades ago.
It is also worthy of note that the RNA pentanucleotide CAGAT, a variant of CAGAC that I highlighted in a recent post, lies directly before the likely area of crossover in the recombinant.
So, the two most unique peptide sequences of SARS-CoV-2 related to its ability to infect human beings and spread rapaciously, the RBD and furin cleavage site, are unified by being preceded by the CAGAC/CAGAT motif.
There are other nearly identical sequences, exceeding 99% at the RNA level, noted by the Li et al and Lam et al papers: the coding sequence for Membrane Protein E and for the 3’OH terminus downstream of the nucleocapsid (N) gene. In all three viral sequences for Membrane Protein E, as well as SARS of 2003 (that is identical in the first 50 amino acids with these other three), the palindrome TGAGT is found, which is the complement to CAGAC,just prior to the E gene. Finally, as shown below, the beginning of the 3’OH RNA sequence, identical in all three viruses, is replete with five nucleotide palindromes, including CAGAC and its variant CAGAT.
Even in SARS of 2003, the sequence identity in the 3’OH region is 3%, far lower than the overall 20% disparity between SARS and SARS-CoV-2.
Therefore, as shown below, much of the critical evolutionary history of both SARS and SARS-CoV-2 can be associated with the proximity of copy-choice recombination sites to CAGAC, its complement, or a similar pentanucleotide.
Others have noted the profligacy of recombination sites within the coronavirus genome that have accumulated over their very long evolutionary history in bats. So there may well be other RNA sequence motifs that tend to facilitate copy-choice errors.
With respect to SARS-CoV-2 and the ancestral viruses that contributed critical regions to its RNA sequence for human pandemic potential, this was clearly a natural process. This reflects an evolution over decades, in bat caves long ago, facilitated by some mechanism, as shown above, whereby CAGAC disrupts the processivity of the viral RNA polymerase complex down its template, and facilitates, albeit rarely, copy-choice errors capable of creating potentially dangerous recombinants to humankind.
To date, no source of SARS-CoV-2 has been determined, and neither bat, nor other mammal, has been found to harbor it except human beings in the pandemic.
All around the globe, those of us who have studied emerging viral pathogens at the molecular level for decades are united in our judgment, based on protein and RNA sequence analysis, that SARS-CoV-2 evolved by a series of recombination events in the wild. Sequence divergence shows that these events occurred through many decades of recombination among both similar and distantly-related bat Coronaviruses, potentially in multiple bat species co-habiting in the same limestone bat caves across a wide swath of southern China.
The reader will note that the sequences are labeled “pangolin”, in parentheses, Pre-SARS, Pre-SARS-CoV-2, Pre-Bat RATG13, Pre- Pan_SL-CoV_GD/P1L, and Pre-HKU9. This is because the source viruses come from the past, and not the present. From past locations, and not the location or host species from which they happened to be much more recently isolated.
This judgment is based on facts and molecular evidence, independently judged by different analyses in the hands of eminently qualified scientists. A number of us have never worked together over long careers. Some of us do not even know of each other by reputation. Yet we have all come to the same conclusion, in China, in Scotland, in North Carolina, in Louisiana, in Texas, in New Mexico, in California and in Australia. The backbone of the virus sequence was derived from a common ancestor of Bat RaTG13 and SARS-CoV-2, most likely in Yunnan province from which Bat RaTG13 was isolated. Small segments of sequence, hundreds of nucleotides long in a genome of 30,000 nt, were derived from viruses ancestral to other viruses only recently isolated in Guangdong province.
The only laboratory in which SARS-CoV-2 was concocted was a natural one in a bat cave, in a process that took decades, an accident of nature waiting for human contact.
William R. Gallaher, Ph.D. (Harvard ’72)
Professor of Microbiology, Immunology and Parasitology, Emeritus
Louisiana State University School of Medicine, New Orleans