The fact that that area of the SARS-CoV-2 genome is a site of RNA polymerase stuttering is clear from the tandem repeat in the RNA sequence, that evidently occurred in the long ago ancestor of SARS-CoV-2 and Bat RaTG13 (tandem repeat mutated in latter) just upstream of the site of the insert, thusly,
SARS-CoV-2 tat cagact cagact tgctcctcggcgggcacgtagt
This is, after all, a looped out peptide sequence, and even when the furin site is common, as in other Coronaviruses, there is not strict alignment relative to neighboring, more constant, peptide regions. The irony of a functionally important loop that is also hypervariable.
This is a really nice finding, Bill. This sequence is in lots of things when you BLAST it but the fact it is in a bat CoV is very convincing to me that that is the source. Adds to the evidence that the polybasic insertion occurred in a bat and not some other animal.
I’m new to this forensic analysis, and not expert on protein design, so please forgive any naivety. But if you were doing something like this for evil purposes, wouldn’t you throw down a bunch of synonymous changes too, to make it look like something older (and thus less traceable)?
I didn’t really mean anything to do with labs. There is still a question as to whether this insertion arose in an intermediate host (or potentially in cryptic transmission in humans). This observation increases the likelihood that the insertion occurred in a bat and thus SARS-CoV-2 was a fully human transmissible virus in the bat species.
The sequence analysis all seems very cogent, especially the discussion on copy-choice errors. But I’m a bit confused about the geographical aspect of your supposition. You say that the HKU9 CoV was found in the Guangdong province, but your transport to Wuhan involves a train from Yunnan. Does that mean you suppose that the mixed infection was in the intersection of Rousettus and Rhinolopus ranges, and then the novel sequence was spread unidirectionally to Yunnan (e.g., Kunming)? Is it not possible that it was maintained in the macrobat, spreading to points in Guangdong that are accessible to/from Wuhan? That seems a little closer. As for the carrier being a human, it would seem surprising to me that this person could be the only one who handled the animal in question and that s/he became contagious only shortly after disembarking from the hypothetical train. I wonder if bats and other animals are themselves transported long distances in China to fill the demand in various markets. Are there bushmeat brokers who might engage in that sort of shipment?
You reference Occam’s Razor in some of your writing. I’m a fan too, so I’m trying to overcome the skepticism that bubbles up in my mind regarding this sole-traveler hypothesis…
I suppose one of the things that was always attractive about the pangolin hypothesis is that this animal is rare enough to be transported long distances. My intuition says that is probably not the case for bats.
We have too ample evidence now that the animal that can transport SARS-CoV-2 from point A to point B, and spread it the fastest is a human being. Caged animals of any kind need not apply – and no animal reservoir for the virus has been found, now several months into the investigation. It is not for lack of the Chinese looking - they are far more motivated to understand the genesis of this than anyone.
The "pangolin hypothesis’ was based on faulty molecular analysis of sequence similarity, which turned out to not be that similar at all. That explanation was debunked 2.5 months ago.
I am not a fan of “gain of function” experiments at all, but the work of Zhengli and colleagues has shown that direct bat to human Coronavirus infection is possible. We are certain of it for Ebola.
The overall sequence of SARS-CoV-2 is closest to Yunnan isolates – only a bit appears to come from a virus that is HKU9-like. So it most likely came from Yunnan – but the outbreak occurred 700 miles away in Wuhan, not in Yunnan. Not along the way there either. The sole human traveler on high speed train while still in incubation is the best fit to the known information, in my view. Ockham’s razor (note spelling).
Microbats, Macrobats and Bat Coronaviruses have been circulating through the Karst limestone formations in the south of China, that stretches from southeast Yunnan to western Guangdong, far longer than the Chinese people have been in existence. That is saying a lot, but it is a truth attested to by the very wide varieties found there of both hosts and viruses…Human Coronavirus disease out of that environment has been an accident waiting to happen – and still is. Emergence happens.
About an hour from this writing, I can go on my porch and watch a flight of microbats rise from hollow trees in the wetland forest at the rear of my property, to skim the headwaters of Gum Creek for their nightly meal of insects. It is the peril of being a virologist that there are times I wonder about their virome.
But right now, the grocery store is a far more dangerous place than my forest. We would all instinctively like to blame that on someone, somewhere. The truth is that every virus of human beings represents a past accidental emergence, an innocent wrong place, wrong time, thing.
Regarding the overall similarity to Yunnan sequences: OK, fair enough. But frankly, this just further tips my intuition toward an accidental escape from the Wuhan lab. It seems there is a lot of research activity devoted to collecting viruses in Yunnan. And, in keeping with your personal account of the critters inhabiting your local environs (which I do not think of as replete with Karst geology; correct me if I’m wrong), I keep finding myself wondering… Why do the researchers’ collection sites emphasize locations so far from Wuhan? Surely there must be bats in and around Wuhan. Is it simply the case that most of the coronaviruses that can infect both bats and humans come from the Yunnan region?
Regarding the spelling of Fra. William of Ockham: I wonder why references to the “razor” so frequently use the “Occam” spelling? I suppose it has to do with the Latin phrase “novacula occami” recorded by Libert Froidmont. And there we arrive at a dilemma: If William merely expressed a philosophy without explicitly invoking the steely metaphor, perhaps the naming rights should go to Prof. Foidmont. (In fact, it seems there is little in William’s surviving writings that really justify singling him out for edgy parsimony, related antecedents going back at least as far as Aristotle.) Nevertheless, if English-speakers are going to translate “novacula” to “razor,” it would only seem consistent to restore what is considered to be the English spelling of William’s birthplace. Thanks for prompting me to look into the matter.
A COMMON PALINDROMIC RNA SEQUENCE AS UNITARY CONTRIBUTOR TO COPY-CHOICE RECOMBINATION IN SARS-COV-2
This is intended to be in the same vein as my original post in this thread, just before midnight of Feb 6, evaluating similarities in Coronavirus sequences at the level of viral RNA. In this case, the subject is the observation that the receptor-binding domain of SARS-CoV-2 bears significantly high similarity to that of a Coronavirus recently obtained from a pangolin, namely Pan_SL-CoV_GD/P1L.
I have hesitated to comment, because an extensive analysis of this issue has been posted as a pre-review manuscript on the preprint website bioRxiv since March 24, and my preference would have been to wait for its publication, along with the posting of the Pan_SL-CoV_GD/P1La sequence on Genbank. It is now 7 weeks later, and neither the paper nor sequence have been posted on PubMed or Genbank. In this era of rapid publication, especially for COVID-related work, this is highly unusual.
The current citation is:
Emergence of SARS-CoV-2 through Recombination and Strong Purifying Selection
Xiaojun Li, Elena E. Giorgi, Manukumar Honnayakanahalli Marichann, Brian Foley, Chuan Xiao, Xiang-Peng Kong, Yue Chen, Bette Korber, Feng Gao
bioRxiv 2020.03.20.000885; doi: https://doi.org/10.1101/2020.03.20.000885
I will begin by saying I concur completely with this paper, by a team of authors I hold in high regard. I have also communicated directly with Brian Foley of the team several days ago. I wish only to add additional information consistent with my previous posts on this thread, and not “republish” their work in any way.
Likewise, colleagues of mine in a global collaboration posted here that SARS-CoV-2 was not derived from any pangolin sequence, in:
I concur completely with their analysis, subsequently published in Nature Medicine, done largely at the amino acid level.
I wish here to compare Bat RaTG13 and Pan_SL-CoV_GD/P1L with SARS-CoV-2 principally at the level of viral RNA sequence, to show how a common palindromic RNA sequence may be the unitary contributor to several events of copy-choice recombination that gave rise to these viral sequences.
The Pan_SL-CoV_GD/P1L sequence I used is the original incomplete RNA sequence, first described in:
Lam, T.T., Shum, M.H., Zhu, H. et al. Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins. Nature (2020). https://doi.org/10.1038/s41586-020-2169-0
While incomplete, the gaps do not affect the areas of sequence to be discussed here. I will use the DNA equivalent, derived by reverse transcription, as is common practice.
The central point of the relationship between the receptor-binding domain (RBD) among the three viral sequences can be seen in this amino acid alignment, derived from the much more complex Figure 2A of the Li et al. 2020 paper, to wit:
Amino acid changes are highlighted in blue This makes their point that the RBD of SARS-CoV-2 is not derived from a virus similar to Bat RaTG13, but rather from one similar to a virus derived from pangolin.
The authors do briefly allude to a much greater difference in RNA sequence between SARS-CoV-2 and the pangolin virus, but I would submit that a closer look at the nature of that difference should be made more clearly. What follows is an annotated alignment of the RNA sequences in this region, from each of the three viruses.
As I first posted concerning the relatedness at the RNA level between Bat RaTG13 and SARS-CoV-2, this alignment is replete with wobble base mutations (blue arrows), between SARS-CoV-2 and Pan_SL-CoV_GD/P1L. There are 28 in all over a span of 268 nt between apparent changes of track from RaTG13-like to Pan_SL-CoV_GD/P1L-like and back to RaTG13-like sequence.
We know from many other virological examples that it takes several decades to accumulate this level of wobble-base mutagenesis, as I described Feb 6 in my first post to this thread. In this case, my estimate would be divergence over a span of 40 years.
So the recombination event resulting in this RBD sequence being in SARS-CoV-2 occurred in a decade around 1980. Not only could this not have occurred in a lab, but it is also unlikely to have occurred in a pangolin.
Pangolins are solitary animals, meeting only to mate. They are very unlikely to be capable of horizontal transmission of a virus, about as unlikely as hermits living in the wilderness. Rather, they reflect transmission from bats within their range of habitation. So the copy-choice recombination event that led to SARS-CoV-2 having an RBD sequence capable to binding to the ACE-2 receptor occurred in a bat cave four decades ago.
It is also worthy of note that the RNA pentanucleotide CAGAT, a variant of CAGAC that I highlighted in a recent post, lies directly before the likely area of crossover in the recombinant.
So, the two most unique peptide sequences of SARS-CoV-2 related to its ability to infect human beings and spread rapaciously, the RBD and furin cleavage site, are unified by being preceded by the CAGAC/CAGAT motif.
There are other nearly identical sequences, exceeding 99% at the RNA level, noted by the Li et al and Lam et al papers: the coding sequence for Membrane Protein E and for the 3’OH terminus downstream of the nucleocapsid (N) gene. In all three viral sequences for Membrane Protein E, as well as SARS of 2003 (that is identical in the first 50 amino acids with these other three), the palindrome TGAGT is found, which is the complement to CAGAC,just prior to the E gene. Finally, as shown below, the beginning of the 3’OH RNA sequence, identical in all three viruses, is replete with five nucleotide palindromes, including CAGAC and its variant CAGAT.
Even in SARS of 2003, the sequence identity in the 3’OH region is 3%, far lower than the overall 20% disparity between SARS and SARS-CoV-2.
Therefore, as shown below, much of the critical evolutionary history of both SARS and SARS-CoV-2 can be associated with the proximity of copy-choice recombination sites to CAGAC, its complement, or a similar pentanucleotide.
Others have noted the profligacy of recombination sites within the coronavirus genome that have accumulated over their very long evolutionary history in bats. So there may well be other RNA sequence motifs that tend to facilitate copy-choice errors.
With respect to SARS-CoV-2 and the ancestral viruses that contributed critical regions to its RNA sequence for human pandemic potential, this was clearly a natural process. This reflects an evolution over decades, in bat caves long ago, facilitated by some mechanism, as shown above, whereby CAGAC disrupts the processivity of the viral RNA polymerase complex down its template, and facilitates, albeit rarely, copy-choice errors capable of creating potentially dangerous recombinants to humankind.
To date, no source of SARS-CoV-2 has been determined, and neither bat, nor other mammal, has been found to harbor it except human beings in the pandemic.
All around the globe, those of us who have studied emerging viral pathogens at the molecular level for decades are united in our judgment, based on protein and RNA sequence analysis, that SARS-CoV-2 evolved by a series of recombination events in the wild. Sequence divergence shows that these events occurred through many decades of recombination among both similar and distantly-related bat Coronaviruses, potentially in multiple bat species co-habiting in the same limestone bat caves across a wide swath of southern China.
The reader will note that the sequences are labeled “pangolin”, in parentheses, Pre-SARS, Pre-SARS-CoV-2, Pre-Bat RATG13, Pre- Pan_SL-CoV_GD/P1L, and Pre-HKU9. This is because the source viruses come from the past, and not the present. From past locations, and not the location or host species from which they happened to be much more recently isolated.
This judgment is based on facts and molecular evidence, independently judged by different analyses in the hands of eminently qualified scientists. A number of us have never worked together over long careers. Some of us do not even know of each other by reputation. Yet we have all come to the same conclusion, in China, in Scotland, in North Carolina, in Louisiana, in Texas, in New Mexico, in California and in Australia. The backbone of the virus sequence was derived from a common ancestor of Bat RaTG13 and SARS-CoV-2, most likely in Yunnan province from which Bat RaTG13 was isolated. Small segments of sequence, hundreds of nucleotides long in a genome of 30,000 nt, were derived from viruses ancestral to other viruses only recently isolated in Guangdong province.
The only laboratory in which SARS-CoV-2 was concocted was a natural one in a bat cave, in a process that took decades, an accident of nature waiting for human contact.
William R. Gallaher, Ph.D. (Harvard ’72)
Professor of Microbiology, Immunology and Parasitology, Emeritus
Louisiana State University School of Medicine, New Orleans
I read with much interest this thoughtful thread; thanks to you all. I’ve been studying the evolution of the spike protein in detail as well.
Two issues are still open (not only) in my view: 1) the origin of nCoV-19 and 2) the origin of the furin cleavage site.
The origin of nCoV-19.
The physicist "John von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.” (quoted by Enrico Fermi, in Dayson F. Nature 2004; 427:297). Now, given that analyse Chinese railways’ timetables is far beyond my skills, I’d go instead for something more traditional: get more data. Trivial? Maybe not. I am deeply astonished by how much has been said and published on the origin(s) of nCoV-19, and all of that mainly based on the comparisons with a single sequence: Bat-CoV-RaTG13. How could this have happened?
I am used to teach students that for meaningful phylogenetic/evolutionary analyses, taxon sampling should be as balanced a possible. Here, bat relatives of the Bat-CoV-RaTG13 are clearly underrepresented. Proposed solution: Someone should ask Dr Zheng-Li Shi, where she collected that precious sample. Then send someone over there, collect some more bat samples and end once and for all this story without the need for more just-so stories. Anybody agrees with my proposal?
The origin of the furin cleavage-site.
In this thread the main proposal about the origin of the furin cleavage-site invokes replication errors and the CAGAC palindrome. Is there a chance that Gallaher’s “RNA polymerase stuttering” hypothesis could be tested in the lab? That would be terrific.
Yet, one thing is to generate (synonymous) mutations, a quite different one is to get them fixed in the population.
Yes, the S1/S2 cleavage site is present in other Coronaviruses (eg. ref to Table1-Fig2, in Coutard et al., 2020, Antiviral Res, https://doi.org/10.1016/j.antiviral.2020.104742). But not quite the same site as in 2019-nCoV, as pointed out in Walls et al. (Cell 2020, 180, 1–12). This cleavage site is “conserved among the 144 SARS CoV-2 isolates sequenced to date but not in the closely related RaTG13” (Walls 2020). The Authors demonstrate that the acquisition of the RRAR recognition site in 2019-nCoV is" something special". Indeed, the Western blot in Fig 1D shows that the S1/S2 cleavage takes place in 2019-nCoV but not in SARS-CoV: the two cleavage site are therefore not “equivalent”. Nor the cleavage takes place in a mutated sequence, in which only the Arg685 is conserved “… thereby mimicking the S1/S2 cleavage site of the related SARSr-CoV S CZX21” (Walls 2020).
Establishing the origin of the S1/S2 cleavage site in 2019-nCov is imperative for “… it is becoming increasingly apparent that proteolytic activation of spike by host cell proteases also plays a critical role … in cell and tissue tropism, host range, and pathogenesis” (Millet & Whittaker, 2015, Virus Res 202:120-134) … and “… S-protein “priming” … may provide a gain-of-function to the 2019-nCoV for efficient spreading in the human population compared to other lineage b beta-coronaviruses” (Coutard et al., 2020, Antiviral Res, https://doi.org/10.1016/j.antiviral.2020.104742).
Conclusion: To me, the origin of the furin cleavage-site in 2019-nCov is not clear yet.
Multiple peer-reviewed articles have shown that cats are susceptible to SARS-CoV-2, which raises an interesting question - could a SARS-CoV-2 ancestor without PRRA infect cats and recombine with FIPV to acquire PRRA? Could this be tested experimentally?
I have submitted a paper entitled “A COMMON PALINDROMIC RNA SEQUENCE AS UNITARY BREAKPOINT CONTRIBUTOR TO COPY-CHOICE RECOMBINATION IN SARS-COV-2” and posted a preprint version (subject to revision post-review) on bioRxiv. I will post the doi when available.
Relevant to the breakpoint sequence hypothesis are the number and position of CAGAC and CAGAT on the plus and minus strands of SARS-CoV-2. A map of these is not included in the paper, due to its file size, but a pdf copy is made available here.
The map not only indicates the location of breakpoint sequences, but also in relation to the RNA code for the known gene products of the virus. Navigation through the 30K genome as challenging, and I hope this helps.
I have run a similar breakpoint analysis on the ref seq for Middle Eastern Respiratory Syndrome (MERS) virus in the spike (S) protein region. This is the protein, especially the S1 attachment subunit, that is the principal determinant of host range, pathogenesis via cell fusion and communicabiity between human subjects.
The specificity of S is responsible in large part for the widespread and rapid communicability of SARS-CoV-2, with relatively low virulence in most of those infected (mortality varies but much less than 10%); in contrast, the spread of MERS is more easily contained, but its virulence is much higher, estimated at 35%.
A virus with the high communicability of SARS-CoV-2 and the virulence of MERS would be potentially apocalyptic in its impact.
Against that prospect, below is an annotated breakpoint sequence map for the reference sequence of MERS, delineating each occurrence of CAGAC or CAGAT, or its equivalent on the “minus” template strand, GTCTG. There are many fewer instances of these sequences in the MERS genome than in SARS-CoV-2, and most do not occur in a comparable location between the two viral genomes.
First of all, as noted by Graham and Baric (2010), each of the independently transcribed mRNAs are preceded by an identical hexanucleotide ACGAAC, that serves as the transcriptional regulatiory sequence TRS). Each could theoretically serve as a breakpoint bracketing any of the ORFs from S onward.
Second, there is a breakpoint sequence in MERS S, 21564CAGAC that is very close to the relative postion of 21691CAGAT in SARS-CoV-2. Homologous recombination at this point would not be likely to have a significant effect on protein structure.
Third, skipping one for the moment, there is a breakpoint sequence in MERS S, 25215CAGAT, that is similarly very close in relative position to 25047CAGAT in SARS-CoV-2.
Most importantly, there is a fourth breakpoint sequence, between the receptor binding domain of MERS and the S1/S2 junction, that is precisely conserved in sequence and position with the corresponding location of SARS-CoV-2. This breakpoint, 23577CAGAC in MERS and 23300CAGAC in SARS-CoV-2 (double-underlined in the pdf) is identical in both viruses at the same relative position in the S1 protein sequence, where it defines an identical dipeptide, QT.
There is potential, therefore, based on closely apposed or identical breakpoint sequences, that the bulk of the S1A and S1B domains of the S1 attachemnt subunit could be exchanged between MERS and SARS-CoV-2 in any mixed infection. While there is no indication that this has occurred in the wild, during the SARS-CoV-2 pandemic we face the unique situation of SARS-CoV-2 being present simultaneously in a substantial number of human beings worldwide. Any simultaneous outbreak of MERS, within or without those areas where is has been previously found prevalent, could produce the kind of mixed infection in humans that we know has resulted in frequent recombination among coronaviruses in the wild or in captive populations of animals. To the viruses, there is no known theoretical difference.
Public health authorities should especially guard against simultaneous spread of more than one coronavirus in the human population at the same time and in the same locations.
I have posted a copy of my preprint entitled " A Palindromic RNA Sequence as Common Breakpoint Contributor to Copy-choice Recombination in SARS-CoV-2" to the ResearchSquare preprint site, as an interim measure while Archives of Virology finishes processing the paper for online publication.
The interim version may be accessed via:
Readers should be aware that I have already assigned copyright to the publisher once the online version appears.
In SARS-CoV-2, reference strain Hu-2, the nucleotide sequence, including the out-of frame “12 nucleotide insert” encoding the furin site, is:
23582 tat cag act cag act aat t/ct cct cgg cgg g\ca cgt agt
encoding Y Q T Q T N S P R R A R S
The original form was presumably derived from a divergent relative of Bat RaTG13, specifically:
23582 tat cag act cag act aat tca cgt agt
encoding Y Q T Q T N S R S
The redundant breakpoint oligonucleotides CAGAC encoding QTQT (1) are key to this and all subsequent changes in this region over the last 16 months, creating a “hot spot” for genomic gymnastics at the S1/S2 interface of the spike protein.
I previously proposed that the bulk of the insert came from a downstream region of S in Bat CoV HKU9, involving an identical 10 nucleotides to the last part of the insert, leaving the first dinucleotide CT still orphan and unexplained (see 3 of 23 posts in this thread, from May 2020).
We have since seen, in the noncoding interface between orf 8 and N within the B lineage, additional evidence that the SARS-CoV-2 replicase is capable, even within the human population, of producing direct tandem repeats. This occurs just after the known splice acceptor breakpoint sequence ACGAAC. To wit:
This reinforces the two locations, comparing SARS-Co-V-2 and Bat RaTG13, where a direct tandem repeat of three nucleotides in SARS-CoV-2 follows a CAGAC breakpoint location (1).
I now propose that the intermediate sequence in the insert involved only NINE identical nucleotides from the same region of HKU-9 downstream in S encoding TSAG, but inserted here in a different frame, to yield the recombinant:
23581 tat cag act cag act aat t/ct cgg cgg g\ca cgt agt
Y Q T Q T N S R R A R S
While this insert created the furin site, it would be less accessible and inefficient without an additional amino acid in the peptide loop, particularly if that missing amino acid would be proline introducing a kink in the otherwise freely rotating peptide chain.
The KEY BREAKTHROUGH MUTATION fully enabling the furin site, and producing a SARS-CoV-2 with higher pathogenicity and transmissibility, would be the next step – a direct tandem repeat of ctc just downstream of the redundant CAGAC breakpoint sequence - finally yielding what was seen in the early clinical isolates of SARS-CoV-2 in Wuhan. The key was what I would call “the missing kink” in the direct precursor to the pandemic version of the virus.
That this region of sequence is a “hot spot” for mutation has been amply demonstrated by the multiple nucleotide and amino acid substitutions that have subsequently appeared independently in multiple sub-lineages of the virus while circulating in the human population.
Thus the 12 nucleotide insert occurred in TWO stages, a nine nucleotide recombinant followed later by a three nucleotide direct repeat. Each stage has thorough precedent in the genomic gymnastics of the coronavirus replicase, as well as sequential mutational events at this same site, as demonstrated in the known genetic rearrangements found among bat coronaviruses and even in SARS-CoV-2 while circulating in the human population.
Combined with the earlier work of Boni et al (2) and my own earlier work (1), this scenario fully accounts for a natural origin for every single nucleotide in the SARS-CoV-2 genome, as well for a breakthrough mutation that was the last step in enabling the pandemic potential of the virus.
Gallaher WR. A palindromic RNA sequence as a common breakpoint contributor to copy-choice recombination in SARS-COV-2. Arch Virol. 2020 Oct;165(10):2341-2348. doi: 10.1007/s00705-020-04750-z. Epub 2020 Jul 31. PMID: 32737584; PMCID: PMC7394270.
Boni MF, Lemey P, Jiang X, Lam TT, Perry BW, Castoe TA, Rambaut A, Robertson DL. Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic. Nat Microbiol. 2020 Nov;5(11):1408-1417. doi: 10.1038/s41564-020-0771-4. Epub 2020 Jul 28. PMID: 32724171.