The comparative recency of the proximal ancestors of SARS-CoV-1 and SARS-CoV-2

The comparative recency of the proximal ancestors of SARS-CoV-1 and SARS-CoV-2

Jonathan E. Pekar, Spyros Lytras, Andrew Magee, Jennifer L. Havens, Edyth Parker, Simon Dellicour, Joseph Hughes, Tetyana I. Vasylyeva, Philippe Lemey, David L. Robertson, Michael Worobey, Joel O. Wertheim


Horseshoe bats are the likely reservoir of sarbecoviruses (1), including SARS-CoV-1 and SARS-CoV-2 (2–4). Since the emergence of SARS-CoV-1 in 2002 and SARS-CoV-2 in 2019, there has been an increase in the sampling of sarbecoviruses in bats, which can reveal how recently SARS-CoV-1-like and SARS-CoV-2-like viruses sampled in bats shared a common ancestor with, respectively, SARS-CoV-1 and SARS-CoV-2 (jointly referred to as SARS-CoVs). Genome-wide sequence identity is typically used to compare sarbecoviruses to the SARS-CoVs, but, because sarbecoviruses frequently recombine (5, 6), whole genome identity is likely an insufficient method to determine their recent evolutionary histories. Rather, it is the non-recombinant regions (NRRs) of SARS-CoVs that are most informative of the origins of these viruses. Here, we analyze the recombination patterns of the clade of SARS-CoV-1-like viruses and the clade of SARS-CoV-2-like viruses and their respective evolutionary histories. We find that the available sequence data provide evidence of an ancestor of each SARS-CoV circulating in bats only a few years prior to its emergence.

Recombination patterns of SARS-CoVs

To understand the recombination patterns of the SARS-CoVs, we first aligned 167 sarbecovirus genomes, including SARS-CoV-1 and SARS-CoV-2, using MAFFT (7). We next used GARD (8) on SARS-CoV-2 and the 18 bat and pangolin virus genomes closely related to SARS-CoV-2 (hereafter the “SARS-CoV-2-like clade”), inferring 27 putative NRRs, with a median length of 970 nucleotides (nt) and a range of 252 to 2836 nt. We also performed a separate recombination analysis using GARD on SARS-CoV-1 and a subset of 37 representative bat virus genomes closely related to SARS-CoV-1 (hereafter the “SARS-CoV-1-like clade”), resulting in 31 putative NRRs, with a median length of 866 nt and a range of 309 to 2195 nt.

Clock rate varies across NRRs

To estimate the date that human SARS-CoVs last shared a common ancestor with bat sarbecoviruses, we need to identify a suitable rate prior for the molecular clock. Because there is insufficient signal when using tip dating with sarbecoviruses isolated from bats and pangolins (6), we calibrated the molecular clock using SARS-CoVs isolated from humans.

We inferred the substitution rate of SARS-CoV-2 across the 27 NRRs of SARS-CoV-2-like viruses by using an empirical tree distribution (n=1000) from a previously published Bayesian phylogenetic analysis of 787 early pandemic genomes (9), sharing the topologies and substitution models across all the NRRs, using a prior of 9.2x10-4 substitutions per site per year, and allowing the clock rate to vary. We used the resulting substitution rates as region-specific rate priors for a subsequent Bayesian phylogenetic inference of SARS-CoV-2 and 26 SARS-CoV-2-like viruses. We inferred substitution rates that varied more than three-fold across the genome in this latter inference (Fig. 1A), with median substitution rates as slow as 5.3x10-4 (3.3x10-4–7.4x10-4; NRR 12) and as fast as 2.2x10-3 (95% HPD: 8.0x10-4–3.5x10-3; NRR 3).

Because there are only 82 complete SARS-CoV-1 sequenced genomes, there is insufficient signal to properly calibrate the substitution rate of the SARS-CoV-1-like clade. We inferred substitution rates from the SARS-CoV-2 phylogeny across the 31 NRRs of SARS-CoV-1-like viruses and then used them as region-specific rate priors for analyses of SARS-CoV-1 and 139 SARS-CoV-1-like viruses. The substitution rates of the SARS-CoV-1-like clade varied up to six-fold across the genome (Fig. 1B), with median rates as slow as 4.0x10-4 (1.4–7.6; NRR 11) and as fast as 1.9x10-3 (1.4x10-3–2.5x10-3; NRR 23), and they were, on average, slightly slower than those of the SARS-CoV-2-like clade.

Figure 1. Substitution rates for the two SARS-CoV-like clades. Substitution rates (substitutions/site/year) across the (A) 27 NRRs for the SARS-CoV-2-like clade and (B) 31 NRRs for the SARS-CoV-1-like clade. The dashed line in each panel is the respective median substitution rate. The dots and thick lines within the violins indicate the median and interquartile range.

Most recent NRR

Here, we are interested in the proximal ancestor of SARS-CoV-2: where SARS-CoV-2 attaches to the rest of the clade comprised of the known SARS-CoV-2-like viruses, which can also be understood as the internal node of the SARS-CoV-2-like phylogeny that is immediately ancestral to SARS-CoV-2. The time of the proximal ancestor of SARS-CoV-2 ranged from several years to several decades preceding late-2019—when SARS-CoV-2 was introduced into humans—across the 27 NRRs (Fig. 2A). Although the median time of the proximal ancestor of SARS-CoV-2 was in 2007, the most recent time of the proximal ancestor was in NRR 3 in 2016 (95% HPD: 2009–2019), only three years prior to its introduction into humans (9). We note that although NRR three is only 259 nt, NRR six is 573 nt and has a similar time of the proximal ancestor of 2015 (95% HPD: 2008–2019).

Like SARS-CoV-2, the median time of the proximal ancestor of SARS-CoV-1 within each NRR varied across several decades (Fig. 2B), with an average of 1993. However, the most recent time of the proximal ancestor of SARS-CoV-1 was in NRR 14 in 2001 (95% HPD: 1998–2002; 1065 nt), only one year before the emergence of SARS-CoV-1 in 2002.

Although accurate estimation of the ages of deeper nodes in the SARS-CoV phylogenies would require more sophisticated models accommodating saturation (10), we restrict our analyses to the most recent nodes of the tree. Our inferred tMRCAs of the NRRs indicate that a few of the published sarbecovirus genomes include non-recombinant fragments that are descendant from viruses that circulated only a few years before the emergence of SARS-CoV-1 and SARS-CoV-2.

Figure 2. Time of the proximal ancestor for the SARS-CoVs. Time of the proximal ancestor of (A) SARS-CoV-2 across the 27 NRRs of the SARS-CoV-2-like clade and (B) SARS-CoV-1 across the 31 NRRs of the SARS-CoV-1-like clade. The dashed line in each panel is the date of the earliest sampled respective SARS-CoV (24 Dec 2019 for SARS-CoV-2; 16 Nov 2002 for SARS-CoV-1). The lower panels of (A) and (B) are zoomed-in panels of approximately the twenty years before the emergence of the SARS-CoV-2 and SARS-CoV-1, respectively. The violins are the 95% HPD and the dots within the violins indicate the median.

The recCA becomes more similar to SARS-CoV-1 and SARS-CoV-2 with increased sampling

The ancestor of SARS-CoV-2 can be understood as the aggregate of the proximal ancestors of SARS-CoV-2 across each of its NRRs (9). This ancestor—here referred to as the recombinant common ancestor (“recCA”)—accounts for the closest relative(s) across all non-recombinant segments.

We reconstructed the recCA of SARS-CoV-2 as the genomes of closely related sarbecoviruses were chronologically published. That is, we reconstructed the recCA with just the SARS-CoV-2 reference genome and the earliest published genome of a non-SARS-CoV-2 virus from the SARS-CoV-2-like clade, and then we successively reconstructed the recCA while progressively adding genomes to the dataset based on the dates the genomes were published. We then examined the similarity of the recCA to SARS-CoV-2 as a function of the publication date, rather than sampling date (Fig. 3).

As more genomes were published, the recCA of SARS-CoV-2 became more similar to SARS-CoV-2 (Fig. 3A). Even once the recCA is more than 96.8% identical to SARS-CoV-2, exceeding the shared genetic identity between SARS-CoV-2 and BANAL-20-52—the genome sampled in Laos in 2021 which is the most genetically similar virus to the SARS-CoV-2 genome—additional sarbecovirus genomes continue to increase the similarity of the recCA to SARS-CoV-2. After the publication of RaTG13 in January 2020 and before that of BANAL-20-52 in August 2021, the similarity of the recCA to SARS-CoV-2 continues to increase despite the genomes being published in that time sharing less overall genetic identity with SARS-CoV-2 than RaTG13. These genomes were therefore able to provide a more closely related fragment, or at least a fragment descended from a more closely related ancestor, in a genome that is not, in aggregate, more closely related to SARS-CoV-2 than either RaTG13 or BANAL-20-52.

We observed similar patterns in our SARS-CoV-1 analysis (Fig. 3B). However, the sampling and publication of dozens of genomes of SARS-CoV-like viruses after 2015 have had a negligible effect in reconstructing a more genetically similar ancestor of SARS-CoV-1, whereas the largest increase in recCA similarity to SARS-CoV-2 happened in 2020. Regardless, our results indicate that the recCA of each SARS-CoV is very similar to the given SARS-CoV when accounting for all closely related sarbecoviruses, with the recCA of SARS-CoV-2 sharing 98.8% genetic identity with SARS-CoV-2 and the recCA of SARS-CoV-1 sharing 98.6% genetic identity with SARS-CoV-1.

Figure 3. The recCA (recombinant common ancestor) and the most closely related sarbecovirus over time. (A) The similarity of the recCA of SARS-CoV-2 and the most closely related sarbecovirus in the SARS-CoV-2 clade to SARS-CoV-2 as a function of time. (B) The similarity of the recCA of SARS-CoV-1 and the most closely related sarbecovirus in the SARS-CoV-1 clade to SARS-CoV-1 as a function of time. The right panels for (A) and (B) are zoomed-in panels of the dashed box in the left panels. The dashed vertical line in (A) is the sampling date of the earliest sampled genome of SARS-CoV-2.


Although whole-genome genetic similarity is frequently used as a proxy for evolutionary recency, it is the non-recombinant fragments that must be analyzed to properly understand the emergence of the SARS-CoVs. Whole-genome genetic distance misleadingly suggests decades of separation between the proximal ancestor and each of the SARS-CoVs (11). However, we show that non-recombinant fragments from published sarbecovirus genomes are descended from viruses ancestral to and circulating as recently as 1–3 years prior to the emergence of the SARS-CoVs.

The viruses that gave rise to the SARS-CoVs have since experienced years of evolution and almost certainly recombined with yet unsampled sub-lineages of the sarbecovirus tree. The search for sarbecoviruses related to the SARS-CoVs should therefore be focused on detecting and characterizing similar genomic fragments that descend from a closely related ancestor, rather than viruses that are genetically similar across the entire genome.

By reconstructing the ancestor of the SARS-CoVs across these non-recombinant fragments, we show not only that the recCA of SARS-CoV-1 is similarly closely related to SARS-CoV-1 as the recCA of SARS-CoV-2 is to SARS-CoV-2, but that increased sampling and publication of sarbecovirus genomes has allowed us to construct more closely related ancestors over time. However, despite the recCAs sharing similar genetic identities to their respective SARS-CoV descendants, the recCA of SARS-CoV-1 comprises a greater number of more relatively recent fragments.


  1. L.-F. Wang, Z. Shi, S. Zhang, H. Field, P. Daszak, B. T. Eaton, Review of bats and SARS. Emerg. Infect. Dis. 12, 1834–1840 (2006).

  2. W. Ren, W. Li, M. Yu, P. Hao, Y. Zhang, P. Zhou, S. Zhang, G. Zhao, Y. Zhong, S. Wang, L.-F. Wang, Z. Shi, Full-length genome sequences of two SARS-like coronaviruses in horseshoe bats and genetic variation analysis. J. Gen. Virol. 87, 3355–3359 (2006).

  3. S. K. P. Lau, P. C. Y. Woo, K. S. M. Li, Y. Huang, H.-W. Tsoi, B. H. L. Wong, S. S. Y. Wong, S.-Y. Leung, K.-H. Chan, K.-Y. Yuen, Severe acute respiratory syndrome coronavirus-like virus in Chinese horseshoe bats. Proc. Natl. Acad. Sci. U. S. A. 102, 14040–14045 (2005).

  4. S. Lytras, J. Hughes, D. Martin, P. Swanepoel, A. de Klerk, R. Lourens, S. L. Kosakovsky Pond, W. Xia, X. Jiang, D. L. Robertson, Exploring the Natural Origins of SARS-CoV-2 in the Light of Recombination. Genome Biol. Evol. 14 (2022).

  5. Xian-Dan Lin, Wen Wang, Zong-Yu Hao, Zhao-Xiao Wang, Wen-Ping Guo, Xiao-Qing Guan, Miao-Ruo Wang, Hong-Wei Wang, Run-Hong Zhou, Ming-Hui Li, Guang-Peng Tang, Jun Wu, Edward C. Holmes, Yong-Zhen Zhang, Extensive diversity of coronaviruses in bats from China. Virology. 507, 1–10 (2017).

  6. M. F. Boni, P. Lemey, X. Jiang, T. T.-Y. Lam, B. W. Perry, T. A. Castoe, A. Rambaut, D. L. Robertson, Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic. Nat Microbiol. 5, 1408–1417 (2020).

  7. K. Katoh, D. M. Standley, MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).

  8. S. L. Kosakovsky Pond, D. Posada, M. B. Gravenor, C. H. Woelk, S. D. W. Frost, GARD: a genetic algorithm for recombination detection. Bioinformatics. 22, 3096–3098 (2006).

  9. J. E. Pekar, A. Magee, E. Parker, N. Moshiri, K. Izhikevich, J. L. Havens, K. Gangavarapu, L. M. Malpica Serrano, A. Crits-Christoph, N. L. Matteson, M. Zeller, J. I. Levy, J. C. Wang, S. Hughes, J. Lee, H. Park, M.-S. Park, K. Ching Zi Yan, R. T. P. Lin, M. N. Mat Isa, Y. M. Noor, T. I. Vasylyeva, R. F. Garry, E. C. Holmes, A. Rambaut, M. A. Suchard, K. G. Andersen, M. Worobey, J. O. Wertheim, The molecular epidemiology of multiple zoonotic origins of SARS-CoV-2. Science. 377, 960–966 (2022).

  10. J. O. Wertheim, D. K. W. Chu, J. S. M. Peiris, S. L. Kosakovsky Pond, L. L. M. Poon, A case for the ancient origin of coronaviruses. J. Virol. 87, 7039–7045 (2013).

  11. WHO Headquarters, WHO-convened global study of origins of SARS-CoV-2: China Part (2021).


Hello, I’m looking at figure 3, and I was wondering if you could clarify what the dates are.

  1. The date of sampling?
  2. The date of sequencing?
  3. The date of publication
    I ask because RaTG13 was sampled in 2013, (sequenced in 2018), so I would expect “Genetic Identity” of the “most similar sampled sarbecovirus” in 2013 to be ~0.961, yet that is not what is shown (nor for 2018). Is the RaTG13 datapoint only shown in 2020?
    If sampling date was used, then the pattern for SARS-CoV-2 and SARS-CoV-1 would look much more similar, no?