nCoV's relationship to bat coronaviruses & recombination signals (no snakes) - no evidence the 2019-nCoV lineage is recombinant

With Xiaowei Jiang at XJTLU we’ve carried out a preliminary evolutionary analysis to characterise the evolutionary origins of the Wuhan virus, nCoV. Focus of our analysis is on the Wuhan-Hu-1 virus (accession no. MN908947, released on GenBank by Shanghai Public Health Clinical Center and School of Public Health, Fudan University, Shanghai, China) as all nCoV cluster together so will share the same evolutionary ancestry. It’s clear from phylogenetic analysis the new human virus is most closely related to bat coronaviruses in the Betacoronaviruses genera. While this is apparent from both the previously reported BLAST and full-genome phylogenetic analysis the closest related bat viruses (CoVZC45 and CoVZXC21) are in fact recombinants with shared breakpoints either side of ORF1b:

The phylogenetic clustering of the Wuhan-Hu-1 virus is consistently as a sister group to the SARS-related bat coronaviruses. Interestingly, a third bat coronavirus (Longquan_140) is a recombinant involving the Wuhan virus lineage in part of ORF1a.

This analysis has detected three bat coronavirus recombinants (two with shared breakpoints) involving the nCoV lineage indicating greater diversity in the Chinese Sarbecovirus group than previously appreciated. The clustering of the related Sarbecovirus viruses from Kenya and Europe suggest the Wuhan virus is still part of the Sabecovirus sub-genre, and these recombination events probably occurred in bats. Although. given the propensity of coronaviruses to switch hosts, involvement of another species cannot be discounted. There is also a very good chance that a non-bat intermediate species is responsible for the beginning of the current outbreak in Wuhan. Given the tight clustering of the nCoV viruses in phylogenetic trees it seems most likely one event has occurred.

Several of these bat coronaviruses have been previously detected to be recombinant under-scoring the importance of doing appropriate analysis when analysing these viruses using phylogenetic methods. Recombination, in this case between divergent coronaviruses circulating in bats, violates our assumption of a single evolutionary tree and so needs to be considered carefully when inferring coronavirus evolution from complete genome alignments. We’re looking into the patterns of breakpoints to see if there’s any clues to the significance (or not) of these recombination events.

We’d like to thank the researchers and health professionals for making the nCoV data available. Credit also needs to be given to the surveillance projects for generating the data that is now available for comparison and to the software developers for making the tools we’ve used freely available: FigTree, available at: FigTree; GARD, available at; MAFFT, available at MAFFT - a multiple sequence alignment program; PhyML, available at ATGC: PhyML; and RDP4, available at ATGC: PhyML.


Note, Spike is at positions 21717 to 25693 in our diversity plot and recombination analysis so to the right of the recombination breakpoint in the bat viruses CoVZC45 and CoVZXC21. In a Spike phylogeny nCoV clusters with these bat viruses. There is no evidence of snakes being involved as incorrectly reported here!

Hi David
Thanks for sharing this. Interesting dive into the hidden world of these viruses in their reservoir (presumably). I guess there will be insufficient sampling of bat viruses do dabble at when this may have occurred?

Would also like to hear your opinion on the "snake"paper. I see it criticised but am not familiar enough with the specific analyses to make a real assessment.


Yes there’s a reasonable set of bat viruses relatively closely related to SARs. Interestingly the nCoV lineage appears to be a sister group to these and its the SARS-related bat viruses that have picked up part of this lineage (at least on two occasions) in our analysis. This would suggest there’s more diversity than currently detected and these viruses are most probably in bats. We can’t, however, conclude anything about intermediate host from this analysis.

On the ‘snake paper’ they’ve correctly detected there’s recombination in the data set but then get the breakpoint wrong. If you look in our diversity plot across the Spike region (21717 to 25693) although it’s harder to call which virus is closest due to the increase in diversity, it’s still clear that there’s been a switch back to the bat-coronavirus CoVZXC21 being the closest. However, in their figure they’ve concluded from this that it’s the Spike that’s part of the recombination region, i.e., the breakpoint being to the right, which is not the case and not where the breakpoint is detected by the more sophisticated maximum likelihood method GARD (or other methods in RDP). The bootstrapping method they mention but don’t show is notorious for not detecting breakpoints reliably. Worse they’ve concluded it’s the Wuhan lineage that’s recombinant when in fact it’s the bat viruses that are changing their phylogenetic position and no longer clustering with Wuhan over ORF1b. On their linking the virus codon usage to snakes if you look at their figure 3A, it shows that nCoV and bat virus cluster together but not with snakes so their own analysis doesn’t support this conclusion. There is thus no evidences that snake are involved.

1 Like

Sincere apologies there was an error in the top figure. The tree on the very right from the region 20954 - 29903 was showing the tree from region 1680 - 3014 from further down in the post. This has been updated.

Here’s a CSV file of the “codon usage” table from the JMV paper, in case anyone wants to check it out. Data entry is very calming, I can recommend it. Please let me know if you see any errors.

,beta CoV Wuhan WIV04,bat SL CoV ZC45,Bungarus multicinctus,Naja atra,Marmota,Erinaceus europeaus,manis javanica,Rhonolophus sinicus,Gallus gallus,Homo sapiens
Met ,AUG,1,1,1,1,1,1,1,1,1,1

Hi David - do you have your alignment posted somewhere? I will run this through the new 3SEQ (not in RDP4 yet) which can compute exact breakpoints for alignments with >5000 polymorphic sites. New version is here: and it has recombination results for MERS-CoV.


If anyone is interested in doing it the obvious thing to do with respect to the ‘snake’ analysis is to look at a few more coronaviruses that have known hosts such as MERS (camel), bovine coronavirus (bovines), bat coronaviruses (bats) etc etc. My guess is that snakes have an particular codon usage bias that just happens to be in the same direction as coronaviruses (in general).

1 Like

How conserved is codon usage in vertebrates? Perhaps there’s plenty of codon usage biases to pick from within snakes (or other vertebrates).

I did a quick comparison, using coding sequences from Ensembl genomes, so the snakes there are Eastern Brown snake and Mainland Tiger Snake. Using the same euclidian distance on RSCU values as in the paper, it does look like snakes are closer to the Wuhan coronavirus than human, bats, cow, pig, cat, dog, chicken. But, I get the same result for MERS and SARS, and as Andy pointed out also Ebola. Just trying to automate it a bit to share the results tables

1 Like

I have a pipeline set up for CAICal - I’ll run some quick analyses looking at MERS, SARS, and nCoV against a bunch of species. Will post in a few hours.

As I’m setting up these analyses, I have already identified the main problem - the codon tables they used for the snakes (Naja atra and Bungarus multicinctus) are highly biased. Specifically, these codon tables are built on only 57 and 59 CDSs, which typically leads to completely wrong estimates of codon adaptation (which is how they link nCoV to snakes). Compare this to the human codon table (93487 CDSs) or Chicken (6017 CDSs - a notoriously badly undersampled genome).

Given how many genes snake genomes have, one can’t possibly create a representative codon table from less than 60 genes.

Also, remember there is variation in base composition across any 1 genome, and also between genomes.

Highly correlated (as in almost perfectly so). E.g., see Fig S5 from our Lassa paper:

We looked at many others not in the paper too.

I attempted to replicate the recombination analysis on an alignment provided by Trevor Bedford (from this nextstrain tree). Alignment length is 29874nt and the number of polymorphic sites is 11586.

I used the new 3SEQ (v1.7) algorithm which can handle alignments with thousands of polymorphic sites. Source code for 3SEQ is here RDP4 has the previous version of 3SEQ (v1.1) and it can handle alignments with up to about 1000 polymorphic sites; this depends on how many of these polymorphic sites are classified as recombination-informative.

I have uploaded the key analysis files to this public folder:

3SEQ inferred breakpoints in the same general regions but a little bit off from David and Xiaowei’s analysis above:

Breakpoint 1: 13280 or 13570
Breakpoint 2: 18635

and these are really reported as breakpoint ranges, with all detail shown in the “3s.rec” files that I posted at the link above. I generated trees for these sub-regions (link to PDF) and it looks like the outer part of the nCoV2019 genome has ancestry/origins in the bat viruses included in this alignment, but there is no clear ancestry signal for the middle part (13500-18500) part of the alignment. This suggests that there are more coronaviruses to be added to our alignments and phylogenies.

However, just because these recombination analyses give us breakpoints, this does not mean that the inferred sub-regions are free of recombination. I went deeper into the recombination analysis by running 3SEQ iteratively on the three inferred subregions (1-13280, 13281-18634, and 18635-29874), and then additionally on sub-sub-regions of these sub-regions, etc. It was very hard to find a subsegment of the genome that did not have a recombination signal, suggesting that the coronaviruses as a family are highly recombinant. Even when sectioning the genome down to 2kb or 3kb regions (using breakpoints given by 3SEQ sub-analyses) all of these small regions had evidence of mosaic recombination signals (the signals detected by 3SEQ). I did not test for phylogenetic recombination signals in these subsegments simply due to time constraints.

A high level of recombination in coronaviruses is consistent with a past analysis that we did on a MERS alignment of 164 genomes ( showing that the majority of these genomes show evidence of recombination. If the recombination rate really is this high, then we need to move to the family of inferential methods that calculate recombination rate not breakpoints; and it may also mean that the origins will be really hard to pin down unless we find a clade of coronaviruses in GenBank that have a relatively weak recombination signal when comparing to nCoV2019 (indicating recent ancestry) .

I’m traveling right now in Vietnam, and a bit hampered with websites being blocked and VPN issues, but will try to keep up if anyone has questions.

Will put online soon. We hadn’t yet as the alignment would be a secondary data release and it’s our understanding, despite the data being made public, there was some sensitivity about this.

The recombination breakpoints in our analysis are from an analysis with GARD in the Hyphy package using the 11 sequences listed in the diversity plot. In recombination analysis the exact locations of breakpoints detected will differ depending on the reference sets used and the method/software used. There’s also other recombination events in the bat coronaviruses (widely reported) that could be causing issues and we detected these both with GARD and RDP. This post’s purpose was to focus on the nCoV lineage. We will provide full details of our analysis in a pre-print very soon.

Very relevant pre-print ‘Discovery of a novel coronavirus associated with the recent pneumonia outbreak in humans and its potential bat origin’: Zhou et al. report a closer bat betacoronavirus, RaTG13, that is similar to nCoV across the region CoVZC45 and CoVZXC21 are recombinant/not clustering with nCoV. Relatively high sequence identity in Spike reported too.

Motivated by David’s post and Maciej’s post on the extent of recombination, I explored what could potentially still be done in a phylogenetic framework. In line with the recombination results posted, a Neighbor-Net [] with splits filtered to keep only those with > 95% bootstrap support indicates reticulate evolution:

and the PHI-test [] provides significant evidence for recombination. I attempted to heuristically filter the alignment from recombination signal by performing an RDP4 analysis (based on the 5 recombination methods selected by default and 3Seq, requiring evidence from 3 methods to call recombination), and keeping only the major non-recombinant stretches in each genome (remaining parts are masked with N’s). Some additional manual editing was done to remove regions that were left with only very little sequence information. Applying the same network reconstruction procedure on this filtered alignment now provides a tree:

And also the PHI-test does not find significant evidence for recombination anymore. So, despite the extent of recombination, there may still be a prospect for removing the major recombination signal in order to perform phylogenetics.


Thanks Philippe. I took your new alignment and ran it through 3SEQ. It has a much weaker recombination signal. 3SEQ detected 38 recombination events in the alignment with the blanked out or removed sections, but in David’s original alignment around 2700 were detected.

The breakpoints detected (that are relevant for the Wuhan nCoV) were around positions 900, 1650, 4800, 6420, and 9600. I built trees for the sub-regions defined by these breakpoints and they do not show any phylogenetic recombination signal that would be relevant to nCoV. They just show nCoV’s relatedness to two bat coronaviruses in the tree: CoVZC45 and CoVZXC21.

Trees are located here: nCoV2019_TreesOnSixCandidateRecombinantSegments.pdf (20.9 KB)

I think we’ll need an alignment of a larger sequence set, but this is already a good start for reconstructing origins.

1 Like