I attempted to replicate the recombination analysis on an alignment provided by Trevor Bedford (from this nextstrain tree). Alignment length is 29874nt and the number of polymorphic sites is 11586.
I used the new 3SEQ (v1.7) algorithm which can handle alignments with thousands of polymorphic sites. Source code for 3SEQ is here http://mol.ax/3seq. RDP4 has the previous version of 3SEQ (v1.1) and it can handle alignments with up to about 1000 polymorphic sites; this depends on how many of these polymorphic sites are classified as recombination-informative.
I have uploaded the key analysis files to this public folder:
https://filedn.com/lexUxgNbsuUbyRIygA2lmeR/nCoV2019/
3SEQ inferred breakpoints in the same general regions but a little bit off from David and Xiaowei’s analysis above:
Breakpoint 1: 13280 or 13570
Breakpoint 2: 18635
and these are really reported as breakpoint ranges, with all detail shown in the “3s.rec” files that I posted at the filedn.com link above. I generated trees for these sub-regions (link to PDF) and it looks like the outer part of the nCoV2019 genome has ancestry/origins in the bat viruses included in this alignment, but there is no clear ancestry signal for the middle part (13500-18500) part of the alignment. This suggests that there are more coronaviruses to be added to our alignments and phylogenies.
However, just because these recombination analyses give us breakpoints, this does not mean that the inferred sub-regions are free of recombination. I went deeper into the recombination analysis by running 3SEQ iteratively on the three inferred subregions (1-13280, 13281-18634, and 18635-29874), and then additionally on sub-sub-regions of these sub-regions, etc. It was very hard to find a subsegment of the genome that did not have a recombination signal, suggesting that the coronaviruses as a family are highly recombinant. Even when sectioning the genome down to 2kb or 3kb regions (using breakpoints given by 3SEQ sub-analyses) all of these small regions had evidence of mosaic recombination signals (the signals detected by 3SEQ). I did not test for phylogenetic recombination signals in these subsegments simply due to time constraints.
A high level of recombination in coronaviruses is consistent with a past analysis that we did on a MERS alignment of 164 genomes (http://mol.ax/pdf/lam18a.pdf) showing that the majority of these genomes show evidence of recombination. If the recombination rate really is this high, then we need to move to the family of inferential methods that calculate recombination rate not breakpoints; and it may also mean that the origins will be really hard to pin down unless we find a clade of coronaviruses in GenBank that have a relatively weak recombination signal when comparing to nCoV2019 (indicating recent ancestry) .
I’m traveling right now in Vietnam, and a bit hampered with websites being blocked and VPN issues, but will try to keep up if anyone has questions.