nCoV-2019 Spike Protein Receptor Binding Domain Shares High Amino Acid Identity With a Coronavirus Recovered from a Pangolin Viral Metagenomic Dataset

Fastas need to be gzipped to be uploaded.

_SARSlike_PlusWuhan_YunnanSPIKEPlusPangolin_CodonAlignedPROT.fasta.gz (9.1 KB)

The sequence names are not great in this file, but I included accession numbers so you can get more information on any sequence. The PDF image of the tree shows how the. spike genes are related to each other.
BetaCoronaviruses_114_WuhanCladeHandAlignedPlusPangolin2_IQtreePDF.pdf (7.2 KB)
SASR_SARSlikePlusPangolinCodonAligned.FASTA.gz (638.1 KB)

The complete genomes codon-aligned have a few small regions in individual sequences which are not “optimally” aligned. But overall, having a DNA alignment that translates to amino acids in one frame, is useful for studying selection pressure and other things.

The spike protein from the Wuhan strain is closer to RaTG13 overall, probably due to the S1-NTD subdomain being so different in the pangolin coronavirus. From the S1-CTD section on, the Wuhan strain and the pangolin strain are pretty similar (~97%), except for the furin cleavage site insertion.

It seems unlikely that the receptor binding domain–and especially the receptor binding motif–would be nearly identical to one found in pangolin through random chance.

Incidentally, pangolins were sold in the market at the center of all of this.

The 2013 Yunnan bat virus genome sequence in now available from GenBank with accession number MN996532.


I am Matthew Wong in Joe Petrosino’s lab at Baylor College of Medicine.

Many news outlets now reporting pangolins as potential source. Saw it first on Reuters citing work from South China Agricultural University and claiming 99% genome similarity. But have not seen a preprint or the genome. Has anyone been able to track this?

There is a report in Nature today:

which links to the same paper, and short read data, that torptube already posted. So apparently they are planning to publish what Matthew Wong already put up here. But maybe they will have a slightly different assembly.

This is new data that’s not been released yet. Xiaowei Jiang @john_jxw posted a translation on Twitter from the South China Agricultural University press release: “The team analyzed more than 1,000 metagenomic samples and identified pangolins as potential intermediate hosts of the new coronavirus; then, through molecular biological testing, it revealed that the positive rate of beta coronavirus in pangolins was 70%. … By analyzing the genome of the virus, they found the sequence similarity between the isolated virus strain and the currently infected human strain is as high as 99%.”


The title of this post ideally needs updated as Matthew Wong’s reconstructed virus from the dead pangolin metagenomic data from Liu et al 2019 ( is relatively close to the 2019-nCoV lineage throughout its genome, not just in Spike. It’s not, however, closer than the RaTG13 genome from bats.

1 Like

Thank you, Brian! I saw that but was left with curiosity to see the data. Matt and I worked on this together with a couple of more people from Baylor College of Medicine. The analyses were all based on available datasets and wished we would have found more datasets from pangolins. Maybe the SCAU investigators have more data. Let’s see!

1 Like

Dear Najami,

My post from 6 days ago has my alignment of the pangolin virus genome to over 100 other Sarbecovirus genomes (SASR_SARSlikePlusPangolinCodonAligned.FASTA.gz above). Sorry for the typos and funny file name, it was supposed to be SARS and SARS-like viruses. Anyway, I have tweaked that alignment a bit more in the past 6 days, so if you want a very slightly improved version, write to me. I also have similarity plots and trees and other stuff like that.

Here’s a link to our recent preprint on this topic - Matt Wong (torptube) ran this analysis.

1 Like

There’s a couple more pre-prints that assemble a partial genome from the Liu et al 2019 pangolin coronvirus data: Wahba et al. ( and Lam et al. ( The Lam paper also report a 2nd lineage of pangolin sarbecoviruses:

The important question would seem to be whether the RBD similarity to SARS-CoV-2 is due to recombination or convergent evolution.

I think it is recombination because there is a background of other synonymous mutations.

You can see the bit in RaTG13 in the region by the key sites of the RBD (top row of each sequence is nucleotides, bottom amino acids). Also there is re recombinant region in the pangolin at the 5’ end of spike (it doesn’t extend into ORF1b).

So my reading of this is that the common ancestor of RaTG13, the pangolin and the lineage that leads to SARS-CoV-2 might have had those RBD residues but the bat got a recombination in and lost them.

Thanks @david.l.robertson -
Matt also pointed out this one by the group who generated the pangolin datasets:


I think it is important to keep in mind that the distances and thus the inferred dates of the common ancestors of these “closely related” viruses (RaTG2013 Yunnan Bat, and the pangolin viruses) are quite a way back in the past, as has been discussed on another thread here . These viruses are known to recombine, but I don’t think we have a good idea yet what type of timescale is involved. Do they recombine on average once every 40 years, or once every 400 years? The BioRxiv paper by Lam et al has trees built from subgenomic regions indicating that in many parts of the genome there seem to be two major clades. But recombination between those clades has happened, as is illustrated by the similarity plot of the YN2018A.MK211375 virus attached here.

In MERS the evidence is that recombination is frequent, often involves relatively small tracts, and occurs over the course of the few years we have been observing it in camels:

Recombination does look like something coronaviruses do in the extreme! The new more complete pangolin virus genome from Xiao et al of the South China Agricultural University seems to confirm this, see Note, the 'as close as 1%’ they reported in their press release for the pangolin virus was misleading as other viruses are also very close in the region they reported as 1% similar, see their Table 1:

(annotations in red added by Xiaowei Jiang). It is still an interesting paper. The evidence is definitely accumulating that the pangolin is the intermediate host.

Does SimPLot ignore gaps in alignments?

I used a gap-stripped alignment for my similarity plots. So yes, the simplot I posted ignored gaps, because they were already stripped out.