Yeah, we can’t a priori assume that a divergent sequence is wrong, but for all sequences in this set where there were multiple mutations, many (all) of them were clearly the result of sequencing errors. Even after taking out the most ‘offensive’ sequences, I’m still fairly certain ~50% of SNPs are sequencing errors.
Doing some quick calculations on a SARS alignment from the 2002 epidemic, I get an N/S rate of ~0.5 and a Ts/Tv of ~ 3.5 - meaning that while synonymous and/or transitions are more frequent, but maybe not as much as we would typically see with other RNA viruses (e.g., Ts/Tv typically ~8). Just as a reference for trying to weed out sequencing errors and bad sequences.
https://nextstrain.org/ncov has been updated with 6 new genomes sampled from Guangdong shared by the Guangdong Provincial Center for Diseases Control and 3 new genomes sampled from Wuhan shared by Institute of Pathogen Biology, Chinese Academy of Medical Sciences.
We now observe clustering of related infections. These are a cluster of two infections in Zhuhai (Guangdong/20SF028/2020 and Guangdong/20SF040/2020) and a cluster of three infections in Shenzhen (Guangdong/20SF013/2020, Guangdong/20SF025/2020, Guangdong/20SF012/2020). These are noted in GISAID as “family cluster infection” and reflect two separate family clusters (https://twitter.com/JingLu_LuJing/status/1220143773532880896). This almost certainly represents human-to-human transmission.
Hopefully this isn’t too off-topic but I thought I’d put it here in case others are also interested in this application of sequences - we are looking into starting animal model work and are not sure what viruses we could get or whether we’d need to try and make our own via reverse genetics from synthetic gene constructs. Based on what we know now, is there a sense of which sequence or sequences would be best to use?
A 2013 bat Betacoronavirus (GISAID accession EPI_ISL_402131) was recently uploaded to GISAID. It is much more closely related to the 2019-2020 human outbreak viruses than the previously available bat virus genomes.
The publication of this sequence is available at the BioRxiv:
There is recombination between many of these viruses, so the phylogenetic trees built from complete genomes is not an accurate reflection of the evolutionary history of the various lineages.
A coronavirus was recently found in short read sequence data from a pangolin viromics data set. I am attaching an alignment of the complete genomes, a maximum likelihood tree built from just the spike glycoprotein gene, that does not include the pangolin sequence and a tree built from complete genomes that does include the pangolin sequence. The 2013 Yunnan bat virus sequence is still closer to nCoV than the pangolin virus sequence, but the pangolin virus is quite close. The 2013 Yunnan bat virus is removed from the alignment uploaded here, because it is GISAID data.
The 2013 Yunnan bat Coronavirus genome sequence is now available from GenBank, with accession number MN996532. So I am now uploading an alignment with this sequence included.