Novel 2019 coronavirus genome

edward_holmes · January 11, 2020, 1:05am

10th January 2020
This posting is communicated by Edward C. Holmes, University of Sydney on behalf of the consortium led by Professor Yong-Zhen Zhang, Fudan University, Shanghai

The Shanghai Public Health Clinical Center & School of Public Health, in collaboration with the Central Hospital of Wuhan, Huazhong University of Science and Technology, the Wuhan Center for Disease Control and Prevention, the National Institute for Communicable Disease Control and Prevention, Chinese Center for Disease Control, and the University of Sydney, Sydney, Australia is releasing a coronavirus genome from a case of a respiratory disease from the Wuhan outbreak. The sequence has also been deposited on GenBank (accession MN908947) and will be released as soon as possible.

Update: This genome is now available on GenBank and an updated version has been posted.

Disclaimer:
Please feel free to download, share, use, and analyze this data¹. We ask that you communicate with us if you wish to publish results that use these data in a journal. If you have any other questions –then please also contact us directly.

Professor Yong-Zhen Zhang,
Shanghai Public Health Clinical Center & School of Public Health,
Fudan University,
Shanghai, China.

email: [email protected]

¹ We know that “data” is plural but we were in a hurry.

arambaut · January 12, 2020, 12:14pm

Five new genomes have been deposited in the GISAID platform:
https://gisaid.org/CoV2020

trvrb · January 14, 2020, 12:36am

A preliminary phylogenetic analysis of these 6 genomes is available at https://nextstrain.org/groups/blab/sars-like-cov. The pipeline to generate this analysis is openly available at https://github.com/blab/sars-like-cov.

arambaut · January 14, 2020, 11:52am

Thanks @trvrb. A word of caution about interpreting this tree. I am almost certain that the divergent sequence IVDC-HB-05/2019 is divergent because of sequencing and assembly artefacts. I strongly suggest not making any epidemic inferences from the 6 genomes available at the moment.

I have contacted the authors of this sequence but have not had a reply yet.

arambaut · January 14, 2020, 11:55am

The SNPs in IVDC-HB-05/2019 are majority non-synonymous and non-sense:

IVDC-HB-04/2020 is also suspect - it has 5 non-synonymous mutations and 3 synonymous:

arambaut · January 14, 2020, 12:02pm

IVDC-HB-01/2019 has been cell cultured with one round of passaging. This should be considered the most reliable. It may could have cell adaptations but it is identical to WIV04/2019 which is direct sequenced so if independent, suggests there are no cell adaptations.

The first genome WH-Human_1 has one SNP difference from all the others which may mean it is real. However it is not known if this genome is from a sample from one of the same patients as the other 5.

richard.neher · January 14, 2020, 1:14pm

IVDC-HB-04/2020 is also suspect - it has 5 non-synonymous mutations and 3 synonymous

The nextstrain tool-tip is misleading here. The reference used has over-lapping annotations ORF1a and ORF1ab. There is a total of 3 mutations inferred for this branch. C1023T, C1025T, A18460G
The first two change the aa sequence of ORF1a in coding 253 and 245 (and these are the same as the mutations listed in ORF1ab).

I was mistaken. This is wrong:

The last mutation is synonymous also in ORF1ab after the slippage site.
So: 2 adjacent non-synonymous, 1 synonymous.

Correction:
The last mutation at A18460G is also non-synonymous. All three mutations are non-synonymous

cupton · January 14, 2020, 7:51pm

HB-04 has a bunch of indels.

Use Base-By-Base to view the alignment (visual sumary shows all diffs)

trvrb · January 14, 2020, 9:02pm

Thanks for the feedback Andrew and Richard. I’ve updated https://nextstrain.org/groups/blab/sars-like-cov to split ORF1a and ORF1b. This makes it clearer how nucleotide mutations map to amino acid substitutions.

Ignoring the divergent BetaCoV/Wuhan/IVDC-HB-05/2019 sequence and masking the initial 11 bases of the alignment, we have the following 5 strains and their mutations relative to the base of the outbreak clade:

WIV04/2019 - no mutations
IVDC-HB-01/2019 - no mutations
IPBCAMS-WH-01/2019 - 3 nucleotide mutations / 2 AA changes
IVDC-HB-04/2020 - 3 nucleotide mutations / 3 AA changes (includes C1023T and C1025T which are suspect being so close together)
WH-Human_1 - 2 nucleotide mutations / 1 AA change

This alignment was stripped to map to reference https://github.com/blab/sars-like-cov/blob/master/config/sars-like-cov_reference.gb and so lacks indels.

arambaut · January 15, 2020, 12:13am

The single basepair gaps are in homopolymeric runs suggesting a sequencing platform that maybe has problems with those. There are 2 larger deletions which I assume are missing read coverage.

Kristian_Andersen · January 15, 2020, 12:16am

I get slightly different stats on the # mutations - HB-04 has some indels that need corrections. Keeping HB-01 as the reference (should maybe be WH-01 though, as that’s the oldest sequence):

IVDC-HB-01/2019: [ref]
IPBCAMS-WH-01/2019: 3 mutations (2 non-syn / 1 syn)
WIV04/2019: 0 mutations
Hu-1/2019: 1 mutation (1 non-syn)
IVDC-HB-04/2020: 2 mutations (2 non-syn) (however, I don’t believe these, so I think this should also be 0 mutations)

I agree with Trevor that the mutations in HB-04 are suspect - right next to each other, non-synonymous, close to a poly-T stretch, and this sequence also needed some manual editing for indels. I think these are probably not correct and that sequence would then also be identical.

As for IVDC-HB-05, I agree with everybody that this sequence is definitely wrong (clustering of mutations, wacky ts/tv ratio, etc). If I do my very best to eliminate sequencing errors that I have commonly observed over the years, then I get a maximum of 7 mutations in this sequence, 4 of which are non-synonymous. These 7 can’t be excluded as likely errors (unlike the other 46 mutations in this sequence), but I think they still represent a (substantial) over-estimation.

Kristian_Andersen · January 15, 2020, 12:19am

Taking what I said above at face value and making WH-01 the reference (being the earliest sequence), we have three substitutions early in the tree (2 non-synonymous, 1 synonymous), followed by one additional substitution in Hu-1 (non-synonymous).

trvrb · January 15, 2020, 6:23am

Thanks for checking this Kristian. I agree with you that the SNP at the end of Hu-1/2019 is spurious as is the SNP at 18460 in BetaCoV/Wuhan/IVDC-HB-04/2020 that appears directly adjacent to a long stretch of ambiguous bases. I’ve masked site 18460 as well as both ends of the alignment. With these changes our counts agree. Additionally, I thought it prudent to drop IVDC-HB-05 from the tree as its divergence is almost certainly spurious.

I’ve updated https://nextstrain.org/groups/blab/sars-like-cov accordingly.

cupton · January 16, 2020, 7:15pm

Or poor assembly.
Think the coverage is so low that regions are missing?

arambaut · January 17, 2020, 12:56pm

I think it would be unlikely that you would get zero coverage just for 1 basepair. The fact that they are in homopolymeric runs suggests systematic run-length errors. This is probably not Illumina data.

kihohong · January 20, 2020, 1:42am

Several sequences including Thailand cases has been added to GISAID today.
Although GISAID announced their whole genome analysis result, I am wondering if you would update your analysis, since your analysis provide much more information including diversities.
Sincerely.

trvrb · January 20, 2020, 3:18am

https://nextstrain.org/ncov has been updated with all genomes currently in GISAID.

trvrb · January 21, 2020, 6:40pm

The Zhejiang Provincial Center for Disease Control and Prevention has shared two new genomes via gisaid.org. We’ve updated https://nextstrain.org/ncov to include them in our analysis bringing total up to 15 highly related samples.

Kristian_Andersen · January 21, 2020, 10:18pm

Four more genomes were released, bringing the total to 19. Note that a couple of these look suspicious:

EPI_ISL_403928: A lot of mutations - can’t be trusted at this stage
EPI_ISL_403931: Mutations in the 5’ end that are wrong

Still not a lot of diversity.

Based on this dataset I count 17 SNPs that appear to be real and 35 that do not (this is not including indels in 402120). All SNPs are private - none of them transmitted.

trvrb · January 22, 2020, 12:17am

Thanks @Kristian_Andersen. We’ve updated https://nextstrain.org/ncov accordingly, but have also left out Wuhan/IPBCAMS-WH-05/2020 / EPI_ISL_403928 due to appearance of spurious mutations.

However, I’d note that if there’s been multiple spillover events from the animal reservoir, I would really expect to be seeing clusters of genetically distinct human cases. So, a divergent sequence by itself is an expected thing. What threw me here however, was strange clustering of mutations in Wuhan/IPBCAMS-WH-05/2020.

Kristian_Andersen · January 22, 2020, 4:31am

Yeah, we can’t a priori assume that a divergent sequence is wrong, but for all sequences in this set where there were multiple mutations, many (all) of them were clearly the result of sequencing errors. Even after taking out the most ‘offensive’ sequences, I’m still fairly certain ~50% of SNPs are sequencing errors.

Doing some quick calculations on a SARS alignment from the 2002 epidemic, I get an N/S rate of ~0.5 and a Ts/Tv of ~ 3.5 - meaning that while synonymous and/or transitions are more frequent, but maybe not as much as we would typically see with other RNA viruses (e.g., Ts/Tv typically ~8). Just as a reference for trying to weed out sequencing errors and bad sequences.

trvrb · January 23, 2020, 7:46am

https://nextstrain.org/ncov has been updated with 6 new genomes sampled from Guangdong shared by the Guangdong Provincial Center for Diseases Control and 3 new genomes sampled from Wuhan shared by Institute of Pathogen Biology, Chinese Academy of Medical Sciences.

We now observe clustering of related infections. These are a cluster of two infections in Zhuhai (Guangdong/20SF028/2020 and Guangdong/20SF040/2020) and a cluster of three infections in Shenzhen (Guangdong/20SF013/2020, Guangdong/20SF025/2020, Guangdong/20SF012/2020). These are noted in GISAID as “family cluster infection” and reflect two separate family clusters (https://twitter.com/JingLu_LuJing/status/1220143773532880896). This almost certainly represents human-to-human transmission.

david_h_oconnor · January 23, 2020, 4:12pm

Hi all,

Hopefully this isn’t too off-topic but I thought I’d put it here in case others are also interested in this application of sequences - we are looking into starting animal model work and are not sure what viruses we could get or whether we’d need to try and make our own via reverse genetics from synthetic gene constructs. Based on what we know now, is there a sense of which sequence or sequences would be best to use?

Thanks,

dave

Kristian_Andersen · January 23, 2020, 5:45pm

Hey Dave - they’re basically all identical except for a handful of SNPs, so you can pretty much pick any. Take a look at Andrew’s overall analysis here: Phylogenetic analysis of 23 nCoV-2019 genomes, 2020-01-23.

I might suggest Wuhan-Hu-1, which is on Genbank: https://www.ncbi.nlm.nih.gov/nuccore/MN908947

david_h_oconnor · January 23, 2020, 9:10pm

Great, many thanks Kristian.

dave

Brian_Foley · January 29, 2020, 2:00am

A 2013 bat Betacoronavirus (GISAID accession EPI_ISL_402131) was recently uploaded to GISAID. It is much more closely related to the 2019-2020 human outbreak viruses than the previously available bat virus genomes.

The publication of this sequence is available at the BioRxiv:

BetaCoronaviruses_56_WuhanCladePlus2013Bat_TreePDF.pdf (4.4 KB)

There is recombination between many of these viruses, so the phylogenetic trees built from complete genomes is not an accurate reflection of the evolutionary history of the various lineages.

Brian_Foley · February 2, 2020, 7:17am

A coronavirus was recently found in short read sequence data from a pangolin viromics data set. I am attaching an alignment of the complete genomes, a maximum likelihood tree built from just the spike glycoprotein gene, that does not include the pangolin sequence and a tree built from complete genomes that does include the pangolin sequence. The 2013 Yunnan bat virus sequence is still closer to nCoV than the pangolin virus sequence, but the pangolin virus is quite close. The 2013 Yunnan bat virus is removed from the alignment uploaded here, because it is GISAID data.

https://www.ncbi.nlm.nih.gov/sra/SRR10168377

SARS_SARSlikeCodonAligned.FASTA.gz (630.7 KB)

_SARSlike_PlusWuhan_YunnanSPIKE_CodonAlignedTreePDF.pdf (7.1 KB)

BetaCoronaviruses_114_WuhanCladeHandAlignedPlusPangolin2_IQtreePDF.pdf (7.2 KB)

Credit for the discovery of the Pangolin virus sequence:

Brian_Foley · February 5, 2020, 4:42pm

The 2013 Yunnan bat Coronavirus genome sequence is now available from GenBank, with accession number MN996532. So I am now uploading an alignment with this sequence included.

113_SARS_CodonAlignedRenamed_IQtreeOrder.FASTA.gz (597.5 KB)