Preliminary phylogenetic analysis of 11 nCoV2019 genomes, 2020-01-19

This is a brief report outlining some phylogenetic analysis of the initial genome sequences. It gives some preliminary findings for information purposes is not intended for publication as an academic work. All the data used here is provided by the laboratories listed below through NCBI or GISAID.

Available genome data

One annotated genome has been released on GenBank by Shanghai Public Health Clinical Center & School of Public Health, Fudan University, Shanghai, China:

This was the first genome released but it has been updated a few times as resequencing was performed, particularly focusing on the start and end of the genome. It is likely that this is a reliable genome sequence but there is insufficient epidemiological information for it to be useful here (there is no exact date of sample collection and it is unclear if the sample is from the same patient as one of the other genomes).

As of 19-Jan-2020, 13 other genome sequences have been released on to GISAID originating from 6 different labs.

Accession Strain Location Collection date Lab
EPI_ISL_402132 BetaCoV/Wuhan/HBCDC-HB-01/2019 China / Hubei Province / Wuhan City 2019-12-30 [1]
EPI_ISL_402127 BetaCoV/Wuhan/WIV02/2019 China / Hubei Province / Wuhan City 2019-12-30 [2]
EPI_ISL_402128 BetaCoV/Wuhan/WIV05/2019 China / Hubei Province / Wuhan City 2019-12-30 [2]
EPI_ISL_402129 BetaCoV/Wuhan/WIV06/2019 China / Hubei Province / Wuhan City 2019-12-30 [2]
EPI_ISL_402130 BetaCoV/Wuhan/WIV07/2019 China / Hubei Province / Wuhan City 2019-12-30 [2]
EPI_ISL_402126 BetaCoV/Kanagawa/1/2020 Kanagawa Prefecture, Japan 2020-01-14 [3]
EPI_ISL_403963 BetaCoV/Nonthaburi/74/2020 Thailand/ Nonthaburi Province 2020-01-13 [4]
EPI_ISL_403962 BetaCoV/Nonthaburi/61/2020 Thailand/ Nonthaburi Province 2020-01-08 [4]
EPI_ISL_402120 BetaCoV/Wuhan/IVDC-HB-04/2020 China / Hubei Province / Wuhan City 2020-01-01 [5]
EPI_ISL_402119 BetaCoV/Wuhan/IVDC-HB-01/2019 China / Hubei Province / Wuhan City 2019-12-30 [5]
EPI_ISL_402121 BetaCoV/Wuhan/IVDC-HB-05/2019 China / Hubei Province / Wuhan City 2019-12-30 [5]
EPI_ISL_402124 BetaCoV/Wuhan/WIV04/2019 China / Hubei Province / Wuhan City 2019-12-30 [2]
EPI_ISL_402123 BetaCoV/Wuhan/IPBCAMS-WH-01/2019 China / Hubei Province / Wuhan City 2019-12-24 [6]

[1] Wuhan Jinyintan Hospital & Hubei Provincial Center for Disease Control and Prevention, China

[2] Wuhan Jinyintan Hospital & Wuhan Institute of Virology, Chinese Academy of Sciences, China

[3] Dept. of Virology III, National Institute of Infectious Diseases, Japan

[4] Bamrasnaradura Hospital & Department of Medical Sciences, Ministry of Public Health, Thailand

[5] National Institute for Viral Disease Control and Prevention, China CDC, China

[6] Institute of Pathogen Biology, Chinese Academy of Medical Sciences & Peking Union Medical College, China

Table 1 | Available nCoV2019 genome sequences

Ten genomes are from Wuhan City in Hubei Province, China with samples collected between 24-Dec-2019 and 01-Jan-2020. Two genomes are from patients in Thailand who had recently travelled from Wuhan. One sequence is from a patient in Japan who had also travelled from Wuhan but this is a short fragment of genome (369 nucleotides long) and is not included in this analysis. One genome, β€˜BetaCoV/Wuhan/IVDC-HB-04/2020’, has evidence of sequencing artefacts and is excluded from the analysis.

Phylogenetic analysis

The phylogenetic tree of the remaining 11 complete genomes is given in Figure 1. This shows that there is very limited genetic variation in the currently sampled viruses in Wuhan (3 are identical, the others have 1, 2 or 3 differences from these). This is indicative of a relatively recent common ancestor for all these viruses.

Figure 1 | Maximum likelihood tree of 11 nCoV2019 genomes. Blue genomes from are from Thailand. The tree is rooted using the oldest sequence but this is an arbitrary choice. Constructed using PhyML

Thailand announced positive tests for two apparently independent travellers from Wuhan. They were reported as not having visited the seafood market that had been associated with some of the early cases in Wuhan and no reported epidemiological links with any of the known cases. We might therefore expect these individuals to have been infected with a random representative of the diversity of viruses circulating in Wuhan (either through exposure to a non-human source or other infected people through human to human transmission). Considering how similar these two virus genomes are to the sample from Wuhan may be informative about how diverse population of viruses is.

The two genomes sampled in Thailand are genetically identical to three of the genomes sampled from Wuhan on the 30-Dec-2019. This suggests that the (very limited) diversity present amongst the sampled and sequenced Wuhan cases is representative of the overall diversity of the outbreak.

Virus Estimated rate x10-3 subst/site/year Reference
SARS-CoV 0.80 – 2.38 Zhao et al. 2004 [2]
MERS-CoV 0.63 [0.14 – 1Β·1] Cotten et al. 2013 [3]
1.12 [0.88 – 1.37] Cotten et al. 2014 [4]
0.96 [0.83 βˆ’ 1.09] Dudas et al. 2018 [5]
HCoV-OC43 0.43 [0.27 – 0.60] Vijgen et al. 2005 [6]

Table 2 | Evolutionary rate estimates of human coronaviruses

To estimate the time of the most recent common ancestor (TMRCA) of the currently sampled viruses (including the ones from Thailand), I used a Bayesian phylogenetic software package called BEAST 7. With the available data it is not possible to estimate the rate of evolution of the virus so I used two assumed values 1x10-3 substitutions per site per year (a reasonable expected rate of evolution for an acute RNA virus) and 0.5x10-3. These values approximately span the rate of rate estimates for other human coronaviruses shown in Table 2.

The estimated date for the most recent common ancestor (and the 95% credible interval) are:

Assumed rate Estimated date of MRCA 95% interval
1x10-3 27-Nov-2019 06-Nov-2019 – 16-Dec-2019
0.5x10-3 28-Oct-2019 13-Sep-2019 – 03-Dec-2019

Both these estimates are compatible with the TMRCA at the beginning of December.


From the available data it is not possible to tell whether the TMRCA of the sampled cases was in a human or a non-human animal (the reservoir).

The sampled human viruses may be the result of multiple independent zoonotic introductions from a non-human animal source, a few introductions and then limited human-to-human transmission or a single introduction into the human population and spread. Determining which of these scenarios is more likely will depend on the assessment of other information (dates of onset, locations of likely non-human animal sources, epidemiological links between cases).

However, the phylogenetic data thus far suggests that the jump or jumps from non-human animals occurred relatively soon before the earliest identified cases. If multiple zoonotic jumps occurred, these did not come from a virus reservoir that was genetically diverse. That, in turn, would suggest that the virus had only recently become established in the direct non-human source or that the initial human patients had been exposed to a non-human animal source that had a genetically limited population of viruses. This might be the case if one or a group of infected animals had been brought into Wuhan city from elsewhere and was in a position to expose multiple individuals.


The number of genetic differences in the genomes is close to the error rate of the sequencing process. Some of the observed differences may be artefacts of this process.

The evolutionary rates used to estimate the TMRCA are supposed represent a plausible range based on previous estimates for other human coronaviruses.

The samples from Wuhan were likely collected as part of the initial investigation of the outbreak centred on the seafood market. This may have resulted in sampling of epidemiologically linked cases that are not representative of the outbreak within the human. But high degree of similarity with the two Thailand cases, and the absence of any reported link between these cases and the Wuhan cases, suggest this is not the case for the reasons outlined above.

The date estimates for the TMRCA is averaged over many plausible phylogenetic reconstructions of the genome data as there is insufficient information in the data to reconstruct any single time-calibrated tree.


  1. Guindon S, Gascuel O. A Simple, Fast, and Accurate Algorithm to Estimate Large Phylogenies by Maximum Likelihood. Syst Biol. 2003;52: 696–704.

  2. Zhao Z, Li H, Wu X, Zhong Y, Zhang K, Zhang Y-P, et al. Moderate mutation rate in the SARS coronavirus genome and its implications. BMC Evol Biol. 2004;4: 21.

  3. Cotten M, Watson SJ, Kellam P, Al-Rabeeah AA, Makhdoom HQ, Assiri A, et al. Transmission and evolution of the Middle East Respiratory Syndrome Coronavirus in Saudi Arabia: a descriptive genomic study. Lancet. 2013;382: 1993–2002.

  4. Cotten M, Watson SJ, Zumla AI, Makhdoom HQ, Palser AL, Ong SH, et al. Spread, Circulation, and Evolution of the Middle East Respiratory Syndrome Coronavirus. MBio. 2014;5: e01062–13.

  5. Dudas G, Carvalho LM, Rambaut A, Bedford T. MERS-CoV spillover at the camel-human interface. Elife. 2018;7. doi:[10.7554/eLife.31257]

  6. Vijgen L, Keyaerts E, MoΓ«s E, Thoelen I, Wollants E, Lemey P, et al. Complete genomic sequence of human coronavirus OC43: molecular clock analysis suggests a relatively recent zoonotic coronavirus transmission event. J Virol. 2005;79: 1595–1604.

  7. Suchard MA, Lemey P, Baele G, Ayres DL, Drummond AJ, Rambaut A. Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evol. 2018;4: vey016.


Thanks @arambaut - this is great. Since I believe ~ 50% of the diversity in the tree comes from sequencing errors, the TMRCAs would likely be even more recent - possibly pushing the interval towards the end of December. That means that the outbreak was detected almost immediately after the first case, which - given that this is flu season in China - is just amazing. Detecting an outbreak of pneumonia (similar to flu) of a novel coronavirus that fast is truly impressive.