Phylodynamic Analysis | 115 genomes | 20 Feb 2020

Phylogenetic analysis of nCoV-2019 genomes

Andrew Rambaut, University of Edinburgh, Edinburgh UK

This is a brief report outlining a simple phylogenetic analysis of publicly shared genome sequences. It gives some preliminary findings for information purposes is not intended as an academic work. All the data used here is provided by the laboratories listed below through NCBI or GISAID.

Phylogenetic analysis

This analyses uses 115 full-length genomes kindly made available on the GISAID and NCBI Genbank platforms. Some sequences were omitted as they were too short, contain sequencing artefacts, were resequencing of the same sample or had insufficient associated information. By ‘sequencing artefacts’ I mean apparent nucleotide variation that is probably due to the process of sequencing or bioinformatics pipelines - most likely due to the difficulty of sequencing samples with low concentrations of virus. Removing these sequences (or in some cases masking out the artefacts) is to avoid potential biases these may cause.

Acknowledgements and details of the genome sequences used in this analysis are given in Table 6 at the end of this document.

The phylogenetic tree of the currently available complete genomes is given in Figure 1. This shows that there is limited genetic variation in the currently sampled viruses but more recent ones are showing more divergence as is expected for fast evolving RNA viruses. But the lack of diversity is indicative of a relatively recent common ancestor for all these viruses.

Figure 1 | Maximum likelihood tree of nCoV2019 genomes constructed using PhyML [1]. The tree is rooted using the oldest sequence but this is an arbitrary choice. Interactive tree figure by @john.mccrone using figtree.js.

Download the tree file from here.

Estimating the date of origin

The software package BEAST [2,3] was used to estimate the date of the most recent common ancestor (MRCA) of the currently available genomes. The MRCA represents the point where the ancestral virus of all the sampled cases was in the same host (whether this was a non-human animal or a human). The rate of evolution of the virus is estimated from the data.

A coalescent model is used as the prior on the tree with two different variants being used - the constant size and exponential growth models. Because the coalescent model assumes that we have a small random sample from a large population, only a single representative genome of any known epidemiologically-linked transmission clusters was included. This leaves 75 genomes in the analysis.

Coalescent model Estimated rate 95% interval
Constant size NA
Exponential growth 0.92x10-3 0.33x10-3 – 1.46x10-3

Table 1 | The estimated rate of evolution (substitutions per site per year) of the sampled nCoV-2019 genomes. Note that recent data strongly supports a model of growth so in the interest of brevity I am no longer reporting the constant size model.

Coalescent model Estimated TMRCA 95% interval
Constant size NA
Exponential growth 29-Nov-2019 28-Oct-2019 – 20-Dec-2019

Table 2 | The estimated date of the MRCA of the sampled nCoV-2019 genomes.

The estimated dates for the most recent common ancestor (and the 95% credible interval) are compatible with the TMRCA from the beginning to middle of December (Table 2). The earliest reported date of symptom onset for the initial cluster of pneumonia cases was 1st December 2019 [4].

Exponential Growth

The exponential growth coalescent model can also infer a growth rate from the genome data, independent of any epidemiological information and from this calculate the doubling time (Table 3).

mean 95% interval
growth rate (/year) 41.03 20.56 – 62.17
doubling time (days) 6.2 4.1 – 12.3

Table 3 | The estimated growth rate and doubling time.


Most estimates since the first data was produced have produced a date that is consistent with the epidemiological reports of a first cluster of cases in December at a seafood market in Wuhan City. As more genomes have been generated, this is looking increasingly supported by the timing of the MRCA of the sampled genomes.

Although this is increasingly obvious from the reports of spread and numbers of confirmed cases and definitive evidence of human to human transmission, the genome sequence data shows no evidence that any non-human animal reservoir has been involved in generating new cases after the initial zoonotic event. If cases in January had been the result of new zoonotic jumps from a reservoir, we would expect those genomes to be more divergent and lie outside the observed diversity to that point.

We can conclude therefore, based on available genome sequence data that the current epidemic has been driven entirely by human to human transmission at least since December. Although this virus may still exist in one or more non-human animal species this will not be of major consequence until the current human epidemic is brought under control.

Caveats for the analysis

The number of genetic differences in the genomes is close to the error rate of the sequencing process. Some of the observed differences may be artefacts of this process in which case the genomes are more similar to each other. Some certain errors have been removed but others may still exist. As more data becomes available, these will increasingly become irrelevant.

The date estimates for the TMRCA is averaged over many plausible phylogenetic reconstructions of the genome data. So the reported credible intervals account for uncertainty in many aspects of the model and are, as a consequence, quite broad. We expect these to narrow as data is added.

Related analyses

@louis.duplessis has also done some careful analysis of these data under constant size, exponential growth and a non-parametric skyline model. He declines to report a growth rate at this stage, but does some useful tests of temporal signal in the data.

@trvrb has taken this analysis a stage further and used the growth rate and N0 parameters to estimate incidence and prevalence:


Priors used on parameters in analysis

parameter prior
evolutionary rate CTMC rate reference prior
kappa lognormal[1, 1.25]
frequencies Dirichlet[1,1]
alpha exponential[0.5]
popsize lognormal[1,2]
growth rate Laplace[100]

Epidemiologically-linked clusters

These are the known clusters of epidemiologically-linked cases. The * denotes the sequence used to represent the cluster in the coalescent analysis.

Previous estimates of evolutionary rates for human coronaviruses

The time of the most recent common ancestor depends on the rate of evolution. There is now sufficient information in the sampled data to infer this. The estimate reported is generally compatible with estimates made for other coronaviruses in the literature (Table 2).

Virus Estimated rate Reference
SARS-CoV 0.80 – 2.38 Zhao et al. 2004 [5]
MERS-CoV 0.63 [0.14 – 1·1] Cotten et al. 2013 [6]
1.12 [0.88 – 1.37] Cotten et al. 2014 [7]
0.96 [0.83 − 1.09] Dudas et al. 2018 [8]
HCoV-OC43 0.43 [0.27 – 0.60] Vijgen et al. 2005 [9]

Table 5 | Evolutionary rate estimates (x10-3 subst/site/year) of human coronaviruses


  1. Guindon S, Gascuel O. A Simple, Fast, and Accurate Algorithm to Estimate Large Phylogenies by Maximum Likelihood. Syst Biol. 2003;52: 696–704.

  2. Drummond AJ, Suchard MA, Xie D, Rambaut A. Bayesian Phylogenetics with BEAUti and the BEAST 1.7. Mol Biol Evol. 2012;29: 1969–1973.

  3. Suchard MA, Lemey P, Baele G, Ayres DL, Drummond AJ, Rambaut A. Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evol. 2018;4: vey016.

  4. Huang C et al. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet 2020;

  5. Zhao Z, Li H, Wu X, Zhong Y, Zhang K, Zhang Y-P, et al. Moderate mutation rate in the SARS coronavirus genome and its implications. BMC Evol Biol. 2004;4: 21.

  6. Cotten M, Watson SJ, Kellam P, Al-Rabeeah AA, Makhdoom HQ, Assiri A, et al. Transmission and evolution of the Middle East Respiratory Syndrome Coronavirus in Saudi Arabia: a descriptive genomic study. Lancet. 2013;382: 1993–2002.

  7. Cotten M, Watson SJ, Zumla AI, Makhdoom HQ, Palser AL, Ong SH, et al. Spread, Circulation, and Evolution of the Middle East Respiratory Syndrome Coronavirus. MBio. 2014;5: e01062–13.

  8. Dudas G, Carvalho LM, Rambaut A, Bedford T. MERS-CoV spillover at the camel-human interface. Elife. 2018;7. doi:(

  9. Vijgen L, Keyaerts E, Moës E, Thoelen I, Wollants E, Lemey P, et al. Complete genomic sequence of human coronavirus OC43: molecular clock analysis suggests a relatively recent zoonotic coronavirus transmission event. J Virol. 2005;79: 1595–1604.

Genome Data Acknowledgements

Table 6 | nCoV2019 genome sequences used in this analysis, the GISAID accession numbers and submitting labs.


As of Feb 1st (AEST), using the 51 genomes from GISAID the regression looks somewhat clearer, albeit with two outliers that I have flagged here. regression_31_01_2020.pdf (39.2 KB)

There are two genomes with substantial sequencing errors and are omitted in my analysis: EPI_ISL_406592, EPI_ISL_406595
One is your extreme outlier.

Great. Thanks for flagging this. I just realigned and made the tree again:
regression_31_01_2020.pdf (8.9 KB)

Hey Sebastian, I think there may be a mistake in your alignment. You’re rooting on EPI_ISL_403931 and on your regression the root-to-tip distance to EPI_ISL_40931 is a lot closer than any other tip on your tree (it’s one of your two outliers). On my old analysis on 30 genomes, EPI_ISL_403931 is identical to 11 other genomes (attached), so your tree should actually be rooted on a polytomy.

In the alignment I used (prepared by Kristian Andersen) there are 6 probable sequencing errors on EPI_ISL_403931 that are masked (marked as ambiguous sites).

The placement of EPI_ISL_402120 is also different on your tree - I also had it on the big polytomy, you have it on a relatively long branch with several unique mutations. On EPI_ISL_402120 Kristian’s alignment had 96 ambiguous sites (probable sequencing errors, on EPI_ISL_402120 many of these were indels that broke ORFs).

Hello Louis,
Yes, I was just going through it now. It is quite difficult to tease out sequencing errors. When I remove EPI_ISL_403931 and EPI_ISL_402120 I get a tree that is rooted similar to yours:

. There are still some long branches that I am not quite sure about. Note that EPI_ISL_406034 is the Californian one that Andrew has in his regression sitting at the top.

You can click on the branches in the tree at the top of the page to try different rootings.

The Los Angeles one is an outlier but there is nothing suspicious about its mutations. It is not effecting the rate estimates too much.

Thanks Andrew and Louis, very helpful! Several branch lengths are quite sensitive to a few potential sequencing errors in the tails of the alignment. I masked those now and get similar estimates to what you have above (there is just one more sequence here):

I’m worried about the effects of non-random sampling on estimated growth rates.
The coalescent will account for variation in sampling rates through time, effectively conditioning on time of sampling, and the distribution of samples is indeed weighted towards the beginning of the epidemic relative to numbers currently infected. The coalescent will not adjust oversampling of households, transmission pairs, travelers etc. and other groups that are identified by contact tracing rather than random sampling. Inclusion of the French pair is one obvious example.

What bias should you expect? A transmission pair sampled close to the present will coalesce in the recent past, reducing effective population size estimates and decreasing estimated growth rates.

I have tried fitting exponential growth coalescent with and without removing ‘phylodynamically redundant’ (just coined that) samples. Removing some samples does indeed seem to increase estimated growth rates. Unfortunately selecting samples to remove is ad-hoc. Here are some sets of redundant samples that I have identified, but there is a lot of guesswork involved. Feedback would be welcome:

 c( 'Guangdong/20SF040/2020', 'Guangdong/20SF028/2020' , 'Guangdong/20SF174/2020') # family , identical 
, c( 'HKU-SZ-002a_2020', 'HKU-SZ-005b_2020' ) # family but _not identical_
, c('France/IDF0373/2020', 'France/IDF0372/2020') #identical
, c('Foshan/20SF210/2020', 'Foshan/20SF211/2020' ) # guessing based on location, identical
, c( 'Nonthaburi/74/2020', 'Nonthaburi/61/2020' ) #thailand  identical 
, c( 'WHU01', 'WHU02' ) #identical, same time 
, c('Wuhan/WIV04/2019', 'Wuhan/WIV06/2019' ) # identical,  guessing based on time and id
, c( 'Guangdong/20SF012/2020', 'Guangdong/20SF025/2020', 'Guangdong/20SF013/2020' )

I have tried to identify all the known transmission clusters outside of Wuhan and only include a representative one. See the updated posting for results.