Phylodynamic estimation of incidence and prevalence of novel coronavirus (nCoV) infections through time

This post is mirrored from I was liking GitHub for this as it allows analysis to be versioned. But would love to have discussion in this thread.

Phylodynamic estimation of incidence and prevalence of novel coronavirus (nCoV) infections through time

Trevor Bedford1

1Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA


Here, we use a phylodynamic approach incorporating 53 publicly available novel coronavirus (nCoV) genomes to the estimate underlying incidence and prevalence of the epidemic. This approach uses estimates of the rate of coalescence through time to infer underlying viral population size and then uses assumptions of serial interval and heterogeneity of transmission to provide estimates of incidence and prevalence. We estimate an exponential doubling time of 7.2 (95% CI 5.0-12.9) days. We arrive at a median estimate of the total cumulative number of worldwide infections as of Feb 8, 2020, of 55,800 with a 95% uncertainty interval of 17,500 to 194,400. Importantly, this approach uses genome data from local and international cases and does not rely on case reporting within China.


Here, we use 53 publicly available nCoV genomes collected between 24 Dec, 2019 and 4 Feb, 2020. These represent cases sequenced from all over the world. There are genomes available from clusters that we did not incorporate here as they are expected to interfere with phylodynamic analysis (coalescent models assume that infections are sampled randomly from the infected population). For this analysis, we removed all but one of each reported cluster. In this case, each international sample can be considered a direct export from the epidemic within China and we can ignore most spatial considerations.


Here, we followed Andrew Rambaut’s work on and use the software package BEAST v1.10.4 to infer viral evolutionary dynamics. We began by running the Nextstrain nCov pipeline to align sequences and mask spurious SNPs. We took the output file masked.fasta as the starting point for this analysis. We loaded this alignment into BEAST and specified an evolutionary model to estimate:

  • strict molecular clock (CTMC rate reference prior)
  • exponential growth rate (Laplace prior with scale 100)
  • effective population size at time of most recent sampled tip (uniform prior between 0 and 10)

We followed Andrew in using a gamma distributed HKY nucleotide substitution model. MCMC was run for 50M steps, discarding the first 10M as burnin and sampling every 30,000 steps after this to give a dataset of 1335 MCMC samples.

The file ncov.xml contains the entire BEAST model specification. To run it will require filling in sequence data; we are not allowed to reshare this data according to GISAID Terms and Conditions. The Mathematica notebook ncov-phylodynamics.nb contains code to analyze resulting BEAST output in ncov.log and plot figures.


Rate and TMRCA

We find substitution rate consistent with previous work of 0.9 × 10-3 (95% CI 0.5-1.4 × 10-3) substitutions per site per year. We find a median TMRCA of 3 Dec (95% CI 30 Oct to 17 Dec).

Effective population size and exponential growth rate

These phylodynamic approaches can estimate effective size of the virus population by examining rates of coalescence through time. Here, we estimated the exponential growth rate as 35.4 (95% CI 9.6-50.0) per year. This translates to a doubling time of 7.2 (95% CI 5.0-12.9) days. This coincides closely with doubling time reported by modeling groups looking at reported cases in China (Wu et al).

Here, we plot timescale of coalescence N_e \tau through time:

N_e \tau is what is directly measured by phylodynamic methods and is measured in years. Here, it can be seen that coalescence is slowing down, so that pairs of lineages on 1 Feb coalesce at a rate of 1 event 10+ years.


The quantity of N_e \tau can be translated into effective population size N_e by dividing out generation time \tau. We assume generation time \tau to be 7.5 days following Li et al. Additionally, effective population size N_e can be translated into prevalence with knowledge of the variance in offspring distribution. High variance in distribution of secondary cases reduces prevalence relative to N_e as described by Volz et al. This reduction is equal to

\sigma^2 = \frac{1}{E[R_0]} + \frac{1}{k} + 1,

where E[R_0] is the mean number of secondary cases and k is the dispersion parameter of secondary cases. We assume E[R_0] to be between 1.8 and 2.8 following Wu et al and others. We assume that variance of secondary cases is at most like SARS with superspreading dynamics with k=0.15, but allow for less variance with k=0.30 (Riou and Althaus). Thus, we convert BEAST estimates of N_e \tau to point prevalence I by following I = N_e \tau \times \sigma^2 / \tau.

We arrive at the following estimate of prevalence through time:

We estimate a median prevalence on 8 Feb of 28,500 currently infected with a 95% uncertainty interval of between 7500 and 104,300 currently infected.

Total incidence

We estimate incidence in each serial interval and then calculate a cumulative incidence total:

We estimate a median total incidence on 8 Feb of 55,800 total infections since start of epidemic with a 95% uncertainty interval of between 17,500 and 194,400 total infections. On Feb 8, there were 34,886 total cases reported (WHO Sit Rep 19). Importantly, this approach uses genome data from local and international cases and does not rely on case reporting within China.

Our phylodynamic approach estimates an infection-to-case reporting rate of between 18% and 100%. Although, there are obviously wide uncertainty intervals to these estimates, we believe it is safe to conclude that case reporting is largely in line with expectations given spectrum of disease severity. These numbers are consistent with WHO reports of mild cases constituting 82% of the epidemic.


The nCoV genomes used in this analysis were generously shared by scientists at the Shanghai Public Health Clinical Center & School of Public Health, Fudan University, Shanghai, China, at the National Institute for Viral Disease Control and Prevention, China CDC, Beijing, China, at the Institute of Pathogen Biology, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China, at the Wuhan Institute of Virology, Chinese Academy of Sciences, Wuhan, China, at the Department of Microbiology, Zhejiang Provincial Center for Disease Control and Prevention, Hangzhou, China, at the Guangdong Provincial Center for Diseases Control and Prevention at the Department of Medical Sciences, at the Shenzhen Key Laboratory of Pathogen and Immunity, Shenzhen, China, at the Hangzhou Center for Disease and Control Microbiology Lab, Zhejiang, China, at the National Institute of Health, Nonthaburi, Thailand, at the National Institute of Infectious Diseases, Tokyo, Japan, at the Korea Centers for Disease Control & Prevention, Cheongju, Korea, at the National Public Health Laboratory, Singapore, at the US Centers for Disease Control and Prevention, Atlanta, USA, at the Institut Pasteur, Paris, France, at the Respiratory Virus Unit, Microbiology Services Colindale, Public Health England, and at the Department of Virology, University of Helsinki and Helsinki University Hospital, Helsinki, Finland, and at the University of Melbourne, Peter Doherty Institute for Infection and Immunity, Melbourne, Australia, at the Victorian Infectious Disease Reference Laboratory, Melbourne, Australia, at the Public Health Virology Laboratory, Brisbane, Australia and at the Institute of Clinical Pathology and Medical Research, University of Sydney, Westmead, Australia via GISAID. We gratefully acknowledge the Authors, Originating and Submitting laboratories of the genetic sequence and metadata made available through GISAID on which this research is based.

Provenance of originating labs, submitting labs and authors is available here.

I’m not convinced that this is the correct transformation for going from Ne(t)-> I(t).
The formula used here is based on the paper by Koelle et al and was derived under epidemic equilibrium. A different formula applies under exponential growth. See here:

Using that transformation would give you much bigger I(t), and FWIW I would also find that more realistic.

And in case it’s useful, I have also derived how it depends on the variance in transmission rates, which you can use as an alternative to the offspring distribution.

Ne = \frac{I(t) m_1}{2 m_2}

where m_i is the i’th moment of a distribution of transmission rates.
When the transmission rate is constant \beta that reduces to

Ne = I(t) / (2 \beta)

which is the same as in Volz 2012. You could see what you get if you plug in a rate of about 108 transmissions/year which corresponds to R0=2.5 with a generation time of 8.4 days. It would be a lot bigger.

Another potential problem is related to the upper bound of Ne=10. I looked through the logs and it looks like the posterior bumps up against the boundary which would curtail the upper bound of case estimates.

Thanks so much for comments @erik.volz! This is super helpful. I’ll revise with new data and this more accurate math.