SARS-CoV-2 Samples from Same Early COVID-19 Patients Were Sequenced Repeatedly with Errors Distorting Phylogenetic Trees

When the bat RaTG13 coronavirus is used as the outgroup for a SARS-CoV-2 phylogenetic study, the resulting phylogenetic tree is rooted near a virus strain isolated from the first patient in the United States [1]. The patient had traveled to and from Wuhan, but the strain has not yet been found in the city. It was postulated that there had been a lack of sequencing efforts in early days of the outbreak, and that newer strains gained competitive advantage during later days. The hypothesis has aroused substantial public interests and debates. Hence it is worthwhile to understand the availability and quality of viral genomic sequences from early patients.

In this exercise, I examined SARS-CoV-2 genomic sequences from China National Center for Bioinformation (CNCB, [2]. It incorporates data from many sources, including pointers to GISAID, and offers convenient data download and a user-friendly visualization of haplotype trees as shown in Fig 1.

The CNCB site returned 24 complete SARS-CoV-2 genomic sequences when limited to samples collected on or before Jan 1, 2020 in Wuhan, with data released by March 8, 2020. There were only 41 confirmed SARS-CoV-2 patients publicly reported by Jan 1, 2020 [3]. Hence age and gender information is mostly sufficient for matching patient identity, unless rejected by published orthogonal information. GISAID was used to retrieve the patient age, gender, sequencing platform and assembly method for each sequence. The combined information is tabulated in Fig 2. Excluded from the study is the grey row for an incomplete genome, and the italic rows for suspected duplicated submissions with identical end indel variants. A ‘U’ in the ‘AgeGender’ column means Unknown. The 41M1 and 41M2 indicates two separate 41-year-old male patients, differentiated by admission dates and whether they work in the Huanan Seafood Wholesale Market where the outbreak had started [4, 5].

The SARS-CoV-2 sample isolated from a 49-year-old female (49F) patient was sequenced at least 5 times, with three other individuals (52F, 61M and 32M) each contributing to at least 2 sequences. The grouping by 49F, 52F and 61M is confirmed by independent news articles [6, 7]. While repeated sequencing facilitates the following error assessment, such duplicated efforts shrunk the small data size even further for any meaningful SARS-CoV-2 early evolutionary or epidemiology study. The impact on a haplotype tree is shown in Fig 3.

Sequencing and/or assembly errors were evaluated conservatively with these rules, with end indels ignored:

  1. When there are only two genomic sequences of a sample, their divergence is the total error count of the two sequences.
  2. Because of the high quality and wide acceptance of the SARS-CoV-2 reference Wuhan-Hu-1, a genomic sequence is considered error-free if it is a perfect match to the reference.
  3. Sequences that are perfectly matched to each other and not duplicated submissions are considered error-free.
  4. Once a first error-free sequence of a strain is established, other sequences of the same sample are measured by divergence from the first sequence.
  5. Sequences that cannot be evaluated by the above rules were abstained.

As shown in Fig 4, in the 16 genomic sequences evaluated, at least 17 errors were counted in 7 (44%) sequences. Except for the rule 1, the errors can be located and compared to variants in other strains. All located errors are unique to the affected sequences. The sequence with the most (6) errors was flagged by CNCB’s QC pipeline as containing densely clustered mutations. The farthest tip in Fig. 4 is also flagged but abstained from evaluation.

In summary, samples from at least 4 early COVID-19 patients in Wuhan were sequenced repeatedly, with 1-6 errors in estimated 44% of the resulting genomic sequences. The errors can cause apparent false branching of phylogenetic trees. The finding supports the necessity of sequencing more early strains with higher quality in order to trace the evolution of SARS-CoV-2.


I gratefully acknowledge the Authors, the Originating and Submitting Laboratories for their sequences, metadata and tools shared through CNCB and GISAID, on which this research is based.


  1. Yu, Wen-Bin, Tang, Guang-Da, Zhang, Li,Corlett, Richard T…(2020).Decoding evolution and transmissions of novel pneumonia coronavirus using the whole genomic data.[ChinaXiv:202002.00033]
  2. Zhao WM, Song SH, Chen ML, et al. The 2019 novel coronavirus resource. Yi Chuan. 2020;42(2):212–221. doi:10.16288/j.yczz.20-030 [PMID: 32102777]
  3. Huang, C., Wang, Y., Li, X., et al. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet 2020; 395: 497–506.
  4. Wu, F., Zhao, S., Yu, B. et al. A new coronavirus associated with human respiratory disease in China. Nature (2020).
  5. Li-Li Ren, Ye-Ming Wang, et al. Identification of a novel coronavirus causing severe pneumonia in human: a descriptive study. Chinese Medical Journal (English) : February 11, 2020. doi: 10.1097/CM9.0000000000000722
  6. 2020-02-27. Tracking gene sequencing of the novel coronavirus: when did the alarm go off? (财新网: 新冠病毒基因测序溯源:警报是何时拉响的
  7. Southern Weekly 2020-03-05. “Restructuring” Jinyintan: Secrets in the Eye of Storm (南方周末: “重组”金银潭:疫情暴风眼的秘密

Please see the figures in ncovQC.pdf (1.6 MB)