Evidence Against the Veracity of SARS-CoV-2 Genomes Intermediate between Lineages A and B
Jonathan Pekar, Edyth Parker, Jennifer L. Havens, Marc A. Suchard, Kristian G. Andersen, Niema Moshiri, Michael Worobey, Andrew Rambaut, Joel O. Wertheim
Early SARS-CoV-2 genomic diversity can be separated into two primary lineages. Lineage B includes the reference genome Hu-1 and is defined by nucleotides C8782 and T28144, whereas lineage A is defined by substitutions C8782T and T28144C, relative to the reference genome. Intermediate sequences, containing either C8782T or T28144C—but not both—have been reported from early 2020. We refer to these genomes as C/C or T/T, because they have the same nucleotide at these two key sites. Here, we investigate the veracity of these sequences and conclude it is probable that neither C/C nor T/T genomes circulated at the start of the COVID-19 pandemic; they are likely the result of sequencing or bioinformatics issues.
We downloaded from GISAID all complete, high-coverage SARS-CoV-2 consensus genomes collected by 28 February 2020 and submitted by 31 December 2020—a table with acknowledgments is given below (1). We restricted our analysis to this time period because we were concerned with diversity at the start of the pandemic. We excluded all animal samples (i.e., bat and pangolin), along with any sequences that had an incomplete collection date, leaving 1716 sequences. These genomes were aligned with MAFFT v7.453 (2) (options --auto --keeplength --addfragments) to reference genome MN908947.3 (GISAID accession EPI_ISL_402125). Genomes with an ambiguous nucleotide at site 8782 or 28144 were excluded. We masked all problematic sites associated with common sequencing errors identified by De Maio et al. (2020) (3).
We then looked for pairs of genomes comprising an intermediate genome (C/C or T/T) and a major lineage (lineage A or B) that shared derived mutations. As demonstrated in Figure 3 of Worobey et al. (2020) (4), the repeated observation of pairs of genomes with apparent homoplasies can more parsimoniously be explained by sequencing error. In other words, we looked at whether putative A/B intermediates shared with “pure” A or “pure” B virus genomes one or more mutations outside of those that define the two pure lineages. If so, then either those mutations arose independently in both the putative intermediate and its pure counterpart, or (more likely) the putative intermediate is not an intermediate at all, and is actually a pure A or B lineage very closely related to its pure-lineage counterpart. This latter scenario implies that the C/C or T/T pattern in the putative intermediate was due to an error in inferring either site 8782 or site 28144.
There were 28 C/C genomes collected by 28 February 2020. Of the 28 C/C genomes, 6 have no additional mutations other than at 28144. We identified 16 C/C genomes that share nucleotide substitutions also found in lineage A (Fig. 1). For example, one C/C sequence from Anhui (EPI_ISL_1069206) shares the mutation A11430G with 7 lineage A genomes. Occasionally multiple such mutations are shared: for example, a C/C genome from Thailand (EPI_ISL_437614) sharing G20134T, A895G, and G24047A with 3 other lineage A genomes from Thailand.
We also identified 11 C/C genomes that share substitutions found within lineage B (Fig. 2), including a Spanish C/C sequence (EPI_ISL_539558) sharing C22444T with 4 lineage B genomes and C26088T with an additional lineage B genome (Fig. 2). This latter example showcases potential sequencing issues beyond the occurrence of putative homoplasies, where apparent mutations prevent straightforward pairing of sequences from separate lineages based on one or several mutations. There are multiple occurrences of several taxa from a particular lineage (e.g., C/C) containing different subsets of mutations from a (set of) taxa from another lineage (e.g., A), as indicated by the brackets on the line connecting sequences in Figures 1 and 2. Notably, 9 of the C/C genomes share substitutions with both A and B lineages, whereas 4 contain substitutions not seen in other lineages.
There were 10 T/T genomes collected by 28 February 2020. Two T/T genomes (EPI_ISL_418251 and EPI_ISL_418247) have additional mutations C3037T and A23403G common among later lineage B sequences. Another two T/T sequences (EPI_ISL_728154 and EPI_ISL_416615) possess mutations G11410A and G26211T. However, aside from the T/T lineage, these two mutations are only seen separately, with G11410A found frequently in lineage B sequences from Japan and the Diamond Princess cruise ship, and G26211T seen in a lineage B Chinese genome (EPI_ISL_411952). Additionally, one T/T genome sampled in Wuhan (EPI_ISL_493180) contains the mutation C13730T, which, though not seen in other sequences in our dataset, persists within a lineage B clade through early 2021 (e.g., EPI_ISL_1322330). One of the T/T sequences has a mutation not seen in other lineages, and 4 do not have any additional mutations.
As discussed in Worobey et al. (2020) (4), the repeated occurrence of numerous derived mutations on either side of a given mutation is difficult to reconcile through homoplasy events. Of the 77 mutations seen in C/C intermediate genomes, 32 (41.6%) would need to be homoplasies if these C/C intermediates actually existed. Similarly, 7 (58.3%) of the 12 mutations seen in T/T genomes would need to be homoplasies if the T/T intermediates truly existed. These apparent homoplasies can arise from issues regarding sample preparation, contamination, sequencing technology, and/or consensus calling approaches (3). In particular, it seems likely that the nucleotide of the Hu-1 lineage B reference is frequently being called at these two sites.
These findings cast substantial doubt on the veracity of C/C or T/T intermediate genomes in early 2020. We suggest that these early C/C and T/T genomes are erroneous and should be excluded from phylogenetic analyses.
Figure 1. Phylogeny of representative sequences from SARS-CoV-2 lineage A and intermediate C/C genomes. Mutations relative to the Hu-1 reference genome are shown above each branch, comparing lineage A and C/C. Lineage defining mutations are colored in red. Derived mutations that are not shared by both lineages are excluded. Branches are connected to taxon names with horizontal dashed lines. The taxon names are GISAID accession numbers, and in cases where more than one sequence is represented, the total number of additional matching homoplastic sequences is indicated after the “+” symbol. Sequences that share derived mutations are connected by the lines on the right, and brackets indicate that a group of sequences share the derived mutations that cannot be individually resolved. For clarity, lines are presented as alternating dashed and solid. Several lineage A genomes with ambiguous placement were also excluded for clarity.
Figure 2. Phylogeny of representative sequences from SARS-CoV-2 lineage B and intermediate C/C genomes. Mutations relative to the Hu-1 reference genome are shown above each branch, comparing lineage B and C/C. Refer to Figure 1 for detailed explanation of format.
supplementary_table.zip (73.8 KB)
Supplementary table 1. GISAID acknowledgements.
Y. Shu, J. McCauley, GISAID: Global initiative on sharing all influenza data - from vision to reality. Euro Surveill. 22, 30494 (2017).
K. Katoh, D. M. Standley, MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol. Biol. Evol. 30, 772–780 (2013).
N. De Maio, C. Walker, R. Borges, L. Weilguny, G. Slodkowicz, N. Goldman, “Issues with SARS-CoV-2 sequencing data,” Virological (2020); Issues with SARS-CoV-2 sequencing data.
M. Worobey, J. Pekar, B. B. Larsen, M. I. Nelson, V. Hill, J. B. Joy, A. Rambaut, M. A. Suchard, J. O. Wertheim, P. Lemey, The emergence of SARS-CoV-2 in Europe and North America. Science. 370, 564–570 (2020).