Evidence Against the Veracity of SARS-CoV-2 Genomes Intermediate between Lineages A and B
Jonathan Pekar, Edyth Parker, Jennifer L. Havens, Marc A. Suchard, Kristian G. Andersen, Niema Moshiri, Michael Worobey, Andrew Rambaut, Joel O. Wertheim
Early SARS-CoV-2 genomic diversity can be separated into two primary lineages. Lineage B includes the reference genome Hu-1 and is defined by nucleotides C8782 and T28144, whereas lineage A is defined by substitutions C8782T and T28144C, relative to the reference genome. Intermediate sequences, containing either C8782T or T28144C—but not both—have been reported from early 2020. We refer to these genomes as C/C or T/T, because they have the same nucleotide at these two key sites. Here, we investigate the veracity of these sequences and conclude it is probable that neither C/C nor T/T genomes circulated at the start of the COVID-19 pandemic; they are likely the result of sequencing or bioinformatics issues.
Methods
We downloaded from GISAID all complete, high-coverage SARS-CoV-2 consensus genomes collected by 28 February 2020 and submitted by 31 December 2020—a table with acknowledgments is given below (1). We restricted our analysis to this time period because we were concerned with diversity at the start of the pandemic. We excluded all animal samples (i.e., bat and pangolin), along with any sequences that had an incomplete collection date, leaving 1716 sequences. These genomes were aligned with MAFFT v7.453 (2) (options --auto --keeplength --addfragments) to reference genome MN908947.3 (GISAID accession EPI_ISL_402125). Genomes with an ambiguous nucleotide at site 8782 or 28144 were excluded. We masked all problematic sites associated with common sequencing errors identified by De Maio et al. (2020) (3).
We then looked for pairs of genomes comprising an intermediate genome (C/C or T/T) and a major lineage (lineage A or B) that shared derived mutations. As demonstrated in Figure 3 of Worobey et al. (2020) (4), the repeated observation of pairs of genomes with apparent homoplasies can more parsimoniously be explained by sequencing error. In other words, we looked at whether putative A/B intermediates shared with “pure” A or “pure” B virus genomes one or more mutations outside of those that define the two pure lineages. If so, then either those mutations arose independently in both the putative intermediate and its pure counterpart, or (more likely) the putative intermediate is not an intermediate at all, and is actually a pure A or B lineage very closely related to its pure-lineage counterpart. This latter scenario implies that the C/C or T/T pattern in the putative intermediate was due to an error in inferring either site 8782 or site 28144.
Results
There were 28 C/C genomes collected by 28 February 2020. Of the 28 C/C genomes, 6 have no additional mutations other than at 28144. We identified 16 C/C genomes that share nucleotide substitutions also found in lineage A (Fig. 1). For example, one C/C sequence from Anhui (EPI_ISL_1069206) shares the mutation A11430G with 7 lineage A genomes. Occasionally multiple such mutations are shared: for example, a C/C genome from Thailand (EPI_ISL_437614) sharing G20134T, A895G, and G24047A with 3 other lineage A genomes from Thailand.
We also identified 11 C/C genomes that share substitutions found within lineage B (Fig. 2), including a Spanish C/C sequence (EPI_ISL_539558) sharing C22444T with 4 lineage B genomes and C26088T with an additional lineage B genome (Fig. 2). This latter example showcases potential sequencing issues beyond the occurrence of putative homoplasies, where apparent mutations prevent straightforward pairing of sequences from separate lineages based on one or several mutations. There are multiple occurrences of several taxa from a particular lineage (e.g., C/C) containing different subsets of mutations from a (set of) taxa from another lineage (e.g., A), as indicated by the brackets on the line connecting sequences in Figures 1 and 2. Notably, 9 of the C/C genomes share substitutions with both A and B lineages, whereas 4 contain substitutions not seen in other lineages.
There were 10 T/T genomes collected by 28 February 2020. Two T/T genomes (EPI_ISL_418251 and EPI_ISL_418247) have additional mutations C3037T and A23403G common among later lineage B sequences. Another two T/T sequences (EPI_ISL_728154 and EPI_ISL_416615) possess mutations G11410A and G26211T. However, aside from the T/T lineage, these two mutations are only seen separately, with G11410A found frequently in lineage B sequences from Japan and the Diamond Princess cruise ship, and G26211T seen in a lineage B Chinese genome (EPI_ISL_411952). Additionally, one T/T genome sampled in Wuhan (EPI_ISL_493180) contains the mutation C13730T, which, though not seen in other sequences in our dataset, persists within a lineage B clade through early 2021 (e.g., EPI_ISL_1322330). One of the T/T sequences has a mutation not seen in other lineages, and 4 do not have any additional mutations.
Conclusion
As discussed in Worobey et al. (2020) (4), the repeated occurrence of numerous derived mutations on either side of a given mutation is difficult to reconcile through homoplasy events. Of the 77 mutations seen in C/C intermediate genomes, 32 (41.6%) would need to be homoplasies if these C/C intermediates actually existed. Similarly, 7 (58.3%) of the 12 mutations seen in T/T genomes would need to be homoplasies if the T/T intermediates truly existed. These apparent homoplasies can arise from issues regarding sample preparation, contamination, sequencing technology, and/or consensus calling approaches (3). In particular, it seems likely that the nucleotide of the Hu-1 lineage B reference is frequently being called at these two sites.
These findings cast substantial doubt on the veracity of C/C or T/T intermediate genomes in early 2020. We suggest that these early C/C and T/T genomes are erroneous and should be excluded from phylogenetic analyses.
Figure 1. Phylogeny of representative sequences from SARS-CoV-2 lineage A and intermediate C/C genomes. Mutations relative to the Hu-1 reference genome are shown above each branch, comparing lineage A and C/C. Lineage defining mutations are colored in red. Derived mutations that are not shared by both lineages are excluded. Branches are connected to taxon names with horizontal dashed lines. The taxon names are GISAID accession numbers, and in cases where more than one sequence is represented, the total number of additional matching homoplastic sequences is indicated after the “+” symbol. Sequences that share derived mutations are connected by the lines on the right, and brackets indicate that a group of sequences share the derived mutations that cannot be individually resolved. For clarity, lines are presented as alternating dashed and solid. Several lineage A genomes with ambiguous placement were also excluded for clarity.
Figure 2. Phylogeny of representative sequences from SARS-CoV-2 lineage B and intermediate C/C genomes. Mutations relative to the Hu-1 reference genome are shown above each branch, comparing lineage B and C/C. Refer to Figure 1 for detailed explanation of format.
supplementary_table_1.zip (73.8 KB)
Supplementary table 1. GISAID acknowledgements.
References
-
Y. Shu, J. McCauley, GISAID: Global initiative on sharing all influenza data - from vision to reality. Euro Surveill. 22, 30494 (2017).
-
K. Katoh, D. M. Standley, MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol. Biol. Evol. 30, 772–780 (2013).
-
N. De Maio, C. Walker, R. Borges, L. Weilguny, G. Slodkowicz, N. Goldman, “Issues with SARS-CoV-2 sequencing data,” Virological (2020); https://pando.tools/t/issues-with-sars-cov-2-sequencing-data/473.
-
M. Worobey, J. Pekar, B. B. Larsen, M. I. Nelson, V. Hill, J. B. Joy, A. Rambaut, M. A. Suchard, J. O. Wertheim, P. Lemey, The emergence of SARS-CoV-2 in Europe and North America. Science. 370, 564–570 (2020).
Addendum
Above we provide evidence that C/C and T/T genomes that are purportedly intermediate to SARS-CoV-2 lineages A and B at the start of the COVID-19 pandemic are likely the result of sequencing or bioinformatics issues. However, this conclusion does not preclude the occurrence of such genomes later in the pandemic. For example, there is good evidence for two T/T genomes from the Diamond Princess outbreak (EPI_ISL_416615 and EPI_ISL_728154). These identical genomes were sequenced by two separate groups (1, 2), with a documented 99.7% allele frequency of T at site 8782 for EPI_ISL_728154 (1). The outbreak on the Diamond Princess was likely a single origin event in February 2020, and all genomes associated with this outbreak are lineage B (i.e., C8782 and T28144) with an additional G11083T substitution. Though this substitution is frequently masked in SARS-CoV-2 analyses (3), it is present in all genomes associated with the Diamond Princess, with >99% allele frequency of T at this site showcased in samples from Murata et al. (1, 2). Therefore, this substitution is a clade-defining marker, allowing us to interrogate mutations in the Diamond Princess clade separate from earlier SARS-CoV-2 genomes. We built a maximum likelihood phylogeny of 83 genomes from the Diamond Princess outbreak and reference genome MN908947.3 (GISAID accession EPI_ISL_402125) using IQ-TREE 2 (4), with a GTR+F substitution model and a minimum branch length of 1e-10 substitutions/site to collapse short branches (Fig. 3, genome accessions provided in Supplementary Table 2).
In agreement with previous phylogenetic analysis (1, 2), the two T/T genomes of interest occupy a derived position in the phylogeny and are part of a clade defined by a G11401T mutation, shared by 3 other genomes from the Diamond Princess, both encoding C at position 8782 (Fig. 3). The two T/T genomes also encode C8139T and G26211T, two additional substitutions not seen elsewhere in the Diamond Princess outbreak. This example showcases that T/T intermediate genomes can arise later in the pandemic, but still do not provide evidence for transitional genomes between lineages A and B at the start of the COVID-19 pandemic.
Figure 3. Maximum likelihood phylogeny of Diamond Princess genomes. The tree is rooted on Hu-1. Substitutions found in T/T genomes relative to Hu-1 annotated on branches. The G11410T clade is colored blue, with the branch leading to the T/T genomes colored red.
supplementary_table_2.zip (2.1 KB)
Supplementary table 2. GISAID acknowledgements for Diamond Princess analysis.
References
-
T. Murata, A. Sakurai, M. Suzuki, S. Komoto, T. Ide, T. Ishihara, Y. Doi, Shedding of Viable Virus in Asymptomatic SARS-CoV-2 Carriers. mSphere. 6 (2021), doi:10.1128/mSphere.00019-21.
-
T. Sekizuka, K. Itokawa, T. Kageyama, S. Saito, I. Takayama, H. Asanuma, N. Nao, R. Tanaka, M. Hashino, T. Takahashi, H. Kamiya, T. Yamagishi, K. Kakimoto, M. Suzuki, H. Hasegawa, T. Wakita, M. Kuroda, Haplotype networks of SARS-CoV-2 infections in the Diamond Princess cruise ship outbreak. Proc. Natl. Acad. Sci. 117, 20198–20201 (2020).
-
N. De Maio, C. Walker, R. Borges, L. Weilguny, G. Slodkowicz, N. Goldman, “Issues with SARS-CoV-2 sequencing data,” Virological (2020); https://pando.tools/t/issues-with-sars-cov-2-sequencing-data/473.
-
B. Q. Minh, H. A. Schmidt, O. Chernomor, D. Schrempf, M. D. Woodhams, A. von Haeseler, R. Lanfear, IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 37, 1530–1534 (2020).