Issues with SARS-CoV-2 sequencing data

NicolaDeMaio · May 5, 2020, 12:27pm

Issues with SARS-CoV-2 sequencing data

Nicola De Maio^1*, Conor Walker¹, Rui Borges², Lukas Weilguny¹, Greg Slodkowicz³, Nick Goldman¹

¹European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridgeshire, United Kingdom.
²Institut für Populationsgenetik, Vetmeduni Vienna, Veterinärplatz 1, Wien 1210, Austria.
³MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge CB2 0QH, United Kingdom.
^*[email protected]

Summary

We investigate oddities in the SARS-CoV-2 genome sequences from GISAID. Many putative sequencing issues seem specific to genomic ends and to certain samples, and are easily filtered out. However, many mutations seem to arise many times along the phylogenetic tree (are highly homoplasic), and seem more likely the result of contamination, recurrent sequencing errors, or hypermutability, than selection or recombination. Some homoplasic substitutions seem laboratory-specific, suggesting that they might arise from specific combinations of sample preparation, sequencing technology, and consensus calling approaches.

To help other researchers in similar efforts or who rely on available SARS-CoV-2 genome sequences in downstream analyses, we summarize the steps that we have recognised so far useful for filtering and masking alignments of SARS-CoV-2 sequences. We also hope that this will spark a discussion regarding the best methods to identify and interpret such peculiar variants and samples, and we provide here a list of filters that so far we think reasonable.

First, we propose to mask alignment ends (as most of us already do), which are affected by low coverage and high rate of apparent sequencing/mapping errors. We mask positions 1–55 and 29804–29903 when aligned to reference MN908947.3, but other more or less stringent choices are possible.

Secondly, we propose masking sites that appear to be highly homoplasic and have no phylogenetic signal and/or low prevalence – these can be recurrent artefacts, or otherwise hypermutable low-fitness sites that might similarly cause phylogenetic noise. A current list of these is:
187, 1059, 2094, 3037, 3130, 6990, 8022, 10323, 10741, 11074, 13408, 14786, 19684, 20148, 21137, 24034, 24378, 25563, 26144, 26461, 26681, 28077, 28826, 28854, 29700.
We provide technical details of how these sites were identified below, however please note that all lists of sites outlined here are a work in progress, and might be affected by many choices made in the preliminary phylogenetic steps.

In addition, we suggest masking any homoplasic positions that are exclusive to a single sequencing lab or geographic location, regardless of phylogenetic signal. Here the phylogenetic signal might be caused by a common source of error (among other things). Our current list is:
4050, 13402.

We also recommend masking of positions that, despite having strong phylogenetic signal, are also strongly homoplasic. These may be caused by hypermutability at certain positions, although it is hard to rule out any possibility for now. Our current list is:
11083, 15324, 21575.

Finally, as other groups have already suggested (see e.g. [12]), we recommend filtering out sequences that: have too few resolved characters (our somewhat arbitrary threshold is about 29,400 reference bases), are too diverged (as can be tested using TreeTime), have unusual locally high divergence (as can be tested using ClonalFramML), have missing/incomplete sampling date information, or that are distant from any other sequence in the dataset (we use a custom script to remove all sequences that are at least three substitutions away from any other sequence). We don’t provide a current list as this is quite long and varies as the number of publicly shared SARS-CoV-2 genome sequences increases.

Detailed analyses

GISAID alignment preparation

We downloaded 5894 SARS-CoV-2 consensus genomes from GISAID [1] on 11th April 2020, and removed animal samples (bat and pangolin). We also removed sequences with less than 29,400 informative nucleotides. The remaining sequences were aligned with MAFFT v7.453 [2] (options --auto --keeplength --addfragments ) to reference genome MN908947.3, which is 29,903bp long. We then masked genome positions covered by less than 90% of the sequences in the alignment (genomic ends 1–55 and 29804–29903) as these low-coverage positions are seemingly prone to putative alignment and/or calling errors (see e.g. Figure 1). So far these filters are quite obvious and have been used, in some version, by other studies so far.

Figure 1: right end of a MAFFT alignment showing seemingly problematic alignments at the end of the genome.

We then ran IQTree v1.6.12 [3] (options -st DNA -m HKY) on the resulting alignment. A few samples have extremely long terminal branches (Figure 2), suggesting either evolutionary events leading to many substitutions (e.g. recombination events or large mutation events), or sequencing/calling artefacts in specific samples.

Figure 2: IQTree phylogeny showing some very long terminal branches.

Investigating recombinations and clusters of mutations/artefacts

To further investigate, and to look into possible recombination events, homoplasies (mutation events seemingly happening multiple times along the phylogeny) and mutational clusters, we ran ClonalFrameML v1.12 [4]. This software found 30 putative recombination events happening at terminal branches, consistent with the long terminal branches in Figure 2. These putative events appear as clusters of substitutions in individual samples and genomic positions (e.g. Figure 3). Given that these mutations are not observed in any other samples, they could represent artefacts in the corresponding sequence. We remove the samples presenting these mutational clusters from further analyses; this filtering strongly overlaps with the filtering of samples showing inconsistent temporal signal in TreeTime [5] (see below).

Figure 3: example of a mutation cluster in a terminal branch.

ClonalFrameML also suggests 3 putative recombination events (position 6971–6979, 13686–13693, and 28881–28883) happening on internal branches, however, these are all very short, and contain mutations not found on other branches of the tree (they are not homoplasies), and therefore these are more likely clusters of substitutions caused by individual multi-nucleotide mutations. We masked these mutational clusters from the alignment, as these events can bias classical phylogenetic and evolutionary analyses. The most striking of these putative multi-nucleotide mutations is 28881–28883 (Figure 4), replacing GGG with AAC. This is the only one of these mutations reaching high frequency in the population (776 sequences, or 16.5% of the samples, while 6971–6979 and 13686–13693 only appear in 2 samples each). The three substitutions at 28881–28883 seem to only appear in complete linkage, with the exception of G28881A that also appears in another two samples. (However, these samples were later filtered from the dataset in following steps.) Another interesting aspect is that the 28881–28883 mutation also seems to appear in other non-phylogenetically related samples as within-host polymorphism (Figure 4). This hints at a considerable proportion of cases being either mixed infections (patients infected with multiple strains of SARS-CoV-2) or maybe in some cases contaminated at low frequency.

Figure 4: alignment extract around positions 28881-28883 showing the 3-nucleotide substitution. The reference GGG haplotype or the alternative one AAC are present in most samples. In some samples, like the highlighted one, however, show “RRS” at these positions: these are the ambiguity codes for the reference and alternative alleles combined, suggesting that both alleles were present in the same sample.

In addition to these clusters of mutations, ClonalFrameML also reveals homoplasies, mutations apparently appearing multiple times along the phylogeny. When multiple homoplasies happen on the same branch, they are the hallmark of recombination. However, we only observe isolated homoplasies (Figure 5), suggesting either recurring mutations, or recurrent sequencing/calling artefacts (see e.g. [6]), rather than very short recombination events at given hotspots, as suggested by some previous studies (see e.g. [7-8]). We investigate these homoplasies further in the next section.

Figure 5: ClonalFrameML representation of putative recombinations (dark blue segments, later found to be clusters of mutations masked from downstream analyses), normal substitutions (white dots) and homoplasies (yellow to red dots, red being stronger homoplasies). Zoom in to see the individual dots. On the X axis are genome positions; the Y axis shows different tree branches (the phylogeny is not shown).

We repeated the ClonalFrameML analysis after the removing the clusters of substitutions found above, and specifying strong priors so to favour the detection of large recombination events at low levels of divergence. No further recombination events were inferred.

Spectrum of mutations

We find a total 2775 SNPs, of which 1085 are non-singletons (i.e. appearing in at least 2 samples), and of which 45 have at least 1% frequency. The mutational spectrum of new mutations seems enriched in C→T and G→T mutations (Figure 6), and while T is the most frequent nucleotide in the genome, its frequency seems to be further increasing, at least before population-level selection is accounted for. Also, the substitution process seems strongly non-reversible and not at equilibrium.
For a more in depth analysis of how read processing pipelines affect the called mutations, see Rayko and Komissarov [13].

Figure 6: numbers of variants of each type classified based on the ancestral allele (row) and the derived allele (column). Here the reference allele is assumed to be ancestral for simplicity, but alternative roots do not change the overall patterns. The last table on the right is from within-host read data (see below).

As additional filtering steps, we removed duplicate samples, samples with incomplete date information, and highly diverged samples as filtered by NextStrain [9] (https://github.com/nextstrain/ncov/blob/master/config/exclude.txt), but not those excluded due to uncertain geographic origin. We also used TreeTime v0.7.5 to test and remove sequences with a genetic distance from the root that is significantly higher than expected based on their sampling time.

Sequencing technology, sequencing lab, and homoplasies

As mentioned before, we noticed many homoplasies in this dataset, more than expected by chance (see Figure 7, e.g. 33 sites with 4 mutations compared to expected 0.24 of them), and more homoplasic than expected by chance (e.g. 5 sites hit by more than 9 mutations, while no such site should have been observed). Baseline null distributions were calculated using the Poisson null in TreeTime, which however ignores some complications such as the effects of selection and possible hypermutable positions. Some of these mutations, particularly G11083T, seem geographically ubiquitous, and appear in consensus sequences generated by any technology. Position 11083 seems to have mutated 28 times, both from G to T (21 times) and back from T to G (7 times).

Other mutations are Illumina or nanopore-specific, and some seem specific to both nanopore and to some countries (for example T13402G, A4050C, T13408C, T8022G, C3130T, T28785G).

Figure 7: TreeTime homoplasy results. On the right, only the most homoplasic mutations are shown. Red lines: only present in ONT; orange: only homoplasic in ONT; dark blue: only present in Illumina; light blue: only homoplasic in Illumina.

To investigate this pattern further, we split our dataset into 3 groups: (A) Illumina sequences (n=2253), (B) nanopore sequences not sequenced at KU Leuven (n=550), and (C) nanopore sequences sequenced at KU Leuven (n=186). We only included sequences from labs that submitted large numbers of sequences consistently sequenced with the same technology. C nanopore sequences were from USA, England, Belgium, Netherlands, Germany and Canada. Illumina sequences were from USA, China, Australia, Japan, Singapore, France, DRC, Luxembourg, Portugal, Wales, Canada and Iceland.

We re-analysed these 3 datasets individually using TreeTime. The most predominant homoplasies were common across technology and country, including G11083T, C16887T, C21575T and C15324T. Others were exclusive to Illumina (most remarkably C11074T, C6990T, C29353T, and C29774T) while others were exclusive to nanopore. Surprisingly dataset C contained more homoplasies than B (Figure 8), of which many are mutations only found in C (e.g. T13402G, A4050C, T13408C, T8022G, C3130T, T28785G). Each of these homoplasies could in principle also be caused by issues with phylogenetic inference, which itself can be affected by homoplasic substitutions.

Figure 8: homoplasies in datasets A, B and C.

Given that B also contains 87 samples from Belgium, this suggests that at least some of the mutation-homoplasies specific to C could be the result of recurrent artefacts. Looking further into some of the most homoplasic of these sites, we see that:
T13402G is a nonsense mutation, and seems to back-mutate 4 times;
T13408C is synonymous, mostly appearing in T13402G mutants; A and G alleles are also observed at the same site 13408 in dataset C.
These two observations further suggest the possibility that selection is not involved in the emergence of these mutations. However, A4050C and T8022G are nonsynonymous.

To test for the possibility that some of these homoplasies might be the result of phylogenetic errors, we masked the most common homoplasies from these datasets one part at a time, to see if other homoplasies would disappear. This had almost no effect.

Another possibility is that these homoplasies might be caused by some of the samples being particularly enriched in recurrent artefacts, rather than artefacts distributed uniformly across all samples. To test this, we compared the sequences of datasets A–C against the full dataset and recorded the minimum number of differences between each sample and any other sample in the dataset. In some sense, this is a measure of the minimum distance of a sample from the rest of the dataset. Most samples have other very small distances (0–1 differences 85% of times for A, 88% for B, 77% for C, median of 0 for A and B and 1 for C). Samples with a distance of at least 3 substitutions were 5% for A and B, and 13% for C. Removing samples with a distance of at least 3 substitutions from A, B or C did not noticeably reduce the number of homoplasies in the corresponding datasets, so it seems like homoplasies and high-distance samples are possibly two separate effects of the same underlying cause.

For future analyses, we filter out all samples with a distance of at least 3 from the rest of the dataset.

Individual noticeable homoplasies

We will now discuss some of the most common homoplasies. As mentioned above, G11083T is the most frequent, appearing 679 times and apparently mutating 21 times forward and 7 times reverting to the original T allele. The T allele is observed in different sequencing technologies and different countries. This is a new non-synonymous (L to F) mutation (ORF1a 3606 nsp6 37) and also is considered one of the best candidates for positive selection ( Notice: this page has been deprecated in favor of a fully redesigned and updated version 2 / Sergei Pond | Observable ), but this homoplasy has also been interpreted in literature as the result of frequent recombination [7]. It appears in all samples from the Diamond Princess cruise ship. Notably the mutation is next to the longest non-terminal homopolymer in the genome, further extending it from 8 consecutive T’s to 10 (Figure 9). The mutation also appears in samples as a within-host polymorphism, as can be seen from the presence of isolated N’s (17 times) and K’s (9 times) in the alignment (see e.g. Figure 9). We also observe this from read files from the Sequence Read Archive (see next section), where the mutation appears as within-patient polymorphism 16 times, and even more often with less stringent filtering and variant calling. When applying more stringent read filtering, the frequency of the T allele seems to consistently decrease. Considering all of these observations, we think that G11083T might be a particularly frequent mutation or artefact. It is unlikely to be the result of positive selective pressure at the amino acid level, as the mutation seems apparently to revert to the original allele many times, and as the same amino acid substitution L→F would also be obtained with the substitution G11083C, which however we never observed in our data.

Figure 9: alignment extract around position 11083.

As a consequence of the observations above, we suspect frequent homoplasies specific to dataset C could be artefacts (or possibly normal mutations that appear as homoplasic due to phylogenetic errors). These include T13402G, which is a nonsense mutation appearing in 51 sequences; its neighbour 13408, in which all three non-reference alleles are observed; and A4050C, a nonsynonymous mutation appearing in 18 sequences. Others are less frequent, and it is particularly unclear if their homoplasy might be caused by phylogenetic errors; these include T8022G (nonsynonymous, appearing in 5 samples), T28785G and C3130T (appearing only in 4 and 3 samples respectively). Recently the dataset C-specific mutations at positions 24389-24390 have been suggested to be strongly affected by local recombination [9], but this might be a phylogenetic artefact caused by other homoplasies.

A pattern observed in many homoplasies (but not those present in C samples specifically) is that they seem to extend pre-existing poly-T runs. The most remarkable example is G11083T above, but the same is observed with its neighbour C11074T (nonsynonymous, not observed within-host, Illumina-specific), with C21575T (nonsynonymous, also observed as within-host polymorphism in a few samples, ubiquitous), and with the Illumina-specific C6990T.

Other frequently mutating homoplasies are C16887T and C15324T (both synonymous not observed within-host), and C6255T (a synonymous variant particularly common in samples from the Netherlands).

Below we show the phylogeny estimated in IQTree after removing all samples with a distance of at least 3 substitutions from the rest of the dataset, and masking the most predominant and dubious homoplasic sites (3130, 3145, 4050, 6990, 8022, 10323, 11074, 11083, 13402, 13408, 14408, 15324, 16887, 21575, and 29353).

Figure 10: IQtree phylogenetic tree after filtering out mutational clusters/recombinations, diverged and isolated samples, and the most common homoplasies.

To test which of these homoplasies might (also) be actual inherited viral mutations, and which are more likely to be non-inherited sequencing artefacts, we measured the phylogenetic signal present in the most homoplasic and/or lab-specific variants using the methods of Borges et al 2018 [10]. First, we masked the most homoplasic sites from the alignment (apparently mutating four times or more) and the homoplasies specific to one lab and mutating at least three times:
187, 241, 335, 1059, 2094, 3037, 3130, 3145, 4050, 6255, 6990, 8022, 8782, 9223, 10323, 10741, 11074, 11083, 11704, 13402, 13408, 14408, 14724, 14786, 14805, 15324, 16887, 17247, 19684, 20148, 21137, 21575, 23403, 24034, 24378, 25563, 26144, 26461, 26681, 27384, 28077, 28826, 28854, 29353, 29700, 29736.
These sites were determined iteratively, first removing the most homoplasic sites, repeating the phylogenetic inference, and then again removing the most homoplasic sites.

We then inferred a phylogeny from the masked alignment using IQTree. Finally, we measured the phylogenetic signal of each of the homoplasies listed above by coding the observed allele as a categorical variable and using the delta statistic. The delta statistic uses Shannon entropy to measure the degree of phylogenetic signal between traits and phylogenies: the idea is that the better a phylogeny is associated with a given trait, the better it is able to retrace trait’s evolution (or in probabilistic terms, to infer the ancestral states with minimal uncertainty). Polytomies in the phylogeny were randomly split to obtain a bifurcating tree. Most of the homoplasies, in particular those appearing in only a few sequences, show close to no phylogenetic signal (Figure 11), supporting the hypothesis that they are artificial. On the other hand, many homoplasies, including site 11083, show strong phylogenetic signal, suggesting that they originated, at least once, as a true mutational event. Of course, this analysis has substantial limitations, it ignores uncertainty in the tree and noise from possible artefacts within the remaining sites. It also does not tell us if a variant is both a true mutational event as well as the result of recurrent sequencing biases, as possible for position 11083.

Figure 11: Phylogenetic signal of candidate homoplasic sites using the method of Borges et al [10].

Within-sample variation

A possible way to include more genetic variation data and to investigate possible artefacts is to consider within-host variation from sequencing read files. We downloaded the available 450 SARS CoV 2 Illumina sequencing FASTQ files from the Sequence Read Archive (SRA, Home - SRA - NCBI ) on 12th April 2020. The pipeline we used for processing the reads in the within-sample variation section is available at GitHub - conorwalker/covid19 . From preliminary analyses of the 8 sequencing datasets of Shen et al 2020 [11] we realized that in order to remove most variants that might emerge from mapping and sequencing artefacts, stricter filters than usual were needed. More specifically, we used fastp (GitHub - OpenGene/fastp: An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...)) to:

trim adapters and polyX tails from reads
trim 15 bases from the start and end of each read
only retain reads with high quality sequence (≥Q35) at ≥90% of the read length
remove reads with low quality (<Q30) at the 3’ end
only retain reads of length ≥50 before alignment with bwa-mem (GitHub - lh3/bwa: Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)), possible PCR duplicates were removed using MarkDuplicates from Picard tools (GitHub - broadinstitute/picard: A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.), and any reads aligned through hard or soft clipping were removed from the bam file.

We only consider variants at a site if the depth of coverage is ≥5 and the base quality is ≥30.

The strong filters of removing 15bp from each read end and removing clipped reads were useful because some variants were observed at very high frequency in some samples but only at read ends or only in clipped reads. For example, at reference position 10779, an A/T polymorphism was observed in 7 out of 8 samples from Shen et al 2020, usually with A being the minor allele, and once being the majority allele. In the 8th sample, allele A appeared fixed. We observed allele A only in read ends (see Figure 12), and after trimming read ends this polymorphism was no longer detected. This suggests that recurrent artefacts might be present not only at the level of substitutions between consensus sequences, but also as frequent, apparent within-host variants.

Figure 12: extract of a bam file from a read alignment of a single sample from Shen et al [11]. Position 10779 appears polymorphic, but the minority variant is only observed near read ends.

Generally, most samples contain 0–5 variants (median 1, mean 8.2). However, some samples have many more, reaching 564 and 433 in two cases (SRR11494637 and SRR11494664 respectively). Coverage or mixed infection/contamination do not seem the issues. Instead, these samples were mostly made of clipped reads, i.e. reads that only partly map on the reference genome. Removing these reads eliminated these extreme cases (Figure 13), but other samples with extreme numbers of variants can still be found, for example SRR11494662 with 359 within-host variants. As samples with an extreme number of within-host variants persist, a more rigorous filtering procedure might still be required before we can confidently interpret within-host variant calls. We therefore offer a word of caution when attempting to interpret the results of such variant calling methods, and minimally recommend a stringent set of filters (as outlined above), as well as removing samples with more than 2% of “N”s within reads.

Figure 13: Histogram of the number of Illumina-sequenced samples (Y axis) with a given percentage of “N” within its reads (X axis). We filtered out samples with more than 2% (all those whose name is shown in the plot).

The mutational spectrum of within-host variants seems extremely shifted toward G→T variants, even more than when comparing consensus sequences (Figure 6) suggesting that most of these variants might be the result of Illumina sequencing errors and/or sequencing biases, or otherwise that most of these new G→T mutations do not reach fixation due to selective forces. We have not yet investigated possible nanopore sequencing biases from within-host variation data.

Within-host variation at homoplasic sites

Most homoplasies found only in the consensus sequences rarely appear polymorphic within-host, and they do not show evident signs of mapping or calling issues. G11083T is again exceptional in this regard. This variant is polymorphic in many samples, as mentioned above, and the frequency of the variant T allele seems to decrease with stricter read filtering. Furthermore, this position is also associated with strange deletion patterns (Figure 14), with a 1bp deletion being present in many reads at this position despite this being within a coding sequence. We also found the same pattern at the neighbouring position 11074 for sample SRR11494457 (Figure 15). Here coverage is low, the derived allele T is the higher frequency allele, and a similar deletion pattern to position 11083 is observed. It is not clear to us if these observations suggest that sequencing errors might be common at this position due to the poly-T. We plan to investigate nanopore read data to further look into this issue.

Figure 14: two read alignments near position 11083. Dark red squares represent deleted bases. Top: sample SRR11397719. Bottom: sample SRR11494651.

Figure 15: read alignment near position 11074, sample SRR11494457.

Acknowledgements

We would like to thank all the authors who have kindly deposited and shared genome data on GISAID (https://www.gisaid.org/) and in particular Piet Maes for helpful suggestions. A full table of acknowledgments for GISAID contributors can be found here: ProblematicSites_SARS-CoV2/data/gisaid_hcov-19_acknowledgements.tsv at master · W-L/ProblematicSites_SARS-CoV2 · GitHub . We also thank Andrew Rambaut, Ewan Birney, Umberto Perron and Graham Jones for helpful discussions.

References

[1] Shu, Yuelong, and John McCauley. “GISAID: Global initiative on sharing all influenza data–from vision to reality.” Eurosurveillance 22.13 (2017).
[2] Katoh, Kazutaka, and Daron M. Standley. “MAFFT multiple sequence alignment software version 7: improvements in performance and usability.” Molecular Biology and Evolution 30.4 (2013): 772-780.
[3] Minh, Bui Quang, et al. “IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era.” Molecular Biology and Evolution 37.5 (2020): 1530-1534.
[4] Didelot, Xavier, and Daniel J. Wilson. “ClonalFrameML: efficient inference of recombination in whole bacterial genomes.” PLoS Computational Biology 11.2 (2015).
[5] Sagulenko, Pavel, Vadim Puller, and Richard A. Neher. “TreeTime: maximum-likelihood phylodynamic analysis.” Virus Evolution 4.1 (2018): vex042.
[6] Meacham, Frazer, et al. “Identification and correction of systematic error in high-throughput sequence data.” BMC Bioinformatics 12.1 (2011): 451.
[7] Yi, Huiguang. “2019 novel coronavirus is undergoing active recombination.” Clinical Infectious Diseases: An Official Publication of the Infectious Diseases Society of America (2020).
[8] https://doi.org/10.1101/2020.04.29.069054
[9] Hadfield, James, et al. “Nextstrain: real-time tracking of pathogen evolution.” Bioinformatics 34.23 (2018): 4121-4123.
[10] Borges, Rui, et al. “Measuring phylogenetic signal between categorical traits and phylogenies.” Bioinformatics 35.11 (2019): 1862-1869.
[11] Shen, Zijie, et al. “Genomic diversity of SARS-CoV-2 in coronavirus disease 2019 patients.” Clinical Infectious Diseases (2020).
[12] Forster, Peter, et al. “Phylogenetic network analysis of SARS-CoV-2 genomes.” Proceedings of the National Academy of Sciences 117.17 (2020): 9241-9243.
[13] https://doi.org/10.1101/2020.04.26.062422

sergei.pond · May 5, 2020, 1:15pm

Technically (this is not the most likely explanation), reversion can be considered evidence of a particularly strong selective effect driven by discordant selective pressures in different hosts. G11083T could also be a hyper variable site, but then you might expect a greater repertoire of amino-acids at this position. It’s curious for sure, but these types of sites are exactly what comes up in standard dN/dS analyses – many substitutions (in our analyses we further require these to occur on internal branches of a tree built from unique haplotypes, mitigating sequencing error to a degree).

Interestingly, both the ‘L’ and the ‘F’ versions appear in CTL epitopes covering this site based on the analysis in https://www.biorxiv.org/content/10.1101/2020.03.30.016931v1.abstract (C01 for ‘F’ and B40, A23, A24 for ‘L’).

NicolaDeMaio · May 5, 2020, 2:09pm

A reason why I think that selection at the amino acid level would be strange is also that, while G11083T seems to occur many times (including as within-host variant), we did not find yet examples of G11083C, which codes for the same amino acid substitution. Of course one could also try to explain this somehow (maybe selection at the nucleotide level? Or maybe the site is both hypermutable and under selection?).
We don’t know for sure, G11083T is really baffling, personally I try to be cautious drawing conclusions either way!

sergei.pond · May 5, 2020, 2:30pm

It’s definitely a curious site!

I also looked into the related bat viruses, and there’s ONE sequence that has a G : C mutation at that position in the @david.l.robertson alignment (the others are G:T). It’s also bracketed by what appear to be strongly conserved codons on both sides…

oscar.maclean · May 5, 2020, 3:45pm

Nice one, really nice analysis!

We’d also spotted that the recombination signature in the Yi (2020) paper was driven by convergence at site 11083. I thought that it was likely the convergence was due to polymerase slippage in that run of Ts, and only the in-frame slips were being observed.

But I think your result here really strongly supports that:

Furthermore, this position is also associated with strange deletion patterns (Figure 14), with a 1bp deletion being present in many reads at this position despite this being within a coding sequence.

Yes Sergei, I’d also noticed that the site was conserved in the bat-infecting lineage, suggesting that it may not be hypermutable (although I believe the homopolymeric tract length differs across the phylogeny). However it could also be hypermutabale and the T is deleterious, although I don’t believe that it’s shown any reversion (though this might be hard to conceretely identify)?

We’d been running selection analyses on the outbreak data, and not expecting to see a great deal but noticed some other sites were being flagged as positively selected, due to apparent convergent evolution. Some of these homoplasies are almost definitely being driven by artificial lab recombination.

E.g.:

Both the ORF8 codon 84 and ORF1AB codon 1599 positive selection signals appear to be driven by a single South Korean sample (gisaid accession 413017). This sample possesses two derived mutations either side of a hypothesised breakpoint. Each of these pairs of derived mutations are possessed by different haplotypes (Figure 2). Therefore this 413017 sample appears to be a recombinant between the 413018 haplotype and either of 413513 or 412871. As both 413017 and 413018 were sequenced by the same lab and released at the same time, this recombination event is likely to be an artefactual product of lab cross-contamination.

Figure 2. Alignment of variable sites in the whole genome alignment, with base positions shown above. The recombinant South Korean sample 413017 shows mismatching topologies across its genome, clustering with different South Korean samples either side of the inferred breakpoint. The Wuhan sample 402124 was collected on 30/12/2019, and shows no unique mutations, and should thus represent the ancestral state of the four chosen ingroup samples.

Another likely lab recombinant pair is 413570 and 413574:
The N ORF 292 site was flagged but driven by a similar homoplasy. However, both samples exhibiting the derived mutation I->T mutation were sequenced by the same Dutch lab and released at the same time.

sergei.pond · May 5, 2020, 4:05pm

I think this is the first time in genomic era history where so many people have looked at so few mutations to extract some meaning and tell a story…

Based on the past performance of our attempts to predict functional significance from evolutionary patterns alone, the prior belief that any mutation if functionally significant should be very low indeed. The best we can do, I believe, is to do a “high”/“low” sort for candidate mutations.

ORF1a 3606 is interesting, so I call it a “predictive success”

matthew.parker · May 6, 2020, 6:38pm

Something we wrote as part of a correction to the manuscript including S943P as a true variation:

As part of another study using a method to detect systematic sequencing errors (Freeman et al. 2020)(https://paperpile.com/c/iw1wsP/afZK), we are interrogating the quality of publically available sequencing data and this position was highlighted as suspect. We interrogated this position in the raw data of the Sheffield nanopore sequencing dataset. Although these two variants are not present in the final consensus sequence from any of the Sheffield isolates, the raw, untrimmed mapped sequencing reads show their presence in only one of the Artic amplicons covering the site (Figure1 A&B) in all samples analysed. We noticed that in fact this position is to the left of the 5’ primer of ampicon 81 in what we believe to be the nanopore adapter sequence. Comparison of the MN908947.3 reference and the adapter sequence reveals similarity around this position:

Nanopore adapter sequence:

CAGCACCTT

MN908947.3 reference sequence:

CAGCAAGTT

If this region was not trimmed before calling variants it is conceivable that because this region matches the reference and is not soft clipped by minimap2, the mismatch between the reference and the adapter sequence could be called as a variant.

In the Sheffield dataset validation set (Figure 1B), we see a C present at around 50% of called bases at both these positions in raw data but this region is trimmed by the Artic pipeline and is therefore not used to call variants and contribute to the final consensus sequence. It is also evident that although amplicon overlaps amplicon 81 in this region, there is no evidence for these variants in the data from amplicon 80.

In summary we believe this is an artefact that has arisen due to a combination of improper trimming of adapter and primer regions from raw sequencing reads before downstream analysis and the coincidental homology between the nanopore adapter sequence and the WuHan reference genome in this region.

Figure 1 - Investigation of S943P

A. IGV plot showing bam files from nanopore sequencing data of amplicons produced by the Artic network protocol. Raw data from amplicon 81 contains adapter sequence which has high similarity to the reference genome, apart from the positions which lead to an erroneous S943P mutation call. This region is therefore included in variant calling if trimming is not carried out. Subsequent panels show that this region is soft clipped when trimming adapters and primers and is therefore not available for variant calling. B. Base frequencies at position 24389 in 23 samples from the Sheffield dataset show that C is present in half of the reads in the raw data, but is absent from trimmed and primer trimmed data.

Freeman, Timothy M., Genomics England Research Consortium, Dennis Wang, and Jason Harris. 2020. “Genomic Loci Susceptible to Systematic Sequencing Bias in Clinical Whole Genomes.” Genome Research 30 (3): 415–26.

liam.shaw · May 8, 2020, 3:45pm

Thanks for the detailed report Nicola et al. and the great discussion from others.

We recently finished a similar analysis of recurrent mutations - see [1] and in press at doi: 10.1016/j.meegid.2020.104351). Similar approach: we used MPBoot to get a MP tree and then HomoplasyFinder, before doing some filtering, and also looked at available SRA data.

The filtering approach we came up with attempts remove potentially suspect sites from GISAID assemblies using a combination of metadata and a RaxML tree. The basic principle is the sort of ‘high’ / ‘low’ sort @sergei.pond suggests, providing a list of all recurrent mutations after excluding pre-determined suspect sites, and then a smaller ‘filtered’ list (n=198) that satisfy a set of thresholds based on the following parameters:

Number of isolates with homoplasy
Proportion of isolates with homoplasy which have a nearest neighbour in the RaxML tree with the homoplasy: ranges between 0 (singleton isolates with homoplasy throughout tree) and 1 (clusters of isolates with homoplasy)
Proportion of isolates with the homoplasy which have at least one ‘N’ in the local region around the homoplasy
Number of submitting and originating labs

See the paper and https://github.com/liampshaw/CoV-homoplasy-filtering/ for more detail and the actual values we chose.

I note that the ‘filtered’ list we gave at the time of that publication included the nucleotide positions 24389-24390 i.e. S943P (that is an incredibly good adapter spot @matthew.parker). So such filtering thresholds definitely do not magically remove all sequencing errors!

Hopefully as a community effort we can keep adding to the list of these suspect positions.

References
[1] L. van Dorp et al. “Emergence of genomic diversity and recurrent mutations in SARS-CoV-2” Infection, Genetics and Evolution (2020) 104351 pdf.

NicolaDeMaio · May 28, 2020, 11:17am

Updated analysis with data from 17th May 2020

Nicola De Maio, Conor Walker, Nick Goldman
European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridgeshire, United Kingdom.

Summary

After our initial post, we worked to improve our analyses. First, we included the new genomes and the metadata available from GISAID on 17th May 2020. Then, following the finding of recurrent sequencing artefacts at positions 24389-24390, we masked these positions, and we paid more attention to variants that are in proximity to each other and linked: such sequencing errors would have more potential to cause topological errors and thus avoid detection and cause misinterpretation of other variants as homoplasic. We also used information from frequent ambiguous bases present in multiple consensus sequences within the same alignment column. These ambiguities usually mark positions that show multiple alleles within the same sample, and therefore can be indicators of recurring within-host mutations or recurring sequencing artefacts.
The new masking recommendations are summarized in our VCF file: see our separate post Masking strategies for SARS-CoV-2 alignments , the VCF file itself https://github.com/W-L/ProblematicSites_SARS-CoV2/blob/master/problematic_sites_sarsCov2.vcf , and the same file in human readable format https://github.com/W-L/ProblematicSites_SARS-CoV2 .

Highly homoplasic sites

We still advise masking positions 11074, 11083, 16887, 21575 when inferring a phylogeny or testing for positive selection, as it is not clear if variants at these positions emerge from true frequent mutations or from sequencing artefacts. However, because they are highly homoplasic (appearing to emerge many times independently) they have the potential to bias phylogenetic inference, while providing only limited phylogenetic signal.

Sites with ambiguous characters

We find sites that have high proportions of ambiguity characters while also showing patterns of variants and ambiguity that are specific to one country or sequencing lab. The most extreme of these positions are 19484 where 222 ambiguities were observed, and 8022, where 62 ambiguous characters were observed. We suggest masking these positions, because while some level of within-host polymorphism is expected due to within-host mutations, mixed infections and contamination, a high number of ambiguities (in particular at sites with low diversity) can be a sign or recurring mutations or sequencing artefacts.

The full list of sites we recommend masking because of ambiguous characters is:

635 (9 variant bases and 6 ambiguous bases, found particularly in UK sequences by COG-UK)
2091 (9 variant and 6 ambiguous, enriched from India, NCDC)
2094 (6 variant and 6 ambiguous, enriched from Belgium KU Leuven and India NCDC)
3564 (46 variant and 15 ambiguous, enriched from Australia VIDRL)
6869 (4 variant and 12 ambiguous, enriched from Italy Caporale)
8022 (9 variant and 62 ambiguous, enriched from India NCDC, Belgium KU Leuven)
8790 (11 variant and 6 ambiguous, enriched from Australia VIDRL)
10129 (2 variant and 10 ambiguous, enriched from USA California Chan-Zuckerberg Biohub )
13476 (5 variant and 5 ambiguous, enriched from Turkey Ministry of Health)
13571 (6 variant and 14 ambiguous, enriched from Australia VIDRL)
15922 (3 variant and 12 ambiguous, enriched from USA California Chan-Zuckerberg Biohub )
19484 (16 variant and 222 ambiguous, enriched from UK COG-UK)
22335 (3 variant and 11 ambiguous, enriched from UK COG-UK)
24933 (39 variant and 16 ambiguous, enriched from Australia VIDRL).

Homoplasic sites common within one sequencing lab

We also re-ran the analysis of homoplasic sites described in our original post but this time making use of the new data and focusing on variants that show high levels of homoplasy relative to their prevalence, and that are mostly found in samples sequenced from a single lab; these variants have a high probability of being sequencing artefacts rather than the result of recombination, contamination, hypermutability or positive selection, and so masking them should reduce the topological distortion caused by artefacts while sacrificing less true phylogenetic signal. We recommend masking the following sites:

3145 (10 variant, enriched from Netherlands Erasmus)
4050 (8 variant, enriched from Belgium KU Leuven)
5736 (6 variant, enriched from Turkey Ministry of Health)
11535 (22 variant, enriched from Australia VIDRL)
13402 (27 variant, enriched from Belgium KU Leuven)
13408 (3 variant, enriched from Belgium KU Leuven, linked with 13402)
14277 (9 variant, enriched from Australia VIDRL)
24389 (5 variant, enriched from Belgium KU Leuven)
24390 (5 variant, enriched from Belgium KU Leuven)
26549 (12 variant, enriched from Turkey Ministry of Health)
29037 (5 variant, enriched from Australia VIDRL) TCT->TTT
29553 (259 variant, enriched from USA Washington UW or “Seattle Flu Study”)

We observe an enrichment of GGT to GTT substitutions in homoplasic sites specific to the Australian VIDRL (2 out of 3, sites 11535 and 14277) and sites with ambiguities specific to the Australian VIDRL (4 out of 4, sites 3564, 8790, 13571, and 24933). This is very different from the overall mutational trend, where most variants are C to T, but it resembles the G to T skew we observed in within-host variants in our original post. It’s not clear to us if this pattern is a result of RNA degradation, RNA modification, some other biological process, or specific aspects of the chemistry and bioinformatics used by the lab (do you have suggestions @torsten ?), either way we suspect this observation might help to improve the calling of consensus sequences, or to understand and account for SARS-CoV-2 hypermutable sites.

Neighbour linked variants

Finally, we investigated linked, proximal variants (<10bp apart). These, if recurrent sequencing artefacts as for the case of 24389-24390, have the potential to cause substantial phylogenetic errors and bias in downstream analyses (including our own inference of homoplasy). We still mask linked sites we previously identified as problematic (24389-24390, 11074 and 11083, 13402 and 13408; see above), and then we investigated the most prevalent of these apparent mutational patterns, 28881-28883 (see our original post). We found no evidence of possible sequencing artefacts at these positions from read data, and we found that almost all samples with mutations at 28881-28883 cluster closely together (see e.g. Figure 1); this is true also after masking some of these positions. The few exceptions are probably just rare cases of contamination or recombination. We conclude that, apart from a few possible samples where the minor variant at 28881-28883 was acquired through recombination or contamination, 28881-28883 likely represents a true, single multi-nucleotide mutational event, and we do not recommend masking these positions anymore. The only exception might be in phylodynamic analysis, where it might make sense to mask only 2 out of these 3 sites to account for the fact that these likely result from a single mutational event.

Figure 1: extract of a phylogenetic network obtained from SARS-CoV-2 sequences in POPART [1]. This is an extract of a simplified network obtained by considering only high-frequency variants (those appearing in at least 50 sequences). The top right corner contains all the sequences with variants at positions 28881-28883 (notice the three edges with 3 substitutions separating these sequences from the rest of the network). Node names show the identifier of one of the sequences with the given haplotype, followed by the number of such sequences. Nodes representing fewer than 3 sequences have not been included.

Other linked, proximal variants were observed at a much lower prevalence, and so we decided to mask these (only in the samples which carried them) to prevent possible biases while only sacrificing limited phylogenetic signal; it would make sense to further investigate them individually in the future, as we did for 28881-28883.
Some of the most remarkable of these sites (due to involving at least 3 nearby perfectly or almost perfectly linked substitutions and sometimes from a single sequencing laboratory) were:

6971, 6975, 6977 (found in 6 samples from Australian VIDRL)
13512, 13513, 13514 (found 2 samples)
13686, 13687, 13693 (found 2 samples)
21302, 21304, 21305 (found in 3 samples)
27382-27384 (found in 4 samples)
28004-28006, -280089 (found in 3 samples from Luxemburg)

We also think it is worth further investigating positions 6309, 6310 and 6312 (and of these in particular 6310), as these form a cluster of nested variants showing some similarity with the 11074-11083 cluster (in this case these variants extend a poly-A rather than a poly-T).

Other changes

We made other changes to our analyses that do not affect our proposed masked sites, but that affect the alignment we work with. In particular, we now mask the first and last 30 non-ambiguous non-gap characters of each sequence, in order to mask most alignment errors without automatically masking large regions at the ends of the genome. We also mask 7bp surrounding missing parts of each sequence, again to avoid errors attributable to alignment and low coverage.

Recombination and contamination

We further investigated the possible presence of recombination using a phylogenetic network approach (see Figures 1 and 2). We found no clear evidence of recombination; however, there are many loops in the network. After removing highly homoplasic sites, we still observe many of these loops, which do not seem to be caused by highly homoplasic mutations. Because these loops seem to be present mostly in the terminal parts of the network and involve sequences with very few representatives, it seems most likely that these are the results of contamination, where the genomes of two different strains contribute to a single consensus sequence (but we cannot exclude recombination events near terminal nodes that would result in the same observed patterns). Contamination/recombination not only creates topological uncertainty, but is likely to cause biases in estimation of mutation rate, homoplasy, and selection.

Figure 2: Extract of phylogenetic network (see Fig. 1 for details). In this area of the network, several loops can be noticed, despite this representing a very simplified version of the full network.

Acknowledgements

We would like to thank all the authors who have kindly deposited and shared genome data on GISAID (https://www.gisaid.org/). A full table of acknowledgments for GISAID contributors can be found here: https://github.com/W-L/ProblematicSites_SARS-CoV2/blob/master/data/acknowledgements_GISAID.zip

References

[1] Leigh, Jessica W., and David Bryant. “POPART: full‐feature software for haplotype network construction.” Methods in Ecology and Evolution 6.9 (2015): 1110-1116.

n_j_loman · May 28, 2020, 11:41am

Great updated analysis Nicola. I think it correlates well with what we have seen in COG-UK. I note that the two positions you mention relate to positions within amplicons 64 and 74 of the ARTIC primer scheme: these are the amplicons that most notoriously drop out of our amplification scheme, which makes perfect sense.

goldman · July 14, 2020, 3:14pm

Updated analysis with data from 12th June 2020

Nicola De Maio^1*, Landen Gozashti², Yatish Turakhia², Conor Walker¹, Robert Lanfear³, Russell Corbett-Detig^2*, and Nick Goldman¹

¹European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridgeshire, United Kingdom
²Department of Biomolecular Engineering and Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
³Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, ACT, Australia
^*[email protected], [email protected]

Purpose: The purpose of this post is to update our masking recommendations by combining the methods from our original post above with those from [1], and to provide reliable and automated methodology for the identification of positions that for one reason or another may confound evolutionary interpretations in the SARS-CoV-2 genome. The updated masking recommendations, as usual, can be found here: https://github.com/W-L/ProblematicSites_SARS-CoV2/blob/master/problematic_sites_sarsCov2.vcf, in the format discussed here: Masking strategies for SARS-CoV-2 alignments.

Background: We and others have identified several variants in SARS-CoV-2 sequencing datasets that have the potential to adversely affect phylogenetic and evolutionary inference. For details, see our previous posts in this thread and bioRxiv preprint [1]. In many cases, these apparent mutations are almost certainly sequencing or data processing artefacts. Other positions that we identify seem instead to reflect hypervariable positions in the SARS-CoV-2 genome or multi-nucleotide mutation events. Even such “real” phenomena might adversely impact biological inferences, and therefore need caution. Here we merge methodology and recommendations obtained from approaches for identifying these “problematic positions” described in previous posts in this thread with those of [1]. We describe our methods and rationale for identifying and categorizing problematic positions and we provide updated masking recommendations in our VCF file (see Masking strategies for SARS-CoV-2 alignments). To facilitate interpretation and cross-referencing against genomic features the updated recommendations are also available as a track on the UCSC SARS-CoV-2 Genome Browser [2] and the ENSEMBL COVID-19 Genome Browser [3].

Data and Phylogenetic Inference: To update our masking recommendations in the context of (almost) all available SARS-CoV-2 genome sequence variation, we used 39,839 sequences available in the GISAID database as of 12 June 2020. Full details of the inferred sequence alignment and phylogenetic tree are given at https://github.com/roblanf/sarscov2phylo/releases/tag/12-6-20. Our pipeline filters out sequences that are short (<28kbp) or highly ambiguous (>1000 ambiguous characters), those that are excessively divergent according to TreeShrink [4], and uses other filters including some mentioned in the earlier comments in this post. Tree topologies are inferred using FastTree2 [5]. This dataset is much larger than those used in our previous analyses, and allows us more power to identify problematic positions.

Approach for Identifying Problematic Positions: One of our approaches for identifying problematic positions was described in detail previously [1]. Briefly, this method looks for highly homoplasic positions and asks whether they are strongly associated with a single sequencing laboratory. We define “lab-association” as cases where 80% or more of the instances of a variant allele come from a single lab, but we also manually explored positions with high parsimony score: alternate allele frequency ratios that are not strongly associated with a single lab. Manual curtation identified several positions where two or more labs contribute similar sequencing artefacts and future versions of our automated pipeline will be updated to detect these as well.

Here we implemented three important modifications relative to this method. First, because the Nextstrain pipeline (on which the analysis of [1] was based) resolves all ambiguities, this method did not consider ambiguous basecalls. These can be a valuable signature of potentially errant alleles, and sometimes enable accurate identification of problematic positions with low minor allele frequencies (see previous updates to this thread). To leverage the signatures of ambiguous basecalls, we performed an analysis after resolving all ambiguous calls as the alternative allele (“alternative resolved”), and one analysis where we resolved all ambiguous calls by parsimony (selecting the reference allele where there are equally parsimonious configurations: “parsimony resolved”). Positions that we identify as highly homoplasic only in the “alternative resolved” dataset and not in the “parsimony resolved” dataset would be “highly_ambiguous” below, but not “highly_homoplasic” as the difference in detection reflects the elevated rate of ambiguous basecalls at a position. We note that because ambiguous positions are ignored by FastTree2 [5] they typically result in extremely high parsimony scores in our “alternative resolved” analysis even relative to other non-heritable variation.

We also added a “country-association” module to our pipeline, using the same methods we previously employed for detecting “lab-associated” variants [1]. In our previous work we attempted to correct the various errors and inconsistencies in “originating lab” and “submitting lab” metadata entries using a simple string comparison approach (https://github.com/lgozasht/COVID-19-Lab-Specific-Bias-Filter). However, in the vastly expanded dataset we used here, there are many cases where the lab description varies too much between metadata entries precluding our identifying all samples from a single sequencing lab. In contrast, we have found that “Country” metadata entries are more reliably uniform. By detecting “country-associated” variants, we sometimes recovered “lab-associated” variants.

We originally used a much smaller dataset to develop our methods and some of our heuristic cutoffs required tuning on this vastly expanded dataset. In particular, in more recent datasets some sequencing groups (e.g. COG-UK) have produced disproportionately more data. Our previous methods sometimes yielded results that we believe are false positives, when real but slightly homoplasious mutations recur in samples from a single group. We resolve this with a second modification to our approach. Briefly, we used the parsimony score, the minimum number of mutations required to generate the sample genotypes on a given phylogeny, to summarize how frequently a position is mutated during SARS-CoV-2 evolution. If the parsimony score is high relative to the minor allele frequency, this suggests that the mutation is not heritable. We required a parsimony score:minor allele frequency ratio of at least 0.5 to identify a potentially problematic position. The ratio we selected implies that each inferred clade with the alternative allele has at most two descendants on average and is therefore unlikely to be a heritable mutation.

The software for the methods above is available from https://github.com/yatisht/strain_phylogenetics (see in particular the C++ branch), and https://github.com/lgozasht/COVID-19-Lab-Specific-Bias-Filter. We developed our masking recommendations so that they would fit most users and most versions of SARS-CoV-2 alignments without requiring additional analyses and filtering; however, in some particular cases (for example if the alignment near genome ends is of interest, or if extremely conservative masking is required), the methods above and in https://github.com/roblanf/sarscov2phylo could be adapted for this purpose.

Types of Evidence Considered: There are four forms of evidence that we use to identify and categorize problematic positions in SARS-CoV-2 sequencing data. These have formed the basis of some of our previous masking recommendations in this thread and in [1], but we reiterate them here briefly to facilitate interpretation of our recommendations.

homoplasic/highly_homoplasic. These are positions with alternate alleles that reoccur on many distinct branches of the SARS-CoV-2 phylogeny.
single_src/narrow_src. Evidence here consists of strong associations between alternate alleles at a given position and specific sequencing labs.
ambiguous/highly_ambiguous. Many positions show a strong excess of ambiguous basecalls relative to the number of alternative alleles. Most highly ambiguous positions are strongly lab-associated and likely reflect systematic errors that are found in a portion of reads for a given sample.
neighbour_linked. Alternative alleles at nearby positions that show strong associations across samples (i.e. linkage-disequilibrium for nearby pairs of low-frequency alleles).

We combine the evidence from these four signals to provide the masking recommendations presented below. In some instances, we also accept additional evidence from the sequencing laboratories or from sequencing read files to identify problematic positions.

Recommendations: To facilitate accurate interpretation of SARS-CoV-2 genome variation, we provide two types of recommendation. First, we designate positions that we believe will adversely affect most analyses as mask, that is, we recommend the removal of all sample genotypes at these positions for essentially all analyses. Second, positions that might adversely affect some biological inferences, but which might be suitable for other analysis are labeled as caution. Below we describe some common categories of problematic positions and explain our recommendations. We note that patterns of variation at some positions are not completely cut-and-dry and we sometimes use our judgement to determine what label is best.

Lab-Associated Errors: Most of the positions that we recommend masking are those that are homoplasic/highly_homoplasic and either single_src or narrow_src. These positions are likely artifactual variants that result from lab-specific data producing or processing. For example, some of these variants likely result from incorrect trimming of adapter or primer sequences, and at least one results from alignment of human rRNA to the SARS-CoV-2 reference genome (position 22802: [1]).

We note that in some cases, lab-associated errors have been corrected, but we still believe the positions should be masked. For example, sequences containing artefactual alleles at positions 4050 and 22802 have been corrected as of mid-June 2020. However, analyses based on earlier downloads from the GISAID database might still contain these variants — the GISAID accession numbers do not change so it is challenging to be certain which version of the sequence a given dataset might include — and for this reason we recommend continuing to mask these positions, but we additionally mark these positions as amended. If a user is certain that their alignments are based on corrected sequences, masking may not be necessary for some of the positions that we have so identified.

Hypermutable Positions: We also recommend masking many positions that are highly_homoplasic but which are not associated with a single sequencing group. This includes positions 11074, 11083 and others which we termed “extremal” in previous work [1]. Most of these positions contain C→T mutations and many are adjacent to poly-T stretches in the genome. Both of these features appear to be associated with hypermutability in the SARS-CoV-2 genome [1]. Given the exceptional mutation rates of these positions combined with the limited phylogenetic signal in the entire SARS-CoV-2 alignment, including these positions has strong potential to affect phylogenetic inference and we recommend that they be masked as well for most analyses. However, these positions might not require masking if a given nucleotide substitution model can accommodate hypermutability, context-dependency, and the possibility of strand-specific mutation effects (i.e. C→T mutation rates differ from G→A).

Highly Ambiguous Positions Often Suggest Systematic Errors: In general, positions with many ambiguous basecalls are not necessarily problematic for phylogenetic and other biological analyses, but positions with many ambiguities can provide evidence to suggest that there is a systematic error that could affect sequences. While it is possible that polymorphisms identified from within-host samples are transmitted and therefore appear as ambiguities in related samples, we emphasize that the highly_ambiguous positions that we identified are always highly homoplasic, consistent instead with them being systematic errors. For positions where we find only ambiguous basecalls (ambiguous and single_src) and no alternative alleles, we typically indicate caution. However, for positions with extremely high rates of ambiguous genotypes (highly_ambiguous), we sometimes suggest masking. When there are also alternative alleles (non-ambiguous characters different from the reference) at a given highly_ambiguous position, and particularly when the sequences containing alternative alleles and ambiguous alleles are single_src, we recommend mask.

Locally-Linked Positions Suggest Multi-Nucleotide Mutations or Correlated Error: Highly correlated locally linked positions might reflect multi-nucleotide mutation (MNM) events (for which we suggest caution) or they may occur due to correlated errors (for which we often observe positions to be homoplasic and single_src and we suggest mask). For example, the correlated errors at positions 24389 and 24390 likely result from alignment of a nanopore adapter sequence that was not trimmed from the reads prior to assembly (see [1,6] and comment above by @matthew.parker). Such correlated errors are particularly problematic because the site pattern at each position is typically modelled as independent during phylogenetic reconstruction and therefore have more pronounced effects on phylogenetic inference [1]. Even real MNM events might strongly affect phylogenetic inferences for similar reasons, and might cause false signals of positive selection [7]. We therefore think these positions should also be subject to scrutiny, and we suggest caution with the linked variants that we identified.

Comprehensiveness of Masking Recommendations: As we described above, updates to GISAID sequences take place without modification of accession numbers or any overt record. Although removing errors is clearly important, this also makes it impossible to provide comprehensive masking recommendations for all possible downloads from the GISAID database. We therefore suggest that it may often be valuable to conduct analyses similar to ours on a given dataset (tools to do this are provided at https://github.com/yatisht/strain_phylogenetics and https://github.com/lgozasht/COVID-19-Lab-Specific-Bias-Filter). Nonetheless, we are confident that most of the highly problematic positions in currently available SARS-CoV-2 sequence data have been identified by our analyses. We will continue to update these recommendations going forward using these approaches and future updates of the global SARS-CoV-2 phylogeny from https://github.com/roblanf/sarscov2phylo. That phylogeny, in turn, is always based on the latest recommendations of positions that should be masked in all analyses contained at https://github.com/W-L/ProblematicSites_SARS-CoV2/blob/master/subset_vcf/problematic_sites_sarsCov2.mask.vcf.

Acknowledgements: We remain grateful to all the authors who have kindly deposited and shared genome data on GISAID (https://www.gisaid.org/). A full table of acknowledgments for GISAID contributors can be found here: https://github.com/roblanf/sarscov2phylo/tree/master/acknowledgements.

References:

[1] Turakhia Y, Thornlow B, Gozashti L, Hinrichs AS, Fernandes JD, Haussler D, et al. 2020. Stability of SARS-CoV-2 phylogenies. https://doi.org/10.1101/2020.06.08.141127.

[2] Fernandes JD, Hinrichs AS, Clawson H, Gonzalez JN, Lee BT, Nassar LR, et al. 2020. The UCSC SARS-CoV-2 Genome Browser. https://doi.org/10.1101/2020.05.04.075945.

[3] ENSEMBL COVID-19 Genome Browser. https://covid-19.ensembl.org/index.html. Accessed 14 July 2020.

[4] Mai U, Mirarab S. 2018. TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees. BMC Genomics 19: 272.

[5] Price MN, Dehal PS, Arkin AP. 2010. FastTree 2 – approximately maximum-likelihood trees for large alignments. PLoS One. 5: e9490.

[6] Korber B, Fischer W, Gnanakaran SG, Yoon H, Theiler J, Abfalterer W, Foley B, Giorgi EE, Bhattacharya T, Parker MD, Partridge DG. 2020. Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2. https://doi.org/10.1101/2020.04.29.069054.

[7] De Maio N, Holmes I, Schlötterer C, Kosiol C. 2013. Estimating empirical codon hidden Markov models. Molecular Biology and Evolution 30: 725–736.

NicolaDeMaio · July 29, 2020, 10:46pm

We have now updated the latest analyses using data from the 22nd of July 2020.
As usual, the current file with our masking recommendations is available here.

NicolaDeMaio · December 22, 2020, 12:32pm

Updated analysis with data from 13th November 2020

Landen Gozashti^1,2, Conor Walker^3,4, Nick Goldman³, Russell Corbett-Detig^2*, and Nicola De Maio^3*,

¹Department of Organismic and Evolutionary Biology and Museum of Comparative Zoology, Harvard University, Cambridge, MA, USA
²Department of Biomolecular Engineering and Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
³European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridgeshire, United Kingdom
⁴Department of Genetics, University of Cambridge, Cambridge, United Kingdom
^*[email protected], [email protected]

Purpose: The purpose of this post is to update our masking recommendations and provide some insight on modifications to our automated methodology for identification of erroneous sites in the SARS-CoV-2 genome. Our updated masking recommendations are still maintained here in the format discussed here. Our systematic pipeline for erroneous site detection can be found here.

Data and Phylogenetic Inference: We used 147,284 sequences available from the GISAID database as of 13 November 2020 to update our masking recommendations. This dataset constitutes nearly all available SARS-CoV-2 sequence variation and empowers more precise detection of erroneous sites than our previous efforts. The inferred sequence alignment and phylogenetic tree of these sequences are found here [2]. We used the same pipeline for filtering and phylogenetic inference as described in our previous post.

Updated Methodology: We initially employed an approach described in [1] to systematically detect possible erroneous variants in SARS-CoV-2 sequences with additional modifications on which we elaborate in our previous post. One of these modifications included a minimum Parsimony Score:Minor Allele Count (MAC) ratio of 0.5. This ratio requires that each inferred clade with the alternative allele at a given site possesses at most two descendants on average and is flagged as a suspicious variant. We manually reanalyzed the effect of this ratio on our systematic output and found that a minimum ratio of 0.5 is no longer adequate, as we miss some variants that appear problematic. Because sequencing centers typically include samples from nearby areas, and due to travel restrictions, we may expect that many samples from a given sequence center are closely related. Then, the addition of systematic errors groups the samples more closely on the phylogeny during tree-building and drives down the inferred parsimony score. This could result in the lower Parsimony Score:MAC ratios that we observe.

As an example, at site 23122 we observe 160 of 177 ambiguous alternate allele calls, nearly all of which stem from a single laboratory. However, this site yields a parsimony score/MAC ratio of ~0.44 and is thus ignored by our previous methods. In light of this, after careful manual curation across all sites regardless of the parsimony score:MAC, we have now reduced this minimum ratio to 0.3 which appears to capture most of the apparently problematic sites in the full dataset. Additional changes may be necessary as new SARS-CoV-2 data accumulates and we will be mindful of this in the future.

Systematic Identification of Locally Linked Variants: We also developed a pipeline for systematic identification of linked variants (available here), which we now apply in conjunction with our other methods. Our pipeline uses plink2 to perform pairwise R² calculations for positive linkage disequilibrium between each variant and other variants within a given region of the genome. We limit this search to pairs of sites within 10bp of one another for computational efficiency. Using this method, we find several new locally linked sites which we annotate in our masking recommendations. Such problematic sites are expected to be particularly challenging for phylogenetic inference algorithms which typically treat phylogenetic patterns at each site as independent.

Acknowledgements: We are very grateful to GISAID and all the groups who shared their sequencing data. A full list of acknowledgments is available here.

References

[1] Turakhia Y, De Maio N, Thornlow B, Gozashti L, Lanfear R, Walker C, Hinrichs AS, et al. 2020. Stability of SARS-CoV-2 phylogenies. https://doi.org/10.1371/journal.pgen.1009175

[2] Lanfear, Rob (2020). A global phylogeny of SARS-CoV-2 sequences from GISAID. Zenodo DOI: 10.5281/zenodo.3958883

NicolaDeMaio · April 13, 2021, 9:58am

Updated analysis with data from 4 March 2021

Landen Gozashti¹, Conor R. Walker^2,3, Robert Lanfear⁴, Nick Goldman², Nicola De Maio² and Russell Corbett-Detig⁵

¹Department of Organismic & Evolutionary Biology and Museum of Comparative Zoology, Harvard University, Cambridge, MA, USA

²European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridgeshire, United Kingdom

³Department of Genetics, University of Cambridge, Cambridge, United Kingdom

⁴Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, ACT, Australia

⁵Department of Biomolecular Engineering and Genomics Institute, University of California Santa Cruz. Santa Cruz, CA, USA

Purpose: The purpose of this post is to update our masking recommendations and provide some insight on modifications to our automated methodology and software performance for identification of error-prone sites in the SARS-CoV-2 genome in light of the exponentially increasing number of available SARS-CoV-2 sequences. Our updated masking recommendations are still maintained here in the format discussed here. Our systematic pipeline for erroneous site detection can be found here (commit 04ee2ef for these analyses).

Data and Phylogenetic Inference: We used 428,865 sequences available through the GISAID database as of 4 March 2021 to update our masking recommendations. This dataset comprises nearly double the number of sequences used in our most recent previous update. The inferred phylogenetic tree of these sequences and metadata for each respective sample is available through GISAID. We used the same filtering pipeline as described in the previous post.

Updated Performance: The number of available SARS-CoV-2 sequences is growing at an exponential rate. The already massive datasets available on GISAID have presented numerous challenges for bioinformatic tools and analyses due to towering runtime and memory requirements [1]. In light of this, we updated our systematic pipeline for problematic site detection to boost performance, primarily focusing on parallelizing our lab association analyses. We achieve approximately a fivefold increase in efficiency when parallelizing our associations across just 5 processes, ensuring that we can still update recommendations in a timely manner even when considering 500,000+ SARS-CoV-2 samples.

Updated Methodology: In our previous post, we reduced our minimum Parsimony Score:Minor Allele Count (MAC) ratio to 0.3 to ensure that we capture most problematic sites within GISAID’s full sequence dataset as of 13 November 2020. We reassessed the feasibility of this ratio with the 4 March 2021 dataset and do not find any new problematic sites exhibiting Parsimony Score:MAC ratios less than 0.3, suggesting that it is still adequate. However, in light of the aforementioned runtime requirements, we plan to impose a minimum parsimony score requirement of 10 in the future. We only observe one cautionable site with a parsimony score < 10 in the 4 March 2021 dataset, and thus performed unnecessary associations at 6,893 sites. Average parsimony score increases with sample size, and imposing a minimum parsimony drastically reduces runtime by limiting the number of performed associations at sites that are unlikely to be problematic.

Observations: Many newly added sites in our updated recommendations comprise a substantial number of ambiguous nucleotide calls associated with sequences from the Victorian Infectious Diseases Reference Laboratory (VIDRL). Notably, some of these sites are particularly challenging to detect due to inconsistencies in “submitting lab” and “originating lab” categories of GISAID’s metadata, which can cause a group of samples to appear as if they are associated with multiple labs when in truth they all stem from one [2]. We use “originating country” as a rough proxy for detecting these errors (see here). However, such errors are becoming more challenging to correct systematically as they rise in frequency and diversity with increased sample size, highlighting the need for careful attention to metadata standardization and the development and application of tools for metadata correction.

References

RussCorbett · November 10, 2021, 8:31pm

Systematic Errors Associated with Some Implementations of ARTIC V4 and a Fast Workflow to Prescreen Samples for New Problematic Sites

Theo Sanderson^1,2,*, Nicola De Maio³, Angie S. Hinrichs⁴, Adriano de Bernardi Schneider^4,5, Conor Walker³, Nick Goldman³, Yatish Turakhia ⁶, Robert Lanfear ⁷, Russell Corbett-Detig^4,5,*

¹Francis Crick Institute, London, UK
²Wellcome Sanger Institute, Hinxton, Cambridgeshire, UK
³European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridgeshire, United Kingdom
⁴Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
⁵Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA
⁶Department of Electrical and Computer Engineering, University of California San
Diego, San Diego, CA, USA
⁷Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, ACT, Australia

^* [email protected], [email protected]

Systematic Errors Associated with Some Implementations of ARTIC V4

The ARTIC V4 primers [1] are an important update to the primer scheme that has driven most sequencing during the SARS-CoV-2 pandemic. V4 resolves a number of amplification difficulties in ARTIC V3 introduced by mutations in SARS-CoV-2 variants of concern, including Delta. However, two genomic artifacts have been observed in substantial quantities from some implementations of the ARTIC V4 primers. Primer v4:93_LEFT, in addition to its intended binding site, also shows homology to a region within amplicon 51 where it can prime a secondary amplicon (in a pair with correctly-binding primer v4:51_RIGHT). Due to a difference between the primer and the region to which it binds, this creates the appearance of a T15521A mutation, which would encode orf1ab:F5086Y (a.k.a. ORF1b:F685Y). Similarly, primer v4:41_LEFT shows additional homology to a region within amplicon 29, where it can prime a secondary amplicon (in a pair with correctly-binding primer v4:29_RIGHT), amplifying a truncated version of amplicon 29 that incorporates an apparent T8835C SNP, from a mismatch within the primer, creating the appearance of an orf1ab:V2857A mutation. In each case, a 5’ mismatching region of the primer is softclipped during mapping. Since these are not standard primer sites, these primer-induced artefacts will survive basic primer trimming.

Figure 1. Mispriming with v4:93_LEFT at a region within amplicon 51 can create a systematic error. The apparent T15521A substitution created by the presence of this primer-derived sequence is circled. Odd and even numbered primers are used in distinct pools.

Solutions to avoid this issue will depend on the workflow used. For sequencing chemistries where one read – or one read-pair – corresponds to an entire amplicon, filtering based on insert-size, or on the presence of canonical primer sequences at both ends, will be helpful. Such an approach is already implemented in the ARTIC Network’s pipeline for nanopore sequencing, where guppyplex filters on amplicon size and align_trim filters on the presence of correct primers. The most challenging case to solve is where amplicons are fragmented prior to short-read sequencing, given that the misprimed amplicon that truncates amplicon 29 has a length >200bp. Removing soft-clipped reads may be of some assistance here. In the longer term, future primer schemes will be screened to avoid potential mispriming sites. These issues, and others observed in some implementations of the V3 primers [2–4], highlight the need for vigilance against technical artefacts that can distort phylogenetic trees.

Due to the frequency at which T15521A and T8835C errors are currently observed, we have added these positions to our masking recommendations for SARS-CoV-2 alignments: problematic_sites_sarsCov2.vcf.

A Workflow for Prescreening New Samples for Systematic Errors

We developed a workflow that can be used to prescreen samples for previously undescribed possible systematic errors. Briefly, the idea is to add newly sequenced samples to the public tree that we maintain [5], and to identify specific variants at individual sites whose parsimony score increased the most as a consequence of adding the new samples. This might be useful for users when evaluating samples produced on a new sequencing run, new wet lab or bioinformatic protocols, or a new batch of sequencing or library preparation reagents. This workflow is intended to facilitate identification of potential systematic errors or otherwise problematic positions in the genome, but it is not a definitive analysis. The results should be accompanied by detailed interpretation of the specific sites that the workflow identified.

To demonstrate the utility of this workflow, we recreated the conditions where it could be used to flag the likely systematic errors that appear to be associated with the new ARTIC V4 primer set. To do this, we pruned the approximately 10,000 samples sequenced at the Wellcome Sanger Institute using the V4 primer scheme that had already been included in the public UShER mutation annotated tree from 10/28 [5]. The idea is that we can recreate a use case for this workflow by analyzing this set of samples. We then analysed all the pruned sequences found in the current COG-UK dataset: aligning them to the reference (NC_045512.2), masking previously known problematic sites, and placing them onto this pruned tree. Exact commands to reproduce this analysis are available from this github repository. Running this workflow identified the likely problematic positions as specific variants at individual sites whose parsimony score increased the most as well as apparent back mutations at those same positions after adding the affected samples.

variant	parsimony_public	parsimony_total	parsimony_user	parsimony_public/sample	parsimony_user/sample
T15521A	1244	3246	2002	0.002629475199	0.4041922989
T8835C	880	1586	706	0.0005531057063	0.2280701754
A15521T	10	100	90	0.0003912644868	0.08042834359
G72A	22	97	75	4.45E-06	1.03E-02
G22032T	56	128	72	9.78E-06	8.54E-03
A22030G	1	72	71	2.49E-05	8.20E-03
A26766C	16	81	65	4.45E-07	8.09E-03
A28248G	44	88	44	7.11E-06	7.40E-03
C8835T	4	42	38	1.96E-05	5.01E-03
T27904G	65	88	23	1.78E-06	4.33E-03

Table 1: The first ten lines of this workflow’s output includes variants whose parsimony score increased the most due to the addition of the user’s samples. Forward and reverse mutations at the two apparently problematic positions are identified.

Interpretation Guidelines

Here we provide some basic guidelines for using this workflow to prescreen newly produced data. Although our workflow can help to identify problematic positions, it may not be obvious what the source of apparently high parsimony score variants at a specific site is. We caution that many factors may influence the resulting parsimony score increases. For example, the phylogenetic diversity within the dataset examined and within a local region will affect our ability to identify such apparent variants. Additionally, because other problematic sites can distort phylogenetic inference, true mutations may display apparently elevated parsimony scores when samples containing these alleles are placed incorrectly on the phylogeny.

Due to the range of factors that can affect these results, we recommend that users of this workflow treat it as only a first pass analysis. Users may wish to run the workflow, identify the variant that generates parsimony score increase and examine the short read alignment files underlying such positions. If this appears to be a systematic error, users should update their workflow to repair this variant call or simply mask the position and rerun the workflow on the updated sequences to determine if other sites are indeed potentially problematic.

Running the Workflow

The ideal use of this workflow would be to run it before uploading samples to sequence repositories so as to identify potential problems in the sequences before sharing them. To run this workflow on your own data, provide the sequence data as a multi-fasta file and run the following commands (dependencies conda, mamba, snakemake, git and a fasta file):

git clone https://github.com/yatisht/usher.git
cd usher/workflows/
snakemake --use-conda --cores [num_threads] --config RUNTYPE=”systematic” FASTA=”[full/path/to/fasta/file]”

The output table, parsimony_report.txt, will contain the variants whose parsimony score increased as a result of adding the user’s samples. We emphasize that this is intended only to flag potentially problematic positions and we anticipate that users may want to deeply investigate sites that have especially large impacts on the estimated parsimony score.

Exact commands to reproduce the analysis we showcased here are available via github. This is slightly different from above because we needed to extract the Sanger samples produced using the ARTIC V4 primers.

Acknowledgements

We are grateful to Moritz Gerstung, @babarlelephant, Alex Selby, Jeffrey Barrett, Rob Davies, Nick Loman and Joshua Quick for assistance with understanding ARTIC V4 issues and possible solutions. We thank the generators of sequences analysed here: the COG-UK Consortium, the Wellcome Sanger Institute COVID-19 Surveillance Team, and all submitters to the INSDC databases.

References

SARS-CoV-2 version 4 scheme release - Laboratory - ARTIC Real-time Genomic Surveillance
Sanderson T, Barrett JC. Variation at Spike position 142 in SARS-CoV-2 Delta genomes is a technical artifact caused by dropout of a sequencing amplicon. medRxiv. 2021; 2021.10.14.21264847.
lcerutti, j_quick, connortr, arambaut, n_j_loman, vborges, et al. Missing G21987A mutation in SARS-CoV-2 delta variants due to non-specific amplification by ARTIC v3 primers. 20 Oct 2021. Available: https://pando.tools/t/missing-g21987a-mutation-in-sars-cov-2-delta-variants-due-to-non-specific-amplification-by-artic-v3-primers/764
Davis JJ, Wesley Long S, Christensen PA, Olsen RJ, Olson R, Shukla M, et al. Analysis of the ARTIC version 3 and version 4 SARS-CoV-2 primers and their impact on the detection of the G142D amino acid substitution in the spike protein. bioRxiv. 2021; doi:10.1101/2021.09.27.461949
McBroome J, Thornlow B, Hinrichs AS, Kramer A, De Maio N, Goldman N, et al. A daily-updated database and tools for comprehensive SARS-CoV-2 mutation-annotated trees. Mol Biol Evol; 2021. doi:10.1093/molbev/msab264

Sam_W · April 5, 2022, 12:37pm

Hi, in regards to our post about erroneous mutations in ARTIC V4/4.1 we have now discovered a significantly higher number of affected genomes when ambiguous bases are considered (31,567 in COG-UK dataset). Assuming sequencing centres use the updated versions of the scheme BED file new sequences should not be affected but I think you should consider adding the following positions to the problematic sites mask:

19209 G/K
19210 G/R
19212 G/R
19214 G/R
19217 A/M