Issues with SARS-CoV-2 sequencing data

goldman · July 14, 2020, 3:14pm

Updated analysis with data from 12th June 2020

Nicola De Maio^1*, Landen Gozashti², Yatish Turakhia², Conor Walker¹, Robert Lanfear³, Russell Corbett-Detig^2*, and Nick Goldman¹

¹European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridgeshire, United Kingdom
²Department of Biomolecular Engineering and Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
³Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, ACT, Australia
^*[email protected], [email protected]

Purpose: The purpose of this post is to update our masking recommendations by combining the methods from our original post above with those from [1], and to provide reliable and automated methodology for the identification of positions that for one reason or another may confound evolutionary interpretations in the SARS-CoV-2 genome. The updated masking recommendations, as usual, can be found here: https://github.com/W-L/ProblematicSites_SARS-CoV2/blob/master/problematic_sites_sarsCov2.vcf, in the format discussed here: Masking strategies for SARS-CoV-2 alignments.

Background: We and others have identified several variants in SARS-CoV-2 sequencing datasets that have the potential to adversely affect phylogenetic and evolutionary inference. For details, see our previous posts in this thread and bioRxiv preprint [1]. In many cases, these apparent mutations are almost certainly sequencing or data processing artefacts. Other positions that we identify seem instead to reflect hypervariable positions in the SARS-CoV-2 genome or multi-nucleotide mutation events. Even such “real” phenomena might adversely impact biological inferences, and therefore need caution. Here we merge methodology and recommendations obtained from approaches for identifying these “problematic positions” described in previous posts in this thread with those of [1]. We describe our methods and rationale for identifying and categorizing problematic positions and we provide updated masking recommendations in our VCF file (see Masking strategies for SARS-CoV-2 alignments). To facilitate interpretation and cross-referencing against genomic features the updated recommendations are also available as a track on the UCSC SARS-CoV-2 Genome Browser [2] and the ENSEMBL COVID-19 Genome Browser [3].

Data and Phylogenetic Inference: To update our masking recommendations in the context of (almost) all available SARS-CoV-2 genome sequence variation, we used 39,839 sequences available in the GISAID database as of 12 June 2020. Full details of the inferred sequence alignment and phylogenetic tree are given at https://github.com/roblanf/sarscov2phylo/releases/tag/12-6-20. Our pipeline filters out sequences that are short (<28kbp) or highly ambiguous (>1000 ambiguous characters), those that are excessively divergent according to TreeShrink [4], and uses other filters including some mentioned in the earlier comments in this post. Tree topologies are inferred using FastTree2 [5]. This dataset is much larger than those used in our previous analyses, and allows us more power to identify problematic positions.

Approach for Identifying Problematic Positions: One of our approaches for identifying problematic positions was described in detail previously [1]. Briefly, this method looks for highly homoplasic positions and asks whether they are strongly associated with a single sequencing laboratory. We define “lab-association” as cases where 80% or more of the instances of a variant allele come from a single lab, but we also manually explored positions with high parsimony score: alternate allele frequency ratios that are not strongly associated with a single lab. Manual curtation identified several positions where two or more labs contribute similar sequencing artefacts and future versions of our automated pipeline will be updated to detect these as well.

Here we implemented three important modifications relative to this method. First, because the Nextstrain pipeline (on which the analysis of [1] was based) resolves all ambiguities, this method did not consider ambiguous basecalls. These can be a valuable signature of potentially errant alleles, and sometimes enable accurate identification of problematic positions with low minor allele frequencies (see previous updates to this thread). To leverage the signatures of ambiguous basecalls, we performed an analysis after resolving all ambiguous calls as the alternative allele (“alternative resolved”), and one analysis where we resolved all ambiguous calls by parsimony (selecting the reference allele where there are equally parsimonious configurations: “parsimony resolved”). Positions that we identify as highly homoplasic only in the “alternative resolved” dataset and not in the “parsimony resolved” dataset would be “highly_ambiguous” below, but not “highly_homoplasic” as the difference in detection reflects the elevated rate of ambiguous basecalls at a position. We note that because ambiguous positions are ignored by FastTree2 [5] they typically result in extremely high parsimony scores in our “alternative resolved” analysis even relative to other non-heritable variation.

We also added a “country-association” module to our pipeline, using the same methods we previously employed for detecting “lab-associated” variants [1]. In our previous work we attempted to correct the various errors and inconsistencies in “originating lab” and “submitting lab” metadata entries using a simple string comparison approach (https://github.com/lgozasht/COVID-19-Lab-Specific-Bias-Filter). However, in the vastly expanded dataset we used here, there are many cases where the lab description varies too much between metadata entries precluding our identifying all samples from a single sequencing lab. In contrast, we have found that “Country” metadata entries are more reliably uniform. By detecting “country-associated” variants, we sometimes recovered “lab-associated” variants.

We originally used a much smaller dataset to develop our methods and some of our heuristic cutoffs required tuning on this vastly expanded dataset. In particular, in more recent datasets some sequencing groups (e.g. COG-UK) have produced disproportionately more data. Our previous methods sometimes yielded results that we believe are false positives, when real but slightly homoplasious mutations recur in samples from a single group. We resolve this with a second modification to our approach. Briefly, we used the parsimony score, the minimum number of mutations required to generate the sample genotypes on a given phylogeny, to summarize how frequently a position is mutated during SARS-CoV-2 evolution. If the parsimony score is high relative to the minor allele frequency, this suggests that the mutation is not heritable. We required a parsimony score:minor allele frequency ratio of at least 0.5 to identify a potentially problematic position. The ratio we selected implies that each inferred clade with the alternative allele has at most two descendants on average and is therefore unlikely to be a heritable mutation.

The software for the methods above is available from https://github.com/yatisht/strain_phylogenetics (see in particular the C++ branch), and https://github.com/lgozasht/COVID-19-Lab-Specific-Bias-Filter. We developed our masking recommendations so that they would fit most users and most versions of SARS-CoV-2 alignments without requiring additional analyses and filtering; however, in some particular cases (for example if the alignment near genome ends is of interest, or if extremely conservative masking is required), the methods above and in https://github.com/roblanf/sarscov2phylo could be adapted for this purpose.

Types of Evidence Considered: There are four forms of evidence that we use to identify and categorize problematic positions in SARS-CoV-2 sequencing data. These have formed the basis of some of our previous masking recommendations in this thread and in [1], but we reiterate them here briefly to facilitate interpretation of our recommendations.

homoplasic/highly_homoplasic. These are positions with alternate alleles that reoccur on many distinct branches of the SARS-CoV-2 phylogeny.
single_src/narrow_src. Evidence here consists of strong associations between alternate alleles at a given position and specific sequencing labs.
ambiguous/highly_ambiguous. Many positions show a strong excess of ambiguous basecalls relative to the number of alternative alleles. Most highly ambiguous positions are strongly lab-associated and likely reflect systematic errors that are found in a portion of reads for a given sample.
neighbour_linked. Alternative alleles at nearby positions that show strong associations across samples (i.e. linkage-disequilibrium for nearby pairs of low-frequency alleles).

We combine the evidence from these four signals to provide the masking recommendations presented below. In some instances, we also accept additional evidence from the sequencing laboratories or from sequencing read files to identify problematic positions.

Recommendations: To facilitate accurate interpretation of SARS-CoV-2 genome variation, we provide two types of recommendation. First, we designate positions that we believe will adversely affect most analyses as mask, that is, we recommend the removal of all sample genotypes at these positions for essentially all analyses. Second, positions that might adversely affect some biological inferences, but which might be suitable for other analysis are labeled as caution. Below we describe some common categories of problematic positions and explain our recommendations. We note that patterns of variation at some positions are not completely cut-and-dry and we sometimes use our judgement to determine what label is best.

Lab-Associated Errors: Most of the positions that we recommend masking are those that are homoplasic/highly_homoplasic and either single_src or narrow_src. These positions are likely artifactual variants that result from lab-specific data producing or processing. For example, some of these variants likely result from incorrect trimming of adapter or primer sequences, and at least one results from alignment of human rRNA to the SARS-CoV-2 reference genome (position 22802: [1]).

We note that in some cases, lab-associated errors have been corrected, but we still believe the positions should be masked. For example, sequences containing artefactual alleles at positions 4050 and 22802 have been corrected as of mid-June 2020. However, analyses based on earlier downloads from the GISAID database might still contain these variants — the GISAID accession numbers do not change so it is challenging to be certain which version of the sequence a given dataset might include — and for this reason we recommend continuing to mask these positions, but we additionally mark these positions as amended. If a user is certain that their alignments are based on corrected sequences, masking may not be necessary for some of the positions that we have so identified.

Hypermutable Positions: We also recommend masking many positions that are highly_homoplasic but which are not associated with a single sequencing group. This includes positions 11074, 11083 and others which we termed “extremal” in previous work [1]. Most of these positions contain C→T mutations and many are adjacent to poly-T stretches in the genome. Both of these features appear to be associated with hypermutability in the SARS-CoV-2 genome [1]. Given the exceptional mutation rates of these positions combined with the limited phylogenetic signal in the entire SARS-CoV-2 alignment, including these positions has strong potential to affect phylogenetic inference and we recommend that they be masked as well for most analyses. However, these positions might not require masking if a given nucleotide substitution model can accommodate hypermutability, context-dependency, and the possibility of strand-specific mutation effects (i.e. C→T mutation rates differ from G→A).

Highly Ambiguous Positions Often Suggest Systematic Errors: In general, positions with many ambiguous basecalls are not necessarily problematic for phylogenetic and other biological analyses, but positions with many ambiguities can provide evidence to suggest that there is a systematic error that could affect sequences. While it is possible that polymorphisms identified from within-host samples are transmitted and therefore appear as ambiguities in related samples, we emphasize that the highly_ambiguous positions that we identified are always highly homoplasic, consistent instead with them being systematic errors. For positions where we find only ambiguous basecalls (ambiguous and single_src) and no alternative alleles, we typically indicate caution. However, for positions with extremely high rates of ambiguous genotypes (highly_ambiguous), we sometimes suggest masking. When there are also alternative alleles (non-ambiguous characters different from the reference) at a given highly_ambiguous position, and particularly when the sequences containing alternative alleles and ambiguous alleles are single_src, we recommend mask.

Locally-Linked Positions Suggest Multi-Nucleotide Mutations or Correlated Error: Highly correlated locally linked positions might reflect multi-nucleotide mutation (MNM) events (for which we suggest caution) or they may occur due to correlated errors (for which we often observe positions to be homoplasic and single_src and we suggest mask). For example, the correlated errors at positions 24389 and 24390 likely result from alignment of a nanopore adapter sequence that was not trimmed from the reads prior to assembly (see [1,6] and comment above by @matthew.parker). Such correlated errors are particularly problematic because the site pattern at each position is typically modelled as independent during phylogenetic reconstruction and therefore have more pronounced effects on phylogenetic inference [1]. Even real MNM events might strongly affect phylogenetic inferences for similar reasons, and might cause false signals of positive selection [7]. We therefore think these positions should also be subject to scrutiny, and we suggest caution with the linked variants that we identified.

Comprehensiveness of Masking Recommendations: As we described above, updates to GISAID sequences take place without modification of accession numbers or any overt record. Although removing errors is clearly important, this also makes it impossible to provide comprehensive masking recommendations for all possible downloads from the GISAID database. We therefore suggest that it may often be valuable to conduct analyses similar to ours on a given dataset (tools to do this are provided at https://github.com/yatisht/strain_phylogenetics and https://github.com/lgozasht/COVID-19-Lab-Specific-Bias-Filter). Nonetheless, we are confident that most of the highly problematic positions in currently available SARS-CoV-2 sequence data have been identified by our analyses. We will continue to update these recommendations going forward using these approaches and future updates of the global SARS-CoV-2 phylogeny from https://github.com/roblanf/sarscov2phylo. That phylogeny, in turn, is always based on the latest recommendations of positions that should be masked in all analyses contained at https://github.com/W-L/ProblematicSites_SARS-CoV2/blob/master/subset_vcf/problematic_sites_sarsCov2.mask.vcf.

Acknowledgements: We remain grateful to all the authors who have kindly deposited and shared genome data on GISAID (https://www.gisaid.org/). A full table of acknowledgments for GISAID contributors can be found here: https://github.com/roblanf/sarscov2phylo/tree/master/acknowledgements.

References:

[1] Turakhia Y, Thornlow B, Gozashti L, Hinrichs AS, Fernandes JD, Haussler D, et al. 2020. Stability of SARS-CoV-2 phylogenies. https://doi.org/10.1101/2020.06.08.141127.

[2] Fernandes JD, Hinrichs AS, Clawson H, Gonzalez JN, Lee BT, Nassar LR, et al. 2020. The UCSC SARS-CoV-2 Genome Browser. https://doi.org/10.1101/2020.05.04.075945.

[3] ENSEMBL COVID-19 Genome Browser. https://covid-19.ensembl.org/index.html. Accessed 14 July 2020.

[4] Mai U, Mirarab S. 2018. TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees. BMC Genomics 19: 272.

[5] Price MN, Dehal PS, Arkin AP. 2010. FastTree 2 – approximately maximum-likelihood trees for large alignments. PLoS One. 5: e9490.

[6] Korber B, Fischer W, Gnanakaran SG, Yoon H, Theiler J, Abfalterer W, Foley B, Giorgi EE, Bhattacharya T, Parker MD, Partridge DG. 2020. Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2. https://doi.org/10.1101/2020.04.29.069054.

[7] De Maio N, Holmes I, Schlötterer C, Kosiol C. 2013. Estimating empirical codon hidden Markov models. Molecular Biology and Evolution 30: 725–736.