Issues with SARS-CoV-2 sequencing data

NicolaDeMaio · December 22, 2020, 12:32pm

Updated analysis with data from 13th November 2020

Landen Gozashti^1,2, Conor Walker^3,4, Nick Goldman³, Russell Corbett-Detig^2*, and Nicola De Maio^3*,

¹Department of Organismic and Evolutionary Biology and Museum of Comparative Zoology, Harvard University, Cambridge, MA, USA
²Department of Biomolecular Engineering and Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA
³European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridgeshire, United Kingdom
⁴Department of Genetics, University of Cambridge, Cambridge, United Kingdom
^*[email protected], [email protected]

Purpose: The purpose of this post is to update our masking recommendations and provide some insight on modifications to our automated methodology for identification of erroneous sites in the SARS-CoV-2 genome. Our updated masking recommendations are still maintained here in the format discussed here. Our systematic pipeline for erroneous site detection can be found here.

Data and Phylogenetic Inference: We used 147,284 sequences available from the GISAID database as of 13 November 2020 to update our masking recommendations. This dataset constitutes nearly all available SARS-CoV-2 sequence variation and empowers more precise detection of erroneous sites than our previous efforts. The inferred sequence alignment and phylogenetic tree of these sequences are found here [2]. We used the same pipeline for filtering and phylogenetic inference as described in our previous post.

Updated Methodology: We initially employed an approach described in [1] to systematically detect possible erroneous variants in SARS-CoV-2 sequences with additional modifications on which we elaborate in our previous post. One of these modifications included a minimum Parsimony Score:Minor Allele Count (MAC) ratio of 0.5. This ratio requires that each inferred clade with the alternative allele at a given site possesses at most two descendants on average and is flagged as a suspicious variant. We manually reanalyzed the effect of this ratio on our systematic output and found that a minimum ratio of 0.5 is no longer adequate, as we miss some variants that appear problematic. Because sequencing centers typically include samples from nearby areas, and due to travel restrictions, we may expect that many samples from a given sequence center are closely related. Then, the addition of systematic errors groups the samples more closely on the phylogeny during tree-building and drives down the inferred parsimony score. This could result in the lower Parsimony Score:MAC ratios that we observe.

As an example, at site 23122 we observe 160 of 177 ambiguous alternate allele calls, nearly all of which stem from a single laboratory. However, this site yields a parsimony score/MAC ratio of ~0.44 and is thus ignored by our previous methods. In light of this, after careful manual curation across all sites regardless of the parsimony score:MAC, we have now reduced this minimum ratio to 0.3 which appears to capture most of the apparently problematic sites in the full dataset. Additional changes may be necessary as new SARS-CoV-2 data accumulates and we will be mindful of this in the future.

Systematic Identification of Locally Linked Variants: We also developed a pipeline for systematic identification of linked variants (available here), which we now apply in conjunction with our other methods. Our pipeline uses plink2 to perform pairwise R² calculations for positive linkage disequilibrium between each variant and other variants within a given region of the genome. We limit this search to pairs of sites within 10bp of one another for computational efficiency. Using this method, we find several new locally linked sites which we annotate in our masking recommendations. Such problematic sites are expected to be particularly challenging for phylogenetic inference algorithms which typically treat phylogenetic patterns at each site as independent.

Acknowledgements: We are very grateful to GISAID and all the groups who shared their sequencing data. A full list of acknowledgments is available here.

References

[1] Turakhia Y, De Maio N, Thornlow B, Gozashti L, Lanfear R, Walker C, Hinrichs AS, et al. 2020. Stability of SARS-CoV-2 phylogenies. https://doi.org/10.1371/journal.pgen.1009175

[2] Lanfear, Rob (2020). A global phylogeny of SARS-CoV-2 sequences from GISAID. Zenodo DOI: 10.5281/zenodo.3958883