Issues with SARS-CoV-2 sequencing data

NicolaDeMaio · April 13, 2021, 9:58am

Updated analysis with data from 4 March 2021

Landen Gozashti¹, Conor R. Walker^2,3, Robert Lanfear⁴, Nick Goldman², Nicola De Maio² and Russell Corbett-Detig⁵

¹Department of Organismic & Evolutionary Biology and Museum of Comparative Zoology, Harvard University, Cambridge, MA, USA

²European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridgeshire, United Kingdom

³Department of Genetics, University of Cambridge, Cambridge, United Kingdom

⁴Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, ACT, Australia

⁵Department of Biomolecular Engineering and Genomics Institute, University of California Santa Cruz. Santa Cruz, CA, USA

Purpose: The purpose of this post is to update our masking recommendations and provide some insight on modifications to our automated methodology and software performance for identification of error-prone sites in the SARS-CoV-2 genome in light of the exponentially increasing number of available SARS-CoV-2 sequences. Our updated masking recommendations are still maintained here in the format discussed here. Our systematic pipeline for erroneous site detection can be found here (commit 04ee2ef for these analyses).

Data and Phylogenetic Inference: We used 428,865 sequences available through the GISAID database as of 4 March 2021 to update our masking recommendations. This dataset comprises nearly double the number of sequences used in our most recent previous update. The inferred phylogenetic tree of these sequences and metadata for each respective sample is available through GISAID. We used the same filtering pipeline as described in the previous post.

Updated Performance: The number of available SARS-CoV-2 sequences is growing at an exponential rate. The already massive datasets available on GISAID have presented numerous challenges for bioinformatic tools and analyses due to towering runtime and memory requirements [1]. In light of this, we updated our systematic pipeline for problematic site detection to boost performance, primarily focusing on parallelizing our lab association analyses. We achieve approximately a fivefold increase in efficiency when parallelizing our associations across just 5 processes, ensuring that we can still update recommendations in a timely manner even when considering 500,000+ SARS-CoV-2 samples.

Updated Methodology: In our previous post, we reduced our minimum Parsimony Score:Minor Allele Count (MAC) ratio to 0.3 to ensure that we capture most problematic sites within GISAID’s full sequence dataset as of 13 November 2020. We reassessed the feasibility of this ratio with the 4 March 2021 dataset and do not find any new problematic sites exhibiting Parsimony Score:MAC ratios less than 0.3, suggesting that it is still adequate. However, in light of the aforementioned runtime requirements, we plan to impose a minimum parsimony score requirement of 10 in the future. We only observe one cautionable site with a parsimony score < 10 in the 4 March 2021 dataset, and thus performed unnecessary associations at 6,893 sites. Average parsimony score increases with sample size, and imposing a minimum parsimony drastically reduces runtime by limiting the number of performed associations at sites that are unlikely to be problematic.

Observations: Many newly added sites in our updated recommendations comprise a substantial number of ambiguous nucleotide calls associated with sequences from the Victorian Infectious Diseases Reference Laboratory (VIDRL). Notably, some of these sites are particularly challenging to detect due to inconsistencies in “submitting lab” and “originating lab” categories of GISAID’s metadata, which can cause a group of samples to appear as if they are associated with multiple labs when in truth they all stem from one [2]. We use “originating country” as a rough proxy for detecting these errors (see here). However, such errors are becoming more challenging to correct systematically as they rise in frequency and diversity with increased sample size, highlighting the need for careful attention to metadata standardization and the development and application of tools for metadata correction.

References