Title: Wastewater samples CANNOT be used for genome assembly
Author: Alejandro Gener, PhD
AIDS Editorial Board, Los Angeles, California, USA
Dr. Alejandro Gener
Conflicts of interest: I received travel bursaries from Oxford Nanopore Technologies in 2019. I am the founder of Student Genomics.
Funding: No funding was provided for this work.
Wastewater sequencing is providing important insights enabling early pathogen detection (e.g., emergent SARS-CoV-2 strains/variants) and allowing public health authorities to anticipate community levels of resistance and to predict treatment susceptibility or vaccine efficacy. However, wastewater sequencing should be viewed distinctly from efforts to assemble genomes from individuals most often performed as part of baseline surveillance. It is helpful to include both assembly data and read data to enable trouble-shooting and verification of SARS-CoV-2 study claims. These practices have broad implications for all pathogens currently being monitored with wastewater and baseline surveillance.
Regarding the recent paper by Fielding-Miller et al. (1), the authors seem to confuse:
1.) the general concept of SARS-CoV-2 (SC2) genome assembly with
2.) what is possible with wastewater amplicon sequencing and
3.) SC2 whole genome sequencing (WGS) from clinical isolate/sample amplicon sequencing.
This is not the first time I have seen researchers confuse these points, so it bears drawing explicit lines to define the scope of wastewater sequencing surveillance analyses with how it is implemented. Number 3 can be used to assemble full-length SC2 genome assemblies. Number 2 can at best be used to estimate the proportion of sequence heterogeneity in pooled environmental specimens. Assemblies would require confirmation of 2 with 3.
One CANNOT assemble genomes from a typical wastewater sample except under highly controlled conditions. A major reason is because genetic material from multiple strains and persons are mixed or pooled together. Reads generated are also much shorter than the lengths of SC2 genomes. This leads to loss of physical linkage (phasing) information in diluted samples and/or samples with low percent reference coverage. This is solved by using isolates with higher sequencing depth (often with lower ct), percent reference coverage, and samples from individuals, not pools. Samples can degrade prior to collection and/or during transit, leading to noise or shorter fragments, preventing long-read sequencing from being a de facto solution to phasing. Wastewater samples in absence of viral isolation, subculturing, and isolate sequencing with measurable SC2 signal (ct lower than 30) would at best look like a box of mixed puzzle pieces, but pieces from different puzzles. High enough pathogen load might look like coverage, but not in a traditional one sample/isolate-one strain sense. The “coverage” the authors note in their “Methods/Sequencing” section for each strain from each person would need to be decreased based on their number of cases (samples positive for SC2). Some people might shed more virus. Some might shed less or undetectable levels. Some people might be co-infected/super-infected, as is seen with immunocompetent hosts during colliding strain (variant) waves, or in immunocompromised people who have difficulty clearing primary infections. All that variability makes it difficult to assemble one puzzle (genome) per strain, let alone multiple puzzles jumbled together. Thankfully, that level of assembly is not necessary to track important variants. But that level of assembly also requires different methods, specifically baseline WGS as done in local/state/national public health settings across the US.
However, if ONE person (or a limited number of people) were infected at a time, and the samples were not diluted in the environment to obscure that signal, then MAYBE it might be possible to detect enough signal (sequence depth and percent reference coverage) to assemble parts of a genome (contigs) or a full genome. This approach was not disclosed by the authors, though the authors did mention wastewater sampling as “time-weighted composite samples and programmed to sample every 10–15 min over a 7 h interval (typically 6:30 am–4:00 pm).” The authors also mention surface sampling (“staff swabbed a one-foot square area in the center of each classroom floor at the end of each day prior to classroom cleaning”). The sampling methods used in the study make recovering single virus samples difficult/impossible. I would not expect the surface sampling to recover sufficient signal for whole genome assemblies without collapsing virus from multiple people. The environmental nature of the sampling is permissive for admixture (mixed sequences from multiple people in the same sample). Admixture CANNOT be directly assessed at the assembly level. Thankfully, detection of any SC2, or enough mutant alleles to make a lineage call, require less sensitive methods than full-length genome assembly.
If one had a limited set of highly divergent strains in a sample, one could map filtered reads based on those reference genomes (plural) and then assemble the genomes from those subsets (2). This does not seem to be what the authors did based on their published text. The study was conducted from November 2020 to March 2021 in California, USA. SC2 is known to surge sub-seasonally, and we have extensive data available for which strains were around in California during that time to be able to use strain-specific reference-guided filtering to perform the analyses that authors claimed to have done (e.g., “recovering near complete genomes”) while failing to provide methodological information and results to support their claims. This should be added to any correction of this manuscript. The authors should be able to reuse their C-VIEW approach (GitHub - ucsd-ccbb/C-VIEW: This software implements a high-throughput data processing pipeline to identify and charaterize SARS-CoV-2 variant sequences in specimens from COVID-19 positive hosts or environments.) but will need to replace their main SC2 reference genome with reference genomes relevant to their study period. Some SC2 strain/variant/specific reference genomes are available here: SARS-CoV-2 variants ~ ViralZone.
Comparing A.) assemblies (.fasta) with B.) read sequences (.fastq) information would be helpful to illustrate the inappropriateness of collapsing nearly identical SC2 sequences onto a single linear reference genome and calling a consensus on that dogpile. This process is also known as reference-guided assembly. However, read-level data was not shared by the authors in the manuscript. As for their assemblies, I searched in GISAID on 27 February 2023 for the “Accession_ID” “hCoV-19/env/USA/CA-SEARCH-16259” which was first on the list of assemblies the authors included in their supplemental data file. This accession is not a GISAID accession, although it vaguely resembles formats like virus name. I failed to return anything in my search. At present, the assembly data from these studies does not seem to be available in GISAID. This is something that reviewers at Lancet Regional Health Americas should have noticed. The authors should also include read data in an open (International Nucleotide Sequence Database Collaboration (INSDC)) database like the Sequence Read Archive (SRA) to enable better interpretation by SC2 community members to maintain the high standard that is needed before studies achieve peer-review status. This is now a requirement of the NIH for future studies and is easily implementable now for the sake of rigor and reproducibility.
Wastewater sequencing for pathogen genome surveillance is great at detecting and quantifying small puzzle pieces (sequence variability) which can be pieced back together to approximate information otherwise only available in actual assemblies from WGS. For most intents and purposes, that kind of approximation should NOT be considered the same as viral assembly. Wastewater surveillance is an extremely useful hack/solution, but it has its limitations. It should be viewed as one half of comprehensive SC2 genomic surveillance programs with single person/individual/clinical isolate WGS.
Fielding-Miller R, Karthikeyan S, Gaines T, Garfein RS, Salido RA, Cantu VJ, et al. Safer at school early alert: an observational study of wastewater and surface monitoring to detect COVID-19 in elementary schools. Lancet Reg Heal – Am [Internet]. 2023 Mar 1;19. Available from: https://doi.org/10.1016/j.lana.2023.100449
Gener A, Burleson T, Pettersson J, Hemarajata P, Green N. Assembly deconvolution resolves SARS-CoV-2 haplotypes in Delta-Omicron co-infections. In: London Calling 2022 [Internet]. London, UK; 2022. Available from: Assembly deconvolution resolves SARS-CoV-2 haplotypes in Delta-Omicron co-infections