Wastewater samples CANNOT be used for genome assembly

Title: Wastewater samples CANNOT be used for genome assembly

Author: Alejandro Gener, PhD

AIDS Editorial Board, Los Angeles, California, USA

Correspondence to:

Dr. Alejandro Gener


Conflicts of interest: I received travel bursaries from Oxford Nanopore Technologies in 2019. I am the founder of Student Genomics.

Funding: No funding was provided for this work.


Wastewater sequencing is providing important insights enabling early pathogen detection (e.g., emergent SARS-CoV-2 strains/variants) and allowing public health authorities to anticipate community levels of resistance and to predict treatment susceptibility or vaccine efficacy. However, wastewater sequencing should be viewed distinctly from efforts to assemble genomes from individuals most often performed as part of baseline surveillance. It is helpful to include both assembly data and read data to enable trouble-shooting and verification of SARS-CoV-2 study claims. These practices have broad implications for all pathogens currently being monitored with wastewater and baseline surveillance.


Regarding the recent paper by Fielding-Miller et al. (1), the authors seem to confuse:

1.) the general concept of SARS-CoV-2 (SC2) genome assembly with

2.) what is possible with wastewater amplicon sequencing and

3.) SC2 whole genome sequencing (WGS) from clinical isolate/sample amplicon sequencing.

This is not the first time I have seen researchers confuse these points, so it bears drawing explicit lines to define the scope of wastewater sequencing surveillance analyses with how it is implemented. Number 3 can be used to assemble full-length SC2 genome assemblies. Number 2 can at best be used to estimate the proportion of sequence heterogeneity in pooled environmental specimens. Assemblies would require confirmation of 2 with 3.

One CANNOT assemble genomes from a typical wastewater sample except under highly controlled conditions. A major reason is because genetic material from multiple strains and persons are mixed or pooled together. Reads generated are also much shorter than the lengths of SC2 genomes. This leads to loss of physical linkage (phasing) information in diluted samples and/or samples with low percent reference coverage. This is solved by using isolates with higher sequencing depth (often with lower ct), percent reference coverage, and samples from individuals, not pools. Samples can degrade prior to collection and/or during transit, leading to noise or shorter fragments, preventing long-read sequencing from being a de facto solution to phasing. Wastewater samples in absence of viral isolation, subculturing, and isolate sequencing with measurable SC2 signal (ct lower than 30) would at best look like a box of mixed puzzle pieces, but pieces from different puzzles. High enough pathogen load might look like coverage, but not in a traditional one sample/isolate-one strain sense. The “coverage” the authors note in their “Methods/Sequencing” section for each strain from each person would need to be decreased based on their number of cases (samples positive for SC2). Some people might shed more virus. Some might shed less or undetectable levels. Some people might be co-infected/super-infected, as is seen with immunocompetent hosts during colliding strain (variant) waves, or in immunocompromised people who have difficulty clearing primary infections. All that variability makes it difficult to assemble one puzzle (genome) per strain, let alone multiple puzzles jumbled together. Thankfully, that level of assembly is not necessary to track important variants. But that level of assembly also requires different methods, specifically baseline WGS as done in local/state/national public health settings across the US.

However, if ONE person (or a limited number of people) were infected at a time, and the samples were not diluted in the environment to obscure that signal, then MAYBE it might be possible to detect enough signal (sequence depth and percent reference coverage) to assemble parts of a genome (contigs) or a full genome. This approach was not disclosed by the authors, though the authors did mention wastewater sampling as “time-weighted composite samples and programmed to sample every 10–15 min over a 7 h interval (typically 6:30 am–4:00 pm).” The authors also mention surface sampling (“staff swabbed a one-foot square area in the center of each classroom floor at the end of each day prior to classroom cleaning”). The sampling methods used in the study make recovering single virus samples difficult/impossible. I would not expect the surface sampling to recover sufficient signal for whole genome assemblies without collapsing virus from multiple people. The environmental nature of the sampling is permissive for admixture (mixed sequences from multiple people in the same sample). Admixture CANNOT be directly assessed at the assembly level. Thankfully, detection of any SC2, or enough mutant alleles to make a lineage call, require less sensitive methods than full-length genome assembly.

If one had a limited set of highly divergent strains in a sample, one could map filtered reads based on those reference genomes (plural) and then assemble the genomes from those subsets (2). This does not seem to be what the authors did based on their published text. The study was conducted from November 2020 to March 2021 in California, USA. SC2 is known to surge sub-seasonally, and we have extensive data available for which strains were around in California during that time to be able to use strain-specific reference-guided filtering to perform the analyses that authors claimed to have done (e.g., “recovering near complete genomes”) while failing to provide methodological information and results to support their claims. This should be added to any correction of this manuscript. The authors should be able to reuse their C-VIEW approach (GitHub - ucsd-ccbb/C-VIEW: This software implements a high-throughput data processing pipeline to identify and charaterize SARS-CoV-2 variant sequences in specimens from COVID-19 positive hosts or environments.) but will need to replace their main SC2 reference genome with reference genomes relevant to their study period. Some SC2 strain/variant/specific reference genomes are available here: SARS-CoV-2 variants ~ ViralZone.

Comparing A.) assemblies (.fasta) with B.) read sequences (.fastq) information would be helpful to illustrate the inappropriateness of collapsing nearly identical SC2 sequences onto a single linear reference genome and calling a consensus on that dogpile. This process is also known as reference-guided assembly. However, read-level data was not shared by the authors in the manuscript. As for their assemblies, I searched in GISAID on 27 February 2023 for the “Accession_ID” “hCoV-19/env/USA/CA-SEARCH-16259” which was first on the list of assemblies the authors included in their supplemental data file. This accession is not a GISAID accession, although it vaguely resembles formats like virus name. I failed to return anything in my search. At present, the assembly data from these studies does not seem to be available in GISAID. This is something that reviewers at Lancet Regional Health Americas should have noticed. The authors should also include read data in an open (International Nucleotide Sequence Database Collaboration (INSDC)) database like the Sequence Read Archive (SRA) to enable better interpretation by SC2 community members to maintain the high standard that is needed before studies achieve peer-review status. This is now a requirement of the NIH for future studies and is easily implementable now for the sake of rigor and reproducibility.

Wastewater sequencing for pathogen genome surveillance is great at detecting and quantifying small puzzle pieces (sequence variability) which can be pieced back together to approximate information otherwise only available in actual assemblies from WGS. For most intents and purposes, that kind of approximation should NOT be considered the same as viral assembly. Wastewater surveillance is an extremely useful hack/solution, but it has its limitations. It should be viewed as one half of comprehensive SC2 genomic surveillance programs with single person/individual/clinical isolate WGS.


  1. Fielding-Miller R, Karthikeyan S, Gaines T, Garfein RS, Salido RA, Cantu VJ, et al. Safer at school early alert: an observational study of wastewater and surface monitoring to detect COVID-19 in elementary schools. Lancet Reg Heal – Am [Internet]. 2023 Mar 1;19. Available from: https://doi.org/10.1016/j.lana.2023.100449

  2. Gener A, Burleson T, Pettersson J, Hemarajata P, Green N. Assembly deconvolution resolves SARS-CoV-2 haplotypes in Delta-Omicron co-infections. In: London Calling 2022 [Internet]. London, UK; 2022. Available from: Assembly deconvolution resolves SARS-CoV-2 haplotypes in Delta-Omicron co-infections

The general point of this comment is correct: when sequencing a genetically diverse / mixed population, a consensus assembly may very likely not represent any individual molecule in the population. The degree of the genetic diversity and proportional abundances of strains will determine whether this is indeed true. When a wastewater sample is obtained from a population infected by a sufficient diversity of SARS-CoV-2 variants, then indeed the author will be correct that a consensus assembly may not reflect the genotype of any individual infection in the population. While it is somewhat pedantic of a point, this is also true when sampling at the level of a single individual patient, with the only difference being of course that a within-person infection has very limited genetic diversity.

With wastewater, this could be particularly problematic if the sampled population is infected by divergent variants, and a resulting consensus assembly ends up looking like a chimera of the two. This could confound phylogenetic analyses that draw on all variants in GISAID. However, it would be prudent for any person performing such analyses to actually examine where the sequences they are analyzing are coming from, and exclude wastewater consensus assemblies from such a particular analysis. In the end, the onus for quality control lies not just with data generators, but with computational analyzers as well, and there is never a replacement for close examination of the data one is using in an analysis.

One aspect of this comment that I struggle to undestand is that the work it is criticizing (Fielder-Miller et al 2023) does not base its primary conclusions on consensus assembly. Rather, they use the tool Freyja for strain deconvolution for their primary results: see Figure 1 and the Figure 1 legend. Safer at school early alert: an observational study of wastewater and surface monitoring to detect COVID-19 in elementary schools - ScienceDirect

So it is a bit unclear to me how the comment’s point relates to the work being cited therein. I think it is because the authors also deposited consensus sequences on GISAID, but I do not necessarily think that was unwarranted, as long as the sequences are clearly delineated as being environmental in origin. Rather, the onus is on users of that data to understand the context in which it was generated appropriately when performing their analyses.

Personally, I think it is prudent to avoid the assumption that every environmental sample is sufficiently diverse enough so that the consensus assembly is meaningless or biologically irrelevant. It is simply not true, as many environmental samples will often have limited genetic diversity. For any given case, this can be confirmed by examining the intra-sample genetic diversity by calling SNVs: if the consensus is well-supported with few high frequency variants, then it all but certainly reflects individual genotypes from the population. Hard and fast rules in biology are rare, and there is no replacement for careful examination of data in its proper context.

The general point that we can’t use wastewater samples to perform genome assembly is incorrect - assembly and consensus calling can indeed be meaningfully done from such samples. However, the critical question is how does such a consensus genome relate to the true genomes being present in a mixed sample and how do we correctly interpret what a “consensus genome” means from such a sample (whether wastewater or not). That’s complex, and it’s unlikely that in most cases (although, see our Karthikeyan et al, Nature 2022 for important exceptions) the consensus genome represents a single true majority genome present in the sample (i.e., in most cases the consensus genome ends up being a chimera of true genomes being present). But that doesn’t mean we can’t assemble the genome in the first place, we just have to be careful when we interpret what it means.

As for the complex task of assembling several consensus genomes representing true genomes in a mixed sample, well, I’m not convinced that’s an unsolvable problem - albeit a very complex one. More to come on that as part of our developments with Freyja.

@Kristian_Andersen, see “dogpile.”

But also, one doesn’t need an accurate assembly per se for mutation surveillance. A motivating factor for my comment is to make sure decision makers looking to cut costs do not try to substitute pooled wastewater sampling for the sampling we do for baseline surveillance. Each has its place, ideally right next to each other keeping watch for evolving genomic landscapes.

@alexcritschristoph, my original/recent trigger was assemblies being reported from samples without clear means to justify that signal was from one/few persons. Because if that, signal might not be high enough to do anything close to true assembly deconvolution, where one can use variant filtering to recover read sets based on intrinsic heterogeneity of divergent strains (cheating with local epi info, but that’s why it’s there), and calling accurate reference-guided consensus assemblies from those. Comparing what I suspect(ed) to be dogpile assemblies (inappropriately collapsed variants from mixed reads) with the reads used to make each one would do a lot for convincing community members that necessary conditions were satisfied in the case of the assemblies in question to make accurate assemblies. Read data was not available for this paper.

A worse/more hacky way to evaluate the integrity of SC2 assemblies is to plop them into Nextclade. The flags that pop up are useful for picking up admixture, but not perfect. The assemblies for this paper were not actually available when I looked. Assuming they become available, that’s on my list of things to check.

The GISAID accessions from Fielding-Miller et al. are all marked ‘env’. I’m not sure if any are publicly available yet (some do not appear to be), but past ‘env’ consensus assemblies deposited by CA-SEARCH are also marked as ‘wastewater surveillance’. Are there particular accessions you are concerned are not appropriately labeled?

Completely agree with this separate point, and I hope the authors can deposit them into the SRA so that they are available to the community for reproducibility or re-analysis!

@alexcritschristoph yeah, “env” might be short for environmental?

Those accession_IDs the authors provided are not GISAID accessions. I picked a couple random ones and couldn’t find them in GISAID with accession or virus name. Since I’m not being paid to do external labs’ QC, I’ll leave it up to them to sort it out/correct their supplemental data file once they figure out what happened.

Disappointed in the journal though for not checking. (But their reviewers were also probably not paid to check :sweat_smile:) That’s why we give accession numbers though. It should be ok to ping folks publicly to help make sure corrections actually get addressed in a timely manner. And friendly reminder to community members to check their work and not be worried if a correction needs to be made. Stakes are too high to pretend to be perfect.

Those accession_IDs the authors provided are not GISAID accessions. I picked a couple random ones and couldn’t find them in GISAID with accession or virus name.

If you search for the CA-SEARCH-xxxxx part of the name you will find these sequences. It appears that they are not returned because they lack the “/202x” at the end, and for sequences that start “hCov-19” the interface does not attempt a partial search.

1 Like

I did a partial string search in the GISAID ID, “text” and “virus name” fields before posting here originally; these failed.

Just tried again and they are still not coming up.

Tried by adding /2023 thru /2020 (just in case); no dice.