Serratus: The ultra-deep search to discover novel coronaviruses

Despite intense efforts to sequence and analyze SARS-CoV-2 isolates, our understanding of the virus’s provenance is limited by incomplete genomic characterization of the Coronaviridae (CoV) family.

Serratus (https://github.com/ababaian/serratus) is an Open Science project for discovery of new virus sequences on an unprecedented scale. Serratus can search well over a million sequencing libraries per week for known and novel viruses, including RNA-seq, meta-genomic, meta-transcriptomic and environmental NGS datasets.

Here we report results from a preliminary survey of 1.14 million sequence libraries (26.78 petabases) from the NCBI Short Read Archive (SRA). We have uncovered previously unreported CoV species and identified thousands of CoV-positive libraries.

To facilitate rapid analysis of this data we are developing an R package, Tantalus (https://github.com/serratus-bio/tantalus), to interface with Serratus data. This project is under active development and we are seeking to establish immediate collaborations for analysis over the next several weeks.

Uncovering novel CoV species

For just one example, in a Peruvian vampire bat (Desmodus rotundus) we identified a putative Alphoacoronavirus, via a partial sequence match at 93.4% identity to the RdRP of Bat coronavirus Trinidad/1CO7BA/2007 (GenBank accession: EU769558).

Assembly of this sequencing library with coronaSPAdes yielded a complete 29,264 nt viral genome. This is a new species of coronavirus based on RdRP, nucleprotein, membrane protein and replicase 1a, classified as a putative Alphacoronavirus falling outside all named sub-genera available in public nucleotide databases.

Alphacoronavirus identification case study

The author of the dataset had previously reported this virus as "DesRot/Peru/AMA_L_F” (Bergner et al. 2019 and see below, but it’s sequence is unpublished and was not in our query.

This is a clear proof of concept that there exists novel CoV within the Serratus data, there are likely scores more of these cases which we are actively working to uncover. Novel viruses from other family can also be identified with this workflow.

Join the Serratus Collaboration

Our primary objective is to accelerate global coronavirus research and assist in diagnostic and vaccine development with rich evolutionary CoV sequence data. All raw and processed Serratus data is freely and immediately available including a per-accession virus report of all vertebrate virus matches.

We are actively looking for collaborators for the intensive analysis of this data over the next four weeks. Expertise sought includes but is not limited to:

  • Computational virology
  • Phylogenetics and tree building
  • Viral ecology and zoonosis modeling
  • Database and web-interface development
  • R package development
  • AWS cloud computing

Researchers with access to large amounts of sequencing data from bats, wild rodents, or any sample taken from an animal with respiratory/GI disease are also sought. We are offering to generate a no-cost virus-report within 24 hours on upto 250,000 libraries. We only ask that adequate meta-data is provided and if samples are CoV+, the reports and CoV viral assemblies be shared immediately and without restriction.

Computational architecture

In February 2020, AWS mirrored the NCBI Short Read Archive (SRA) onto their S3 servers as an Open Data-set which allows for an unprecedented rate of access to raw sequencing data.

To perform the ultra-high throughput CoV search, we employed AWS cloud HPC with a 22,500 vCPU cluster (1460 x R5.xlarge, 4120 x C5.xlarge, and 90 x C5.large EC2 instances). Using this hyper-parallelized architecture we could bypass conventional networking and disk IO limitations to achieve a processing rate exceeding 500,000 sequencing libraries per day at a cost of ~$0.01 per library.

Our viral search query is a CoV pan-genome composed of all CoV Genbank sequences clustered at 99% identity and all non-retroviral “representative virus sequences” in RefSeq. We employ bowtie2 as the aligner which can detect short-read sequences at up to 20% nucleotide divergence. These alignment files are then summarized into a report file for import into R and downstream processing.

Conclusion

Serratus is an open science project; we are actively seeking to establish collaborations starting immediately to translate these data into a meaningful community resource in the fight against COVID-19. By expanding the known repertoire of coronaviruses together, we can not only help determine the origins of this pandemic, but help prevent another one.

Through mining NextSeq500 data made publicly available from our laboratory on European Nucleotide Archive (PRJEB28138), the authors of Serratus describe a putatively novel coronavirus species in common vampire bats which they designate “Fr4NK.” We wish to point out that the presence of novel alphacoronaviruses in this species (and indeed in this exact dataset) was already known and reported twice (see Bergner et al. 2019, Bergner et al. 2020). In particular, Bergner et al. 2020 states: “We also detected full genomes of novel viruses in genera capable of infecting humans such as Alphacoronavirus and Rotavirus…”. A formal description of this putatively novel species is in a thesis chapter (now publicly available here) and a manuscript which describes multiple novel viral genomes from different parts of Peru along with additional ecological, evolutionary, and geographic analyses is in final stages for submission. Our earlier studies referred to “Fr4NK” as “DesRot/Peru/AMA_L_F”. Our phylogenetic analysis of the RdRp gene is below, showing that the vampire bat-associated alphacoronaviruses are nested within a larger clade of neotropical bat viruses from the bat family Phyllostomidae (Figure).


Figure. Coronaviridae RdRp phylogeny. Maximum likelihood tree based on a 272 amino acid alignment of 52 RdRp sequences including vampire bat CoV sequences (blue), Neotropical bat RdRp sequences (pink), non-Neotropical bat RdRp sequences (green) and RdRp sequences from CoVs infecting other species (black). The scale bar represents the mean expected rate of substitutions per site.

Serratus is an exciting tool that may facilitate large scale data mining for viral sequences. However, the case study selected illustrates not just the power of the software but also the challenges of using unfamiliar data without contacting the teams who generated them. We appreciate that the result was posted on virological.org so that this discussion could occur at very early stages. Nonetheless, we encourage users of this software to contact those who generated the data prior to making public announcements. This practice would be advantageous for both those who produce and those who re-analyze data in at least 4 ways: (1) it would reduce time spent duplicating findings, (2) it would reduce errors or overstatements associated with details and limitations of sequences which may only be known to data generators, (3) it would reduce the risk of undercutting the novelty of more comprehensive studies, including unpublished PhD chapters, and (4) it would avoid disincentivizing public releases of short read data for fear of lost novelty.

Of course, there is a rich tradition of using public viral sequence data in evolutionary virology so there must be rational limits. In for example, hypothesis-driven comparative analyses involving hundreds or thousands of viruses, it may not be practical to contact all authors of putatively novel viruses. However, if the sole focus is description of a novel virus sequence, we encourage analysts to contact those who produced the data. This in no way reduces the utility of Serratus, but merely serves as a note of caution about data use and hopefully will encourage fruitful discussion among those in the virology community who generate and re-analyze data.

Daniel Streicker, Laura Bergner, Richard Orton

1 Like

Thanks for promoting a thoughtful discussion. It’s unfortunate that we missed these papers but it does stem from these sequences not being deposited on GenBank. There are thousands of sequencing libraries that are CoV+ and hundreds of thousands with other viruses, it is not practical to inquire about each of these before we release these data. DesRot/Peru/AMA_L_F is a proof of concept for Serratus, sequences unlike anything in public databases can be discovered.

The primary purpose of this post is to start exactly this discussion and find out where we are wrong and how to improve by reaching out to expert virologists. Please earnestly consider joining us a collaborator to help interpret the trove of CoV data in the public archive.

This project is rapidly developing and we do so with complete transparency. Please do consider depositing all the CoV sequences in your lab to GenBank, this in no way detracts from the value of your work. The goal of this project is to ensure all coronavirus sequences are available immediately and without restriction. As we are in the midst of a coronavirus pandemic, I think this is a time where we as a scientific community should come together and achieve something greater the the sum of each individual.

Thanks for your response and for updating the original post. The genome sequences (coronaviruses and otherwise) from Peruvian bats will be uploaded to genbank as soon as their associated descriptions are available. We opted not to release the assembled genome sequences earlier since we thought that descriptions of the genomes merited more discussion than was feasible in papers about ecological drivers of viral richness.

For avoidance of confusion, we did not suggest users of Serratus contact the teams who generated the data in all cases. We only suggested that in instances where a virus is highlighted as a key example or of special importance, such as your case study, that contacting those who generated the data would be mutually beneficial to avoid claims of novelty which may not entirely hold up (whether a virus which has been described in the literature, but is not yet on genbank is ‘novel’ is probably a philosophical question for another day!) and would help promote a culture of trust and proper attribution in data sharing.

I look forward to seeing what else turns up in the public, under-analyzed data. It is really an exciting initiative. Nonetheless, I thought it was worthwhile to raise awareness that there may be some sensitivity around use of unpublished data which could be easily managed with a small amount of communication.