Serratus: The ultra-deep search to discover novel coronaviruses

ababaian · June 16, 2020, 3:02am

Despite intense efforts to sequence and analyze SARS-CoV-2 isolates, our understanding of the virus’s provenance is limited by incomplete genomic characterization of the Coronaviridae (CoV) family.

Serratus (https://github.com/ababaian/serratus) is an Open Science project for discovery of new virus sequences on an unprecedented scale. Serratus can search well over a million sequencing libraries per week for known and novel viruses, including RNA-seq, meta-genomic, meta-transcriptomic and environmental NGS datasets.

Here we report results from a preliminary survey of 1.14 million sequence libraries (26.78 petabases) from the NCBI Short Read Archive (SRA). We have uncovered previously unreported CoV species and identified thousands of CoV-positive libraries.

To facilitate rapid analysis of this data we are developing an R package, Tantalus (https://github.com/serratus-bio/tantalus), to interface with Serratus data. This project is under active development and we are seeking to establish immediate collaborations for analysis over the next several weeks.

Uncovering novel CoV species

For just one example, in a Peruvian vampire bat (Desmodus rotundus) we identified a putative Alphoacoronavirus, via a partial sequence match at 93.4% identity to the RdRP of Bat coronavirus Trinidad/1CO7BA/2007 (GenBank accession: EU769558).

Assembly of this sequencing library with coronaSPAdes yielded a complete 29,264 nt viral genome. This is a new species of coronavirus based on RdRP, nucleprotein, membrane protein and replicase 1a, classified as a putative Alphacoronavirus falling outside all named sub-genera available in public nucleotide databases.

Alphacoronavirus identification case study

The author of the dataset had previously reported this virus as "DesRot/Peru/AMA_L_F” (Bergner et al. 2019 and see below, but it’s sequence is unpublished and was not in our query.

This is a clear proof of concept that there exists novel CoV within the Serratus data, there are likely scores more of these cases which we are actively working to uncover. Novel viruses from other family can also be identified with this workflow.

Join the Serratus Collaboration

Our primary objective is to accelerate global coronavirus research and assist in diagnostic and vaccine development with rich evolutionary CoV sequence data. All raw and processed Serratus data is freely and immediately available including a per-accession virus report of all vertebrate virus matches.

We are actively looking for collaborators for the intensive analysis of this data over the next four weeks. Expertise sought includes but is not limited to:

Computational virology
Phylogenetics and tree building
Viral ecology and zoonosis modeling
Database and web-interface development
R package development
AWS cloud computing

Researchers with access to large amounts of sequencing data from bats, wild rodents, or any sample taken from an animal with respiratory/GI disease are also sought. We are offering to generate a no-cost virus-report within 24 hours on upto 250,000 libraries. We only ask that adequate meta-data is provided and if samples are CoV+, the reports and CoV viral assemblies be shared immediately and without restriction.

Computational architecture

In February 2020, AWS mirrored the NCBI Short Read Archive (SRA) onto their S3 servers as an Open Data-set which allows for an unprecedented rate of access to raw sequencing data.

To perform the ultra-high throughput CoV search, we employed AWS cloud HPC with a 22,500 vCPU cluster (1460 x R5.xlarge, 4120 x C5.xlarge, and 90 x C5.large EC2 instances). Using this hyper-parallelized architecture we could bypass conventional networking and disk IO limitations to achieve a processing rate exceeding 500,000 sequencing libraries per day at a cost of ~$0.01 per library.

Our viral search query is a CoV pan-genome composed of all CoV Genbank sequences clustered at 99% identity and all non-retroviral “representative virus sequences” in RefSeq. We employ bowtie2 as the aligner which can detect short-read sequences at up to 20% nucleotide divergence. These alignment files are then summarized into a report file for import into R and downstream processing.

Conclusion

Serratus is an open science project; we are actively seeking to establish collaborations starting immediately to translate these data into a meaningful community resource in the fight against COVID-19. By expanding the known repertoire of coronaviruses together, we can not only help determine the origins of this pandemic, but help prevent another one.

daniel.streicker · June 16, 2020, 11:18am

Through mining NextSeq500 data made publicly available from our laboratory on European Nucleotide Archive (PRJEB28138), the authors of Serratus describe a putatively novel coronavirus species in common vampire bats which they designate “Fr4NK.” We wish to point out that the presence of novel alphacoronaviruses in this species (and indeed in this exact dataset) was already known and reported twice (see Bergner et al. 2019, Bergner et al. 2020). In particular, Bergner et al. 2020 states: “We also detected full genomes of novel viruses in genera capable of infecting humans such as Alphacoronavirus and Rotavirus…”. A formal description of this putatively novel species is in a thesis chapter (now publicly available here) and a manuscript which describes multiple novel viral genomes from different parts of Peru along with additional ecological, evolutionary, and geographic analyses is in final stages for submission. Our earlier studies referred to “Fr4NK” as “DesRot/Peru/AMA_L_F”. Our phylogenetic analysis of the RdRp gene is below, showing that the vampire bat-associated alphacoronaviruses are nested within a larger clade of neotropical bat viruses from the bat family Phyllostomidae (Figure).

Figure. Coronaviridae RdRp phylogeny. Maximum likelihood tree based on a 272 amino acid alignment of 52 RdRp sequences including vampire bat CoV sequences (blue), Neotropical bat RdRp sequences (pink), non-Neotropical bat RdRp sequences (green) and RdRp sequences from CoVs infecting other species (black). The scale bar represents the mean expected rate of substitutions per site.

Serratus is an exciting tool that may facilitate large scale data mining for viral sequences. However, the case study selected illustrates not just the power of the software but also the challenges of using unfamiliar data without contacting the teams who generated them. We appreciate that the result was posted on virological.org so that this discussion could occur at very early stages. Nonetheless, we encourage users of this software to contact those who generated the data prior to making public announcements. This practice would be advantageous for both those who produce and those who re-analyze data in at least 4 ways: (1) it would reduce time spent duplicating findings, (2) it would reduce errors or overstatements associated with details and limitations of sequences which may only be known to data generators, (3) it would reduce the risk of undercutting the novelty of more comprehensive studies, including unpublished PhD chapters, and (4) it would avoid disincentivizing public releases of short read data for fear of lost novelty.

Of course, there is a rich tradition of using public viral sequence data in evolutionary virology so there must be rational limits. In for example, hypothesis-driven comparative analyses involving hundreds or thousands of viruses, it may not be practical to contact all authors of putatively novel viruses. However, if the sole focus is description of a novel virus sequence, we encourage analysts to contact those who produced the data. This in no way reduces the utility of Serratus, but merely serves as a note of caution about data use and hopefully will encourage fruitful discussion among those in the virology community who generate and re-analyze data.

Daniel Streicker, Laura Bergner, Richard Orton

ababaian · June 16, 2020, 3:24pm

Thanks for promoting a thoughtful discussion. It’s unfortunate that we missed these papers but it does stem from these sequences not being deposited on GenBank. There are thousands of sequencing libraries that are CoV+ and hundreds of thousands with other viruses, it is not practical to inquire about each of these before we release these data. DesRot/Peru/AMA_L_F is a proof of concept for Serratus, sequences unlike anything in public databases can be discovered.

The primary purpose of this post is to start exactly this discussion and find out where we are wrong and how to improve by reaching out to expert virologists. Please earnestly consider joining us a collaborator to help interpret the trove of CoV data in the public archive.

This project is rapidly developing and we do so with complete transparency. Please do consider depositing all the CoV sequences in your lab to GenBank, this in no way detracts from the value of your work. The goal of this project is to ensure all coronavirus sequences are available immediately and without restriction. As we are in the midst of a coronavirus pandemic, I think this is a time where we as a scientific community should come together and achieve something greater the the sum of each individual.

daniel.streicker · June 17, 2020, 10:55am

Thanks for your response and for updating the original post. The genome sequences (coronaviruses and otherwise) from Peruvian bats will be uploaded to genbank as soon as their associated descriptions are available. We opted not to release the assembled genome sequences earlier since we thought that descriptions of the genomes merited more discussion than was feasible in papers about ecological drivers of viral richness.

For avoidance of confusion, we did not suggest users of Serratus contact the teams who generated the data in all cases. We only suggested that in instances where a virus is highlighted as a key example or of special importance, such as your case study, that contacting those who generated the data would be mutually beneficial to avoid claims of novelty which may not entirely hold up (whether a virus which has been described in the literature, but is not yet on genbank is ‘novel’ is probably a philosophical question for another day!) and would help promote a culture of trust and proper attribution in data sharing.

I look forward to seeing what else turns up in the public, under-analyzed data. It is really an exciting initiative. Nonetheless, I thought it was worthwhile to raise awareness that there may be some sensitivity around use of unpublished data which could be easily managed with a small amount of communication.

ababaian · August 11, 2020, 11:56am

Our preprint, “Petabase-scale sequence alignment catalyses viral discovery” is now online.

We’ve aligned 3.84 million libraries to all reference vertebrate viruses (except retroviruses) and created an explorable website (https://serratus.io) of the data.

The site is under active development and we’re especially looking for feedback to improve the web interface. How can we make this data more ‘explorable’ and help you find the virus/datasets you’re looking for? For programmatic access the R/postgreSQL package, “Tantalus” is the current best option.

Once again all our data is public/cc0 so if you think any of this work can help your ongoing CoV research and you can’t find the data you’d like on your own, don’t be shy to reach out.

david_h_oconnor · August 15, 2020, 7:25pm

This is really cool! Since you asked for suggestions on making the data more explorable via the web interface, can I suggest:

When viewing all the SRA runs with matching reads, add a ‘download summary’ button that allows you to download either a list of all matching SRA accessions.
The first page of search results doesn’t indicate how the size of the result set. It simply says ‘Page 1 out of …’
It might not be feasible, but if the number of bowtie2 mapped reads is calculated for each dataset, having a way to download those reads specifically through sratools for additional analysis would be incredibly useful.

Thanks!

dave

ababaian · August 17, 2020, 4:59pm

Thanks for the feedback Dave!

I’m happy to let you know that you can already download the bam-alignment files for each one of the 3.84 million SRA files we’ve analyzed using the “Download .bam” buttons on the viral report page.

One note though, the bam files are not sorted and not indexed, so it requires running samtools sort <in.bam> and samtools index <in.sort.bam> to be loaded into IGV. If you simply want the reads you can access them via samtools view <in.bam>. To view the alignments in IGV you’ll also need the “Pangenome” file which is cov3ma for these files and can be downloaded here. You can find every virus included in the reference search as Supplementary Table 2 in the preprint. These are the nucleotide-level alignments for now, we’re working on expanding the front-end for the protein data too, should be up in a few weeks.

I’m working on a Tutorial to help explain a lot of these little details, and hopefully answer some common questions. For now, to not fill up Virological posts with debugging; may I suggest further suggestions be posted to the Serratus.io Github Issues Page so we can have more of a back-and-forth and help get you set up.

The list of summary files and page debug is a great idea, Victor and Dan are on it!

ababaian · October 11, 2021, 9:10am

Serratus Project Updates

Ribovirus classification by a polymerase barcode sequence

Palmprint preprint

RNA viruses encoding a polymerase gene (riboviruses) dominate the known eukaryotic virome. Next-generation sequencing is revealing a wealth of new riboviruses with uncharacterised phenotypes, precluding classification by traditional taxonomic methods. These are often classified on the basis of polymerase sequence identity, but standardised methods to support this approach are currently lacking. To address this need, we describe the polymerase palmprint, a well-defined segment of the palm sub-domain delineated by well-conserved catalytic motifs. We present a novel algorithm, palmscan, which identifies palmprints in nucleotide and amino acid sequences. We describe PALMdb, a reference database of palmprints derived from public sequence databases.

Using palmprints, we identity and classify GenBank (v241) RdRP sequences into 15,016 distinct species-like Operational Taxonomic Units (sOTU). An sOTU is defined by clustering GenBank palmprints (the most conserved core of RdRP) at 90% amino acid identity. Subsequently, novel RNA virus species are defined by <90% palmprint identity to any known sOTU.

Figure 1: The RdRP palmprint is the protein sequence spanning three well-conserved sequence motifs (A, B, and C), including intervening variable regions, exemplified within full-length poliovirus RdRP structure with essential aspartic acid residues(*) (pdb: 1RA6). Conservation was calculated from RdRP alignment in [Wolf et al., 2018, trimmed to the poliovirus sequence; motif sequence logos are shown below.]

Massive Scale RdRP Search

Updated Serratus preprint

Earlier this year we had a major update of the Serratus databases. We performed a search of viral RNA-dependent RNA polymerases (RdRP) against 5.7 million sequencing libraries, or 10.2 petabases. Using the Serratus cloud-computing architecture, this search was completed in 11 days for a cost of ~$25,000 credits.

We report 883,502 distinct RdRP sequences, which includes over 130,000 novel sOTU, an increase of total known RNA viruses by a factor of ~9.8. Amongst these sequences, we validate 9 Coronavirus sOTU, hundreds of novel delta-like viruses and hundreds of huge bacteriophages, details of which can be found in the preprint.

All of this data is fully-explorable at Serratus.io, and raw data (including aligned reads) can be accessed here.

Figure 2: Per-phylum histogram of amino acid identity of novel species-like operational taxonomic units (sOTUs) aligned to the NCBI non-redundant protein database. inlay Preston plot of palmprint abundances indicates that singleton palmprints (i.e., observed in exactly one run) occur within 95% confidence intervals of the value predicted by extrapolation from high-abundance palmprints (linear regression applied to log-transformed data)

palmID, A sequence-based Serratus Interface

palmID Web Interface
palmID Container + Source

That was a lengthy pre-amble. Most recently we have developed a sequence-oriented interface for the Serratus data / SRA tentatively called palmID.

In brief, the user provides an RdRP sequence, from which the ‘palmprint’ catalytic-core is extracted and cross-referenced to the Serratus sOTU database. palmID then retrieves proximal virus matches at species-, genus-, family-, and phylum-like levels and their taxonomic information, if available. The set of input-matching viruses are then cross-referenced against the 5.7 million sequencing libraries we processed to extract additional meta-data from the sequencing libraries in which the virus was found. This allows for a user to instantly cross-reference taxonomic, geospatial, temporal, and associated organisms for known and novel viruses in the SRA. This tool is currently in beta, and is limited to non-permuted and monomeric RdRPs.

The goal of palmID is to provide a “data-driven” annotation for uncharacterized viral sequenences. At the same time it should present data-anomolies in known viruses, such as the presence of the virus in an unexpected organism. PalmID is still in the early beta-stages, but I would be interested in soliciting feedback from the community on features, data, or documentation which could improve this tool.

As always, Serratus is a 100% open-science project. All data is released into cc0 immediately and free to use. If you’re interested in helping out or would like some specific help, do not hesitate to reach out.