Global framework for SARS-CoV-2 data analysis: Application to intrahost variation | Part 1

Global framework for SARS-CoV-2 data analysis: Application to intrahost variation | Part 1

Wolfgang Maier, Marius Van Den Beek, Björn Grüning, Sergei Kosakovsky Pond, Anton Nekrutenko, and the Galaxy Team in US, EU, and AU.

Correspondence should be addressed to SKP and AN.

In an age of digital connectedness, open, accessible, globally shared data and analysis platforms have the potential to transform the way biomedical research is done, opening the way to ‘global research markets’, where competition arises from deriving understanding rather than access to samples and data. The fields of astronomy and high energy physics have demonstrated the benefits of global data generation and sharing. We have the opportunity to mirror their successes in infrastructure funding by showing how biological research can embrace the same global perspective on common infrastructure investment and data sharing. (A. Lonie from Baker et al. 2020).


Already, there are over 250,000 SARS-CoV-2 short read sequencing datasets in NCBI Short Read Archive, and the rate of deposition is not slowing down. “Big data” settings require “big data” solutions – something that human genomics had to deal with from the outset, and pathogen genomics is facing at the moment. We describe a fully open end-to-end analytic framework for standardized reproducible high-throughput analysis of these data on public computing infrastructure. Using high quality datasets from two studies, we describe patterns of variation detectable in SARS-COV-2 intrahost data and
analyze them in the context of N501Y lineages and sites under selection. In particular, we identify a subset of variants present in the N501Y lineages that were detectable at low frequencies in individual hosts prior to the emergence of these lineages. Our results suggest that intrahost dynamics, which did not receive significant attention during this pandemic, should be an integral part of any serious pathogen surveillance effort.

This document is a part of an ongoing effort. We will be posting updates as it progresses.


Science has been put on a war footing in response to the Covid-19 pandemic. There are now over 100,000 publications and preprints (>500/day) on SARS-CoV-2 and COVID in PubMed. While the sheer scope of this effort is remarkable, it also exposes two categories of challenges in how the biomedical research community is responding to a life-changing global health crisis.

Technical challenges

  • Data deluge - the COVID-19 pandemic is the first global health crisis coinciding with the wide accessibility of next generation sequencing (NGS). As a result, the volume of data is enormous with over 250,000 sequencing datasets and over 575,000 complete genomes (as of February 2021). There are dramatic differences in data quality and, importantly, metadata—essential information that distinguishes useful data from noise.
  • Heterogeneity of analysis practices - the majority of SARS-CoV-2-related publications use ad hoc approaches that vary dramatically from study to study. COG-UK is a notable exception: a unified effort on developing common experimental and analytical protocols for amplicon-based sequencing. No comparable endeavor exists for other types of data.
  • Deepening technological divide - while open source tools dominate the analysis of SARS-CoV-2, researchers from the developing world have difficulty leveraging them due to lack of robust computational infrastructure—the best software is useless if one has no resources to run it on.

Scientific challenges

  • Lack of primary data - the number of complete SARS-CoV-2 genomes is much higher than the number of publicly available raw read datasets. As a result it is often impossible to verify accuracy of assemblies containing unexpected features such as substitutions, indels, and rearrangements.
  • Paucity of intrahost data - the vast majority of current studies examine SARS-CoV-2 evolutionary dynamics at the level of complete genomes. Much remains to be learned about variation within individuals and its temporal dynamics.
  • The need for data integration - there is an increasing body of work on variation, selection, transcriptome architecture and other aspects of SARS-CoV-2 biology. For maximum impact, these disparate types of data must be integrated so, for example, one can check that intrahost variants may correspond to sites under selection identified from complete genomes or whether particular synonymous substitutions may affect the production of specific sgRNAs, etc.

In this report we detail how we are attempting to address these issues by building a free globally distributed SARS-CoV-2 analysis network based on the Galaxy and Datamonkey data analysis platforms. We address the technical challenges by demonstrating how publicly accessible analysis workflows can be applied to large collections of SARS-CoV-2 data without the need to procure or maintain computational infrastructure.

Next we address the biological challenges by analyzing a subset of large raw read SARS-CoV-2 datasets derived from RNAseq and Ampliconic experiments. We demonstrate that a number of recently emerged substitutions observed in the “N501Y” lineages from the UK, Brazil, and South Africa data, have been detectable at low frequencies within individual hosts for some time.

Our work establishes a global genomic data analysis portal for monitoring global health emergencies such as the current COVID-19 crises and any future events of similar magnitude.

Results and Discussion

Goals of this study

  1. Provide initial insight into intrahost variability of the SARS-CoV-2 genome using publicly available datasets.
  2. Compare intrahost variants against known variants of concern (VOC) as well as sites under selection.
  3. Explore co-variation patterns among intrahost variants.
  4. Perform all analyses on public, free computational infrastructure and make all analysis artifacts such as workflows, notebooks, as well as pre- and post-processed datasets available in an immediately accessible form. We do not merely describe results. Instead we provide “live code” enabling anyone to repeat our analyses.

Global infrastructure for SARS-CoV-2 analysis

We are the developers of two widely used data analysis frameworks: Galaxy and HyPhy/Datamonkey. Together we provide free powerful infrastructure for analysis of data to anyone in a world with an Internet connection. In 2020 we joined forces in tuning our tools to the needs of SARS-CoV-2 research.


Galaxy is a general web-based data analysis framework for analysis of large biological (as well as climate, ecology, natural language processing and others) datasets. This framework is used to deploy global data analysis instances collectively known as Usegalaxy.* Consortium. The “big three” Galaxy instances in this coinsortim are in the US (, Europe (, and Australia ( The tool sets, workflows, and reference data are synchronized across major instances.

Anyone in the world can freely use Galaxy. Every user receives 250 Gb allocation at signup. Every instance has mechanisms for accommodating users who require bigger allocations. Every major Galaxy instance is backed by robust national high performance computing (HPC) infrastructure. In the US is based at the Texas Advanced Computer Center (TACC). It utilizes a variety of XSEDE resources such as JetStream cloud and Bridges shared memory system. European and Australian instances are supported by massive computational resources available through deNBI, Elixir and Nectar Cloud.

Galaxy is a major part of biological tool ecosystem. Almost 8,000 analysis tools have been incorporated into Galaxy. This is achieved through close partnership with BioConda and Biocontainers communities. Workflows assembled from these tools are curated and supported through a variety of mechanisms including DockStore and Elixir workflow hub.


HyPhy (Hypothesis Testing using Phylogenies) is an open-source software package for the analysis of genetic sequences (in particular the inference of natural selection) using techniques in phylogenetics, molecular evolution, and machine learning. It features a rich scripting language for limitless customization of analyses. Additionally, HyPhy features support for parallel computing environments (via message passing interface). HyPhy has over 10,000 registered users and has been cited in over 4,500 peer-reviewed publications (Google Scholar).

Complete tool stack

Everything discussed in this piece is fully open and reproducible. The analysis was performed in several stages:

  1. Galaxy – we analyzed primary data (raw sequencing reads) from several studies using Galaxy platform. The final result of these analyses was a list of variants. The workflows used for this analysis are available as described here. They can be readily (now) used to repeat our analyses, or be applied to new data. We support both Illumina and ONT data derived from Ampliconic and RNAseq library preps.
  2. Jupyter – data tables from the previous step were analyzed in either Jupyter (AN) or ObservableHQ (SKP). Jupyter is directly integrated with Galaxy. The resulting notebooks are distributed via GibHub and can be instantiated on Galaxy or Google Colaboratory. Graphs generated in Jupyter notebooks are available for interactive data exploration via DataPane. Clicking the :bar_chart: symbol adjacent to a graph will open a new browser pane with an intercative version of the graph. Clicking the :spiral_note_pad: symbol will open a data table.
  3. ObservableHQ – data tables from step 1 were also analyzed using ObservableHQ–a serverless JavaScript-based notebook environment. A collection of these fully interactive notebooks is available here.

Finding the data

Notebook for this section is available here :notebook:. Gzipped SRA RunInfo dump for January 20, 2021 can be downloaded here. This file was used to generate the tables and heatmap below.

As of late January 2021 the sequence read archive (SRA) at the US National Center for Biotechnology Information (NCBI) contained 190,288 raw read datasets summarized in Fig. 1:

Figure 1. :bar_chart: Counts of SRA datasets stratified by platform (Y-axis) and library strategy (X-axis).

While SRA metadata are not perfect (e.g., term WGA, whole genome amplification, is likely equivalent to Amplicon) the table shows three primary types of data: (1) Illumina-based Ampliconic, (2) Oxford nanopore (ONT)-based Ampliconic, and (3) Illumina-based RNASeq. Of these three types of studies, Illumina-based RNASeq is the most suitable for accurate assessment of the intra-host variability. This approach avoids amplification biases characteristic of PCR-based enrichment approaches such as PrimalSeq—a primary methodology used to generate Illumina- and ONT-based ampliconic datasets. However, because one of the key objectives of this study is to provide freely accessible workflows for the analysis of SARS-CoV-2 variation data we developed procedures for leveraging ampliconic data as well. Thus we describe two distinct analytical strategies: for Illumina-based RNASeq data and Amplicon data. We have also developed workflows for the analysis of ONT data. However, these will be described in a separate report.

Next, we identified the largest studies—containing the largest count of individual read datasets—for each of the two (Illumina RNA-seq and Illumina ARTIC) experimental approaches:

Table 1. Top 10 SRA BioProjects for three experimental strategies. BioProject IDs with a hyperlink correspond to a study chosen for subsequent analysis (Illumina RNAseq = Illumina Paired RNAseq data; Illumina Amp = Illumina Paired Artic data; N = number of SRA datasets in a given BioProject)

Illumina RNAseq N Illumina Amp N
PRJNA622837 1,564 PRJEB37886 104,984
PRJNA612578 964 PRJNA613958 14,860
PRJNA650245 617 PRJNA614995 3,967
PRJNA610428 42 PRJNA645906 2,286
PRJEB38546 26 PRJNA639066 1,931
PRJNA634356 25 PRJNA625551 1,163
PRJNA650134 22 PRJNA656534 567
PRJNA661544 15 PRJNA686984 543
PRJNA638211 10 PRJEB38723 542
PRJNA605983 9 PRJEB42024 539

From each set we selected a single study that has been published. This is because metadata for SRA datasets is generally of low quality and having a publication allows for much better understanding of the corresponding data (e.g., which ARTIC primer set version was used; how RNA was isolated and so on).

The entire PRJNA622837 has been conducted prior to emergence of N501Y lineages. The PRJEB37886 contains samples isolated before and after N501Y emergence. Thus we separated this dataset into two parts. In the end we analyzed the following datasets:

Dataset nickname Origin
"Boston" Entire PRJNA622837
"COG-Pre" a pre N501Y portion of PRJEB37886 (ERR4603708 - ERR4604210)
"COG-Post" a post N501Y portion of PRJEB37886 (ERR4859723 - ERR4861540)

These nicknames are used in the remainder of this document to refer to these datasets.

Analysis workflows

We developed four analysis workflows to support identification of variants from all relevant input data sources (Table 2). They are available on all major worldwide Galaxy instances as well as from DockStore and WorkflowHub.

Table 2. Workflow description. Clicking EU, EU, or AU buttons will take you to workflow at US (, European (, or Australian ( instance.

Workflow Input data Read aligner Variant caller
Illumina RNAseq SE
Single end data derived from RNAseq experiments bowtie2 lofreq
Illumina RNAseq PE
Paired end data derived from RNAseq experiments bwa-mem lofreq
Illumina ARTIC
Paired-end data generated with ARTIC protocols bwa-mem lofreq
ONT fastq files generated with ARTIC protocols minimap2 medaka
Output of any of the above workflows - -

The two Illumina RNASeq workflows (for paired-end and single-end sequenced data) perform read mapping with bwa-mem and bowtie2, respectively, followed by sensitive variant calling across a wide range of AFs with lofreq (which we previously identified as the most appropriate tool for this task).

The workflow for Illumina-sequenced ARTIC data builds on the RNASeq workflow for paired-end data using the same steps for mapping and variant calling, but adds extra logic for trimming ARTIC primer sequences off reads with the ivar package. In addition, this workflow uses ivar also to identify amplicons affected by ARTIC primer-binding site mutations and excludes reads derived from such “tainted” amplicons when calculating allele-frequencies of other variants.

The workflow for ONT-sequenced ARTIC data is modeled after the alignment/variant-calling steps of the ARTIC pipeline. It performs, essentially, the same steps as that pipeline’s minion command, i.e. read mapping with minimap2 and variant calling with medaka. Like the Illumina ARTIC workflow it uses ivar for primer trimming. Since ONT-sequenced reads have a much higher error rate than Illumina-sequenced reads and are therefor plagued more by false-positive variant calls, this workflow does make no attempt to handle amplicons affected by potential primer-binding site mutations.

All four workflows use SnpEff, specifically its 4.5covid19 version, for variant annotation.

The fifth, Reporting, workflow takes table of variants produced by any of the other four workflows and generates a list of variants by Samples and by Variant. For an example see here.

The five workflows are designed to work in the following order:

Figure 2. Relationship between analysis workflows and datatypes.

Availability of variant calls

The result of an intra-host variation analysis is a list of called differences from the reference SARS-CoV-2 genome, their estimated allele frequencies, technical metrics reported by variant callers, and functional annotation. The list of variants is a large and complex data set in its own right.

The datasets listed in the table below are produced directly by our Galaxy workflows using only publically available reference data and raw Illumina reads downloaded from NCBI SRA as the input. Specifically, the Reporting Workflow generates two types of reports for each dataset:

  • Report by Samples - a list of variants for each sampl (SRA accession);
  • Report by Variants - a derivative of the above data grouped by variant position and substitution type.

Table 3. Links to variant calls. By Sample = variants are grouped by SRA accession. By Variant = variants are grouped by genomic coordinate and substitution type. See this link for a description of the structure of each type of report.

Dataset Link
“Boston” data By Sample By Variant
COG-UK Pre N501Y By Sample By Variant
COG-UK Post N501Y By Sample By Variant

On top of these tables, we provide raw data and exploration visualization tools for the interested reader, qualitative descriptions of the type of variation found, and in-depth discussion of several key variants and sets of variants.