Early analysis of 1,093 SARS-CoV2 intra-host variation datasets and comparison with sites under selection

Marius Van Den Beek, Wolfgang Maier, Sergei Kosakowsky Pond, Anton Nekrutenko + DataMonkey and GalaxyProject Teams | galaxyproject.org


This is a part of the join effort between Datamonkey and GalaxyProject teams to provide free, reproducible, and accessible set of workflows for the analysis of SARS-CoV2 data.


This is a preliminary analysis of SARS-CoV2 intra-host variation. It includes SNPs only (no indels at this point) and Illumina data only. This post describes analysis of data publicly available from SRA and ENA between the beginning of the outbreak and May 17, 2020.

Outline

  • Dataset overview
  • Types of changes
  • What’s up with stops?
  • Filtering strategy
  • Analysis of RNA editing
    • APOBEC
    • ADAR
  • A closer look at variants
  • Comparison with sites under selection

Dataset overview


How we call variants

In this study we use lofreq for calling variants. This choice is based on banchmarking described here

Sites, samples, and studies

There are still very few raw read darasets for SARS-CoV2

Despite the fact that the number of complete genomes in places like GISAID has already surpassed 30,000, the number of raw datasets remains relatively modest by comparison:
[‘RNA-Seq’, ‘AMPLICON’, ‘Targeted-Capture’, ‘WGS’, ‘OTHER’]

Feature Count Description
Total sites 11,059 Total number of SNPs across all samples. Since many sites are presented in multiple samples they are counted multiple times here
Distinct sites 4,899 Distinct variable positions across SARS-CoV2 genome
Distinct samples 1,093 Distinct SRA or ENA accessions in this dataset
Distinct studies 28 SRA bundles samples into studies
Distinct geographic locations 24 ‘Germany’, ‘China:Shanghai’, None, ‘USA:Virginia’,‘USA: Hamilton, MT’, ‘USA’, ‘USA:Minnesota’, ‘USA: Washington’, ‘Peru: Lima’, ‘USA: New York’, ‘Germany: Dusseldorf’, ‘Bangladesh’, ‘USA:Los Angeles’, ‘China: Hubei’, ‘China: Wuhan’, ‘Malaysia’, ‘India: Rajkot’, ‘Israel’, ‘Germany: Heinsberg’, ‘USA: WA’,‘USA: Seattle, WA’, ‘Cambodia:Sihanoukville’,‘Malaysia: Kuala Lumpur’, ‘Egypt’, ‘Nepal: Kathmandu’

Types of changes

At first sight most variable sites involve non-synonymous changes and have wide range of alternative allele frequencies

We define three types of sites:

  1. FIXED = alternative allele frequency (AAF) across all samples at a given site >= 95%
  2. POLYMORPHIC = AAF across all samples at a given site > 5% and < 95%
  3. MIXED = AAF across all samples at a given site > 5% (these are different from polymorphic in that some samples may have AAF > 95%).

In our data we have (here ‘.’ = non coding regions such as UTRs):

FUNCLASS N type
0 . 5 fixed
1 MISSENSE 203 fixed
2 NONSENSE 1 fixed
3 SILENT 132 fixed
4 . 74 polymorphic
5 MISSENSE 3207 polymorphic
6 NONSENSE 192 polymorphic
7 SILENT 1243 polymorphic
8 . 5 mixed
9 MISSENSE 90 mixed
10 SILENT 58 mixed

or graphically:

These changes are distributed across SARS-CoV2 genes in the following way (here rlen, Y-axis,
is the number of changes with a gene divided by its length):

Where’s the data?

What’s up with stops?


There is a number of substitution leading to stops. The most prominent site, 29,039, is dominated by a single study.

The figure above shown a substantial number of nucleotide changes creating stop codons (red bars in the above plot). Our previous analysis showed that stops need to be treated with caution (the study SRP253798 was excluded from this analysis). Let’s see which studies have stops:

study_accession Library Strategy Stops Samples
0 SRP258105 AMPLICON 73 46
1 SRP253926 AMPLICON 70 38
2 SRP251618 RNA-Seq 61 27
3 ERP121228 AMPLICON 33 22
4 SRP261417 Targeted-Capture 31 19
5 SRP256957 AMPLICON 8 4
6 SRP248092 RNA-Seq 6 3
7 SRP255151 AMPLICON 3 3
8 SRP261949 OTHER 2 1
9 SRP258006 AMPLICON 1 1
10 SRP254688 RNA-Seq 1 1
11 SRP254488 AMPLICON 1 1
12 SRP242226 RNA-Seq 1 1

and which of the stops are shared across multiple studies:

POS change Samples Studies Strategies Gene Function
0 29039 AT 29 5 3 N nucleocapsid phosphoprotein
1 10718 CT 2 2 1 orf1ab 3C-like proteinase
2 11357 AT 14 2 2 orf1ab nsp6
3 16672 AT 2 2 2 orf1ab helicase
4 17865 TA 6 2 2 orf1ab helicase
5 23511 TG 3 2 1 S surface glycoprotein
6 28209 GT 2 2 1 ORF8 ORF8 protein
7 29120 CT 2 2 1 N nucleocapsid phosphoprotein
8 29642 CT 2 2 2 ORF10 ORF10 protein
9 269 GT 1 1 1 orf1ab leader protein
10 329 CT 1 1 1 orf1ab leader protein
11 374 GT 1 1 1 orf1ab leader protein
12 404 AT 1 1 1 orf1ab leader protein
13 536 GT 1 1 1 orf1ab leader protein
14 728 GT 1 1 1 orf1ab leader protein

here is a genome view:

let’s look at position 29,039 (a subset of rows shown; see notebook for full dataset) – largest cross on the image above (below all columns are self explanatory with the exception of DP4, which contains strand-specific counts: (1) reference observations on forward, (2) reference observations on reverse, (3) alternative observations on forward, and (4) alternative observations on reverse) :

Sample DP AF SB DP4 GENE CODON study_accession library_strategy library_source library_selection
0 SRR11597150 52 0.057692 2 31,18,3,0 N Aag/Tag SRP258105 AMPLICON VIRAL RNA PCR
1 SRR11597171 62 0.096774 2 43,12,6,0 N Aag/Tag SRP258105 AMPLICON VIRAL RNA PCR
2 SRR11597195 53 0.056604 5 32,16,1,2 N Aag/Tag SRP258105 AMPLICON VIRAL RNA PCR
3 SRR11783573 2015 0.074442 146 834,1003,28,148 N Aag/Tag SRP261417 Targeted-Capture VIRAL RNA RT-PCR
4 SRR11783615 1095 0.204566 291 530,317,53,189 N Aag/Tag SRP261417 Targeted-Capture VIRAL RNA RT-PCR
5 SRR11783623 5373 0.051182 223 2603,2420,272,75 N Aag/Tag SRP261417 Targeted-Capture VIRAL RNA RT-PCR
6 SRR11771993 105 0.066667 0 50,48,4,3 N Aag/Tag SRP251618 RNA-Seq VIRAL RNA unspecified
7 SRR11772009 140 0.064286 0 64,67,4,5 N Aag/Tag SRP251618 RNA-Seq VIRAL RNA unspecified

it is interesting to note that most rows in the above table come from SRP251618 and this one study has a relatively tight distribution of alternative allele frequencies at this site:

in general, a tight distribution of allele frequencies within a single study may potentially suggest contamination (e.g., see Dickins 2014). Here is the distribution of alternative allele frequencies for all SNPs within SRP251618 where a site is shared across at least 5 samples (as you can see it is a little tight):

To gain another perspective let’s plot frequency of nucleotide changes leading to a stop versus that number of Samples, Studies, and Library preparation strategies they appear in:



as you can see the absolute majority of changes leading to stops are restricted to just one sample, study, or strategy.

Filtering strategy


We restrict downstream analyses to only those sites that are shared across at least 2 studies (e.g., SRP identifiers).

At this point we can afford being on a conservative side of things. For all subsequent analyses we will only use variants that are shared across at least two Studies. This reduces the number of 5,210 distinct sites to 458.


This charts highlight another interesting observations. The total number of C-to-T changes (together with complementary G-to-A) is 215, which is ~47% of the total (458). These changes can potentially be results of APOBEC3-mediated RNA editing of single stranded RNA. In addition, there are 110 A-to-G changes (along with complementary T-to-C), which is ~24% of the total. These can be results of ADAR-mediated A-to-I RNA editing. Below we try to make sense of these numbers:

Analysis of potential RNA editing


There is evidence of RNA editing by APOBEC3 and ADAR systems.

Here we look for signatures of APOBEC- and ADAR-mediated RNA editing.

APOBEC signatures

APOBEC (apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like) converts C in single stranded stretches of DNA or RNA to U resulting C -> T transition. Members of APOBEC family target specific sequence motifs: TC , CCC , TTC where the last base mutated to T , or TCW ( W = A / T ) where the middle base mutates to T (see Chen and MacCarthy 2014).

The majority of mutations observed on this data are T to C :

Plus strand

Let’s plot upstream context (two upstream nucleotides plus the SNP) of C-to-T substitutions versus number of distinct sites with a C-to-T change:

most sites have context AAC which is not an APOBEC3 motif. At the same time the next most frequent motif is TTC which is an APOBEC3 motif.

Which motifs correspond to sites that are most frequently changed:

here, N is the number of Samples sharing a particular change. For example, a green triangle above ACC is a non-synonymous change shared by over 500 samples in our dataset. Here is a look at sites wihere N is above 100:

POS N Samples Studies change GENE FUNCLASS type func up down mid
0 241 413 413 12 5 . . 5’UTR . TTC CGT TCG
1 1059 265 265 5 3 orf1ab MISSENSE mat_peptide nsp2 CAC CCT ACC
2 3037 431 431 13 5 orf1ab SILENT mat_peptide nsp3 TTC CTA TCT
3 8782 129 129 11 4 orf1ab SILENT mat_peptide nsp4 AGC CCA GCC
4 14408 537 537 11 5 orf1ab MISSENSE mat_peptide RNA-dependent RNA polymerase ACC CTA CCT
5 18060 119 119 7 4 orf1ab SILENT mat_peptide 3’-to-5’ exonuclease CTC CTT TCT

and the allele frequency distribution for these suggests that in most cases they are fixed:

Minus strand

For the opposite strand edits we will plot the downstream context (SNP base plus two downstream nucleotides):

here 3’-GAA-5’ (or 5’-TTC-3’ in the proper 5’-to-3’ orientation) is the most prominent motif. It is an APOBEC3 motif.

The motif corresponding to the sites that are most frequently changed is:

and the most prominent sites look like this:

POS N change GENE FUNCLASS type func up down mid
3 1730 23 GA orf1ab MISSENSE mat_peptide nsp2 TGG GAA GGA
26 28881 17 GA N MISSENSE CDS nucleocapsid phosphoprotein TAG GGG AGG
27 28882 17 GA N SILENT CDS nucleocapsid phosphoprotein AGG GGG GGG

with allele frequencies distributed in the following way:

There are several interesting things here:

  1. Site 1,730 is always polymorphic despite the fact that it is present in 23 samples. Sites 28,881 and 28,882 are adjacent to each other and both have upstream context CCC (complement of GGG).

The sites appear to be linked (we need to look at alignments) based on the difference between alternative allele frequencies for site 28,881 and 28,882 for all samples that have that change as there the alternative allele frequencies are essentially equal between site 28,881 and 28,882:

ADAR editing

ADAR changes A to I within double stranded stretches, which reads as an A to G change. Bazak et al. 2014 report the following context for ADAR-edited sites:

Plus strand

For plus strand A-to-G changes have the following surrounding context:

Here the GA means G is -1 position and A in +1 position. In terms of A-to-G substitution context present in the highest number of samples, it is GT and TT:

at these sites:

POS N change GENE FUNCLASS type func mid left right left-right
23 17858 93 AG orf1ab MISSENSE mat_peptide helicase TAT T T TT
30 23403 482 AG S MISSENSE CDS surface glycoprotein GAT G T GT

with allele frequencies distributed like this:

Minus strand

On the minus strand we would expect T-toC substitutions with the following contexts:

The distribution of most widely observed contexts:

where the most prominent sites are:

POS N change GENE FUNCLASS type func mid left right left-right
1 833 24 TC orf1ab MISSENSE mat_peptide nsp2 CTT C T CT
41 17247 28 TC orf1ab SILENT mat_peptide helicase GTG G G GG
61 28144 187 TC ORF8 MISSENSE CDS ORF8 protein TTA T A TA

with the following distribution of allele frequencies:

A closer look at variants


There is a group of mixed sites with a wide allele frequency range shared across multiple studies. Such sites are fixed in some sample and polymorphic in others. Some of these sites are supported by multiple experimental strategies (highlighted by different colors; strategies are: RNA-Seq, AMPLICON, Targeted-Capture, WGS, OTHER.)

The following image shows nonsynonymous sites supported by at least two studies versus alternative allele frequency for individual samples. You can see a handful of sites where frequency is stretched across the entire 0 to 1 range:

Here we take a look at the three categories of sites – (1) fixed, (2) polymorphic, and (3) mixed. In the following graphs we shown missense and nonsense sites only. The size of the marker is proportional to the number of distinct studies supporting each site. The Y axis the range of alternative allele frequences across all samples at a given site.

Fixed sites

Here is a list of these sites (in all tables of this kind below Samples = # of SRR datasets supporting this site, Studies = # of SRP accessions supporting this site, Strategies = # of library preparation strategies such as RNA-Seq, AMPLICON, Targeted-Capture, WGS, OTHER, AFRange is the range of alternative allele frequency values across all samples at this site with this change):

POS change FUNCLASS func Samples Studies Strategies AFrange
0 1917 CT MISSENSE nsp2 22 2 2 0.021277
1 2018 AG MISSENSE nsp2 5 2 2 0.027568
2 8031 AG MISSENSE nsp3 2 2 2 3.5e-05
3 11916 CT MISSENSE nsp7 7 2 2 0.021583
4 12478 GA MISSENSE nsp8 5 2 2 0.035075
5 14786 CT MISSENSE RNA-dependent RNA polymerase 3 3 2 0.031151
6 25429 GT MISSENSE ORF3a protein 3 2 1 0.016583
7 28300 GT MISSENSE nucleocapsid phosphoprotein 2 2 2 0.027397

Polymorphic sites

There are many polymorphic sites, so here is a partial list containing only sites supported by >= 3 Studies (the NONSENSE call here is the same that we discussed above):

POS change FUNCLASS func Samples Studies Strategies AFrange
0 72 GA . . 5 3 1 0.11878
1 203 CT . . 3 3 2 0.072665
2 593 CT MISSENSE leader protein 3 3 2 0.211161
3 635 CT MISSENSE leader protein 3 3 2 0.166995
4 906 CT MISSENSE nsp2 7 3 2 0.125
5 1730 GA MISSENSE nsp2 23 3 2 0.35
6 3259 GT MISSENSE nsp3 5 3 1 0.162266
7 7869 CT MISSENSE nsp3 3 3 2 0.089
8 11083 GT MISSENSE nsp6 24 7 3 0.895846
9 11367 AT MISSENSE nsp6 5 3 1 0.022661
10 12578 GT MISSENSE nsp8 4 3 1 0.24249
11 13640 AT MISSENSE RNA-dependent RNA polymerase 4 3 3 0.087245
12 13693 AT MISSENSE RNA-dependent RNA polymerase 4 3 2 0.042662
13 15682 TA MISSENSE RNA-dependent RNA polymerase 14 3 3 0.167971
14 15685 TA MISSENSE RNA-dependent RNA polymerase 15 3 3 0.082486
15 16085 AT MISSENSE RNA-dependent RNA polymerase 3 3 2 0.296992
16 17288 CT MISSENSE helicase 4 3 1 0.425314
17 17523 GA MISSENSE helicase 4 3 2 0.072694
18 17549 TG MISSENSE helicase 9 3 2 0.028718
19 18508 CT MISSENSE 3’-to-5’ exonuclease 5 3 2 0.139676
20 18881 CT MISSENSE 3’-to-5’ exonuclease 3 3 1 0.078217
21 22062 CT MISSENSE surface glycoprotein 7 4 3 0.129441
22 24156 TC MISSENSE surface glycoprotein 3 3 3 0.024088
23 24966 AT MISSENSE surface glycoprotein 6 3 3 0.226931
24 25433 CT MISSENSE ORF3a protein 3 3 1 0.120255
25 26542 CA MISSENSE membrane glycoprotein 9 4 3 0.053911
26 26545 TG MISSENSE membrane glycoprotein 9 3 3 0.09785
27 29039 AT NONSENSE nucleocapsid phosphoprotein 29 5 3 0.153384
28 29187 CT MISSENSE nucleocapsid phosphoprotein 33 4 2 0.191402
29 29774 CT . . 5 4 2 0.865321

Mixed sites

This is perhaps the most interesting category:

Here is a partial list of sites supported by at least four Studies:

POS change FUNCLASS GENE func Samples Studies Strategies AFrange
0 241 CT . . . 413 12 5 0.92
1 490 TA MISSENSE orf1ab leader protein 11 4 2 0.247869
2 1059 CT MISSENSE orf1ab nsp2 265 5 3 0.946809
9 6696 CT MISSENSE orf1ab nsp3 7 4 2 0.948255
10 10779 TA MISSENSE orf1ab 3C-like proteinase 10 3 1 0.90319
11 14408 CT MISSENSE orf1ab RNA-dependent RNA polymerase 537 11 5 0.821792
12 15760 GA MISSENSE orf1ab RNA-dependent RNA polymerase 3 3 2 0.924901
13 17285 CT MISSENSE orf1ab helicase 6 4 2 0.938289
14 17747 CT MISSENSE orf1ab helicase 93 4 3 0.880492
15 17858 AG MISSENSE orf1ab helicase 93 4 3 0.844828
16 18736 TC MISSENSE orf1ab 3’-to-5’ exonuclease 9 4 2 0.898305
18 23403 AG MISSENSE S surface glycoprotein 482 13 5 0.941878
20 25563 GT MISSENSE ORF3a ORF3a protein 340 7 3 0.936221
21 26144 GT MISSENSE ORF3a ORF3a protein 30 5 2 0.134199
22 28077 GC MISSENSE ORF8 ORF8 protein 11 4 2 0.109966
23 28144 TC MISSENSE ORF8 ORF8 protein 187 12 4 0.907761
24 28881 GA MISSENSE N nucleocapsid phosphoprotein 17 8 4 0.803922
25 28883 GC MISSENSE N nucleocapsid phosphoprotein 17 8 4 0.80198

Here is another view showing distribution of alternative allele frequencies for top mixed sites across sample collection locations:

.

Comparison with sites under selection


There is a good concordance between intrahost variable sites and sites under selection

Finally we wanted to see how our results compare with the selection analysis descrtibed here. The selection analysis ranks sites based on belonging to one of more of the following categories (see this for detailed description:

  • evidence of selection (pervasive, episodic, or negative)
  • multiple selective events
  • increase in the allele frequency
  • high frequency variants
  • evolutionary unexpected variants
  • non-neutral evolution at that site in other coronoviruses
  • belonging to CTL-epitopes

Based on these criteria each site is given a score between 1 and 7 (a TSV representation of this dataset is available here). Among 142 sites with score > 0 it is distributed in the following way:
image
We first intersected all selected sites with score > 0 with variants on our dateset. Among 286 nonsynonymous sites 27 (~10%) were under selection. We then restricted our analysis to 17 most reliable sites under selection with score > 4. Seven of these sites were shared with our dataset:

POS change func Samples Studies Strategies score multiple_clades related_selection
0 6312 CA nsp3 4 2 2 5 5 1
1 10377 CT 3Cpro 5 2 1 5 2 1
2 14408 CT RdRp 537 11 5 7 15 1
3 14786 CT RdRp 3 3 2 5 4 0
4 17126 TC helicase 2 2 2 6 2 1
5 23403 AG S 482 13 5 5 31 0
6 25563 GT orf3a 340 7 3 6 7 1

and the variation in allele frequencies across these sites:

More to come …

1 Like

I am quite surprised that you see 28,881 and 28,882 being often polymorphic, but not 28,883. These three sites appear in consensus sequences to be mutated in almost perfect linkage, so it would make sense that if polymorphisms are present at 28,881 and 28,882 , say, because of mixed infections, then 28,883 would also be found as polymorphic in the same samples and actually in perfect linkage with the others in the same reads.

@NicolaDeMaio - I do actually:

POS change minAF maxAF rangeAF Samples Studies Strategies
0 28881 GA 0.196078 1 0.803922 17 8 4
1 28882 GA 0.186275 1 0.813725 17 8 4
2 28883 GC 0.19802 1 0.80198 17 8 4

It looks like selection data is indexed by codon, while variants are indexed by nucleotide, so 28,883 is definitely there. I will fix this in the next run that incorporated indels