Early analysis of 1,093 SARS-CoV2 intra-host variation datasets and comparison with sites under selection

anekrut · May 29, 2020, 2:34pm

Marius Van Den Beek, Wolfgang Maier, Sergei Kosakowsky Pond, Anton Nekrutenko + DataMonkey and GalaxyProject Teams | galaxyproject.org

This is a part of the join effort between Datamonkey and GalaxyProject teams to provide free, reproducible, and accessible set of workflows for the analysis of SARS-CoV2 data.

This is a preliminary analysis of SARS-CoV2 intra-host variation. It includes SNPs only (no indels at this point) and Illumina data only. This post describes analysis of data publicly available from SRA and ENA between the beginning of the outbreak and May 17, 2020.

Outline

Dataset overview
Types of changes
What’s up with stops?
Filtering strategy
Analysis of RNA editing
- APOBEC
- ADAR
A closer look at variants
Comparison with sites under selection

Dataset overview

How we call variants

In this study we use lofreq for calling variants. This choice is based on banchmarking described here

Sites, samples, and studies

There are still very few raw read darasets for SARS-CoV2

Despite the fact that the number of complete genomes in places like GISAID has already surpassed 30,000, the number of raw datasets remains relatively modest by comparison:
[‘RNA-Seq’, ‘AMPLICON’, ‘Targeted-Capture’, ‘WGS’, ‘OTHER’]

Feature	Count	Description
Total sites	11,059	Total number of SNPs across all samples. Since many sites are presented in multiple samples they are counted multiple times here
Distinct sites	4,899	Distinct variable positions across SARS-CoV2 genome
Distinct samples	1,093	Distinct SRA or ENA accessions in this dataset
Distinct studies	28	SRA bundles samples into studies
Distinct geographic locations	24	‘Germany’, ‘China:Shanghai’, None, ‘USA:Virginia’,‘USA: Hamilton, MT’, ‘USA’, ‘USA:Minnesota’, ‘USA: Washington’, ‘Peru: Lima’, ‘USA: New York’, ‘Germany: Dusseldorf’, ‘Bangladesh’, ‘USA:Los Angeles’, ‘China: Hubei’, ‘China: Wuhan’, ‘Malaysia’, ‘India: Rajkot’, ‘Israel’, ‘Germany: Heinsberg’, ‘USA: WA’,‘USA: Seattle, WA’, ‘Cambodia:Sihanoukville’,‘Malaysia: Kuala Lumpur’, ‘Egypt’, ‘Nepal: Kathmandu’

Types of changes

At first sight most variable sites involve non-synonymous changes and have wide range of alternative allele frequencies

We define three types of sites:

FIXED = alternative allele frequency (AAF) across all samples at a given site >= 95%
POLYMORPHIC = AAF across all samples at a given site > 5% and < 95%
MIXED = AAF across all samples at a given site > 5% (these are different from polymorphic in that some samples may have AAF > 95%).

In our data we have (here ‘.’ = non coding regions such as UTRs):

	FUNCLASS	N	type
0	.	5	fixed
1	MISSENSE	203	fixed
2	NONSENSE	1	fixed
3	SILENT	132	fixed
4	.	74	polymorphic
5	MISSENSE	3207	polymorphic
6	NONSENSE	192	polymorphic
7	SILENT	1243	polymorphic
8	.	5	mixed
9	MISSENSE	90	mixed
10	SILENT	58	mixed

or graphically:

These changes are distributed across SARS-CoV2 genes in the following way (here rlen, Y-axis,
is the number of changes with a gene divided by its length):

Where’s the data?

What’s up with stops?

There is a number of substitution leading to stops. The most prominent site, 29,039, is dominated by a single study.

The figure above shown a substantial number of nucleotide changes creating stop codons (red bars in the above plot). Our previous analysis showed that stops need to be treated with caution (the study SRP253798 was excluded from this analysis). Let’s see which studies have stops:

	study_accession	Library Strategy	Stops	Samples
0	SRP258105	AMPLICON	73	46
1	SRP253926	AMPLICON	70	38
2	SRP251618	RNA-Seq	61	27
3	ERP121228	AMPLICON	33	22
4	SRP261417	Targeted-Capture	31	19
5	SRP256957	AMPLICON	8	4
6	SRP248092	RNA-Seq	6	3
7	SRP255151	AMPLICON	3	3
8	SRP261949	OTHER	2	1
9	SRP258006	AMPLICON	1	1
10	SRP254688	RNA-Seq	1	1
11	SRP254488	AMPLICON	1	1
12	SRP242226	RNA-Seq	1	1

and which of the stops are shared across multiple studies:

	POS	change	Samples	Studies	Strategies	Gene	Function
0	29039	AT	29	5	3	N	nucleocapsid phosphoprotein
1	10718	CT	2	2	1	orf1ab	3C-like proteinase
2	11357	AT	14	2	2	orf1ab	nsp6
3	16672	AT	2	2	2	orf1ab	helicase
4	17865	TA	6	2	2	orf1ab	helicase
5	23511	TG	3	2	1	S	surface glycoprotein
6	28209	GT	2	2	1	ORF8	ORF8 protein
7	29120	CT	2	2	1	N	nucleocapsid phosphoprotein
8	29642	CT	2	2	2	ORF10	ORF10 protein
9	269	GT	1	1	1	orf1ab	leader protein
10	329	CT	1	1	1	orf1ab	leader protein
11	374	GT	1	1	1	orf1ab	leader protein
12	404	AT	1	1	1	orf1ab	leader protein
13	536	GT	1	1	1	orf1ab	leader protein
14	728	GT	1	1	1	orf1ab	leader protein

here is a genome view:

let’s look at position 29,039 (a subset of rows shown; see notebook for full dataset) – largest cross on the image above (below all columns are self explanatory with the exception of DP4, which contains strand-specific counts: (1) reference observations on forward, (2) reference observations on reverse, (3) alternative observations on forward, and (4) alternative observations on reverse) :

	Sample	DP	AF	SB	DP4	GENE	CODON	study_accession	library_strategy	library_source	library_selection
0	SRR11597150	52	0.057692	2	31,18,3,0	N	Aag/Tag	SRP258105	AMPLICON	VIRAL RNA	PCR
1	SRR11597171	62	0.096774	2	43,12,6,0	N	Aag/Tag	SRP258105	AMPLICON	VIRAL RNA	PCR
2	SRR11597195	53	0.056604	5	32,16,1,2	N	Aag/Tag	SRP258105	AMPLICON	VIRAL RNA	PCR
3	SRR11783573	2015	0.074442	146	834,1003,28,148	N	Aag/Tag	SRP261417	Targeted-Capture	VIRAL RNA	RT-PCR
4	SRR11783615	1095	0.204566	291	530,317,53,189	N	Aag/Tag	SRP261417	Targeted-Capture	VIRAL RNA	RT-PCR
5	SRR11783623	5373	0.051182	223	2603,2420,272,75	N	Aag/Tag	SRP261417	Targeted-Capture	VIRAL RNA	RT-PCR
6	SRR11771993	105	0.066667	0	50,48,4,3	N	Aag/Tag	SRP251618	RNA-Seq	VIRAL RNA	unspecified
7	SRR11772009	140	0.064286	0	64,67,4,5	N	Aag/Tag	SRP251618	RNA-Seq	VIRAL RNA	unspecified

it is interesting to note that most rows in the above table come from SRP251618 and this one study has a relatively tight distribution of alternative allele frequencies at this site:

in general, a tight distribution of allele frequencies within a single study may potentially suggest contamination (e.g., see Dickins 2014). Here is the distribution of alternative allele frequencies for all SNPs within SRP251618 where a site is shared across at least 5 samples (as you can see it is a little tight):

To gain another perspective let’s plot frequency of nucleotide changes leading to a stop versus that number of Samples, Studies, and Library preparation strategies they appear in:

as you can see the absolute majority of changes leading to stops are restricted to just one sample, study, or strategy.

Filtering strategy

We restrict downstream analyses to only those sites that are shared across at least 2 studies (e.g., SRP identifiers).

At this point we can afford being on a conservative side of things. For all subsequent analyses we will only use variants that are shared across at least two Studies. This reduces the number of 5,210 distinct sites to 458.

This charts highlight another interesting observations. The total number of C-to-T changes (together with complementary G-to-A) is 215, which is ~47% of the total (458). These changes can potentially be results of APOBEC3-mediated RNA editing of single stranded RNA. In addition, there are 110 A-to-G changes (along with complementary T-to-C), which is ~24% of the total. These can be results of ADAR-mediated A-to-I RNA editing. Below we try to make sense of these numbers:

Analysis of potential RNA editing

There is evidence of RNA editing by APOBEC3 and ADAR systems.

Here we look for signatures of APOBEC- and ADAR-mediated RNA editing.

APOBEC signatures

APOBEC (apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like) converts C in single stranded stretches of DNA or RNA to U resulting C → T transition. Members of APOBEC family target specific sequence motifs: TC , CCC , TTC where the last base mutated to T , or TCW ( W = A / T ) where the middle base mutates to T (see Chen and MacCarthy 2014).

The majority of mutations observed on this data are T to C :

Plus strand

Let’s plot upstream context (two upstream nucleotides plus the SNP) of C-to-T substitutions versus number of distinct sites with a C-to-T change:

most sites have context AAC which is not an APOBEC3 motif. At the same time the next most frequent motif is TTC which is an APOBEC3 motif.

Which motifs correspond to sites that are most frequently changed:

here, N is the number of Samples sharing a particular change. For example, a green triangle above ACC is a non-synonymous change shared by over 500 samples in our dataset. Here is a look at sites wihere N is above 100:

	POS	N	Samples	Studies	change	GENE	FUNCLASS	type	func	up	down	mid
0	241	413	413	12	5	.	.	5’UTR	.	TTC	CGT	TCG
1	1059	265	265	5	3	orf1ab	MISSENSE	mat_peptide	nsp2	CAC	CCT	ACC
2	3037	431	431	13	5	orf1ab	SILENT	mat_peptide	nsp3	TTC	CTA	TCT
3	8782	129	129	11	4	orf1ab	SILENT	mat_peptide	nsp4	AGC	CCA	GCC
4	14408	537	537	11	5	orf1ab	MISSENSE	mat_peptide	RNA-dependent RNA polymerase	ACC	CTA	CCT
5	18060	119	119	7	4	orf1ab	SILENT	mat_peptide	3’-to-5’ exonuclease	CTC	CTT	TCT

and the allele frequency distribution for these suggests that in most cases they are fixed:

Minus strand

For the opposite strand edits we will plot the downstream context (SNP base plus two downstream nucleotides):

here 3’-GAA-5’ (or 5’-TTC-3’ in the proper 5’-to-3’ orientation) is the most prominent motif. It is an APOBEC3 motif.

The motif corresponding to the sites that are most frequently changed is:

and the most prominent sites look like this:

	POS	N	change	GENE	FUNCLASS	type	func	up	down	mid
3	1730	23	GA	orf1ab	MISSENSE	mat_peptide	nsp2	TGG	GAA	GGA
26	28881	17	GA	N	MISSENSE	CDS	nucleocapsid phosphoprotein	TAG	GGG	AGG
27	28882	17	GA	N	SILENT	CDS	nucleocapsid phosphoprotein	AGG	GGG	GGG

with allele frequencies distributed in the following way:

There are several interesting things here:

Site 1,730 is always polymorphic despite the fact that it is present in 23 samples. Sites 28,881 and 28,882 are adjacent to each other and both have upstream context CCC (complement of GGG).

The sites appear to be linked (we need to look at alignments) based on the difference between alternative allele frequencies for site 28,881 and 28,882 for all samples that have that change as there the alternative allele frequencies are essentially equal between site 28,881 and 28,882:

ADAR editing

ADAR changes A to I within double stranded stretches, which reads as an A to G change. Bazak et al. 2014 report the following context for ADAR-edited sites:

Plus strand

For plus strand A-to-G changes have the following surrounding context:

Here the GA means G is -1 position and A in +1 position. In terms of A-to-G substitution context present in the highest number of samples, it is GT and TT:

at these sites:

	POS	N	change	GENE	FUNCLASS	type	func	mid	left	right	left-right
23	17858	93	AG	orf1ab	MISSENSE	mat_peptide	helicase	TAT	T	T	TT
30	23403	482	AG	S	MISSENSE	CDS	surface glycoprotein	GAT	G	T	GT

with allele frequencies distributed like this:

Minus strand

On the minus strand we would expect T-toC substitutions with the following contexts:

The distribution of most widely observed contexts:

where the most prominent sites are:

	POS	N	change	GENE	FUNCLASS	type	func	mid	left	right	left-right
1	833	24	TC	orf1ab	MISSENSE	mat_peptide	nsp2	CTT	C	T	CT
41	17247	28	TC	orf1ab	SILENT	mat_peptide	helicase	GTG	G	G	GG
61	28144	187	TC	ORF8	MISSENSE	CDS	ORF8 protein	TTA	T	A	TA

with the following distribution of allele frequencies:

A closer look at variants

There is a group of mixed sites with a wide allele frequency range shared across multiple studies. Such sites are fixed in some sample and polymorphic in others. Some of these sites are supported by multiple experimental strategies (highlighted by different colors; strategies are: RNA-Seq, AMPLICON, Targeted-Capture, WGS, OTHER.)

The following image shows nonsynonymous sites supported by at least two studies versus alternative allele frequency for individual samples. You can see a handful of sites where frequency is stretched across the entire 0 to 1 range:

Here we take a look at the three categories of sites – (1) fixed, (2) polymorphic, and (3) mixed. In the following graphs we shown missense and nonsense sites only. The size of the marker is proportional to the number of distinct studies supporting each site. The Y axis the range of alternative allele frequences across all samples at a given site.

Fixed sites

Here is a list of these sites (in all tables of this kind below Samples = # of SRR datasets supporting this site, Studies = # of SRP accessions supporting this site, Strategies = # of library preparation strategies such as RNA-Seq, AMPLICON, Targeted-Capture, WGS, OTHER, AFRange is the range of alternative allele frequency values across all samples at this site with this change):

	POS	change	FUNCLASS	func	Samples	Studies	Strategies	AFrange
0	1917	CT	MISSENSE	nsp2	22	2	2	0.021277
1	2018	AG	MISSENSE	nsp2	5	2	2	0.027568
2	8031	AG	MISSENSE	nsp3	2	2	2	3.5e-05
3	11916	CT	MISSENSE	nsp7	7	2	2	0.021583
4	12478	GA	MISSENSE	nsp8	5	2	2	0.035075
5	14786	CT	MISSENSE	RNA-dependent RNA polymerase	3	3	2	0.031151
6	25429	GT	MISSENSE	ORF3a protein	3	2	1	0.016583
7	28300	GT	MISSENSE	nucleocapsid phosphoprotein	2	2	2	0.027397

Polymorphic sites

There are many polymorphic sites, so here is a partial list containing only sites supported by >= 3 Studies (the NONSENSE call here is the same that we discussed above):

	POS	change	FUNCLASS	func	Samples	Studies	Strategies	AFrange
0	72	GA	.	.	5	3	1	0.11878
1	203	CT	.	.	3	3	2	0.072665
2	593	CT	MISSENSE	leader protein	3	3	2	0.211161
3	635	CT	MISSENSE	leader protein	3	3	2	0.166995
4	906	CT	MISSENSE	nsp2	7	3	2	0.125
5	1730	GA	MISSENSE	nsp2	23	3	2	0.35
6	3259	GT	MISSENSE	nsp3	5	3	1	0.162266
7	7869	CT	MISSENSE	nsp3	3	3	2	0.089
8	11083	GT	MISSENSE	nsp6	24	7	3	0.895846
9	11367	AT	MISSENSE	nsp6	5	3	1	0.022661
10	12578	GT	MISSENSE	nsp8	4	3	1	0.24249
11	13640	AT	MISSENSE	RNA-dependent RNA polymerase	4	3	3	0.087245
12	13693	AT	MISSENSE	RNA-dependent RNA polymerase	4	3	2	0.042662
13	15682	TA	MISSENSE	RNA-dependent RNA polymerase	14	3	3	0.167971
14	15685	TA	MISSENSE	RNA-dependent RNA polymerase	15	3	3	0.082486
15	16085	AT	MISSENSE	RNA-dependent RNA polymerase	3	3	2	0.296992
16	17288	CT	MISSENSE	helicase	4	3	1	0.425314
17	17523	GA	MISSENSE	helicase	4	3	2	0.072694
18	17549	TG	MISSENSE	helicase	9	3	2	0.028718
19	18508	CT	MISSENSE	3’-to-5’ exonuclease	5	3	2	0.139676
20	18881	CT	MISSENSE	3’-to-5’ exonuclease	3	3	1	0.078217
21	22062	CT	MISSENSE	surface glycoprotein	7	4	3	0.129441
22	24156	TC	MISSENSE	surface glycoprotein	3	3	3	0.024088
23	24966	AT	MISSENSE	surface glycoprotein	6	3	3	0.226931
24	25433	CT	MISSENSE	ORF3a protein	3	3	1	0.120255
25	26542	CA	MISSENSE	membrane glycoprotein	9	4	3	0.053911
26	26545	TG	MISSENSE	membrane glycoprotein	9	3	3	0.09785
27	29039	AT	NONSENSE	nucleocapsid phosphoprotein	29	5	3	0.153384
28	29187	CT	MISSENSE	nucleocapsid phosphoprotein	33	4	2	0.191402
29	29774	CT	.	.	5	4	2	0.865321

Mixed sites

This is perhaps the most interesting category:

Here is a partial list of sites supported by at least four Studies:

	POS	change	FUNCLASS	GENE	func	Samples	Studies	Strategies	AFrange
0	241	CT	.	.	.	413	12	5	0.92
1	490	TA	MISSENSE	orf1ab	leader protein	11	4	2	0.247869
2	1059	CT	MISSENSE	orf1ab	nsp2	265	5	3	0.946809
9	6696	CT	MISSENSE	orf1ab	nsp3	7	4	2	0.948255
10	10779	TA	MISSENSE	orf1ab	3C-like proteinase	10	3	1	0.90319
11	14408	CT	MISSENSE	orf1ab	RNA-dependent RNA polymerase	537	11	5	0.821792
12	15760	GA	MISSENSE	orf1ab	RNA-dependent RNA polymerase	3	3	2	0.924901
13	17285	CT	MISSENSE	orf1ab	helicase	6	4	2	0.938289
14	17747	CT	MISSENSE	orf1ab	helicase	93	4	3	0.880492
15	17858	AG	MISSENSE	orf1ab	helicase	93	4	3	0.844828
16	18736	TC	MISSENSE	orf1ab	3’-to-5’ exonuclease	9	4	2	0.898305
18	23403	AG	MISSENSE	S	surface glycoprotein	482	13	5	0.941878
20	25563	GT	MISSENSE	ORF3a	ORF3a protein	340	7	3	0.936221
21	26144	GT	MISSENSE	ORF3a	ORF3a protein	30	5	2	0.134199
22	28077	GC	MISSENSE	ORF8	ORF8 protein	11	4	2	0.109966
23	28144	TC	MISSENSE	ORF8	ORF8 protein	187	12	4	0.907761
24	28881	GA	MISSENSE	N	nucleocapsid phosphoprotein	17	8	4	0.803922
25	28883	GC	MISSENSE	N	nucleocapsid phosphoprotein	17	8	4	0.80198

Here is another view showing distribution of alternative allele frequencies for top mixed sites across sample collection locations:

.

Comparison with sites under selection

There is a good concordance between intrahost variable sites and sites under selection

Finally we wanted to see how our results compare with the selection analysis descrtibed here. The selection analysis ranks sites based on belonging to one of more of the following categories (see this for detailed description:

evidence of selection (pervasive, episodic, or negative)
multiple selective events
increase in the allele frequency
high frequency variants
evolutionary unexpected variants
non-neutral evolution at that site in other coronoviruses
belonging to CTL-epitopes

Based on these criteria each site is given a score between 1 and 7 (a TSV representation of this dataset is available here). Among 142 sites with score > 0 it is distributed in the following way:

We first intersected all selected sites with score > 0 with variants on our dateset. Among 286 nonsynonymous sites 27 (~10%) were under selection. We then restricted our analysis to 17 most reliable sites under selection with score > 4. Seven of these sites were shared with our dataset:

	POS	change	func	Samples	Studies	Strategies	score	multiple_clades	related_selection
0	6312	CA	nsp3	4	2	2	5	5	1
1	10377	CT	3Cpro	5	2	1	5	2	1
2	14408	CT	RdRp	537	11	5	7	15	1
3	14786	CT	RdRp	3	3	2	5	4	0
4	17126	TC	helicase	2	2	2	6	2	1
5	23403	AG	S	482	13	5	5	31	0
6	25563	GT	orf3a	340	7	3	6	7	1

and the variation in allele frequencies across these sites:

More to come …

NicolaDeMaio · June 5, 2020, 11:43am

I am quite surprised that you see 28,881 and 28,882 being often polymorphic, but not 28,883. These three sites appear in consensus sequences to be mutated in almost perfect linkage, so it would make sense that if polymorphisms are present at 28,881 and 28,882 , say, because of mixed infections, then 28,883 would also be found as polymorphic in the same samples and actually in perfect linkage with the others in the same reads.

anekrut · June 8, 2020, 8:35pm

@NicolaDeMaio - I do actually:

	POS	change	minAF	maxAF	rangeAF	Samples	Studies	Strategies
0	28881	GA	0.196078	1	0.803922	17	8	4
1	28882	GA	0.186275	1	0.813725	17	8	4
2	28883	GC	0.19802	1	0.80198	17	8	4

It looks like selection data is indexed by codon, while variants are indexed by nucleotide, so 28,883 is definitely there. I will fix this in the next run that incorporated indels