Gained stops in data from The Peter Doherty Institute for Infection and Immunity

anekrut · May 15, 2020, 3:44pm

We (https://covid19.galaxyproject.org) perform continuous analyses of all read-level data from short read archives (both SRA and ERA). At this point we have 1,343 distinct Illumina datasets from 22 studies. One study (SRP253798) stands out as highly problematic and likely cannot be used for reliable analysis of intra-host variation. So use caution. Here are details:

Absolute majority of gained stops are from a single study

(Variant list we used for this analysis can be found here).

After applying our variant calling workflow to these data we discovered an unusually high number of substitutions leading to the creation of stop codons – an obvious red flag. Here is a scatterplot of the relationship between total number of variants and variants leading to a stop creation colored by SRA study accessions:

stop_vs_var

you can see that it is dominated by SRP253798. Removing this one study accession cleans up this picture quite a bit:

stop_vs_var_nonoise

This particular study SRP253798 is also by far the biggest in terms of Sample (raw read datasets):

	study_accession	samples
0	SRP253798	876
1	ERP121228	227
2	SRP253926	118
3	SRP256957	54
4	SRP255151	15
5	SRP258006	10
6	SRP248092	7
7	SRP254688	6
8	SRP251618	5
9	SRP257865	4
10	SRP254488	4
11	SRP250294	4
12	SRP259532	2
13	SRP242226	2
14	SRP040661	2
15	SRP257463	1
16	SRP256770	1
17	SRP255769	1
18	SRP255672	1
19	SRP252988	1
20	SRP250653	1
21	SRP245409	1

Most stops are from T-to-C transitions

If one plots a very simple relationship between the number of C-to-T transitions (a possible signature of APOBEC-induced RNA modification) to the total number of changes for all study accessions we see this strange 1-to-1 relationship:

tc_to_all

again, it is dominated by SRP253798. Removing it produces a far more reasonable picture:

tc_to_all_nonoise

We are excluding these samples from analyses completely.

n_j_loman · May 15, 2020, 4:54pm

Thanks for this analysis.

I am curious about two aspects of your automated pipeline:

do you specifically handle amplicon data (i.e. correctly trim primer sequences) according to the primer scheme being employed? This may have an impact on the detection of mixed positions where there is a difference between the primer sequence and the template.
whether your pipeline filters reads according to Q score in a principled manner. E.g. some centres might deposit raw, untrimmed data without read quality filtering, and some might upload quality filtered data at Q30. This could have a big impact on your results and unfairly penalise the Doherty centre data (I am not sure what they do).

Nick

anekrut · May 15, 2020, 7:00pm

Nick:

Excellent points.

Point 1

do you specifically handle amplicon data (i.e. correctly trim primer sequences) according to the primer scheme being employed

Aside from removing adapter contamination and looking at over-represented sequences, no. However, most of the data is ampliconic:

	study_accession	library_strategy	samples
0	SRP253798	AMPLICON	876
1	ERP121228	AMPLICON	227
2	SRP253926	AMPLICON	118
3	SRP256957	AMPLICON	54
4	SRP255151	AMPLICON	15
5	SRP258006	AMPLICON	10
6	SRP248092	RNA-Seq	7
7	SRP254688	RNA-Seq	6
8	SRP251618	RNA-Seq	5
9	SRP257865	WGS	4
10	SRP254488	AMPLICON	4
11	SRP250294	WGS	4
12	SRP259532	AMPLICON	2
13	SRP242226	RNA-Seq	2
14	SRP040661	WGS	2
15	SRP257463	WGS	1
16	SRP256770	RNA-Seq	1
17	SRP255769	RNA-Seq	1
18	SRP255672	AMPLICON	1
19	SRP252988	RNA-Seq	1
20	SRP250653	RNA-Seq	1
21	SRP245409	RNA-Seq	1

If we look at the relationship between the number of variants introducing stop to all variants for the top three largest studies, SRP253798 still looks strange:

Point 2

whether your pipeline filters reads according to Q score in a principled manner

the strategy is:

map
remove duplicates
retain properly paired reads only (if data is PE)
use lofreq realign
use lofreq call for calling at sites with Q>=30 and mapping Q >= 20 and minimum coverage of 50

the strange thing is that from the standpoint of “common sense” metrics such as strand bias or read placement the variants in Doherty look good actually. Below is a stop in nsp3 (Run SRR11578269):

n_j_loman · May 15, 2020, 7:52pm

Thanks. It’s curious - and interesting. I don’t believe in that case that this is sequencing error, i.e. error arising from the sequencing instrument. But it would be interesting in that case to find out exactly what the cause of these high density runs of C>T mutations are. I think RNA editing is a good bet.

sergei.pond · May 15, 2020, 9:24pm

@n_j_loman : Indeed, it’s interesting, and there are a number of studies suggesting extensive opportunity for RNA editing.

I hear that are ~20,000 NGS UK datasets available. If you had a chance to look at them, would you be willing to share if any significant fraction show evidence of C>T or G>T hypermutations?

n_j_loman · May 15, 2020, 10:26pm

Definitely, we’ve started on variant frequency analysis and I’ll report back as soon as I can, and I am expecting most of these datasets should be deposited in SRA before very long for others to look at too.

n_j_loman · May 17, 2020, 12:47pm

Would it be possible to post the run accessions for these extreme outlier samples as represented in this plot?

torsten · May 18, 2020, 5:23am

Anton,

Thanks for taking the time to look at our data set. The data comes from a public health setting and was in no way designed for intra-host variation (or more specifically intra-sample variation). These samples travel from pathology labs to the RT-PCR lab to the WGS lab, often facing multiple freeze/thaw cycles, without any RNA protection buffer. This is not ideal but part of the reality of public health and clinical microbiology.

We build consensus genomes from the data. Currently we use ivar as our primary tool; it’s trim module correctly removes primer sequence based on the ARTIC .bed file used in the amplicon assay. I note you did not do this in your Galaxy workflow, which puts you at risk of calling false SNPs in the (many) primer regions due to the reads having non-template DNA in them which differs from the sample. This could cause false minor alleles like you report, but that said, I believe most of the ones you have found are real. We believe RNA degradation is the main cause, which your work has helped highlight.

We are currently calling consensus at 90% threshold, which relates nicely with your 10% minor allele fraction. For some of our samples we noticed a large number of “heterzygous” sites which are encoded using an ambiguous IUPAC code, and yes most of them were codes encompassing a “T” + A,G,C ambiguity, which is consistent with the RNA degradation theory. Other groups appear to be using thresholds lower than 90%, which would help mitigate this problem.

We will be submitting more data soon to SRA. It will have similar problems I suspect. I am endeavouring to add more metadata about each sample to help people infer patterns or batch effects too. This is an ongoing project for us, as COVID cases have not ceased for us yet. The FASTQ we submitted was all pairs that mapped to the reference genome. This was to ensure we did not submit any human DNA, but a few chimeras have gotten through. We did not filter in any other way. eg quality.

The main use case for our data is cluster detection and analysis - using genomics in combination with epidemiology to best understand transmission and sources. This is only minorly affected by the issues you highlight in your post fortunately! But we will improve our processes to ensure the minor alleles don’t impact our decision making. Here is our paper showing how we used the data:

I am happy you did this post, but I feel the post title is a bit alarmist, and it could be made more obvious that your criticism applies only to intra-sample analysis and minor alleles. I would be interested to see if your results change when you handle the non-template primer DNA correctly, but I think they will largely be the same as the virus has low diversity.

I think we can agree that this is an excellent example of open data and many eyes working for the greater good!

Torsten

anekrut · May 18, 2020, 1:42pm

Dear @n_j_loman:

Here is a table of 10 accessions from SRP253798 with the highest number of C->T changes (variants = total number of variants with AF > 5%, CT = number of variants that are C-to-T substitutions):

	Sample	study_accession	library_strategy	variants	CT	ratio
0	SRR11578269	SRP253798	AMPLICON	5179	5174	0.999035
1	SRR11578268	SRP253798	AMPLICON	3781	3775	0.998413
2	SRR11577865	SRP253798	AMPLICON	3181	3177	0.998743
3	SRR11578028	SRP253798	AMPLICON	2688	2681	0.997396
4	SRR11578029	SRP253798	AMPLICON	2367	2314	0.977609
5	SRR11577862	SRP253798	AMPLICON	2285	2196	0.96105
6	SRR11578392	SRP253798	AMPLICON	1517	1491	0.982861
7	SRR11578031	SRP253798	AMPLICON	1557	1449	0.930636
8	SRR11578030	SRP253798	AMPLICON	1259	1254	0.996029
9	SRR11578094	SRP253798	AMPLICON	1174	1060	0.902896

anekrut · May 18, 2020, 1:48pm

Dear @torsten:

Sorry if this came across critical. I just found these samples somewhat unusual. You guys are actually the ones who do release the data. Most GISAID assemblies are mysterious in nature as there are no corresponding read-level datasets. So this is great! My scanning over the data does not seem to indicate that it is primer-derived (e.g., see SRR11578269 from above) but I will incorporate primer scanning for ampliconic data. Thank you again and please continue setting the example for the entire community by releasing the data.

NicolaDeMaio · January 26, 2021, 4:16pm

Dear @torsten and @anekrut , what do you think of studies that do investigate within-host polymorphisms from this dataset? I have seen a few of them, the latest one is https://doi.org/10.3390/v13010133 . Should one be concerned?

anekrut · January 26, 2021, 7:09pm

I scanned it quickly, but intrahost variants (particularly from ampliconic data such as PRJNA613958) are tricky. There is large amount of noise. I also cannot comment on ViVarSeq.