Gained stops in data from The Peter Doherty Institute for Infection and Immunity

We ( perform continuous analyses of all read-level data from short read archives (both SRA and ERA). At this point we have 1,343 distinct Illumina datasets from 22 studies. One study (SRP253798) stands out as highly problematic and likely cannot be used for reliable analysis of intra-host variation. So use caution. Here are details:

Absolute majority of gained stops are from a single study

(Variant list we used for this analysis can be found here).

After applying our variant calling workflow to these data we discovered an unusually high number of substitutions leading to the creation of stop codons – an obvious red flag. Here is a scatterplot of the relationship between total number of variants and variants leading to a stop creation colored by SRA study accessions:


you can see that it is dominated by SRP253798. Removing this one study accession cleans up this picture quite a bit:


This particular study SRP253798 is also by far the biggest in terms of Sample (raw read datasets):

study_accession samples
0 SRP253798 876
1 ERP121228 227
2 SRP253926 118
3 SRP256957 54
4 SRP255151 15
5 SRP258006 10
6 SRP248092 7
7 SRP254688 6
8 SRP251618 5
9 SRP257865 4
10 SRP254488 4
11 SRP250294 4
12 SRP259532 2
13 SRP242226 2
14 SRP040661 2
15 SRP257463 1
16 SRP256770 1
17 SRP255769 1
18 SRP255672 1
19 SRP252988 1
20 SRP250653 1
21 SRP245409 1

Most stops are from T-to-C transitions

If one plots a very simple relationship between the number of C-to-T transitions (a possible signature of APOBEC-induced RNA modification) to the total number of changes for all study accessions we see this strange 1-to-1 relationship:


again, it is dominated by SRP253798. Removing it produces a far more reasonable picture:


We are excluding these samples from analyses completely.


Thanks for this analysis.

I am curious about two aspects of your automated pipeline:

  1. do you specifically handle amplicon data (i.e. correctly trim primer sequences) according to the primer scheme being employed? This may have an impact on the detection of mixed positions where there is a difference between the primer sequence and the template.

  2. whether your pipeline filters reads according to Q score in a principled manner. E.g. some centres might deposit raw, untrimmed data without read quality filtering, and some might upload quality filtered data at Q30. This could have a big impact on your results and unfairly penalise the Doherty centre data (I am not sure what they do).



Excellent points.

Point 1

do you specifically handle amplicon data (i.e. correctly trim primer sequences) according to the primer scheme being employed

Aside from removing adapter contamination and looking at over-represented sequences, no. However, most of the data is ampliconic:

study_accession library_strategy samples
0 SRP253798 AMPLICON 876
1 ERP121228 AMPLICON 227
2 SRP253926 AMPLICON 118
3 SRP256957 AMPLICON 54
4 SRP255151 AMPLICON 15
5 SRP258006 AMPLICON 10
6 SRP248092 RNA-Seq 7
7 SRP254688 RNA-Seq 6
8 SRP251618 RNA-Seq 5
9 SRP257865 WGS 4
10 SRP254488 AMPLICON 4
11 SRP250294 WGS 4
12 SRP259532 AMPLICON 2
13 SRP242226 RNA-Seq 2
14 SRP040661 WGS 2
15 SRP257463 WGS 1
16 SRP256770 RNA-Seq 1
17 SRP255769 RNA-Seq 1
18 SRP255672 AMPLICON 1
19 SRP252988 RNA-Seq 1
20 SRP250653 RNA-Seq 1
21 SRP245409 RNA-Seq 1

If we look at the relationship between the number of variants introducing stop to all variants for the top three largest studies, SRP253798 still looks strange:

Point 2

whether your pipeline filters reads according to Q score in a principled manner

the strategy is:

  • map
  • remove duplicates
  • retain properly paired reads only (if data is PE)
  • use lofreq realign
  • use lofreq call for calling at sites with Q>=30 and mapping Q >= 20 and minimum coverage of 50

the strange thing is that from the standpoint of “common sense” metrics such as strand bias or read placement the variants in Doherty look good actually. Below is a stop in nsp3 (Run SRR11578269):

Thanks. It’s curious - and interesting. I don’t believe in that case that this is sequencing error, i.e. error arising from the sequencing instrument. But it would be interesting in that case to find out exactly what the cause of these high density runs of C>T mutations are. I think RNA editing is a good bet.

1 Like

@n_j_loman : Indeed, it’s interesting, and there are a number of studies suggesting extensive opportunity for RNA editing.

I hear that are ~20,000 NGS UK datasets available. If you had a chance to look at them, would you be willing to share if any significant fraction show evidence of C>T or G>T hypermutations?

Definitely, we’ve started on variant frequency analysis and I’ll report back as soon as I can, and I am expecting most of these datasets should be deposited in SRA before very long for others to look at too.

Would it be possible to post the run accessions for these extreme outlier samples as represented in this plot?



Thanks for taking the time to look at our data set. The data comes from a public health setting and was in no way designed for intra-host variation (or more specifically intra-sample variation). These samples travel from pathology labs to the RT-PCR lab to the WGS lab, often facing multiple freeze/thaw cycles, without any RNA protection buffer. This is not ideal but part of the reality of public health and clinical microbiology.

We build consensus genomes from the data. Currently we use ivar as our primary tool; it’s trim module correctly removes primer sequence based on the ARTIC .bed file used in the amplicon assay. I note you did not do this in your Galaxy workflow, which puts you at risk of calling false SNPs in the (many) primer regions due to the reads having non-template DNA in them which differs from the sample. This could cause false minor alleles like you report, but that said, I believe most of the ones you have found are real. We believe RNA degradation is the main cause, which your work has helped highlight.

We are currently calling consensus at 90% threshold, which relates nicely with your 10% minor allele fraction. For some of our samples we noticed a large number of “heterzygous” sites which are encoded using an ambiguous IUPAC code, and yes most of them were codes encompassing a “T” + A,G,C ambiguity, which is consistent with the RNA degradation theory. Other groups appear to be using thresholds lower than 90%, which would help mitigate this problem.

We will be submitting more data soon to SRA. It will have similar problems I suspect. I am endeavouring to add more metadata about each sample to help people infer patterns or batch effects too. This is an ongoing project for us, as COVID cases have not ceased for us yet. The FASTQ we submitted was all pairs that mapped to the reference genome. This was to ensure we did not submit any human DNA, but a few chimeras have gotten through. We did not filter in any other way. eg quality.

The main use case for our data is cluster detection and analysis - using genomics in combination with epidemiology to best understand transmission and sources. This is only minorly affected by the issues you highlight in your post fortunately! But we will improve our processes to ensure the minor alleles don’t impact our decision making. Here is our paper showing how we used the data:

I am happy you did this post, but I feel the post title is a bit alarmist, and it could be made more obvious that your criticism applies only to intra-sample analysis and minor alleles. I would be interested to see if your results change when you handle the non-template primer DNA correctly, but I think they will largely be the same as the virus has low diversity.

I think we can agree that this is an excellent example of open data and many eyes working for the greater good!


Dear @n_j_loman:

Here is a table of 10 accessions from SRP253798 with the highest number of C->T changes (variants = total number of variants with AF > 5%, CT = number of variants that are C-to-T substitutions):

Sample study_accession library_strategy variants CT ratio
0 SRR11578269 SRP253798 AMPLICON 5179 5174 0.999035
1 SRR11578268 SRP253798 AMPLICON 3781 3775 0.998413
2 SRR11577865 SRP253798 AMPLICON 3181 3177 0.998743
3 SRR11578028 SRP253798 AMPLICON 2688 2681 0.997396
4 SRR11578029 SRP253798 AMPLICON 2367 2314 0.977609
5 SRR11577862 SRP253798 AMPLICON 2285 2196 0.96105
6 SRR11578392 SRP253798 AMPLICON 1517 1491 0.982861
7 SRR11578031 SRP253798 AMPLICON 1557 1449 0.930636
8 SRR11578030 SRP253798 AMPLICON 1259 1254 0.996029
9 SRR11578094 SRP253798 AMPLICON 1174 1060 0.902896

Dear @torsten:

Sorry if this came across critical. I just found these samples somewhat unusual. You guys are actually the ones who do release the data. Most GISAID assemblies are mysterious in nature as there are no corresponding read-level datasets. So this is great! My scanning over the data does not seem to indicate that it is primer-derived (e.g., see SRR11578269 from above) but I will incorporate primer scanning for ampliconic data. Thank you again and please continue setting the example for the entire community by releasing the data.

1 Like

Dear @torsten and @anekrut , what do you think of studies that do investigate within-host polymorphisms from this dataset? I have seen a few of them, the latest one is . Should one be concerned?

I scanned it quickly, but intrahost variants (particularly from ampliconic data such as PRJNA613958) are tricky. There is large amount of noise. I also cannot comment on ViVarSeq.