Anton,
Thanks for taking the time to look at our data set. The data comes from a public health setting and was in no way designed for intra-host variation (or more specifically intra-sample variation). These samples travel from pathology labs to the RT-PCR lab to the WGS lab, often facing multiple freeze/thaw cycles, without any RNA protection buffer. This is not ideal but part of the reality of public health and clinical microbiology.
We build consensus genomes from the data. Currently we use ivar as our primary tool; it’s trim module correctly removes primer sequence based on the ARTIC .bed file used in the amplicon assay. I note you did not do this in your Galaxy workflow, which puts you at risk of calling false SNPs in the (many) primer regions due to the reads having non-template DNA in them which differs from the sample. This could cause false minor alleles like you report, but that said, I believe most of the ones you have found are real. We believe RNA degradation is the main cause, which your work has helped highlight.
We are currently calling consensus at 90% threshold, which relates nicely with your 10% minor allele fraction. For some of our samples we noticed a large number of “heterzygous” sites which are encoded using an ambiguous IUPAC code, and yes most of them were codes encompassing a “T” + A,G,C ambiguity, which is consistent with the RNA degradation theory. Other groups appear to be using thresholds lower than 90%, which would help mitigate this problem.
We will be submitting more data soon to SRA. It will have similar problems I suspect. I am endeavouring to add more metadata about each sample to help people infer patterns or batch effects too. This is an ongoing project for us, as COVID cases have not ceased for us yet. The FASTQ we submitted was all pairs that mapped to the reference genome. This was to ensure we did not submit any human DNA, but a few chimeras have gotten through. We did not filter in any other way. eg quality.
The main use case for our data is cluster detection and analysis - using genomics in combination with epidemiology to best understand transmission and sources. This is only minorly affected by the issues you highlight in your post fortunately! But we will improve our processes to ensure the minor alleles don’t impact our decision making. Here is our paper showing how we used the data:
I am happy you did this post, but I feel the post title is a bit alarmist, and it could be made more obvious that your criticism applies only to intra-sample analysis and minor alleles. I would be interested to see if your results change when you handle the non-template primer DNA correctly, but I think they will largely be the same as the virus has low diversity.
I think we can agree that this is an excellent example of open data and many eyes working for the greater good!
Torsten