Missing G21987A mutation in SARS-CoV-2 delta variants due to non-specific amplification by ARTIC v3 primers

lcerutti · October 20, 2021, 7:25am

Context

The G21987A mutation, which is found in the delta lineage of SARS-CoV-2, is often absent from consensi generated at the Health2030 Genome Center (Geneva), which led to the classification of the samples as AY* sublineage. However, the consensi generated at the Centre Hospitalier Universitaire Vaudois (CHUV) contain the mutation, showing a discrepancy between our 2 centres (Trestan Pillonel, CHUV, personal communication)

The protocol used at CHUV for sequencing SARS-CoV-2 is based on the PARAGON GENOMICS CleanPlex primer set, whereas we generate amplicons based on ARTIC v3 (Illumina COVIDSeq). In fact, it has recently been described by Alban Ramette’s lab (Institute for Infectious Diseases, IFIK) that the 21300-22300 region is not correctly amplified by the ARTIC v3 primer set (https://www.medrxiv.org/content/10.1101/2021.09.09.21262951v1.full-text).

The Illumina COVIDSeq amplification protocol uses 2 separate primer pools generating non-overlapping amplicons that are pooled in the following steps of the procedure to cover the entire genome. To understand why we do not observe the G21987A mutation, we prepared samples of the SARS-CoV-2 alpha (4 samples) and delta (4 samples) variants (courtesy provided by Hôpitaux Universitaires de Genève, HUG) using the Illumina COVIDSeq protocol, but sequenced the 2 pools separately. We focused on the region of interest, pos 21300-22300 of the SARS-CoV-2 genome, which is covered by 3 amplicons produced by ARTIC v3 primers.

Observations for alpha variants

The 3 amplicons of interest are correctly amplified. The coverage at position 21987 for pool 1 (mean=3931.3 std=526.2) is lower than for pool 2 (mean=10545.8 std=2188.7), but within the ranges expected by our protocol.

An IGV view of the region of interest for one alpha variant analyzed:

Observations for delta variants

Amplicon 21990-22324 is not efficiently amplified (coverage at pos 21987: mean=2224.0, sdt=1393.9), and the majority of the reads do not come from the amplicon itself, but from amplification between this amplicon and the previous one in the pool. Our hypothesis is that 1) at the initial PCR cycles there is some kind of suboptimal amplification of the 2 expected regions and 2) in the later stages of the PCR there is an amplification between the 2 produced amplicons, probably due to better affinity than the original primers. This would result in the integration of the G present in the original primer in the reads.

The 21658-22038 amplicon in pool 2 is barely amplified (coverage at pos 21987: mean=167.8, std=86.4), but includes the expected nucleotide A.

An IGV view of the region of interest for one delta variant analyzed:

Conclusion

The G21987A mutation is mostly absent from the sequencing of the SARS-CoV-2 delta lineage because the ARTIC v3 protocol produces a non-specific amplification between amplicons 21357-21743 and 21990-22324 which would incorporate the G at position 21987 from the primer sequence. This is not counterbalanced by amplicon 21658-22038 as it shows consistently a much lower amplification. The 5’ primer anchor of the 21990-22324 amplicon would be lost due to non-specific amplification, hence the ineffectiveness of standard tools in removing primers from reads in this region.

The problem described should be corrected by the ARTIC v4 version of the primers, which will soon be available as an add-on to the Illumina COVIDSeq kit (Illumina, personal communication).

j_quick · October 20, 2021, 11:16am

I think your explanation makes sense; an amplicon containing the primer base could be produced by the interaction of a ‘run though’ product and amplicon 73 by site-directed mutagenesis. As the pools are mixed then fragmented you have no way to trim the reads properly in this case. The only way I can think of avoiding this is to barcode the pools separately if you are using fragmentation.

connortr · October 20, 2021, 11:46am

I have a couple of other quick questions;

Which pipeline did you run the samples through - the output from the COVIDSEQ kit can be run through other pipelines, and this may generate consensus that is better, even with low coverage for particular amplicons.
Would be useful to know which instrument you sequenced on and which kit you used. We generally go for a 2x150bp kit on the NextSeqs here for samples that have been amplified using COVIDSEQ kits - as we had poor quality results when using the sequencing kits (with shorter readlengths) included with the COVIDSEQ kits themselves. If you have been using shorter readlengths then it might be worth seeing if you observe the same issue if you were to sequence the COVIDSEQ kit products on a longer readlength run.

arambaut · October 20, 2021, 11:54am

To follow up @connortr’s 1st point - we have seen genomes across the pandemic which may have apparent reversions to reference genome. Some of which we believe are due to the pipeline used (I.e., not trimming primers when the overlapping amplicon has dropped out). These dropout issues can occur in any lineage when the Ct is high so if there is an issue with the pipeline it may not be restricted to Delta.

n_j_loman · October 20, 2021, 1:12pm

Thanks for this useful report.

It is important when analysing amplicon data to try and account for non-specific amplification.

In our ARTIC ONT pipeline we attempt to assign full-length reads to individual amplicons. If the start and end coordinates are in conflict with the primer scheme we will discard those reads.

This can be trickier when using fragmented sequencing protocols e.g. Nextera in the case of COVIDSeq as it’s not always clear which amplicon region the read belongs to.

Having said that, I believe the iVar software (wrapped in the Connor Lab nextflow pipeline in use in the COG-UK project) has some specific support for detecting these type of problems, e.g. as documented here:

github.com/andersen-lab/ivar

Filter amplication artifacts

opened 01:36PM - 16 Oct 20 UTC

closed 10:27PM - 06 Nov 20 UTC

jts

**Is your feature request related to a problem? Please describe.** In multipl…exed PCR protocols, aberrant amplicons can be generated which cross the expected amplicon boundaries (for example a fragment may start in amplicon 5 and extend all the way to amplicon 7). In ligation-based sequencing protocols these are easy to identify based on the primers on either end of the paired reads. With Nextera-based protocols it is more difficult but populations of paired-end fragments that begin in one amplicon and end in another are fairly straightforward to identify. In my experience these artifacts can lower the allele frequency of SNPs in primer binding sites, occasionally by as much as 10-15%. **Describe the solution you'd like** I wrote a simple python prototype to filter these, the core of it is here: ``` # make an interval tree containing the amplicons tree = IntervalTree() for a in amplicons: tree.addi(amplicons[a].start, amplicons[a].end, a) bam_file = pysam.AlignmentFile(args.bam, "rb") bam_out = pysam.AlignmentFile("-", "w", template=bam_file) for alignment in bam_file.fetch(until_eof=True): fragment_start = alignment.reference_start fragment_end = 0 tlen = alignment.template_length if tlen < 0: fragment_start = alignment.reference_end fragment_end = fragment_start + tlen start_intervals = tree.at(fragment_start) end_intervals = tree.at(fragment_end) # make a set of amplicon IDs that are compatible with the fragment start/end start_id_set = set([x[2] for x in start_intervals]) end_id_set = set([x[2] for x in end_intervals]) id_intersection = start_id_set.intersection(end_id_set) if len(id_intersection) == 1: bam_out.write(alignment) ``` This creates an interval tree containing from the coordinates of each amplicon, then looks to see if the read pair is compatible with exactly one set of coordinates to ensure it does not cross the amplicon boundary. I'm considering adding this python script into our fork of the COG-UK nextflow (cc @m-bull) but it would be cleaner if this feature could be added to `ivar trim` (probably best to put it behind a flag that is defaulted to off) to avoid an extra pass through the BAM file. Thanks!

lcerutti · October 20, 2021, 1:29pm

Indeed, I agree with the idea of barcoding the pools separately, at least to detect an imbalance of mutations between the 2 pools. This is something we need to discuss with the lab to see if it is feasible under our current conditions.

vborges · October 20, 2021, 1:36pm

Very nice report. This finally explains why we see, after primer trimming, the G (reference) base in the middle of reads for Delta. Primer trimming is thus not enough to overcome this issue. Looking forward to having the V4 artic primers.

lcerutti · October 20, 2021, 2:01pm

Hello @connortr. Before I answer your questions, I need to set the context. We are doing sequencing for the surveillance programme in Switzerland and we have 14 days between sample collection and submission to GISAID (everything is designed to process 768 samples/week). The first week is dedicated to collecting samples from hospitals and private laboratories; on Tuesday of the second week, the samples arrive at our institute and the libraries are prepared. The run start on Thursday late afternoon and end on Friday morning. On Friday lunchtime, we hand over the consensi to the hospitals, which are responsible for sending them directly, or via the SPSP, to GISAID. So we are very limited in time.

So, to answer your questions:

We use an internal pipeline designed to process all samples as quickly as possible, but as accurately as possible (2nd place in the ESCMID assessment). But I think with the delta variant we have more problems and we might consider other pipelines.
We use a read length of 59 nt, as the run has to be performed overnight on a NovaSeq. Unfortunately, due to time constraints, we cannot plan for longer reads. The tests we did in early 2020 showed that the 59nt length produced results comparable to those of longer reads. But this was before the advent of deltas which may be a problem now due to the lower affinity of primers (at least in some regions).

lcerutti · October 20, 2021, 2:10pm

Indeed, you are right @arambaut. These events can also occur at other places in the genome if we have high Ct’s or if the quality of the initial RNA is not optimal. But I have to stress that these are sporadic events, whereas G21987A is a systematic problem that cannot be solved with primers trimming.

lcerutti · October 20, 2021, 2:18pm

Yes, we tested the iVar pipeline. We excluded it because of the execution time and time constraints we have. However, if ARTIC v4 does not solve the problem, we may consider switching to other solutions like iVar. Thanks for the feedback @n_j_loman.

arambaut · October 20, 2021, 3:55pm

I believe Illumina fixed the primer trimming issue last year as it was causing mis-calls. But what I mean is here not filtering out short or off target amplicons. The issue with Delta and the amplicon 72 dropout with V3 may be revealing a flaw in the bioinformatics pipeline that could otherwise be happening silently in other contexts.

How critical it is depends on what the genomes are being used for.

AMNiewiadomska · October 20, 2021, 4:29pm

There have been a coupe of recent preprints on this issue: