Missing G21987A mutation in SARS-CoV-2 delta variants due to non-specific amplification by ARTIC v3 primers

Context

The G21987A mutation, which is found in the delta lineage of SARS-CoV-2, is often absent from consensi generated at the Health2030 Genome Center (Geneva), which led to the classification of the samples as AY* sublineage. However, the consensi generated at the Centre Hospitalier Universitaire Vaudois (CHUV) contain the mutation, showing a discrepancy between our 2 centres (Trestan Pillonel, CHUV, personal communication)

The protocol used at CHUV for sequencing SARS-CoV-2 is based on the PARAGON GENOMICS CleanPlex primer set, whereas we generate amplicons based on ARTIC v3 (Illumina COVIDSeq). In fact, it has recently been described by Alban Ramette’s lab (Institute for Infectious Diseases, IFIK) that the 21300-22300 region is not correctly amplified by the ARTIC v3 primer set (Investigating the biological and technical origins of unknown bases in the S region of the SARS-CoV-2 Delta variant genome sequences | medRxiv).

The Illumina COVIDSeq amplification protocol uses 2 separate primer pools generating non-overlapping amplicons that are pooled in the following steps of the procedure to cover the entire genome. To understand why we do not observe the G21987A mutation, we prepared samples of the SARS-CoV-2 alpha (4 samples) and delta (4 samples) variants (courtesy provided by Hôpitaux Universitaires de Genève, HUG) using the Illumina COVIDSeq protocol, but sequenced the 2 pools separately. We focused on the region of interest, pos 21300-22300 of the SARS-CoV-2 genome, which is covered by 3 amplicons produced by ARTIC v3 primers.

G21987A_schema

Observations for alpha variants

The 3 amplicons of interest are correctly amplified. The coverage at position 21987 for pool 1 (mean=3931.3 std=526.2) is lower than for pool 2 (mean=10545.8 std=2188.7), but within the ranges expected by our protocol.

An IGV view of the region of interest for one alpha variant analyzed:


Observations for delta variants

Amplicon 21990-22324 is not efficiently amplified (coverage at pos 21987: mean=2224.0, sdt=1393.9), and the majority of the reads do not come from the amplicon itself, but from amplification between this amplicon and the previous one in the pool. Our hypothesis is that 1) at the initial PCR cycles there is some kind of suboptimal amplification of the 2 expected regions and 2) in the later stages of the PCR there is an amplification between the 2 produced amplicons, probably due to better affinity than the original primers. This would result in the integration of the G present in the original primer in the reads.

The 21658-22038 amplicon in pool 2 is barely amplified (coverage at pos 21987: mean=167.8, std=86.4), but includes the expected nucleotide A.

An IGV view of the region of interest for one delta variant analyzed:


Conclusion

The G21987A mutation is mostly absent from the sequencing of the SARS-CoV-2 delta lineage because the ARTIC v3 protocol produces a non-specific amplification between amplicons 21357-21743 and 21990-22324 which would incorporate the G at position 21987 from the primer sequence. This is not counterbalanced by amplicon 21658-22038 as it shows consistently a much lower amplification. The 5’ primer anchor of the 21990-22324 amplicon would be lost due to non-specific amplification, hence the ineffectiveness of standard tools in removing primers from reads in this region.

The problem described should be corrected by the ARTIC v4 version of the primers, which will soon be available as an add-on to the Illumina COVIDSeq kit (Illumina, personal communication).

1 Like

I think your explanation makes sense; an amplicon containing the primer base could be produced by the interaction of a ‘run though’ product and amplicon 73 by site-directed mutagenesis. As the pools are mixed then fragmented you have no way to trim the reads properly in this case. The only way I can think of avoiding this is to barcode the pools separately if you are using fragmentation.

I have a couple of other quick questions;

  1. Which pipeline did you run the samples through - the output from the COVIDSEQ kit can be run through other pipelines, and this may generate consensus that is better, even with low coverage for particular amplicons.

  2. Would be useful to know which instrument you sequenced on and which kit you used. We generally go for a 2x150bp kit on the NextSeqs here for samples that have been amplified using COVIDSEQ kits - as we had poor quality results when using the sequencing kits (with shorter readlengths) included with the COVIDSEQ kits themselves. If you have been using shorter readlengths then it might be worth seeing if you observe the same issue if you were to sequence the COVIDSEQ kit products on a longer readlength run.

To follow up @connortr’s 1st point - we have seen genomes across the pandemic which may have apparent reversions to reference genome. Some of which we believe are due to the pipeline used (I.e., not trimming primers when the overlapping amplicon has dropped out). These dropout issues can occur in any lineage when the Ct is high so if there is an issue with the pipeline it may not be restricted to Delta.

Thanks for this useful report.

It is important when analysing amplicon data to try and account for non-specific amplification.

In our ARTIC ONT pipeline we attempt to assign full-length reads to individual amplicons. If the start and end coordinates are in conflict with the primer scheme we will discard those reads.

This can be trickier when using fragmented sequencing protocols e.g. Nextera in the case of COVIDSeq as it’s not always clear which amplicon region the read belongs to.

Having said that, I believe the iVar software (wrapped in the Connor Lab nextflow pipeline in use in the COG-UK project) has some specific support for detecting these type of problems, e.g. as documented here:

Indeed, I agree with the idea of barcoding the pools separately, at least to detect an imbalance of mutations between the 2 pools. This is something we need to discuss with the lab to see if it is feasible under our current conditions.

Very nice report. This finally explains why we see, after primer trimming, the G (reference) base in the middle of reads for Delta. Primer trimming is thus not enough to overcome this issue. Looking forward to having the V4 artic primers.

Hello @connortr. Before I answer your questions, I need to set the context. We are doing sequencing for the surveillance programme in Switzerland and we have 14 days between sample collection and submission to GISAID (everything is designed to process 768 samples/week). The first week is dedicated to collecting samples from hospitals and private laboratories; on Tuesday of the second week, the samples arrive at our institute and the libraries are prepared. The run start on Thursday late afternoon and end on Friday morning. On Friday lunchtime, we hand over the consensi to the hospitals, which are responsible for sending them directly, or via the SPSP, to GISAID. So we are very limited in time.

So, to answer your questions:

  1. We use an internal pipeline designed to process all samples as quickly as possible, but as accurately as possible (2nd place in the ESCMID assessment). But I think with the delta variant we have more problems and we might consider other pipelines.

  2. We use a read length of 59 nt, as the run has to be performed overnight on a NovaSeq. Unfortunately, due to time constraints, we cannot plan for longer reads. The tests we did in early 2020 showed that the 59nt length produced results comparable to those of longer reads. But this was before the advent of deltas which may be a problem now due to the lower affinity of primers (at least in some regions).

Indeed, you are right @arambaut. These events can also occur at other places in the genome if we have high Ct’s or if the quality of the initial RNA is not optimal. But I have to stress that these are sporadic events, whereas G21987A is a systematic problem that cannot be solved with primers trimming.

Yes, we tested the iVar pipeline. We excluded it because of the execution time and time constraints we have. However, if ARTIC v4 does not solve the problem, we may consider switching to other solutions like iVar. Thanks for the feedback @n_j_loman.

I believe Illumina fixed the primer trimming issue last year as it was causing mis-calls. But what I mean is here not filtering out short or off target amplicons. The issue with Delta and the amplicon 72 dropout with V3 may be revealing a flaw in the bioinformatics pipeline that could otherwise be happening silently in other contexts.

How critical it is depends on what the genomes are being used for.

There have been a coupe of recent preprints on this issue: