Putative host origins of RNA insertions in SARS-CoV-2 genomes

Thomas P. Peacock1*, David L. V. Bauer2, Wendy S. Barclay1

1Department of Infectious Disease, St Marys Medical School, Imperial College London, UK.

2The RNA Virus Replication Laboratory, The Francis Crick Institute, London, UK.

*corresponding author – thomas.peacock09@imperial.ac.uk


SARS-CoV-2 continues to evolve and adapt to humans. In this report, we describe RNA insertions, particularly those in the SARS-CoV-2 Spike protein, and show how they mostly cluster in the Spike N-terminal domain and at the S1/S2 cleavage site. While many insertion sequences appear to be viral in origin, we find that a subset of insertions show homology to RNA sequences from host transcripts, implying incorporation of short host RNA sequences during viral genome replication.


Over the course of the COVID-19 pandemic, SARS-CoV-2 has accumulated mutations that likely further adapt it to humans. These mutations have occurred through three major mechanisms – RNA substitutions, deletions and, less commonly, insertions. Insertions have been less well characterised but are of particular interest due to i) being less ‘random’ than deletions/substitutions as they require a source of the novel RNA sequence and ii) being of particular interest to the origin of SARS-CoV-2, which has a unique 12 nucleotide insert in its Spike protein, resulting in the creation of a furin cleavage site.

Several variant lineages contain insertions in their spike proteins, most notably the variant of interest, Mu (or B.1.621) which contains a 3 nucleotide insertion in its aa~140 region of Spike, as well as the A.2.5 and B.1.214.2 variants which both have insertions in the aa~210 insertion hotspot region (Gerdol, Dishnica et al. 2021). In addition to circulating variants, insertions have also been identified in chronic infections of individuals (Hoffman, Costales et al. 2021), and upon serial passage of SARS-CoV-2 in cell culture (Andreano, Piccini et al. 2021, Shiliaev, Lukash et al. 2021).

In this report we investigate the potential origins of both widespread and rare insertions in the SARS-CoV-2 genome, and relate this back to what is known about genome insertions in other RNA viruses.


Spike insertions are mostly found in the N-terminal domain and around the S1/S2 cleavage site.

The Spike protein of SARS-CoV-2 shows the greatest amount of variation amongst circulating strains. Mutations in Spike are associated with changes in SARS-CoV-2 transmissibility, pathogenicity and other viral properties. To investigate the distribution of insertions into the SARS-CoV-2 Spike protein, we searched for spike insertions using GISAID and manually removed sequences that appeared to be most obviously artefactual (See Method section). We found that Spike insertions mostly fall within the Spike N-terminal domain, or in the S1/S2 furin cleavage site region. Both of these regions are also known hotspots for deletions (McCarthy, Rennick et al. 2021), further confirming their plasticity.

Figure 1. Distribution of insertions in the Spike protein. Sequences with insertion in the Spike were pulled from GISAID (as of 8th October 2021). Sequences were curated to remove low quality/likely artefacts and plotted against the Spike protein sequence. Lineages of interest and insertions with high numbers of sequences associated are annotated.

Many nucleotide insertions in SARS-CoV-2 appear to be viral in origin

Next, we investigated the potential origins of the insertion sequences, and expanded the analysis to several non-Spike insertions. Many insertions are clearly duplications of adjacent codons or codon pairs. For example, a recurrent Spike ins678QT insertion, encoded as CAG ACT, has been seen arising independently across multiple SARS-CoV-2 lineages. Another subset of insertions show high homology to distal parts of the SARS-CoV-2 genome, often in the opposite sense (see Table 1 and Figure 2), indicating as others have suggested, that they likely result from copy-choice recombination from the template genome during replication (Garushyants, Rogozin et al. 2021). One notable example is the Russian AT.1 lineage, which contains a 12-nucleotide insertion at the S1/S2 site: the AT.1 insertion shows high homology to the 3’UTR of the SARS-CoV-2 genome which is present in viral genomes and all sub-genomic RNAs, and therefore abundant within viral replication organelles.

Insertion Pango lineage Monophyletic? Location Gisaid IDs Putative region of origin E value
Spike ins654GAEGALNTP B.1.617.2 2 clusters India 7 x sequences N 28,661-28,708 6.00E-09
Spike ins159INTTC B.1.2 cluster Yes Oregon, USA EPI_ISL_1791466; EPI_ISL_1791465; EPI_ISL_1791473 Spike 21,650-21,682 4.00E-06
Spike ins214KAFKQ B.1.1.7 cluster Yes Czech Republic EPI_ISL_2228123; EPI_ISL_2228125 NSP2 1745-1780 4.00E-05
NSP3 ins153ENPHL B.1.2 cluster Yes USA 60x sequences NSP12 15256-15294 4.00E-05
Spike ins214KKLIRGD B.1.638 cluster Yes South Africa EPI_ISL_3451093; EPI_ISL_3451094 Spike 22,745-22789 8.00E-05
Spike ins214W B.1 cluster Yes New York EPI_ISL_4096626; EPI_ISL_4096639 N 29,218-29,247 2.00E-04
Spike ins261DGSDK/A262S B.1.617.2 cluster No India EPI_ISL_2461980; EPI_ISL_2461981 NSP3 7,666-7,707 8.00E-04
Spike ins214ANRN B.1.1.28 cluster Yes South America 7x sequences M 26,631-26,660 0.002
NSP4 ins265ICFA/N266Y B.1.2 cluster Mostly USA 9x sequences NSP7 11,998-12,036 0.003
Spike N679K/ins679GIAL AT.1 lineage Yes Worldwide ex Russia 174x sequences 3’ UTR 29,820-29846 0.006
Spike ins108AV B.1.617.2 clusters multiple clusters Various 108x sequences NSP16 21,415-21,447 0.006
ORF6 M58I/ins58FRSL B.1.1.7 cluster multiple clusters Europe 43x sequences 5’ UTR 29-64 0.01
Spike ins214QAS B.1 cluster Yes North America x19 sequences N 1,156-1,182 0.017
Spike ins214KRI B cluster Mostly Denmark EPI_ISL_972296; EPI_ISL_1880746; EPI_ISL_970811; EPI_ISL_929396 NSP5 10,849-10,878 0.046
Spike 247FKT B.1.258.14 cluster Yes Italy EPI_ISL_1229272; EPI_ISL_1359440; EPI_ISL_1359445; EPI_ISL_1558361 NSP13 17,986-18,018 0.046

Table 1. Insertions with a high confidence of being viral (from distal sites implying template switching) in origin. Insertions shown in ascending order of BLASTN E values.

Figure 2. Examples of SARS-CoV-2 clusters or lineages with strong support for viral RNA template switching. Wobble base pairing shown in red (sense) or green (antisense). E values from BLASTN shown.

Several nucleotide insertions in SARS-CoV-2 appear to originate from host mRNAs.

Although an appreciable number of insertion sequences showed high similarity to proximal or distal regions of the SARS-CoV-2 genome, many did not. Therefore, we speculated that the other major source of RNAs that could be used as templates for insertions would be host mRNAs themselves. Therefore, we used BLASTN to look for homology between these RNA inserts and the host transcriptome. Several interesting hits were found with high homology to host mRNAs (see Table 2, Figure 3). For example, a cluster of B.1.1.519 viruses from the USA with a 12 nucleotide insertion in the N protein showed very high homology to human ZBTB20 transcript. Another highly relevant, but more tentative, example is the insertion seen in the variant of interest Mu/B.1.621, which shows a degree of homology to the mRNA of human TRIM28.

Insertion Pango lineage Monophyletic? Location Gisaid IDs/references Putative origin E value
N ins390EMPV B.1.1.519/B.1.1.7 clusters No West Virginia EPI_ISL_2615927; EPI_ISL_2184872; EPI_ISL_2184874 ZBTB20 9,337-9,372 5.00E-04
ORF6 ins32TIIL B.1.1.7 cluster Yes Scotland EPI_ISL_1103220; EPI_ISL_1103106; EPI_ISL_1103136; EPI_ISL_1103154; EPI_ISL_1247705 GOLIM4 4,194-4,226 0.011
Spike ins214GLTSKRN N/A N/A Laboratory mutant Shiliaev et al RASSF6 (Chlorocebus sabaeus) 0.02
Spike ins214KFH B.1.1.7 cluster Yes Scotland EPI_ISL_1190578; EPI_ISL_1063373; EPI_ISL_1123191 HYDIN 9,463-9,492 0.029
Spike ins216ADL B.1.2 cluster Yes USA EPI_ISL_1016594; EPI_ISL_1532440; EPI_ISL_1037192; EPI_ISL_1234589; EPI_ISL_2797236 CTRC 139-168 0.029
Spike ins214HSG AY.4 cluster Yes UK x7 sequences PTPRB 4,412-4,444 0.029
Spike ins214EGAE AY.4 cluster Yes Germany EPI_ISL_4223414; EPI_ISL_4223419 ADIPOR1 18-50 0.029
Spike Y144T/Y145S/ins145N B.1.621 lineage Yes Worldwide ex Colombia >5000 sequences TRIM28 1,805-1,834 0.086

Table 2. Insertions with highest confidence of originating from the host. Insertions shown in ascending order of BLASTN E values.

Figure 3. Examples of SARS-CoV-2 clusters or lineages with support for host origin of RNA insertions. Wobble base pairing shown by full stops. Mismatches in the insertions shown in red. E values from BLASTN shown.

Final thoughts

Naturally occurring insertions of foreign RNA into viral genomes are rare but can have high consequence. Coronaviruses have previously incorporated genes from their hosts or other viruses. For example, the phosphodiesterases, NS2a of the Embecovirus subgenus (which includes OC43, HKU1 and MHV), and NS4b of MERS-CoV, are thought to be two independent acquisitions of vertebrate AKAP7. Furthermore, Embecovirus Haemagglutinin-Esterase glycoproteins share a high degree of structural homology to the same proteins from Orthomyxoviridae. However, in both these examples whole proteins or domains are used for a similar function to the host/other virus. Flaviviridae and Orthomyxoviridae have been shown to acquire host RNA sequences that confer phenotypes unrelated to their original function, more like the insertions proposed in this study. Firstly, the Pestivirus bovine viral diarrhea virus (BVDV) has been shown to repeatedly insert host RNA into its NS2 coding region which can alter polyprotein cleavage and lead to a cytopathic phenotype in cell culture. Interestingly, some cytopathic BVDV achieved the same phenotype using viral-derived RNA sequences. Finally, and perhaps most relevant to the origin of SARS-CoV-2, is the example of the 2012 Mexican highly pathogenic avian influenza H7N3 outbreak (Maurer-Stroh, Lee et al. 2013). Avian influenza viruses of the H7 and H5 subtypes can exist as either low or high pathogenicity depending on the presence of absence of a polybasic furin cleavage site in their haemagglutinin proteins. It has been proposed that the 2013 Mexican H7N3 virus gained its furin cleavage site from heterologous recombination with host 28S ribosomal RNA, thought to be possible due to genomic RNA replication in the host nucleus/nucleolus where nascent rRNA is synthesised.

Insertions in the SARS-CoV-2 genome are also of particular interest as they may have the potential for much greater phenotypic change than mutation or deletion alone - a prime example being the original insertion generating the furin cleavage site, which likely contributed to the pandemic potential of the virus. However, insertion of additional loops in the Spike NTD, or further insertions at the S1/S2 site may have the ability to change the antigenicity or cleavability of these regions, respectively and alter the phenotype of the viruses that emerge. These types of mutations therefore can act as ‘wildcard’ mutations that are hard to predict, and special care should be taken so they can accurately be identified and characterised.


Spike insertions were identified using the search term “Spike_ins” in the GISAID database. Insertions were then curated confirming insertions were truly present and weren’t obviously artefacts using the following criteria: i) insertions were found in more than one sequence (i.e. were not unique); ii) insertions did not contain any unresolved nucleotides (i.e. N); iii) insertions were in frame and did not contain stop codons; iv) virus genomes did not contain multiple insertions; v) when sequences were aligned they still contained the insertions (i.e. were not artefacts from the metadata); vi) insertions had some degree of phylogenetic clustering (as shown using Usher) and, if they did not show much clustering, did not all come a from a single uploading laboratory and; vii) sequences that appeared to show cross-lineage contamination – for example the large number of Delta isolates with identical insertions and surrounding mutations to B.1.621/Mu which is more likely due to the corresponding sequencing tile, which is known to drop out in Delta with the ARTIC V3 primers, picking up low levels of Mu contamination, rather than true recombination/convergent evolution.

To assess whether insertions in SARS-CoV-2 showed homology to the viral genome or host transcriptome insertions and the flanking regions were assessed using BLASTN against a reference SARS-CoV-2 genome (NC_045512.2) and the refseq_select Homo sapiens database (or Chlorocebus sabaeus in the case of the insertion from Shiliaev et al (Shiliaev, Lukash et al. 2021)). E values from BLASTN were then reported with a cutoff of >0.05 (with select E values above 0.05 included for lineages of particular interest). To confirm Pango lineages and test whether insertion-containing clusters were monophyletic, sequences were analysed using Ultrafast Sample placement on Existing tRee (UShER; UCSC UShER: Upload)(Turakhia, Thornlow et al. 2021).


The authors would like to thank Dr Ada Yan and Dr Daniel Goldhill for their help with the analysis and drafting of this report and Professor Julian Hiscox, and other members of the UK-G2P for their invaluable insights into the mechanisms of viral replication and recombination.


Andreano, E., G. Piccini, D. Licastro, L. Casalino, N. V. Johnson, I. Paciello, S. Dal Monego, E. Pantano, N. Manganaro, A. Manenti, R. Manna, E. Casa, I. Hyseni, L. Benincasa, E. Montomoli, R. E. Amaro, J. S. McLellan and R. Rappuoli (2021). “SARS-CoV-2 escape from a highly neutralizing COVID-19 convalescent plasma.” Proceedings of the National Academy of Sciences 118(36): e2103154118.

Garushyants, S. K., I. B. Rogozin and E. V. Koonin (2021). “Insertions in SARS-CoV-2 genome caused by template switch and duplications give rise to new variants of potential concern.” bioRxiv: 2021.2004.2023.441209.

Gerdol, M., K. Dishnica and A. Giorgetti (2021). “Emergence of a recurrent insertion in the N-terminal domain of the SARS-CoV-2 spike glycoprotein.” bioRxiv: 2021.2004.2017.440288.

Hoffman, S. A., C. Costales, M. K. Sahoo, S. Palanisamy, F. Yamamoto, C. Huang, M. Verghese, D. A. Solis, M. Sibai, A. Subramanian, L. S. Tompkins, P. Grant, R. W. Shafer and B. A. Pinsky (2021). “SARS-CoV-2 Neutralization Resistance Mutations in Patient with HIV/AIDS, California, USA.” Emerg Infect Dis 27(10): 2720-2723.

Maurer-Stroh, S., R. T. C. Lee, V. Gunalan and F. Eisenhaber (2013). “The highly pathogenic H7N3 avian influenza strain from July 2012 in Mexico acquired an extended cleavage site through recombination with host 28S rRNA.” Virology Journal 10(1): 139.

McCarthy, K. R., L. J. Rennick, S. Nambulli, L. R. Robinson-McCarthy, W. G. Bain, G. Haidar and W. P. Duprex (2021). “Recurrent deletions in the SARS-CoV-2 spike glycoprotein drive antibody escape.” Science: eabf6950.

Shiliaev, N., T. Lukash, O. Palchevska, D. K. Crossman, T. J. Green, M. R. Crowley, E. I. Frolova and I. Frolov (2021). “Natural isolate and recombinant SARS-CoV-2 rapidly evolve in vitro to higher infectivity through more efficient binding to heparan sulfate and reduced S1/S2 cleavage.” bioRxiv: 2021.2006.2028.450274.

Turakhia, Y., B. Thornlow, A. S. Hinrichs, N. De Maio, L. Gozashti, R. Lanfear, D. Haussler and R. Corbett-Detig (2021). “Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic.” Nature Genetics 53(6): 809-816.

1 Like

Equivalent analysis on Omicron - has a high degree homology to human TMEM245 mRNA.

Hey Tom - great analysis, as always! As for the insertion in Omicron, I think we have to be careful what we can and cannot conclude. A lot of people take this as meaning that Omicron has an “insertion of human origin”, which I don’t think we can conclude with any sort of confidence (although I think it’s a very plausible hypothesis).

What I think we can reasonably conclude is that the insertion is more likely to be host-derived than SARS-CoV-2 derived. However, given how short the stretch of homology is, what that host is exactly is unknowable IMO. Could certainly be human, but could also be a range of other things - including virus, bacteria, mammalian, etc.


I concur with Kristian’s cautionary note as to source for the Omicron insert, that actually seems from the alignment to be GCCAGAAGA (not that highlighted). In the -1 reading frame this would code for ARR, a VERY familiar tripeptide motif to those of us who study proteolytic, especially furin, cleavage sites in lots of viruses.

A near identical version 28621 GCCAGAAGc even appears downstream in the WT SARS-CoV-2 nucleocapsid protein around aa115, albeit out of frame. We have seen transposition of sequence by copy-choice, out of frame, already in SARS-CoV-2, with respect to the N gene. Obviously, in infected cells a lot of independent mRNAs for both of these proteins are being synthesized virtually side-by-side and simultaneously, while host-cell specific mRNAs are being suppressed.

The assertion that Omicron contains this host RNA insert is the kind of thing that can easily go “viral” in the internet sense, with all the usual agenda-driven misinterpretations that have plagued discussion of this virus. While I am a strong proponent of promiscuous copy-choice indels and mutations in Coronaviruses, anything that presents even a whiff of extra-Coronavirus RNA in a variant of concern could only engender yet more public mischief. Even though 17/17 is very suggestive, it contains common codons in multiple frames for common peptide motifs, and may not be all that unique among viral sources. Tread softly on the pedal, would be my opinion.

Bill Gallaher