Identification of a common deletion in the spike protein of SARS-CoV-2

Identification of a common deletion in the spike protein of SARS-CoV-2

Zhe Liu1,2, Huanying Zheng2, Runyu Yuan1,2, Mingyue Li3, Huifang Lin1,2, Jingju Peng1,2, Qianlin Xiong1,2, Jiufeng Sun1,2, Baisheng Li2, Jie Wu2, Ruben J.G. Hulswit4, Thomas A. Bowden4, Andrew Rambaut5, Nick Loman6, Oliver G Pybus4, Changwen Ke2, Jing Lu1,2

1 Guangdong Provincial Institution of Public Health, Guangzhou, China;
2 Guangdong Provincial Center for Disease Control and Prevention, Guangzhou, China;
3Department of Rehabilitation Medicine, The Third Affilated Hospital, Sun Yat-sen University, Guangzhou, China
4Department of Zoology, University of Oxford, Oxford, UK
5University of Edinburgh, UK
6Institute of Microbiology and Infection, University of Birmingham, UK

Correspondence to:
Jing Lu
Guangdong Provincial Institution of Public Health, 160 Qunxian Rd, Dashi Town, Panyu District, Guangdong Province, Guangzhou 514300, China; email:

In comparison with other betacoronaviruses, two notable features are identified in the SARS-CoV-2 genome: (1) the receptor binding domain (RBD) of SARS-CoV-2 is distinct from the most closely-related bat-origin virus (RaTG13) and is demonstrated to have a high affinity to human ACE2 receptor; (2) a unique insertion of 12 nucleotides (or four amino acids, PRRA) at the S1 and S2 boundary results in a polybasic (furin) cleavage site and three predicted O-linked glycans around the cleavage site 4.

With respect to the first feature, the similar RBD identified in a SARS-like virus from a pangolin suggests that the RBD in SARS-CoV-2 may already exist in its potential animal host(s) before it transmitted into human 5. The question remaining is the history and function of the insertion at the S1/S2 boundary, which is uniquely identified in SARS-CoV-2.

Identification of deletions in SARS-CoV-2 spike protein
The first COVID-19 clinical case in Guangdong was reported on 19th January, with illness onset on 1st January 6. A BALF (Bronchoalveolar lavage fluid) sample from this patient was collected and inoculated into Vero-E6 cells. The cell-isolated viral strain was obtained after three rounds of passage. Multiple sequencing methods were used for whole genome sequencing and the validation of variants (Figure A), including multiplex-PCR with Miseq platform (PE150), direct CDNA sequencing in Nanopore platform and Sanger sequencing. After mapping to the SARS-CoV-2 reference genome (MN908947.3), we found there were two variants in cell-isolated viral strain with deletions at (1) 23585–23599, flanking the polybasic cleavage site, resulting in a QTQTN deletion in spike protein (one amino acid before the polybasic cleavage site) and (2) 23596–23617, including the polybasic cleavage site and the 6 nucleotides 5’ of the cleavage site, resulted in a NSPRRAR deletion that included the polybasic cleavage site (Figure A). To exclude the possible errors caused by PCR amplification, both of these two deletion variants were verified through direct cDNA sequencing on the ONT nanopore platform. Sanger sequencing with specific primers also identified heterozygous peaks with distinct double peaks starting at the position 23585 and triple peaks after that, highlighting the existence of multiple variants caused by the above two deletions (Figure B).

Figure. Deletion variants identified in SARS-CoV-2 cell strains. (A) High-throughput sequencing of the cell isolated strain (014) from the first SARS-CoV-2 patient (EPI 403934) in Guangdong, China. Representative reads mapping to the SARS-CoV-2 genome (MN908947.3 used as reference genome) showed two deletion variants. (B) Sanger sequencing of the 014 cell strains. The heterozygous peaks highlighted with a red box and the sites with distinct three peaks were marked with * © High-throughput sequencing showed the ratio of deletion variants in original clinical sample SF014 (P0) and cell strains after 3 rounds of cell passage (P1-3). (D) Phylogenetic tree of genome sequences of all 22 SARS-CoV-2 cell strains. The size of red dots is proportional to the ratio of Var1 (deletion at 23585–23599).

The deletion is commonly identified in cell isolated strains
To investigate whether these deletions described above are random mutations occasionally identified in a strain or would commonly occur after cell passages, we performed whole genome sequencing on the other 21 SARS-CoV-2 viral strains collected after 2 rounds of cell passage in Vero-E6 or Vero cells (Supplemental Table). The corresponding original samples for these strains were collected between 19th January and 28th February 2020. Multiplex-PCR combined with the nanopore sequencing was used, following the general protocol as described in ( The ARTIC pipeline was applied to trimmed primers and generated the bam files, which included all reads mapping to the SARS-CoV-2 reference genome (MN908947.3). Variant sites were called by using iVar7 with depth >=20 as a threshold. With this method, 10 of 21 cell isolate strains have different ratios of variants (>10%) with deletion at the flank of the polybasic cleavage site (deletion at 23585–23599) (Figure C). One has the variant with deletion on the polybasic cleavage site (deletion at 23596–23617). To find out whether the deletion on 23585–23599 was restricted in a specific genetic lineage, we next investigated the phylogenetic relationship of these strains and first 014 strain described above. As shown in Figure D, the strains with a relative higher ratio of this deletion were dispersed in the phylogenetic tree suggesting the deletion mutation was not restricted to a specific genetic lineage of SARS-CoV-2 viruses.

Screening for deletion variants in original clinical samples
To identify whether these deletions also occurred in original clinical samples, we screened the high through-put sequencing data from 149 clinical samples, which collected between 6th February and 20th March in Guangdong, China. These samples were sequenced as by using multiplex PCR combined with nanopore sequencing. There were 68 SARS-CoV-2 genomes with sequencing average depth >=20 at the sites neighboring 23585. As shown in Table 1, the variants with the deletion at 23585-23599 were found in 3 (6%) of clinical samples with ratios ranging from 8.8–32.8% indicating this deletion may also occur in vivo infections even though the rate was extremely low compared to the results from in vitro (Figure D). To date, there are no genome sequences deposited in public dataset having this deletion. However, this did not mean this variant did not exist in currently released sequences since most of the variants with a lower ratio would be discarded when generating the final consensus sequences.

Table 1: Deletion variant (23585–23599) identified in clinical samples
Samples REF_depth ALT_depth Del Variant Ratio
20SF5645 104 25 19.4%
ST-N3-D 82 8 8.8%
SZ-N16-D 256 125 32.8%

The spike protein of coronaviruses plays a important role in viral infectivity, transmissibility and, antigenicity. Therefore, the genetic character of the spike protein in SARs-CoV-2 would shed light on its origin and evolution. For SARs-CoV-1, strong positive selection has been identified in the spike coding sequence8 and deletions in the other gene segment9 at the early stage but not the late stage of the epidemic, suggesting the adaptive pressures operated on the SARS-CoV-1 genome at the beginning of the epidemic. This result also indicates the SARs-CoV-1 may not well established in the human population at the early stage when it first transmitted from an intermediate animal host. For SARs-CoV-2, the virus presents high infectivity and efficient transmission capability among the human population since it is firstly identified1. Genetic changes related with viral fitness of SARs-CoV-2 require further epidemiological investigation and functional experiments.

Here, we use different sequencing methods to identify and verify a deletion at sites flanking the polybasic cleavage site. The deletion variants could be detected from 3 in 68 clinical samples, but half of 22 in vitro isolated viral strains tested in this study. These data indicate (1) the deletion of QTQTN may benefit SARS-CoV-2 replication or infection in vitro (Vero-E6 cell) but is likely to be under strong purification selection in vivo since it is rarely identified in clinical samples and (2) there could be an efficient mechanism for deleting this region from the viral genome, as the variants with the 23585–23599 deletion are commonly detected after two rounds of cell passage. Notably, a recently reported SARs-like strain RmYN02, which is phylogenetically related to a SARS-CoV-2, also has a deletion at the QTQT site10. This raises another possible scenario, which is that SARS-CoV-2-like viruses in animals may not have QTQTN in their spike protein and a variant with this insertion occurred upon virus transmission into humans. The mechanistic explanation and functional significance of these genomic changes in SARS-CoV-2 requires further work. Nonetheless, this study has provided valuable clues to aid further investigation of this remarkable evolutionary tale. The deletion mutation identified in vitro should be also noted for current vaccine development.

Data Availability
Metagenomic sequencing, multiplex PCR sequencing and cDNA direct sequencing data after mapping to SARs-COV-2 reference genome (MN908947.3) have been deposited in the Genome Sequence Archive11 in BIG Data Center12, Beijing Institute of Genomics (BIG), Chinese Academy of Sciences, under project accession numbers CRA002500, publicly accessible at The sample information and corresponding accession number for each sample were listed in supplemental Table.

This work was supported by grants from Guangdong Provincial Novel Coronavirus Scientific and Technological Project (2020111107001), Science and Technology Planning Project of Guangdong (2018B020207006).


  1. Wu, F. et al. A new coronavirus associated with human respiratory disease in China. Nature 1–8 (2020) doi:10.1038/s41586-020-2008-3.
  2. WHO, (2020). Coronavirus disease (COVID-2019) situation reports.
  3. Cui, J., Li, F. & Shi, Z.-L. Origin and evolution of pathogenic coronaviruses. Nat Rev Microbiol 17, 181–192 (2019).
  4. Andersen, K. G., Rambaut, A., Lipkin, W. I., Holmes, E. C. & Garry, R. F. The proximal origin of SARS-CoV-2. Nat Med (2020) doi:10.1038/s41591-020-0820-9.
  5. Lam, T. T.-Y. et al. Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins. Nature (2020) doi:10.1038/s41586-020-2169-0.
  6. Kang, M. et al. Evidence and characteristics of human-to-human transmission of SARS-CoV-2. medRxiv 2020.02.03.20019141 (2020) doi:10.1101/2020.02.03.20019141.
  7. Grubaugh, N. D. et al. An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar. Genome Biology 20, 8 (2019).
  8. The Chinese SARS Molecular Epidemiology Consortium. Molecular Evolution of the SARS Coronavirus During the Course of the SARS Epidemic in China. Science 303, 1666–1669 (2004).
  9. Muth, D. et al. Attenuation of replication by a 29 nucleotide deletion in SARS-coronavirus acquired during the early stages of human-to-human transmission. Scientific Reports 8, 1–11 (2018).
  10. Zhou, H. et al. A novel bat coronavirus reveals natural insertions at the S1/S2 cleavage site of the Spike protein and a possible recombinant origin of HCoV-19. bioRxiv 2020.03.02.974139 (2020) doi:10.1101/2020.03.02.974139.
  11. Wang, Y. et al. GSA: Genome Sequence Archive. Genomics Proteomics Bioinformatics 15, 14–18 (2017).
  12. National Genomics Data Center Members and Partners. Database Resources of the National Genomics Data Center in 2020. Nucleic Acids Res. 48, D24–D33 (2020).

Supplemental Table. Sample information and accession no for all sequencing data.
(Raw data are under reviewing by database and will released on 1st April)
Sample name Sequencing
method Cell strain Passage Accession NO
20SF5645 PCR+Nanopore - original SAMC150972
ST-N3-D PCR+Nanopore - original SAMC150973
SZ-N16-D PCR+Nanopore - original SAMC150974
20SF014 Metagenomic - original SAMC151281
029 PCR+Nanopore Vero-E6 2 SAMC150975
112 PCR+Nanopore Vero-E6 2 SAMC150976
107 PCR+Nanopore Vero-E6 2 SAMC150977
115 PCR+Nanopore Vero-E6 2 SAMC150978
1676 PCR+Nanopore Vero-E6 2 SAMC150979
252 PCR+Nanopore Vero-E6 2 SAMC150980
262 PCR+Nanopore Vero-E6 2 SAMC150981
265 PCR+Nanopore Vero-E6 2 SAMC150982
263 PCR+Nanopore Vero-E6 2 SAMC150983
272 PCR+Nanopore Vero-E6 2 SAMC150984
F2 PCR+Nanopore Vero-E6 2 SAMC150985
F4 PCR+Nanopore Vero-E6 2 SAMC150986
F5 PCR+Nanopore Vero-E6 2 SAMC150987
028 PCR+Nanopore Vero 2 SAMC150988
107 PCR+Nanopore Vero 2 SAMC150989
115 PCR+Nanopore Vero 2 SAMC150990
025 PCR+Nanopore Vero-E6 2 SAMC150991
028 PCR+Nanopore Vero-E6 2 SAMC150992
108 PCR+Nanopore Vero-E6 2 SAMC150993
112 PCR+Nanopore Vero 2 SAMC150994
108 PCR+Nanopore Vero 2 SAMC150995
014/MiSeq PCR+MiSeq Vero-E6 3 SAMC150996
014/cDNA Nanopre direct cDNA Vero-E6 3 SAMC150997


The database says the sequences won’t be available until May 31, 2020.

basicInfo: PRJCA002455; CRA002500 ; The data under accession CRA002500 will be available on 2020-05-31

Hi, Brian,
My colleagues need to check all the bam files were correctly related with the right samples but they will be released today as we said.

1 Like

Excellent. Thank you very much! I am not sure I am reading the paper right, but it seems that you found deletions not only in the first patient, but also in samples (after culture on Vero cells) from other patients. I am trying to figure out if you think (or know) that there was heterogeneity of viruses in the early epidemic, or if you are sure this is repeated deletions occurring independently in cell cultures.

I don’t think the heterogeneity of viruses from different stage of epi was related with this deletion. The deletion was commonly identified after culture on vero-E6 cells, and the phylogeny shows it is not restricted in a specific genetic lineage.

1 Like

The presence of inserts or deletions in consensus sequences or as variants of SARS-like coronaviruses is also observed in bovine coronavirus, also a member of betacoronavirus ( . For example, after passing 3 different naturally infected bovine nasal samples in different cell lines we observed the consensus sequences of many viral samples acquired a 12-nucleotide insert encoding 4 amino acids (Ser, Arg, Arg, Arg) located at nt 2737 of the spike gene (S2 subunit), whereas none of the unpassaged samples contained this insert at the consensus level. Further analysis identified other nonsynonymous mutations that were part of the insert genotype/variant sequence. Deep sequencing revealed that the insert genotype was present but very rare in the unpassaged samples but quickly became consensus after passage in cell culture.

Potential effects of the BCoV insert in the S2 subunit region of the spike gene include increased host cell range via trypsin-independent fusion host cell entry due to the creation of a furin cleavage site, and enhanced binding to heparan sulfate on the host cell surface due to the addition of a multibasic region.

Interestingly, a SARS-CoV construct (Watanabe et al., 2008) with a furin site at the SARS-CoV S2 position (793-KPTKR-797 to 793-KRRKR-797) shows SARS-CoV S activation at the cell surface in a trypsin-independent manner. Multiple sequence alignment of BCoV insert region compared with that of other coronaviruses show that the two arginines (R794 and R795) from the SARS-CoV construct overlap with arginines from the insert 912-SRRR-915 identified in passaged BCoV. This may indicate that multi-basic insertions may commonly play a role in coronavirus host tropisms and patterns in coronavirus evolution can be detected by study (including deep sequencing) of many different zoonotic coronaviruses.

One thought, without a closely related outgroup whether an InDel is considered an insertion or a deletion depends on the reference sequence used in the comparison. As sequencing of unpassaged viruses becomes more frequent as compared to sequencing of highly passaged reference strains this may become more evident.

In this case it is definitely a deletion. The earliest SARS-CoV-2 genomes have the poly-basic insertion as do the 1000s sequenced subsequently (from original sample, not isolate). The closest bat virus doesn’t have it so it was an insertion at some point prior to the common ancestor of sampled SARS-CoV-2 genomes with a deletions in these samples.

Have you seen any heterogeneity in the location of the deletion in any of the 10/21 cell culture isolates where you observed the deletion? We observed a cell culture-associated variant offset by +2 nucleotides (beginning at 23583) that nonetheless retains the proximal tyrosine.


Dave and Shelby O’Connor

Yes, you pointed the error in this primary draft. The major deletion (var1) should be 23585-23599 (CAGACTCAGACTAAT; corresponding to amino acids QTQTN). You may get an alignment with the deletion at +2 in front of this since it is the same for these two deletions for nt alignments: CAGACTCAGACTAAT or ATCAGACTCAGACTA. However, the first one is more likely the case since it will not cause a frameshift mutation. Many thanks for pointing this.

It seems I can not edit the post anymore. The correct position for these deletions should be:

Thanks Jing,

I don’t think that the +2 version encodes a frameshift either, as the resulting codon sequence is also ‘TAT’ encoding a tyrosine (thought that was pretty cool) when splicing the first nucleotide upstream of the deletion with the two nucleotides downstream of the deletion.


dave and shelby

Thanks dave. It would be more clear if you can show the alignment figure.

Sure! How is this:


Published Version of this post:

In my “Tackling…” thread here on virological, I have identified this spot as a breakpoint for recombination or hypervariability, involving the CAGAC repeat that appears just prior to the S1/S2 boundary. I would note that this is a very unusual area in being replete with palindromic sequences in the original sense, of being readable as the same in both directions on the same strand.

ATCAGACTCAGACTA is a 15-mer palindrome, containing within it two CAGAC palindromes. Likewise, the insert itself that creates the furin site also contains two palindromes end to end. CTCCTC and GGCGG. This strains credulity as a coincidence.

I have a paper under review now, describing the “breakpoint sequence” hypothesis I have put forward over on the “Tackling…” thread here. There are certain places in the genome predisposed, by sequence or RNA structure or both, to be preferred sites of variability. Since coronaviruses have a proofreading function to protect its very long genome from destructive effects of mutation, they appear to adapt by copy-choice recombination - a lot – and there must be a way of exchanging out blocks of genetic information in a nonrandom way to protect functional domains. This S1/S2 site is the site of some of the greatest hypervariability in the virus family, in spite of it being a key site for endoproteolytic maturation of the S protein complex. However ironic it may seem, adapting to differences in host endoproteolytic enzymes may be a way of quickly adapting to new hosts. If so, it would not be surprising if adaptation to cell culture would rapidly select for such variants as these. It is a “hot spot” with a purpose.

Bill Gallaher

Hi Gallaher,
This is very enlightening. I think it is the most reasonable explanation for the quick and frequent emergence of the deletion variants. ATCAGACTCAGACTA deletion is one amino acid ahead of furin site and I don’t know whether it will affect the furin cleavage efficiency or not. I think the adaptation of this deletion variant maybe not related to endoproteolytic enzymes. We saw the rise and down of the ratio of this deletion variant in an animal infection model. Hope we can find some mechanism behind. Thanks for sharing your idea.