SARS-CoV-2: don't ignore non-canonical genes

SARS-CoV-2: don’t ignore non-canonical genes

Zachary Ardern1*, Xinzhu Wei2, Chase W Nelson3*

  1. Institute for Biological Interfaces 5, Karlsruhe Institute of Technology, Karlsruhe, Germany
  2. Departments of Computer Science, Human Genetics, and Computational Medicine, University of California, Los Angeles, Los Angeles, CA, USA
  3. Institute for Comparative Genomics, American Museum of Natural History, New York City, NY, USA


Non-canonical genes have been largely ignored in emerging viruses. The genomes of viruses closely related to SARS-CoV-2 vary in both accessory and out-of-frame (i.e., overlapping) genes. However, unwarranted methodological assumptions often exclude these genes from consideration, despite their importance for virology, evolution, zoonosis, antigenic potential, vaccines, and therapeutics (Firth and Brierley 2012; Ho et al. 2021; Pavesi 2021).

Known or putative non-canonical genes in SARS-CoV-2 include out-of-frame genes ORF2b overlapping Spike; ORF3c, ORF3d, and ORF3b overlapping ORF3a; and ORF9b and ORF9c overlapping N (Firth 2020; Jungreis, Nelson, et al. 2021). Each of these genes shows evidence of translation for one or more isoforms from ribosome profiling (Finkel et al. 2021), mass spectrometry (Zecha et al. 2020), or HLA-I presentation (Nagler et al. 2021; Weingarten-Gabbay et al. 2021).

Among out-of-frame genes in SARS-CoV-2, ORF3b is an interferon antagonist (Konno et al. 2020) that elicits a substantial immunoglobulin G (IgG) response (Li et al. 2021) and is dramatically truncated in SARS-CoV-2 compared to SARS-CoV. Another, ORF3d (Nelson, Ardern, Goldberg, et al. 2020), elicits one of the strongest antibody responses observed in patient sera (Hachim et al. 2020), although it has been truncated in some lineages (Jungreis, Sealfon, and Kellis 2021). Confusingly, ORF3d has erroneously been referred to as ORF3b in many studies (documented in Jungreis, Nelson, et al. 2021). A third example, ORF9b (Figure 1), also elicits a strong IgG response (Li et al. 2021) and may contribute to the increased transmissibility of the SARS-CoV-2 Alpha (B.1.1.7) variant (Thorne et al. 2021).

Figure 1 | Translation of proteins N and ORF9b from alternative reading frames of the same locus in the SARS-CoV-2 genome. Non-canonical out-of-frame (i.e., overlapping) genes occur when one nucleotide sequence is translated in different reading frames to yield distinct proteins (same locus, different product). One example in SARS-CoV-2, shown here, is the translation of protein ORF9b from an alternative (+1) reading frame of the N (nucleocapsid) gene. Protein structures show the N-terminal domain of N (Peng et al. 2020; Protein Data Bank 7CDZ) and the ORF9b homodimer (Weeks et al. 2020; Protein Data Bank 6Z4U), visualized using Mol* Viewer (Sehnal et al. 2021). The nucleotides shown correspond to coordinates 28282-98 in the reference genome Wuhan-Hu-1 (NC_045512.2), where N begins at 28274 and ORF9b begins at 28284 (for full coordinates of overlapping genes in SARS-CoV-2, see Jungreis, Nelson, et al. 2021).

Given the above, it is unfortunate that, to date, not a single out-of-frame gene has been annotated in the SARS-CoV-2 reference genome, Wuhan-Hu-1 (NC_045512.2). As a consequence, they are generally excluded from genomic, laboratory, and clinical analyses. Other frequently neglected accessory genes in SARS-CoV-2 include ORF6, ORF7a, ORF7b, ORF8, and the disputed ORF10.

Non-canonical genes are also documented in other pandemic viruses. This includes HIV-1, where the out-of-frame asp is expressed and integrated into the viral envelope (Affram et al. 2019) and is associated with pandemic spread (Cassan et al. 2016). Other examples come from such disparate viruses as influenza (Machkovech et al. 2019), betaherpesvirus (Finkel et al. 2020), and Zika virus (Irigoyen et al. 2017). One powerful approach for detecting such genes is ribosome profiling, which identifies actively translated mRNA fragments protected by ribosomes (i.e., ribosome footprints) (Stern-Ginossar 2015). Such new techniques for studying gene function provide opportunities for more inclusive studies of gene repertoire, particularly when characterizing newly emerged viruses.

Non-canonical genes demand a rethink of viral genome annotation and molecular biology. For example, requiring evolutionary conservation between virus lineages (Jungreis, Sealfon, and Kellis 2021) necessarily dismisses genes unique to one lineage (Nelson, Ardern, Goldberg, et al. 2020). Evolutionary and translatomic analyses of individual lineages (e.g., SARS-CoV-2 vs. SARS-CoV) together enable a more comprehensive understanding than standard methods based on codon usage, ORF length, or deep conservation (Nelson, Ardern, and Wei 2020). Indeed, non-canonical gene products interact with host cells and contribute to clinical outcomes, as demonstrated by ORF3b and ORF9b. We must stop neglecting non-canonical genes.


We thank Noam Stern-Ginossar for feedback on the text and Ming-Hsueh (Mitch) Lin for feedback on the figure.


Affram Y, Zapata JC, Gholizadeh Z, Tolbert WD, Zhou W, Iglesias-Ussel MD, Pazgier M, Ray K, Latinovic OS, Romerio F. 2019. The HIV-1 antisense protein ASP is a transmembrane protein of the cell surface and an integral protein of the viral envelope. J. Virol. 93:e00574-19.

Cassan E, Arigon-Chifolleau A-M, Mesnard J-M, Gross A, Gascuel O. 2016. Concomitant emergence of the antisense protein gene of HIV-1 and of the pandemic. Proc. Natl. Acad. Sci. U. S. A. 113:11537–11542.

Finkel Y, Mizrahi O, Nachshon A, Weingarten-Gabbay S, Morgenstern D, Yahalom-Ronen Y, Tamir H, Achdout H, Stein D, Israeli O, et al. 2021. The coding capacity of SARS-CoV-2. Nature 589:125–130.

Finkel Y, Schmiedel D, Tai-Schmiedel J, Nachshon A, Winkler R, Dobesova M, Schwartz M, Mandelboim O, Stern-Ginossar N. 2020. Comprehensive annotations of human herpesvirus 6A and 6B genomes reveal novel and conserved genomic features. eLife 9:e50960.

Firth AE. 2020. A putative new SARS-CoV protein, 3c, encoded in an ORF overlapping ORF3a. J. Gen. Virol. 101:1085–1089.

Firth AE, Brierley I. 2012. Non-canonical translation in RNA viruses. J. Gen. Virol. 93:1385–1409.

Hachim A, Kavian N, Cohen CA, Chin AWH, Chu DKW, Mok CKP, Tsang OTY, Yeung YC, Perera RAPM, Poon LLM, et al. 2020. ORF8 and ORF3b antibodies are accurate serological markers of early and late SARS-CoV-2 infection. Nat. Immunol. 21:1293–1301.

Ho JSY, Zhu Z, Marazzi I. 2021. Unconventional viral gene expression mechanisms as therapeutic targets. Nature 593:362–371.

Irigoyen N, Dinan AM, Meredith LW, Goodfellow I, Brierley I, Firth AE. 2017. The translational landscape of Zika virus during infection of mammalian and insect cells. bioRxiv.

Jungreis I, Nelson CW, Ardern Z, Finkel Y, Krogan NJ, Sato K, Ziebuhr J, Stern-Ginossar N, Pavesi A, Firth AE, et al. 2021. Conflicting and ambiguous names of overlapping ORFs in the SARS-CoV-2 genome: a homology-based resolution. Virology 558:145–151.

Jungreis I, Sealfon R, Kellis M. 2021. SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes. Nat. Commun. 12:2642.

Konno Y, Kimura I, Uriu K, Fukushi M, Irie T, Koyanagi Y, Sauter D, Gifford RJ, USFQ-COVID19 Consortium, Nakagawa S, et al. 2020. SARS-CoV-2 ORF3b is a potent interferon antagonist whose activity is increased by a naturally occurring elongation variant. Cell Rep. 32:108185.

Li Y, Xu Z, Lei Q, Lai D-Y, Hou H, Jiang H-W, Zheng Y-X, Wang X-N, Wu J, Ma M-L, et al. 2021. Antibody landscape against SARS-CoV-2 reveals significant differences between non-structural/accessory and structural proteins. Cell Rep. 36:109391.

Machkovech HM, Bloom JD, Subramaniam AR. 2019. Comprehensive profiling of translation initiation in influenza virus infected cells. PLoS Pathog. 15:e1007518.

Nagler A, Kalaora S, Barbolin C, Gangaev A, Ketelaars SLC, Alon M, Pai J, Benedek G, Yahalom-Ronen Y, Erez N, et al. 2021. Identification of presented SARS-CoV-2 HLA class I and HLA class II peptides using HLA-peptidomics. Cell Rep. 35:109305.

Nelson CW, Ardern Z, Goldberg TL, Meng C, Kuo C-H, Ludwig C, Kolokotronis S-O, Wei X. 2020. Dynamically evolving novel overlapping gene as a factor in the SARS-CoV-2 pandemic. eLife 9:e59633.

Nelson CW, Ardern Z, Wei X. 2020. OLGenie: Estimating natural selection to predict functional overlapping genes. Mol. Biol. Evol. 37:2440–2449.

Pavesi A. 2021. Origin, evolution and stability of overlapping genes in viruses: a systematic review. Genes 12:809.

Peng Y, Du N, Lei Y, Dorje S, Qi J, Luo T, Gao GF, Song H. 2020. Structures of the SARS-CoV-2 nucleocapsid and their perspectives for drug design. EMBO J. 39:e105938.

Sehnal D, Bittrich S, Deshpande M, Svobodová R, Berka K, Bazgier V, Velankar S, Burley SK, Koča J, Rose AS. 2021. Mol* Viewer: modern web app for 3D visualization and analysis of large biomolecular structures. Nucleic Acids Res. 49:W431–W437.

Stern-Ginossar N. 2015. Decoding viral infection by ribosome profiling. J. Virol. 89:6164–6166.

Thorne LG, Bouhaddou M, Reuschl A-K, Zuliani-Alvarez L, Polacco B, Pelin A, Batra J, Whelan MVX, Ummadi M, Rojc A, et al. 2021. Evolution of enhanced innate immune evasion by the SARS-CoV-2 B.1.1.7 UK variant. bioRxiv.

Weeks SD, De Graef S, Munawar A. 2020. X-ray crystallographic structure of Orf9b from SARS-CoV-2.

Weingarten-Gabbay S, Klaeger S, Sarkizova S, Pearlman LR, Chen D-Y, Gallagher KME, Bauer MR, Taylor HB, Dunn WA, Tarr C, et al. 2021. Profiling SARS-CoV-2 HLA-I peptidome reveals T cell epitopes from out-of-frame ORFs. Cell 184:3962-3980.e17.

Zecha J, Lee C-Y, Bayer FP, Meng C, Grass V, Zerweck J, Schnatbaum K, Michler T, Pichlmair A, Ludwig C, et al. 2020. Data, reagents, assays and merits of proteomics for SARS-CoV-2 research and testing. Mol. Cell. Proteomics 19:1503–1522.

Hi Zachary, Xinzhu and Chase.

First of all I wanted to thank you for bringing up this important point. While it is unlikely that all of these genes are functional/expressed, I agree that there is both computational and experimental evidence that some do have a function.

Which is why we (Virus Pathogen Resource) decided to add our own annotation to the SARS-CoV-2 genomes in our database, in addition to that provided by GenBank. Our annotation program (VIGOR4) also annotates ORF3c , ORF3d , ORF3b, ORF9b and ORF9c. See image below.

We have not added ORF2b as we did not feel there was enough evidence (at the time) to support it, but we can revisit that if there is user demand.

You can find VIGOR4 annotated proteins at

We are still a little behind NCBI as data keeps growing exponentially, but should be caught up by the end of the month. Obviously we can only annotate publicly available genomes, but if anyone would like to annotate their own, they can either download our annotation program here or use the web interface here.

Any and all feedback is welcome!


Hi Anna,
thank you for this! VIGOR4 is easy to use, which is great, and the output table is very clear. I’m glad you’ve annotated all of these.

I tested it and I note that for the results for ORF3d, it says “no evidence of expression”, but I think that there is evidence of expression at least for the shorter form of the ORF, which we term “ORF3d-2” - see Finkel et al. “The coding capacity of SARS-CoV-2” Supp. Tables 4 & 5. Likewise for ORF10. In any case, the evidence for expression of the excluded ORF2b (S.iORF1 or 2) appears to be stronger than for these ORFs, from this data.

Of course, what counts as sufficient evidence will be debatable, and all of these ORFs deserve further study! For such ORFs inferred to be weakly expressed on the basis of ribosome profiling data (or other lines of evidence such as immunological or evolutionary data, as e.g. mentioned in our eLife paper), I would personally recommend a weaker claim than “no evidence”, such as “evidence for expression ambiguous” or “expression status ambiguous”.



Greetings @AMNiewiadomska! Thanks very much for sharing this tremendous resource, VIGOR4. I too tried it on Wuhan-Hu-1 and was delighted by its ease of use, in general — and its accurate identification of ORF3b, in particular! I’m excited to employ this in the future.

I agree with @ZacharyArdern about the language regarding expression (but, of course, I’m biased!). Taking ORF3d/-2 as an example, even beyond the ribosome profiling data of Finkel et al. which we re-analyzed in the eLife paper, I personally feel that the presence of a strong antibody response in convalescent sera documented by Hachim et al. (where it is called ORF3b) constitutes at least some degree of evidence for expression.

For similar reasons, I’d also vote for inclusion of all proposed non-canonical genes in your software. It could certainly be noted where evidence is dubious, but in my view it would be maximally beneficial to use your tool without needing to subsequently find any ORFs manually (e.g., ORF2b). The user, who should presumably have some familiarity with SARS-CoV-2 genes, could then choose which if any to exclude.

One issue I’m sure you’ve encountered is that, even with such a great resource as this, many (most?) researchers understandably still rely on the gene annotations of the most-used reference genome, Wuhan-Hu-1 (NC_045512), without further investigation of gene repertoire. As one example, I recently reviewed a study which failed to consider ORF9b, with the understandable justification that it is not annotated in the aforementioned reference genome. This then led to interpreting certain evidence relevant to ORF9b as instead pertaining to the more dubious ORF10, simply because the latter is annotated. Hopefully we can inch toward the best of both worlds, with people adopting resources like yours, but also with the primary reference genome being updated!

Thanks a ton!


Thank you both for the input. I can make a quick edit to the SARS-CoV-2 annotation database to add in ORF2b and ORF3d/-2 for the CLI, however changes to the database and GUI would take longer.
I agree that overall annotation in public databases may need to be revisited, but that may be more difficult to get consensus on.

I’ll do a little more reading over the next few days and discuss with my team members. I haven’t seen any new literature on transcription/expression for most of these ORFs (other than what cited), so if you know of anything else please let me know. Data on ORF2b seems to be particularly scarce.

And while I’m here, what do you think of this paper? It’s the only one I’ve seen that shows any evidence of ORF10 expression. A plasmid DNA-launched SARS-CoV-2 reverse genetics system and coronavirus toolkit for COVID-19 research


Thanks very much for your kind response and help, Anna! I’m so excited that ORF2b might be incorporated into the annotator.

Regarding a list of non-canonical overlapping ORFs, the latest of which I’m aware is the one we compiled in Jungreis, Nelson, et al. (2021) Table 1 (i.e., ORF2b, ORF3c, ORF3d/-2, ORF3b, ORF9b, and ORF9c).

Regarding ORF2b, I’m aware of the following evidence:

  1. ribosome profiling (Finkel et al. 2021)
  2. HLA-I presentation (Weingarten-Gabbay et al. 2021; Nagler et al. 2021)
  3. purifying selection between early human isolates of SARS-CoV-2 (Nelson, Ardern, Goldberg, et al. 2020)

Thanks for sharing this study by Rinh et al., which I had not seen. I do not have any expertise for evaluating its molecular biology aspects, but if correct, like you, it would constitute the first compelling evidence I’ve seen for ORF10 expression. (We didn’t detect purifying selection between human SARS-CoV-2 isolates for ORF10, but were probably underpowered to do so.) They also make the good point that other coronaviruses (e.g., SARS-CoV) lack ORF10, so cross-reactivity is unlikely to explain the finding. Although not evidence of expression, you might check out the supplement of Gordon et al. (2020), which lists host proteins which interact with ORF10. (However, note that this paper refers to ORF3d as ORF3b.)

Hope that’s helpful, and thanks again!

1 Like