We determined which of the proposed accessory ORFs in SARS-CoV-2 are true conserved protein-coding regions, using evolutionary signatures in 44 Sarbecovirus genomes.
Key conclusions:
-
ORFs 3a, 6, 7a, 7b, and 8 are conserved protein-coding genes (even though ORF8 has little nucleotide-level conservation)
-
ORFs 10 and 14 have not been under protein-coding constraint and are unlikely to produce functional proteins.
-
ORF9b is ambiguous.
-
Strong additional evidence for a novel overlapping gene, ORF3c, near the 5’ end of ORF3a (proposed by Cagliani et al, who had called it ORF3h)
-
We use Sarbecovirus conservation to classify mutations within the SARS-CoV-2 population, including observing that D614G, the spike-protein mutation that has been increasing in several geographies, is in a stretch of 11 amino acids that are perfectly-conserved among the 44 Sarbecovirus strains.
We have provided genome browser track hubs showing evolutionary protein-coding potential (PhyloCSF), synonymous constraint elements (FRESCo), and classification of SNVs by Sarbecovirus conservation.