Sarbecovirus comparative genomics elucidates gene content of SARS-CoV-2 and functional impact of COVID-19 pandemic mutations

We determined which of the proposed accessory ORFs in SARS-CoV-2 are true conserved protein-coding regions, using evolutionary signatures in 44 Sarbecovirus genomes.

Key conclusions:

  • ORFs 3a, 6, 7a, 7b, and 8 are conserved protein-coding genes (even though ORF8 has little nucleotide-level conservation)

  • ORFs 10 and 14 have not been under protein-coding constraint and are unlikely to produce functional proteins.

  • ORF9b is ambiguous.

  • Strong additional evidence for a novel overlapping gene, ORF3c, near the 5’ end of ORF3a (proposed by Cagliani et al, who had called it ORF3h)

  • We use Sarbecovirus conservation to classify mutations within the SARS-CoV-2 population, including observing that D614G, the spike-protein mutation that has been increasing in several geographies, is in a stretch of 11 amino acids that are perfectly-conserved among the 44 Sarbecovirus strains.

We have provided genome browser track hubs showing evolutionary protein-coding potential (PhyloCSF), synonymous constraint elements (FRESCo), and classification of SNVs by Sarbecovirus conservation.