Response to “On the origin and continuing evolution of SARS-CoV-2”

We are pleased to see the addendum added to the Tang et al. manuscript in National Science Review, which we have copied below for reference. However we note that the online abstract, which will be by far the most read part of the paper, is unchanged in the strength of it’s claims “On the other hand, the S type, which is evolutionarily older and less aggressive”.

On the specific methodological comment on our purifying selection analysis:
To investigate the origin of the estimates of nonsynonymous to synonymous site ratios given by Tang et al. above, we ran the PAML software they cited (Yang 2007) using two different models. These two models differ in their estimation of codon frequencies, the 1x4 model uses average base frequencies across sites, and the codon frequency model uses the observed counts of each codon in the alignment. The estimates of ratios of nonsynonymous to synonymous sites these models produced ranged from 2.76 to 3.75. All models for these count estimates are fundamentally wrong, as they are approximations using limited data. Even the gold standard for generating this data, mutation accumulation experiments, are limited by the difficulty in observing lethal mutations. However, given that PAML uses a more powerful maximum likelihood framework and has been cited over 5,000 times (as of 16/3/20), we will happily use these estimated ratios, rather than our own (2.43). Both of these ratios from PAML would produce a significant chi squared test on our count data in Table 1 of the original post P<0.036.

We thank the authors for bringing our attention to this. We therefore agree it is a fair conclusion that significant evidence of purifying selection can be observed, which is filtering out nonsynonymous mutations before they can be observed in the outbreak. We would note that this is a subtly different result from that in the original Tang et al. paper, which suggested evidence of purifying selection suppressing the frequency of the observed mutations in the outbreak. Our criticism of that analysis remains unchanged.

Tang et al’s addendum for reference:
“In our recent publication, we showed that among circulating SARS-CoV-2 (with 103 genomes analyzed) two different viral genomes co-exist. We identified them as lineages L and S. The concerned amino acid we used to define the L and S lineages is located in ORF8 (open reading frame 8), which plays a yet undefined role in the viral life cycle. Based on the finding that “L” lineage has a higher frequency than lineage S, we described the L lineage as aggressive. We now recognize that within the context of our study the term “aggressive” is misleading and should be replaced by a more precise term “a higher frequency”. In short, while we have shown that the two lineages naturally co-exist, we provided no evidence supporting any epidemiological conclusion regarding the virulence or pathogenicity of SARS-CoV-2. By saying so, corrections will be made in the print version of this paper to avoid being misleading.”

Reference
Yang, Ziheng. “PAML 4: phylogenetic analysis by maximum likelihood.” Molecular biology and evolution 24.8 (2007): 1586-1591.