nCoV-2019 codon usage and reservoir (not snakes v2)

A paper by Ji et al. suggesting that snakes might serve as a likely reservoir for the novel nCoV-2019 virus was recently published after accelerated review and widely circulated by the news media (https://onlinelibrary.wiley.com/doi/abs/10.1002/jmv.25682).

The author’s claim was based on the observation that the codon usage of nCoV-2019 was more similar to snakes than other potential hosts they investigated, however, this premise is incorrect - only rarely is the codon usage of a virus most closely matched to that of a known reservoir host.

To investigate the claim by Ji et al. and to further build the argument put forward by @david.l.robertson (nCoV's relationship to bat coronaviruses & recombination signals (no snakes) - no evidence the 2019-nCoV lineage is recombinant) that there is no data to support that snakes would be a likely reservoir for nCoV-2019, I calculated the codon usage of SARS-CoV (likely/known reservoir: bats), MERS-CoV (known reservoir: camels), and nCoV-2019 (reservoir: unknown), and investigated how closely matched they were to a range of different species (including known reservoir hosts).

I used the same codon tables used by Ji et al. from the commonly used Kazusa codon table database (http://www.kazusa.or.jp/codon/) - however, several of these codon tables are out of date and for some of the species investigated, severely undersampled. I also obtained codon tables from a larger set of species using the more comprehensive tables from “HIVE-CUT”: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1793-7. Raw “codon adaptation index” (CAI) was calculated for each virus sequence from SARS-CoV, MERS-CoV, and nCoV-2019 against each of the potential reservoir species using the command line version of CAICal (https://biologydirect.biomedcentral.com/articles/10.1186/1745-6150-3-38). The codon adaptation index is effectively a measure of how well the codon usage of a particular sequence matches that of a putative host. To normalize for GC content and AA usage, an expected CAI was also calculated for each species (note: it is unclear to me if Ji et al. also normalized their values).

As can be seen in Figure 1, that while nCoV-2019 does indeed have a high CAI to several different snake species, the same is also true for both MERS-CoV and SARS-CoV that have camels and bats as known reservoir species, respectively. In fact, both MERS-CoV and SARS-CoV sequences have higher normalized CAI values than nCoV-2019. In addition, nCoV-2019 (as well as MERS-CoV and SARS-CoV) also have high CAI values to hosts that are even more unlikely than snakes to serve as the reservoir, including several fungi. And no, fungi are not likely to have started the outbreak in Wuhan.

Figure 2 shows the most relevant host species, giving a clearer picture of the issues noted above.

In conclusion, the study by Ji et al. is flawed and there is no evidence for snakes being the reservoir for nCoV-2019. This does not mean that snakes couldn’t be the reservoir, however, there is currently no data to support this claim and I find that hypothesis unlikely given that nCoV-2019 is closely related to SARS-CoV-like viruses circulating in bats. At this stage, we still do not know what the reservoir for nCoV-2019 is and how widespread it is (although bats seem likely). Finally, the premise that one would be able to use a single simple measure such as codon adaptation to identify reservoir hosts for novel viruses is incorrect.

A somewhat interesting finding from these analyses is the fact that nCoV-2019 overall has a lower CAI to almost all species tested. I wouldn’t really read too much into that though - we have seen that for other virus genera in the past.

Codon tables as well as raw and normalized CAI values can be downloaded below.

codon_tables.hive_cut.zip (44.9 KB) codon_tables.kazusa.zip (20.3 KB) csv_files.zip (8.6 KB)
Fig1.pdf (3.1 MB) Fig2.pdf (1.8 MB)

Great analysis. I look forward to the newspaper headlines proclaiming a fungal origin for nCoV-2019 :rofl:

Bo Xu (http://evolve.zoo.ox.ac.uk/Evolve/Bo_Xu.html) has run the raw data from the JMedVirol paper through a PCA, which helps visualise what’s going on. If anyone wants the R code just shout.

Some notes from Bo on the interpretation of the correlation circle:
(1) Positively correlated codons are grouped together.
(2) Negatively correlated codons are on opposite sides of the plot origin.
(3) The distance between codons and the origin measures the quality of the representation of the codons on the principal component (PC). A codon that is away from the origin (ie close to the circle circumference) is important for interpreting that component. Otherwise, the codon is close to the center of the circle.

Oli

PCA_species_contribution.pdf (6.3 KB) CorreCircle_codon&PC_contribution.pdf (8.5 KB) Euclidean distance.pdf (25.2 KB) Contribution_species_PC1&PC2.pdf (4.8 KB)

1 Like

I completely agree that no information about possible hosts of a virus can emerge from any analyses purporting to show that the codon usage of a particular virus most closely resembles that of a particular vertebrate species.

All of these analyses are ill-founded, for a simple reason that has not yet been made clear.

These analyses do not recognise that there is no single set of values that represent human (or snake) codon usage. Take human genes as an example: the G+C content at third positions of codons in human genes ranges from around 30% to around 90%. In A+T-rich genes A- and T-ending codons are used predominantly for all amino acids where there is a choice; in G+C-rich genes G- and C-ending codons are used predominantly for all amino acids where there is a choice (1). This variation is not related in any simple way to the tissue(s) in which a gene is expressed, nor to the level of gene expression; rather, it seems to be largely correlated with the base composition of the region of chromosome in which the gene is located. The favoured theory is that G+C content reflects the overall rate of recombination in that chromosome region, and in particular that gene conversion events are biased towards generating Gs and Cs (2).

[As an example, the alpha and beta globin genes are expressed at similar levels, at the same time, in the same tissue. Values of G+C content at silent sites in the alpha and beta globin genes are 92% and 65%, respectively, meaning the two genes have very different codon usage patterns (3). The alpha globin gene is on chromosome 16, the beta globin gene is on chromosome 11.]

Thus, simple codon usage compilations (such as those at the Kazusa Codon Usage Database) do not capture this enormous within-species heterogeneity of codon usage patterns, and present only an average pattern, that applies to few genes. There is far more difference in codon usage between different human genes, than there is between the average human pattern and the average pattern for any other mammal (or indeed any other vertebrate). The apparent difference, in this case, between compiled mammal and snake codon usage values, likely reflects the relatively small number of genes used for the snake compilation.

There is no evidence that codon usage in any vertebrate viruses have been under selection pressure to adapt to their hosts. But even if it was thought that this might happen, we have no clear expectation as to which, among the enormously diverse patterns of codon usage seen within any vertebrate, the virus should be selected to resemble.

(1) Sharp et al. (1988) Nucleic Acids Research 16:8207-8211. doi.org/10.1093/nar/16.17.8207

(2) Duret & Galtier (2009) Annual Review of Genomics and Human Genetics 10:285-311.

doi.org/10.1146/annurev-genom-082908-150001

(3) Sharp et al. (1993) Biochemical Society Transactions 21:835-841.

doi.org/10.1042/bst0210835

Paul Sharp

2 Likes

Follow-up on the flawed logic of codon bias linking snakes to 2019-nCoV

Along a similar vein to what Kristian Anderson’s group pointed out, we conducted related analyses that illustrate that codon bias analyses do not provide evidence that snakes were hosts of the 2019-nCoV virus. The conclusion by Ji and colleagues that snake were the most plausible reservoir for the 2019-nCov was based on their finding that snake genomes show greater codon bias to this virus than do genomes from bats or other plausible hosts, which as many of you on this forum have already pointed out is not a robust way to infer the host reservoir of a virus. By analyzing their data and additional genomic codon bias from other eukaryotes and coronaviruses, we further emphasize why this is a flawed inference.

We show that the inherent AT-bias of coronavirus genomes in general leads to spurious inferences linking snakes (which also exhibit particularly AT-rich genomes) as hosts of 2019-nCoV . We conducted expanded analyses of codon biases from multiple coronaviruses isolated from bats with 2019-nCoV, and multiple additional snakes and other eukaryotes. These analyses underscore the lack of evidence from codon bias analyses for snakes as a host of 20190nCoV:

Synonymous codon usage patterns in genomes of coronaviruses isolated from bat hosts show nearly identical patterns as that of 2019-nCoV, indicating that, even if codon usage similarity was a valid method to determine the likely host of the virus, there is no evidence for 2019-nCoV having a divergent pattern of codon usage from viruses isolated from bat hosts (panel A). PCA analyses of codon biases among eukaryotes and coronaviruses further illustrate that codon bias alone is uninformative for linking eukaryotic hosts to particular viruses (panel B and C), and fails to implicate snakes being any more similar to 2019-nCoV than to other coronaviruses which were isolated from bats (panel C).

Coronaviruses tend to have relatively AT-rich genomes which is linked to their highly AT-rich codon biases. Accordingly, there is a linear relationship in which other AT-rich eukaryote genomes exhibit more similar codon usage to the AT-rich coronaviruses than do more GC-rich species. Many of these highly AT-rich genomes are implausible eukaryote hosts, but included to illustrate the point that codon bias simply links coronaviruses with highly AT-rich eukaryote genomes. As snakes are particularly AT-rich compared to other vertebrates, they inherently exhibited more similar codon usage to 2019-nCoV than did more GC-rich genomes of mammals, which led to Ji et al.’s incorrect conclusion about snakes.

Lastly, the snake species implicated by Ji et al. are also unlikely to naturally prey on bats. Both snake species are classical dietary specialists that prey almost exclusively on other snakes, and not on bats (as implied by Ji et al.).

-Todd Castoe & Blair Perry

__MultiFigDraft_v2_01.24.20.pdf (1.1 MB)

1 Like