The initial variant counts in the three datasets are:
Table 4. Initial variant counts
|Dataset||Total variants||Total sites||Total samples|
The variant lists and corresponding counts were generated with a minimum allele-frequency threshold of 0.05 and a minimum number of variant-supporting reads of 10. For a variant to be listed in the reports it has to surpass these thresholds in at least one sample of the respective dataset.
We estimate that for variant calls with an allele-frequency at the chosen threshold of 0.05 our analyses have a false-positive rate of < 15% for both Illumina RNAseq and Illumina Artic data, while the true-positive rate of calling such very low-frequency variants is around 80% and approaches 100% for variants with an AF >= 0.15. This estimate is based on an initial application of the Illumina RNAseq and Illumina Artic workflows to two samples for which data of both types had been obtained at the virology department of the University of Freiburg and the assumption that variants supported by both sets of sequencing data are true variants. The second threshold of ten variant-supporting reads is applied to ensure that calculated allele-frequencies are reliable for all variants.
Because a fraction of our called variants is undoubtedly erroneous, we wanted to be conservative and eliminate questionable sites based on their frequency of occurrence in each dataset. As a simple model, we assume that a fraction of low AF variants are random errors, modeled by a simple Poisson distribution with per-site error rate λ. We then tabulate, for each position in the genome, the number of samples that contain a variant with 5% ≤ AF ≤ 50%, infer λ using a closed form ML estimator (which is simply the mean of per-base counts), and plot the observed number of genome positions with N=0,1,2... low frequency variants (red) vs the Poisson prediction (black). In all three datasets, the observed distributions have “fat tails”, and the point where the predicted Poisson distribution clearly diverges from the observation can be taken as the error-vs-real threshold.
"COG-pre" Error threshold is 2 or fewer samples (estimated error rate 3.39e-6 per base)
"COG-post" Error threshold is 2 or fewer samples (estimated error rate 5.05e-6 per base)
Figure 3. Observed number of genome positions for low frequency variants (red) versus the Poisson prediction (black).
After this filtering the final number of variants for subsequent analysis was as follows:
Table 5. Variant counts after applying above thresholds
|Dataset||Total variants||Total sites||Total samples|
All subsequent analyses were performed on variants occurring in ≥ 3 samples in “Boston” dataset and ≥ 2 in COG-Pre and COG-Post datasets.
Categories of variants
Variants can be broadly divided into three categories based on their allele frequency (AF) in the host.
- Fixed (AF≥80%). These are the variants that are (nearly) fixed in the within-host population, and would appear in the consensus genome as differences from the reference. These variants are not of primary interest here. These are the most abundant and common variants in our samples (Tables 4 and 5), and every sample has at least 2 such variants.
- Rare (AF<10%). These variants appear within a host at low frequencies and would not propagate to whole genome assemblies. They could represent genuine intra-host variation, i.e., positions in the genome that are subject to selection, indel hotspots, etc; or sequencing and experimental artifacts or errors. Low frequency variants are of particular interest here, because they can only be detected via NGS analyses. These variants are relatively common, with a large of degree of heterogeneity between samples: some have no low frequency variants, while others have >20.
- Intermediate (10%≤AF<80%). These are perhaps the most interesting group of variants, because they might arise during selective sweeps within the host, or during multiple infections, if occuring in combination with others.
Table 6. Summary statistics of three categories of intra-host variants. In each cell the values are:
Unique samples ( variants per sample [min variants per sample, max variants per sample] )
|Dataset||[0%; 10%)||[10%; 80%)||[80%; 100%]|
|“Boston”||578 (5.5 [1, 29])||204 (2.7 [1, 17])||639 (7.2 [1, 13])|
|COG-Pre||303 (4.4 [1, 35])||472 (3.9 [1, 24])||499 (6.4 [1, 14])|
|COG-Post||569 (3.2 [1, 36])||1,797 (3.8 [1, 28])||1,818 (14.9 [3, 34])|
We further classify variants into five types of sequence changes they create in the viral genome:
- Stop changes that introduce premature stop codons
- Non-coding changes outside the coding region (3’ and 5’ regions of the genome)
Figure 4. Distribution of varinat counts by allele frequency. Numbers in parentheses are counts of distinct sites.
For all datasets, non-synonymous variants are the most common, followed by synonymous variants, non-coding, and stops.
In terms of “kind” of substitutions were SNPs (no MNPs) and indels:
Table 7. Types of variants (# of distinct sites)
Distribution of variant AFs across samples
We quantify the degree of AF heterogeneity (does the same variant occur with high AF in some samples, but low AF in others) using the Coefficient of Variation (CoV) of the AF distribution. Variants that occur only in a small fraction of the samples (low PF) can occur at variable intra-host AF (high CoV), whereas variants that have higher PF tend to occur at similar AF (either high or low) in different samples.
|mean vs. CoV||CoV vs. PF|
|mean vs. CoV||CoV vs. PF|
|mean vs. CoV||CoV vs. PF|
Figure 5. The relationship between intra-host AF mean and CoV and population frequency. Colors are the same as in Fig. 4
For context, we provide examples of individual variants that illustrate different combinations of AF and PF in Fig. 6 below. This pattern of a much higher number of intra-host variants that do not become segregating mutations at the population level is common in viruses, and is generally consistent with largely neutral intra-host evolutionary dynamics.
23403 A→G, S/D614G Common fixed variant (high AF, high PF, low variance in AF)
7507 A→C, nsp3/K1596N Common low frequency variant (low AF, high PF, high variance in AF)
23086 C→T, S/Y508Y Rare bimodal variant (low/high AF, low PF, high variance in AF)
Figure 6. Examples of individual variants in different PF and AF classes.
Spatial distribution of variants across the genome
A graphical summary of variant density across the genome and genes/products shows that synonymous and non-synonymous variants are dispersed across the entire genome with some cold spots and some hot spots, shown in Fig. 7 below:
Figure 7. Spatial distribution of variants by functional class (top) and variant type (bottom). The height of each marker is Coefficient of Variation for Alternative Allele Frequency at a given site.
Across all datasets, several accessory genes (ORF3a, ORF7a, ORF8) had higher than genome-average density of non-synonymous variants.
Figure 8. Spatial density of variants per gene/product
When considering variants with all allele frequencies, the dominant patterns of co-occurence are clade-segregrating sites in the data, e.g. high frequency variants that exist in strong linkage disequilbirum (e.g. the 241/3037/14408/24403/25563 set seen as thick vertical lines in the plots below).
Figure 9 Dot-plot of observed variants in the “Boston” dataset; rows – samples, columns – genomic coordinates; samples are arranged by hierarchical clustering. Limited to variants that occur in at least 4 samples.
Figure 10 Dot-plot of observed variants in the COG-UK Post dataset; rows – samples, columns – genomic coordinates; samples are arranged by hierarchical clustering. Limited to variants that occur in at least 3 samples.
A more interesting pattern may be observed if we restrict our attention only to relatively common low frequency variants; among which there are several groups that co-occur in multiple samples (all exclusively at low frequencies).
Figure 11. Dot-plot of observed variants in the “Boston” dataset; restricted to variants that appear only at AF≤10% and occur in at least 4 samples each. Variants are partitioned into 10 clusters, using K-medoids using the Hamming distance on AF vectors; the cluster with 8 variants is highlighted
A cluster of eight low frequency variants occured in 8 samples (the probability of this occurring by chance is < 10-8). These variants were
|Nucleotide Variant||Sample count||Effect|
No similar low-AF clusters were detected in the COG-UK data, but a cluster of two medium AF frequency variants (9096:C→T, 29692:G→T) co-occurred 3 times (expected < 0.01).
One possible explanation for co-occurence of low frequency mutations would be multiple-infection, but it is not entirely clear why these “groups” of mutations would only occur at low frequency.
Variants of concern in intrahost context
The emergence of N501Y lineages, starting with the B.1.1.7 lineage in the UK raised intriguing questions about the genesis of this lineage, and a hypothesis that the variant arose in a chronically infected immunocompromized host. We were interested in how many of the clade definining mutations were detectable at subconsensus allele frequencies in the three datasets.
Specifically, we analyzed the overlap between our data and five distinct mutation sets:
- Receptor binding domain mutations from Greany et al. 2020 (called “Bloom” after the last author in the rest of this document).
Below is the graphical summary in the three datasets. The most interesting sites here are the ones that have high AF CoV and are present in multiple samples:
Figure 12. Variants of concern in intrahost context. Size of marker is proportional to the number of samples containing the variant. Two big circles in the COG-Post dataset correspond to
del3 from B.1.1.7 and
L18F from P.1. Horizontal red bars delineate the five different mutation sets.
Table Overlap with VOC sites
Sites under selection in intrahost contest
We wanted to see if any intrahost variants identified in this study are also shown to be under persistent or episodic positive selection. We defined sites under positive selection as those identified with FEL and MEME methods with 0.0001 significance cutoff. There was a total of 306 such sites. Because selection analysis identifies codons (not individual genome positions) responsible for potential selective amino acid changes, we considered all nucleotide substitutions falling within boundaries of codons showing the signature of selection.
First, we considered all sites. There were 47, 197, and 428 variants overlapping with codons under selection in “Boston”, COG-Pre, and COG-Post datasets, respectively:
Figure 11. Variants overlapping with codons displaying signature of positive selection. Size of the marker corresponds to the proportion of samples in each dataset carrying a particular variant. Colors: green = synonymous, orange = non-synonymous, magenta = indels.
Next, we considered only sites with low or intermediate allele frequencies thus avoiding fixed variants. There were 10, 130, and 150 such variants in “Boston”, COG-Pre, and COG-Post datasets, respectively:
There is a number of potentially interesting sites identified from this analysis:
Deletion cluster within nsp6
There is a cluster of deletion within the vicinity of site 11,071 (nsp6/37) showing evidence for pervasive positive selection (see DataMonkey COVID-19 portal). While deletion variants are much more frequent in COG (Ampliconic) datasets this cluster is also present in “Boston” dataset derived from RNAseq library preparations:
High PF/low AF sites in S and ORF3a in “Boston” dataset
There are two sites present in large fraction of samples (high population frequency PF) at low allele frequencies within the “Boston” dataset. Both sites (22,254 S/I231M and 25,842 ORF3a/T151P) are present in > 75% of “Boston” samples with maximum AF of 17% and 24%, respectively.