Masking strategies for SARS-CoV-2 alignments
Nicola De Maio*, Conor Walker, Rui Borges, Lukas Weilguny, Greg Slodkowicz, Nick Goldman
*[email protected]
In recent weeks, we have seen many analyses of SARS-CoV-2 genome alignments being published or posted as preprints. Some authors, including us (Issues with SARS-CoV-2 sequencing data), have raised questions regarding the trustworthiness of some of the columns of these alignments.
We would like to propose this space for open discussion of new and possible strategies for masking SARS-CoV-2 genome alignment columns. As far as we can tell, this is still an open question, and we welcome suggestions and comments.
We also want to propose a common format for sharing and more easily using and combining such filtering strategies. We include in this post our proposed masking sites (and those so far recommended by @matthew.parker in our post) in VCF format. This format (see below) succinctly summarizes which positions of the genome are of relevance, their associated variants, and the reasons why such entries are in the file.
Suggestions and further additions of purported problematic sites are very gratefully received. We will update our VCF with contributions from other groups and from our own further analyses. We also invite other groups to propose their own masking files/strategy, where they significantly differ from ours, for others to use in their own downstream analyses and to ease replication of results/analyses from third parties interested in testing different masking strategies.
Our most recent VCF file can be accessed here. (Archive versions are kept here.)
Additionally, a human-friendly (markdown) version is available here.
The content of the VCF file is described below:
Comment lines within the VCF file begin with “##”.
Non-comment lines have the following columns:
- #CHROM: chromosome (in this case MN908947.3, the name of the reference genome used).
- POS: position within the reference (in our case MN908947.3) to which the line refers to.
- ID: column is included to comply with vcf standards but is unused for now (all entries are “.”).
- REF: reference allele at this position.
- ALT: alternative alleles at this position, separated by commas (IUPAC ambiguity code).
- QUAL: for now unused.
- FILTER: masking recommendation.
- “mask”: sites that we recommend masking
- “caution”: sites that we do not recommend masking, but for which we advise caution due to preliminary analyses showing high homoplasy or other characteristics that may mislead phylogenetic/phylodynamic analyses.
- INFO: semicolon-separated key=value pairs of metadata for each site, including:
- SUB: the person/lab who proposed the filtering. Currently these are:
- “NDM”: Nicola De Maio (our suggestions)
- “RCD”: Russell Corbett-Detig ([email protected])
- “MP”: @matthew.parker
- EXC: list of reasons for including the entry. Current reasons for recommending mask/caution are:
- “seq_end”: alignment ends are affected by low coverage and high error rates.
- “ambiguous”: sites which show an excess of ambiguous basecalls relative to the number of alternative alleles, often emerging from a single country or sequencing laboratory.
- “amended”: previous sequencing errors which now appear to have been fixed in the latest versions of the GISAID sequences, at least in sequences from some of the sequencing laboratories.
- “highly_ambiguous”: sites with a very high proportion of ambiguous characters, relative to the number of alternative alleles.
- “highly_homoplasic”: positions which are extremely homoplasic - it is sometimes not necessarily clear if these are hypermutable sites or sequencing artefacts.
- “homoplasic”: homoplasic sites, with many mutation events needed to explain a relatively small alternative allele count.
- “interspecific_contamination”: cases (so far only one instance) in which the known sequencing issue is due to contamination from genetic material that does not have SARS-CoV-2 origin.
- “nanopore_adapter”: cases in which the known sequencing issue is due to the adapter sequences in nanopore reads.
- “narrow_src”: variants which are found in sequences from only a few sequencing labs (usually two or three), possibly as a consequence of the same artefact reproduced independently.
- “neighbour_linked”: proximal variants displaying near perfect linkage.
- “single_src”: only observed in samples from a single laboratory.
- SRC_COUNTRY: source country/countries of samples with the variant.
- SRC_LAB: source laboratory/laboratories of samples with the variant, ordered to match the respective values in SRC_COUNTRY.
- GENE: gene within which the variant occurs (if any).
- AA_POS: amino acid position within the gene of the considered variant.
- AA_REF: reference amino acid.
- AA_ALT: alternative amino acid alleles separated by commas (one entry for each value in column “ALT”; IUPAC ambiguity code).
- SUB: the person/lab who proposed the filtering. Currently these are: