Masking strategies for SARS-CoV-2 alignments
Nicola De Maio*, Conor Walker, Rui Borges, Lukas Weilguny, Greg Slodkowicz, Nick Goldman
In recent weeks, we have seen many analyses of SARS-CoV-2 genome alignments being published or posted as preprints. Some authors, including us (Issues with SARS-CoV-2 sequencing data), have raised questions regarding the trustworthiness of some of the columns of these alignments.
We would like to propose this space for open discussion of new and possible strategies for masking SARS-CoV-2 genome alignment columns. As far as we can tell, this is still an open question, and we welcome suggestions and comments.
We also want to propose a common format for sharing and more easily using and combining such filtering strategies. We include in this post our proposed masking sites (and those so far recommended by @matthew.parker in our post) in VCF format. This format (see below) succinctly summarizes which positions of the genome are of relevance, their associated variants, and the reasons why such entries are in the file.
Suggestions and further additions of purported problematic sites are very gratefully received. We will update our VCF with contributions from other groups and from our own further analyses. We also invite other groups to propose their own masking files/strategy, where they significantly differ from ours, for others to use in their own downstream analyses and to ease replication of results/analyses from third parties interested in testing different masking strategies.
The content of the VCF file is described below:
Comment lines within the VCF file begin with “##”.
Non-comment lines have the following columns:
- “#CHROM”: chromosome (in this case MN908947.3, the name of the reference genome used).
- “POS”: position within the reference (in our case MN908947.3) to which the line refers to.
- “ID”: column is included to comply with vcf standards but is unused for now (all entries are “.”).
- “REF”: reference allele at this position.
- “ALT”: alternative alleles at this position, separated by commas (IUPAC ambiguity code).
- “QUAL”: for now unused.
- “FILTER”: masking recommendation.
- “mask”: sites that we recommend masking
- “caution” sites that we do not recommend masking, but for which we advise caution due to preliminary analyses showing high homoplasy or other characteristics that may mislead phylogenetic/phylodynamic analyses.
- “INFO”: the person/lab who proposed the filtering. Current entries are:
- NDM: Nicola De Maio (our suggestions)
- MP: @matthew.parker
- “EXC”: comma-separated list of reasons for including the entry. Current reasons for exclusion (and masking recommendation) are:
- seq_end: low-reliability alignment ends
- no_sig: homoplasy had no phylogenetic signal
- single_src: the homoplasy mostly originates from one sequencing lab
- highly_homoplasic: the homoplasy seemed to occur an extreme number of times
- highly_ambiguous: high levels of ambiguity relative to the prevalence of alternative alleles, minor allele and ambiguity prevalently from one or few sources
- MNM: sites seemingly affected by a multinucleotide mutation
- neighbour_linked: proximal variants displaying near perfect linkage
- seq_err: systematic sequencing errors, as suggested by @matthew.parker
- homoplasic: means that the mutation seemed to occur multiple times
- “SRC”: in case of lab-enriched mutations/artefacts, this represents the particular labs where enrichment is observed.
- “GENE”: gene within which the variant occurs (if any).
- “AAPOS”: amino acid position within the gene of the considered variant.
- “REFAA”: reference amino acid.
- “ALTAA”: alternative amino acid alleles separated by commas (one entry for each value in column “ALT”; IUPAC ambiguity code).