Masking strategies for SARS-CoV-2 alignments

Masking strategies for SARS-CoV-2 alignments

Nicola De Maio*, Conor Walker, Rui Borges, Lukas Weilguny, Greg Slodkowicz, Nick Goldman
*demaio@ebi.ac.uk

In recent weeks, we have seen many analyses of SARS-CoV-2 genome alignments being published or posted as preprints. Some authors, including us (Issues with SARS-CoV-2 sequencing data), have raised questions regarding the trustworthiness of some of the columns of these alignments.

We would like to propose this space for open discussion of new and possible strategies for masking SARS-CoV-2 genome alignment columns. As far as we can tell, this is still an open question, and we welcome suggestions and comments.

We also want to propose a common format for sharing and more easily using and combining such filtering strategies. We include in this post our proposed masking sites (and those so far recommended by @matthew.parker in our post) in VCF format. This format (see below) succinctly summarizes which positions of the genome are of relevance, their associated variants, and the reasons why such entries are in the file.

Suggestions and further additions of purported problematic sites are very gratefully received. We will update our VCF with contributions from other groups and from our own further analyses. We also invite other groups to propose their own masking files/strategy, where they significantly differ from ours, for others to use in their own downstream analyses and to ease replication of results/analyses from third parties interested in testing different masking strategies.

Our most recent VCF file can be accessed here. (Archive versions are kept here.)
Additionally, a human-friendly (markdown) version is available here.

The content of the VCF file is described below:

Comment lines within the VCF file begin with “##”.

Non-comment lines have the following columns:

  • #CHROM”: chromosome (in this case MN908947.3, the name of the reference genome used).
  • “POS”: position within the reference (in our case MN908947.3) to which the line refers to.
  • “ID”: column is included to comply with vcf standards but is unused for now (all entries are “.”).
  • “REF”: reference allele at this position.
  • “ALT”: alternative alleles at this position, separated by commas (IUPAC ambiguity code).
  • “QUAL”: for now unused.
  • “FILTER”: masking recommendation.
    • “mask”: sites that we recommend masking
    • “caution” sites that we do not recommend masking, but for which we advise caution due to preliminary analyses showing high homoplasy or other characteristics that may mislead phylogenetic/phylodynamic analyses.
  • “INFO”: the person/lab who proposed the filtering. Current entries are:
  • “EXC”: comma-separated list of reasons for including the entry. Current reasons for exclusion (and masking recommendation) are:
    • seq_end: low-reliability alignment ends
    • no_sig: homoplasy had no phylogenetic signal
    • single_src: the homoplasy mostly originates from one sequencing lab
    • highly_homoplasic: the homoplasy seemed to occur an extreme number of times
    • highly_ambiguous: high levels of ambiguity relative to the prevalence of alternative alleles, minor allele and ambiguity prevalently from one or few sources
    • MNM: sites seemingly affected by a multinucleotide mutation
    • neighbour_linked: proximal variants displaying near perfect linkage
    • seq_err: systematic sequencing errors, as suggested by @matthew.parker
    • homoplasic: means that the mutation seemed to occur multiple times
  • “SRC”: in case of lab-enriched mutations/artefacts, this represents the particular labs where enrichment is observed.
  • “GENE”: gene within which the variant occurs (if any).
  • “AAPOS”: amino acid position within the gene of the considered variant.
  • “REFAA”: reference amino acid.
  • “ALTAA”: alternative amino acid alleles separated by commas (one entry for each value in column “ALT”; IUPAC ambiguity code).
2 Likes

UPDATE: Post above mildly edited on 27/5/20, to reflect updated results from new analyses and more data. The most recent VCF will always be here (full VCF) and here (human-friendly), and the archive will remain here.

@goldman, @NicolaDeMaio thanks for sharing this! I’m interested in apply this mask in an analysis that I’m currently working on. Do you have recommendations for software that can be used to apply a VCF mask like this to an alignment? I’m imagining providing your VCF and my aligned fasta file as input, and getting a fasta file as output with the masked positions removed.

Hi @gregcaporaso, the following Python 3 script should do this for you:
https://github.com/W-L/ProblematicSites_SARS-CoV2/blob/master/src/mask_alignment_using_vcf.py

It requires an input alignment in FASTA format (specified with -i), an input VCF used for masking (specified with -v), and an output filename (specified with -o). By default, only “mask” recommendation sites are masked with “N”, but sites we tag as “caution” can also be included with either --both or --caution. For example:

python mask_alignment_using_vcf.py -i input_msa.fasta -o output_masked_msa.fasta -v problematic_sites_sarsCov2.vcf

Optionally, you can also specify alternate mask characters (with -n, default “N”), or that mask sites should be removed (with -d).

Please note that the reference SARS-CoV-2 genome sequence needs to be included in the alignment, and will be used to assign masking annotations to the correct alignment columns. The reference sequence record is identified from the FASTA headers. By default we assume that “MN908947” is contained within one of the headers and we use this string to identify the reference sequence record. If this string is not in the header for the reference sequence, you can specify an alternate identifying string with --reference_id.

All options should be self explanatory (with -h), hope that helps!

1 Like

Thanks @conorwalker! I’ll try this out today and let you know if I run into any issues. Is your issue tracker the best way to follow up if so?

@goldman initiated a DM with me this morning and I replied with some thoughts about the proposal to use VCF as a common format for alignment masks. I’ll move them over to this thread though in case you or others have thoughts and since the initial post in this topic mentions wanting to initiate a discussion.

I’ve been poking around a little bit this morning to see if there are any existing formats for defining masks and I’m not seeing much so I think you might be onto something. I’m see some information about this in the Geneious manual here (I’ve never used Geneious myself). They’re talking about defining masks as “Nexus-style CharSets”, which seems like it can be be specified as ranges (e.g., 1-3 6 8-9) or a binary vector (e.g., 1110010110) where in both of those examples, alignment positions 4, 5, 7, and 10 would be retained post-masking. This is not nearly as informative as your proposed VCF format but it has the benefit of being easier to parse than VCF (not that that’s a good reason to use it, just trying to think through pros and cons).

The only mask format I’ve used before is the “Lanemask” for the Greengenes 16S reference database. This is simply a vector of 1s and 0s with length equal to the number of positions in the alignment. The Lanemask is represented as a single line in a file. So this seems like a variant of the Nexus CharSet concept (though the meaning of 1 versus 0 is reversed I think).

A big drawback to both of these is that the positions to mask are defined as alignment positions, which is extremely fragile. Your approach of specifying a reference sequence is much better.

I also like the idea of multiple levels proposed here (caution versus mask), and it’s useful to record information describing why a position is being masked. I can imagine a software interface where you define a level that you want to mask at, and that level and all “higher” levels are masked. For example, a parameter like --mask-level caution would strip all positions that are tagged as caution or mask, while --mask-level mask would strip only the positions tagged as mask. This could be really helpful for benchmarking purposes. I think it might make sense for the multiple levels idea to be generalized to numbers, so additional levels are easy to define.

Overall I like the proposal here. Two real questions about it though:

  1. Is anyone aware of an existing format that should be used (avoiding defining yet another bioinformatics file format is always good if possible)?
  2. Is there a simpler format that could be used to facilitate parsing by different tools and creation of these files? It seems like some information isn’t necessarily needed here (e.g., ID and QUAL aren’t used, things like REF and ALT don’t seem essential for masking, and it seems like #CHROME will always have the same value). If different groups end up creating these files I could see people starting to fill in some of the non-essential fields with placeholder values, which can get confusing for users. A simple tab-separated text file that just contains the essential information (which I realize would be a new file format) might be easier. I see the essential information as what you have in the #CHROME, POS, FILTER, EXC, and maybe INFO.

If we had a good format for this I could see it being used in other places (e.g., I work on the QIIME 2 microbiome bioinformatics platform - a good format defining masks for marker sequence alignments such as 16S or ITS could be really helpful for the work that we do).

Hi @gregcaporaso.

First off, any issues with the code that @conorwalker posted can indeed be addressed through the github issue tracker.

Regarding formats for masking: we don’t know of any formats that cover the things you’re talking about. Conceptually at least, there could be site-wise, sequence-wise, and site X sequence-wise masks you might want to use. There are indeed issues about indexing to a given alignment or a reference. Our VCF tries to capture everything we “know” about certain sites, and in that sense does have all the available info. So it seems to me that the missing link is general code to mask an alignment from a VCF, or else is code to convert a VCF to something else, and then further code to mask an alignment based on that. Personally, I’d need convincing that the latter approach was better, but there may well be use-cases I haven’t thought of.

Cheers,

@goldman

Thanks @goldman, and thanks again for the code @conorwalker! I ran into one very minor issue with the code, but was able to fix it. I submitted a pull request with my modification here. Other than that, I didn’t notice any issues.

I will likely end up writing some code for using this VCF-based mask for use in QIIME 2-based workflows. I’ll follow up when I have that in case it’s useful for others here.

Thanks for sharing your alignment mask and initiating this work on defining a format for these!

Code now updated as per your suggestions. Glad it’s useful. Please keep in touch about where you go with this!

Thanks @goldman, I’ll keep in touch about it.

Also, I realized that I haven’t really mentioned why I’m interested in this, beyond just trying the mask out on my data. A research interest of mine is ensuring retrospective reproducibility of computational steps in biological/genomics/microbiome/etc studies - in other words, building software that records its own notes on what you did, so that you or someone else could look back and reproduce a workflow even if you didn’t keep detailed notes about what you did or inadvertently left out a key piece. (This is what our retrospective data provenance tracking does in QIIME 2 - see an example here. The Provenance tab on that page shows in detail all of the steps that led to the creation of that interactive plot. Click on the boxes and circles in the Provenance tab to see details on the actions and the data, respectively.)

In my work on SARS-CoV-2, masking seems to be a step that is hard to track this information on as it’s often a very manual process (open an alignment in an alignment editor, remove alignment columns that don’t look right, maybe move some bases around, save as a fasta file, move on with your workflow). While it’d be great if we could just automate that whole process away (e.g., do it just by identifying very high entropy alignment columns and/or columns with a very high gap frequency), it seems like we’re not there yet and we should be looking at our alignments before moving on anyway. In lieu of that, having a standard format (like what you’re proposing here) for describing a mask is an essential part of being able to record information about what was done so that we can reproduce it. An alignment editor could create this (kind of like an undo history in a spreadsheet editor), and we can have tools that apply it like the script your team has provided here.

Anyway, if other folks are interested in this, definitely reach out! Maybe it’s something we could collaborate on to advance SARS-CoV-2 and other research.