Masking strategies for SARS-CoV-2 alignments

Thanks @conorwalker! I’ll try this out today and let you know if I run into any issues. Is your issue tracker the best way to follow up if so?

@goldman initiated a DM with me this morning and I replied with some thoughts about the proposal to use VCF as a common format for alignment masks. I’ll move them over to this thread though in case you or others have thoughts and since the initial post in this topic mentions wanting to initiate a discussion.

I’ve been poking around a little bit this morning to see if there are any existing formats for defining masks and I’m not seeing much so I think you might be onto something. I’m see some information about this in the Geneious manual here (I’ve never used Geneious myself). They’re talking about defining masks as “Nexus-style CharSets”, which seems like it can be be specified as ranges (e.g., 1-3 6 8-9) or a binary vector (e.g., 1110010110) where in both of those examples, alignment positions 4, 5, 7, and 10 would be retained post-masking. This is not nearly as informative as your proposed VCF format but it has the benefit of being easier to parse than VCF (not that that’s a good reason to use it, just trying to think through pros and cons).

The only mask format I’ve used before is the “Lanemask” for the Greengenes 16S reference database. This is simply a vector of 1s and 0s with length equal to the number of positions in the alignment. The Lanemask is represented as a single line in a file. So this seems like a variant of the Nexus CharSet concept (though the meaning of 1 versus 0 is reversed I think).

A big drawback to both of these is that the positions to mask are defined as alignment positions, which is extremely fragile. Your approach of specifying a reference sequence is much better.

I also like the idea of multiple levels proposed here (caution versus mask), and it’s useful to record information describing why a position is being masked. I can imagine a software interface where you define a level that you want to mask at, and that level and all “higher” levels are masked. For example, a parameter like --mask-level caution would strip all positions that are tagged as caution or mask, while --mask-level mask would strip only the positions tagged as mask. This could be really helpful for benchmarking purposes. I think it might make sense for the multiple levels idea to be generalized to numbers, so additional levels are easy to define.

Overall I like the proposal here. Two real questions about it though:

  1. Is anyone aware of an existing format that should be used (avoiding defining yet another bioinformatics file format is always good if possible)?
  2. Is there a simpler format that could be used to facilitate parsing by different tools and creation of these files? It seems like some information isn’t necessarily needed here (e.g., ID and QUAL aren’t used, things like REF and ALT don’t seem essential for masking, and it seems like #CHROME will always have the same value). If different groups end up creating these files I could see people starting to fill in some of the non-essential fields with placeholder values, which can get confusing for users. A simple tab-separated text file that just contains the essential information (which I realize would be a new file format) might be easier. I see the essential information as what you have in the #CHROME, POS, FILTER, EXC, and maybe INFO.

If we had a good format for this I could see it being used in other places (e.g., I work on the QIIME 2 microbiome bioinformatics platform - a good format defining masks for marker sequence alignments such as 16S or ITS could be really helpful for the work that we do).