Masking strategies for SARS-CoV-2 alignments

Thanks @goldman, I’ll keep in touch about it.

Also, I realized that I haven’t really mentioned why I’m interested in this, beyond just trying the mask out on my data. A research interest of mine is ensuring retrospective reproducibility of computational steps in biological/genomics/microbiome/etc studies - in other words, building software that records its own notes on what you did, so that you or someone else could look back and reproduce a workflow even if you didn’t keep detailed notes about what you did or inadvertently left out a key piece. (This is what our retrospective data provenance tracking does in QIIME 2 - see an example here. The Provenance tab on that page shows in detail all of the steps that led to the creation of that interactive plot. Click on the boxes and circles in the Provenance tab to see details on the actions and the data, respectively.)

In my work on SARS-CoV-2, masking seems to be a step that is hard to track this information on as it’s often a very manual process (open an alignment in an alignment editor, remove alignment columns that don’t look right, maybe move some bases around, save as a fasta file, move on with your workflow). While it’d be great if we could just automate that whole process away (e.g., do it just by identifying very high entropy alignment columns and/or columns with a very high gap frequency), it seems like we’re not there yet and we should be looking at our alignments before moving on anyway. In lieu of that, having a standard format (like what you’re proposing here) for describing a mask is an essential part of being able to record information about what was done so that we can reproduce it. An alignment editor could create this (kind of like an undo history in a spreadsheet editor), and we can have tools that apply it like the script your team has provided here.

Anyway, if other folks are interested in this, definitely reach out! Maybe it’s something we could collaborate on to advance SARS-CoV-2 and other research.