Masking strategies for SARS-CoV-2 alignments

Hi @gregcaporaso, the following Python 3 script should do this for you:
https://github.com/W-L/ProblematicSites_SARS-CoV2/blob/master/src/mask_alignment_using_vcf.py

It requires an input alignment in FASTA format (specified with -i), an input VCF used for masking (specified with -v), and an output filename (specified with -o). By default, only “mask” recommendation sites are masked with “N”, but sites we tag as “caution” can also be included with either --both or --caution. For example:

python mask_alignment_using_vcf.py -i input_msa.fasta -o output_masked_msa.fasta -v problematic_sites_sarsCov2.vcf

Optionally, you can also specify alternate mask characters (with -n, default “N”), or that mask sites should be removed (with -d).

Please note that the reference SARS-CoV-2 genome sequence needs to be included in the alignment, and will be used to assign masking annotations to the correct alignment columns. The reference sequence record is identified from the FASTA headers. By default we assume that “MN908947” is contained within one of the headers and we use this string to identify the reference sequence record. If this string is not in the header for the reference sequence, you can specify an alternate identifying string with --reference_id.

All options should be self explanatory (with -h), hope that helps!

1 Like