Masking strategies for SARS-CoV-2 alignments

Masking strategies for SARS-CoV-2 alignments

Nicola De Maio*, Conor Walker, Rui Borges, Lukas Weilguny, Greg Slodkowicz, Nick Goldman
*[email protected]

In recent weeks, we have seen many analyses of SARS-CoV-2 genome alignments being published or posted as preprints. Some authors, including us (Issues with SARS-CoV-2 sequencing data), have raised questions regarding the trustworthiness of some of the columns of these alignments.

We would like to propose this space for open discussion of new and possible strategies for masking SARS-CoV-2 genome alignment columns. As far as we can tell, this is still an open question, and we welcome suggestions and comments.

We also want to propose a common format for sharing and more easily using and combining such filtering strategies. We include in this post our proposed masking sites (and those so far recommended by @matthew.parker in our post) in VCF format. This format (see below) succinctly summarizes which positions of the genome are of relevance, their associated variants, and the reasons why such entries are in the file.

Suggestions and further additions of purported problematic sites are very gratefully received. We will update our VCF with contributions from other groups and from our own further analyses. We also invite other groups to propose their own masking files/strategy, where they significantly differ from ours, for others to use in their own downstream analyses and to ease replication of results/analyses from third parties interested in testing different masking strategies.

Our most recent VCF file can be accessed here. (Archive versions are kept here.)
Additionally, a human-friendly (markdown) version is available here.

The content of the VCF file is described below:

Comment lines within the VCF file begin with “##”.

Non-comment lines have the following columns:

  • #CHROM: chromosome (in this case MN908947.3, the name of the reference genome used).
  • POS: position within the reference (in our case MN908947.3) to which the line refers to.
  • ID: column is included to comply with vcf standards but is unused for now (all entries are “.”).
  • REF: reference allele at this position.
  • ALT: alternative alleles at this position, separated by commas (IUPAC ambiguity code).
  • QUAL: for now unused.
  • FILTER: masking recommendation.
    • “mask”: sites that we recommend masking
    • “caution”: sites that we do not recommend masking, but for which we advise caution due to preliminary analyses showing high homoplasy or other characteristics that may mislead phylogenetic/phylodynamic analyses.
  • INFO: semicolon-separated key=value pairs of metadata for each site, including:
    • SUB: the person/lab who proposed the filtering. Currently these are:
    • EXC: list of reasons for including the entry. Current reasons for recommending mask/caution are:
      • “seq_end”: alignment ends are affected by low coverage and high error rates.
      • “ambiguous”: sites which show an excess of ambiguous basecalls relative to the number of alternative alleles, often emerging from a single country or sequencing laboratory.
      • “amended”: previous sequencing errors which now appear to have been fixed in the latest versions of the GISAID sequences, at least in sequences from some of the sequencing laboratories.
      • “highly_ambiguous”: sites with a very high proportion of ambiguous characters, relative to the number of alternative alleles.
      • “highly_homoplasic”: positions which are extremely homoplasic - it is sometimes not necessarily clear if these are hypermutable sites or sequencing artefacts.
      • “homoplasic”: homoplasic sites, with many mutation events needed to explain a relatively small alternative allele count.
      • “interspecific_contamination”: cases (so far only one instance) in which the known sequencing issue is due to contamination from genetic material that does not have SARS-CoV-2 origin.
      • “nanopore_adapter”: cases in which the known sequencing issue is due to the adapter sequences in nanopore reads.
      • “narrow_src”: variants which are found in sequences from only a few sequencing labs (usually two or three), possibly as a consequence of the same artefact reproduced independently.
      • “neighbour_linked”: proximal variants displaying near perfect linkage.
      • “single_src”: only observed in samples from a single laboratory.
    • SRC_COUNTRY: source country/countries of samples with the variant.
    • SRC_LAB: source laboratory/laboratories of samples with the variant, ordered to match the respective values in SRC_COUNTRY.
    • GENE: gene within which the variant occurs (if any).
    • AA_POS: amino acid position within the gene of the considered variant.
    • AA_REF: reference amino acid.
    • AA_ALT: alternative amino acid alleles separated by commas (one entry for each value in column “ALT”; IUPAC ambiguity code).
2 Likes

UPDATE: Post above mildly edited on 27/5/20, to reflect updated results from new analyses and more data. The most recent VCF will always be here (full VCF) and here (human-friendly), and the archive will remain here.

@goldman, @NicolaDeMaio thanks for sharing this! I’m interested in apply this mask in an analysis that I’m currently working on. Do you have recommendations for software that can be used to apply a VCF mask like this to an alignment? I’m imagining providing your VCF and my aligned fasta file as input, and getting a fasta file as output with the masked positions removed.

Hi @gregcaporaso, the following Python 3 script should do this for you:
https://github.com/W-L/ProblematicSites_SARS-CoV2/blob/master/src/mask_alignment_using_vcf.py

It requires an input alignment in FASTA format (specified with -i), an input VCF used for masking (specified with -v), and an output filename (specified with -o). By default, only “mask” recommendation sites are masked with “N”, but sites we tag as “caution” can also be included with either --both or --caution. For example:

python mask_alignment_using_vcf.py -i input_msa.fasta -o output_masked_msa.fasta -v problematic_sites_sarsCov2.vcf

Optionally, you can also specify alternate mask characters (with -n, default “N”), or that mask sites should be removed (with -d).

Please note that the reference SARS-CoV-2 genome sequence needs to be included in the alignment, and will be used to assign masking annotations to the correct alignment columns. The reference sequence record is identified from the FASTA headers. By default we assume that “MN908947” is contained within one of the headers and we use this string to identify the reference sequence record. If this string is not in the header for the reference sequence, you can specify an alternate identifying string with --reference_id.

All options should be self explanatory (with -h), hope that helps!

1 Like

Thanks @conorwalker! I’ll try this out today and let you know if I run into any issues. Is your issue tracker the best way to follow up if so?

@goldman initiated a DM with me this morning and I replied with some thoughts about the proposal to use VCF as a common format for alignment masks. I’ll move them over to this thread though in case you or others have thoughts and since the initial post in this topic mentions wanting to initiate a discussion.

I’ve been poking around a little bit this morning to see if there are any existing formats for defining masks and I’m not seeing much so I think you might be onto something. I’m see some information about this in the Geneious manual here (I’ve never used Geneious myself). They’re talking about defining masks as “Nexus-style CharSets”, which seems like it can be be specified as ranges (e.g., 1-3 6 8-9) or a binary vector (e.g., 1110010110) where in both of those examples, alignment positions 4, 5, 7, and 10 would be retained post-masking. This is not nearly as informative as your proposed VCF format but it has the benefit of being easier to parse than VCF (not that that’s a good reason to use it, just trying to think through pros and cons).

The only mask format I’ve used before is the “Lanemask” for the Greengenes 16S reference database. This is simply a vector of 1s and 0s with length equal to the number of positions in the alignment. The Lanemask is represented as a single line in a file. So this seems like a variant of the Nexus CharSet concept (though the meaning of 1 versus 0 is reversed I think).

A big drawback to both of these is that the positions to mask are defined as alignment positions, which is extremely fragile. Your approach of specifying a reference sequence is much better.

I also like the idea of multiple levels proposed here (caution versus mask), and it’s useful to record information describing why a position is being masked. I can imagine a software interface where you define a level that you want to mask at, and that level and all “higher” levels are masked. For example, a parameter like --mask-level caution would strip all positions that are tagged as caution or mask, while --mask-level mask would strip only the positions tagged as mask. This could be really helpful for benchmarking purposes. I think it might make sense for the multiple levels idea to be generalized to numbers, so additional levels are easy to define.

Overall I like the proposal here. Two real questions about it though:

  1. Is anyone aware of an existing format that should be used (avoiding defining yet another bioinformatics file format is always good if possible)?
  2. Is there a simpler format that could be used to facilitate parsing by different tools and creation of these files? It seems like some information isn’t necessarily needed here (e.g., ID and QUAL aren’t used, things like REF and ALT don’t seem essential for masking, and it seems like #CHROME will always have the same value). If different groups end up creating these files I could see people starting to fill in some of the non-essential fields with placeholder values, which can get confusing for users. A simple tab-separated text file that just contains the essential information (which I realize would be a new file format) might be easier. I see the essential information as what you have in the #CHROME, POS, FILTER, EXC, and maybe INFO.

If we had a good format for this I could see it being used in other places (e.g., I work on the QIIME 2 microbiome bioinformatics platform - a good format defining masks for marker sequence alignments such as 16S or ITS could be really helpful for the work that we do).

Hi @gregcaporaso.

First off, any issues with the code that @conorwalker posted can indeed be addressed through the github issue tracker.

Regarding formats for masking: we don’t know of any formats that cover the things you’re talking about. Conceptually at least, there could be site-wise, sequence-wise, and site X sequence-wise masks you might want to use. There are indeed issues about indexing to a given alignment or a reference. Our VCF tries to capture everything we “know” about certain sites, and in that sense does have all the available info. So it seems to me that the missing link is general code to mask an alignment from a VCF, or else is code to convert a VCF to something else, and then further code to mask an alignment based on that. Personally, I’d need convincing that the latter approach was better, but there may well be use-cases I haven’t thought of.

Cheers,

@goldman

Thanks @goldman, and thanks again for the code @conorwalker! I ran into one very minor issue with the code, but was able to fix it. I submitted a pull request with my modification here. Other than that, I didn’t notice any issues.

I will likely end up writing some code for using this VCF-based mask for use in QIIME 2-based workflows. I’ll follow up when I have that in case it’s useful for others here.

Thanks for sharing your alignment mask and initiating this work on defining a format for these!

Code now updated as per your suggestions. Glad it’s useful. Please keep in touch about where you go with this!

Thanks @goldman, I’ll keep in touch about it.

Also, I realized that I haven’t really mentioned why I’m interested in this, beyond just trying the mask out on my data. A research interest of mine is ensuring retrospective reproducibility of computational steps in biological/genomics/microbiome/etc studies - in other words, building software that records its own notes on what you did, so that you or someone else could look back and reproduce a workflow even if you didn’t keep detailed notes about what you did or inadvertently left out a key piece. (This is what our retrospective data provenance tracking does in QIIME 2 - see an example here. The Provenance tab on that page shows in detail all of the steps that led to the creation of that interactive plot. Click on the boxes and circles in the Provenance tab to see details on the actions and the data, respectively.)

In my work on SARS-CoV-2, masking seems to be a step that is hard to track this information on as it’s often a very manual process (open an alignment in an alignment editor, remove alignment columns that don’t look right, maybe move some bases around, save as a fasta file, move on with your workflow). While it’d be great if we could just automate that whole process away (e.g., do it just by identifying very high entropy alignment columns and/or columns with a very high gap frequency), it seems like we’re not there yet and we should be looking at our alignments before moving on anyway. In lieu of that, having a standard format (like what you’re proposing here) for describing a mask is an essential part of being able to record information about what was done so that we can reproduce it. An alignment editor could create this (kind of like an undo history in a spreadsheet editor), and we can have tools that apply it like the script your team has provided here.

Anyway, if other folks are interested in this, definitely reach out! Maybe it’s something we could collaborate on to advance SARS-CoV-2 and other research.

Updated analysis with data from 12th June 2020

We have now updated our masking recommendations as a consequence of more data being available and using improved methodology. See the corresponding post here: Issues with SARS-CoV-2 sequencing data.

In addition to more positions being included, we now also include some additional tags for the column “EXC”. The tag amended represents the case in which sequencing errors now appear to have been fixed in the latest versions of the GISAID sequences. narrow_src marks the scenario in which a variant is found in sequences from a few sequencing labs (usually two or three) possibly as a consequence of the same artefact reproduced independently. ambiguous refers to positions that have a moderately high number of ambiguity characters (fewer than positions marked as highly_ambiguous). interspecific_contamination refers to the case (so far only one instance) in which the known sequencing issue is due to contamination from genetic material that does not have SARS-CoV-2 origin. nanopore_adapter refers to the case in which the known sequencing issue is due to the adapter sequences in nanopore reads.

The original post from 14/5/20 was mildly edited again, to reflect changes to the VCF content because of these updates. The most recent VCF will always be here (full VCF) and here (human-friendly) , and the archive will remain here .

Update: we have introduced some changes to the format of our VCF, and it should now be in compliance with the VCF v4.3 specification. This update allows standard tools to parse our files without producing warnings/errors (tested using PyVCF and BCFtools), so it should now be easier to incorporate our recommendations into analysis pipelines.

This post has been updated to reflect these changes.

As a further update, we now include results from analyses using data from the 22nd of July 2020.
As usual, the current file with our masking recommendations is available here.
The archive of past versions is here.

New updated masking recommendations, based on GISAID data from the 13th of November 2020, are now available.
As before, the current file with the masking recommendations is available here and the archive of past versions is here.

We have now updated our masking recommendations, based on GISAID data from the 4th of March 2021.
The current file with the masking recommendations is still available here and the archive of past versions here.

So, to confirm, the input file is an aligned fasta (unmasked) file and not an unaligned fasta file, correct?
Thanks.

why not mask the entire 5’ and 3’ ends? they are often not used for phylogenetics/phylodynamics.

why “N” and not lower case?

Hi @mscotch ,

We do require an alignment in input, and if you want to use our VCF file you need sequences aligned to the Wuhan reference, for example using MAFFT.

You can change the masking character from “N” to any other using the -n option (see previous replies above).

Regarding 5’ and 3’ ends, we do recommend masking of the first 55 and last 100 positions of the genome, see our masking recommendations: https://github.com/W-L/ProblematicSites_SARS-CoV2/blob/master/problematic_sites_sarsCov2.vcf .

Best wishes,
Nicola