Updated analysis with data from 12th June 2020
We have now updated our masking recommendations as a consequence of more data being available and using improved methodology. See the corresponding post here: Issues with SARS-CoV-2 sequencing data.
In addition to more positions being included, we now also include some additional tags for the column “EXC”. The tag amended represents the case in which sequencing errors now appear to have been fixed in the latest versions of the GISAID sequences. narrow_src marks the scenario in which a variant is found in sequences from a few sequencing labs (usually two or three) possibly as a consequence of the same artefact reproduced independently. ambiguous refers to positions that have a moderately high number of ambiguity characters (fewer than positions marked as highly_ambiguous). interspecific_contamination refers to the case (so far only one instance) in which the known sequencing issue is due to contamination from genetic material that does not have SARS-CoV-2 origin. nanopore_adapter refers to the case in which the known sequencing issue is due to the adapter sequences in nanopore reads.
The original post from 14/5/20 was mildly edited again, to reflect changes to the VCF content because of these updates. The most recent VCF will always be here (full VCF) and here (human-friendly) , and the archive will remain here .