Issues with SARS-CoV-2 sequencing data

liam.shaw · May 8, 2020, 3:45pm

Thanks for the detailed report Nicola et al. and the great discussion from others.

We recently finished a similar analysis of recurrent mutations - see [1] and in press at doi: 10.1016/j.meegid.2020.104351). Similar approach: we used MPBoot to get a MP tree and then HomoplasyFinder, before doing some filtering, and also looked at available SRA data.

The filtering approach we came up with attempts remove potentially suspect sites from GISAID assemblies using a combination of metadata and a RaxML tree. The basic principle is the sort of ‘high’ / ‘low’ sort @sergei.pond suggests, providing a list of all recurrent mutations after excluding pre-determined suspect sites, and then a smaller ‘filtered’ list (n=198) that satisfy a set of thresholds based on the following parameters:

Number of isolates with homoplasy
Proportion of isolates with homoplasy which have a nearest neighbour in the RaxML tree with the homoplasy: ranges between 0 (singleton isolates with homoplasy throughout tree) and 1 (clusters of isolates with homoplasy)
Proportion of isolates with the homoplasy which have at least one ‘N’ in the local region around the homoplasy
Number of submitting and originating labs

See the paper and https://github.com/liampshaw/CoV-homoplasy-filtering/ for more detail and the actual values we chose.

I note that the ‘filtered’ list we gave at the time of that publication included the nucleotide positions 24389-24390 i.e. S943P (that is an incredibly good adapter spot @matthew.parker). So such filtering thresholds definitely do not magically remove all sequencing errors!

Hopefully as a community effort we can keep adding to the list of these suspect positions.

References
[1] L. van Dorp et al. “Emergence of genomic diversity and recurrent mutations in SARS-CoV-2” Infection, Genetics and Evolution (2020) 104351 pdf.