Issue with pipelines using bcftools to calling consensus in low-coverage regions

Note: Adding this here as a reference since this was an issue with earlier versions of the Illumina pipeline and can lead to spurious reversions to reference bases.

bcftools consensus calls a consensus sequence by “applying” variants to a reference sequence. However, the alignment file might have regions of low coverage due to issues like amplicon dropout and the low coverage might not be sufficient to reliably call variants. If such regions of low coverage are not masked (typically using N) properly, the consensus sequence generated will contain reference bases in place of any real variants that might be present in the “true” consensus sequence. To avoid this issue, regions of low coverage should be masked using tools like bedtools genomecov + bedtools maskfasta and this masked reference sequence should be supplied to bcftools consensus to call a reliable consensus sequence.

2 Likes

Thanks Karthik for writing this up. One clarification - since this is inherent to bcftools, this issue likely plagued (and may still plague) other consensus calling pipelines and not just earlier versions of Illumina’s offerings.