Real-time sequencing and data release for Ebolavirus genomic surveillance in Guinea

This is a draft post and is subject to updates.

Josh Quick, Nicholas Loman, Sophie Duraffour, Lauren Cowley, Andrew Rambaut, Jared Simpson, Simon Weller, Phil Rachwal, Jamie Taylor, Daniel Turner, Stephan Gunther, Miles Carroll and the European Mobile Laboratory EVIDENT-EU project.
(Author list subject to change).

We have established a portable genomic surveillance laboratory in Coyah, Guinea employing the Oxford Nanopore MinION sequencing technology. This sequencer is “USB stick” format and the entire set of reagents to port the laboratory fits into a hard-shell flightcase and a soft bag, a huge boon for field work.

This Virological thread is intended to support rapid data release and description of Makona Ebola virus sequences from Guinea during the ongoing 2014-2015 West African outbreak. We anticipate it will be continually updated with new data and information. We aim for this thread to be a citable “pre-publication” on the sequencing data and to help stimulate community discussion and review of the datasets in terms of the ongoing outbreak in Guinea.

Methods

Viral nucleic acid is enriched and converted to cDNA by RT-PCR amplification from inactivated RNA extracts from blood samples (whole blood or serum) supplied by a network of referring diagnostic laboratories . Typically we use an 11 amplicon scheme, with most amplicons being ~1.3 - 2.4kb, in order to tile across the genome. This scheme achieves 98.4% theoretical coverage of the genome, with around 400 bases at the 3’ end of the viral genome not covered by the current amplicon scheme (we plan to address this in a future update).

After pooling them in approximately equimolar amounts, amplicon pool is cleaned with SPRI and prepared for MinION sequencing. We employ the standard gDNA sequencing kit (SQK-MAP-005.1) according to manufacturer’s protocol (Oxford Nanopore Technologies, UK) and run each pool on an individual nanopore flowcell (FLO-MAP-003). The nanopore instrument can read the full-length amplicons without need for fragmentation. Within 30-60 minutes, sufficient sequencing data has typically been produced for high (>100x) coverage of the genome. For some flowcells this time may be longer.

Nanopore data currently requires an online Internet connection to produce basecalled data via a cloud service called Metrichor. Internet access is limited in Coyah to mobile 3G Wifi hotspot connections which can be unreliable, so reduced datasets of around 200-400Mb of compressed ‘raw’ or ‘squiggle’ data are transferred to University of Birmingham for analysis. This represents a trade-off between communications speed and genome coverage. We have explored many possible avenues to improve connectivity including external 3G antennae, satellite phones and alternative mobile networks, with limited success.

The bioinformatics approach currently utilises the marginAlign and marginVariant software (Jain et al.) to detect single nucleotide variants when aligned to a reference strain from early in the outbreak (EM_079517). Due to the nature of the sequencing data, which is dominated by insertion and deletion errors, we do not attempt to call insertion or deletions, however these mutations are rarely detected in this Ebola outbreak. Variants with a posterior probability value of >=0.5 and coverage depth of >15x are accepted and a consensus is generated. Variants detected by marginAlign with a posterior probability of <= 0.5 are further checked using a signal-level algorithm, nanopolish, to ensure they have read-level support of >75%. Regions of uncertainty (for example in hard to sequence homopolymeric regions, primer binding sites) or with coverage <15x are masked with an N character. This approach gives a high true positive variant calling rate (assessed by artificially mutating the reference genome and assessing TP/FN rates), assuming sufficient genomic coverage of each amplicon. We assess the true positive rate for each sample individually and assign each sample a quality score. Those with a true positive rate (TPR, i.e. sensitivity) of >= 95% (meaning a maximum of one missed polymorphism out of 30) are assigned as ‘good’. Those with a TPR of >= 80% and less than 95% are assigned as ‘fair’. Those with TPR <=80% are assigned as poor and not released, although may be inspected manually for canonical lineage-defining mutations. TPR drops when individual amplicons fail to amplify during RT-PCR, or where coverage levels are low for that amplicon.

Data utility

The data are currently used in the following ways:

  1. To generate a phylogenetic tree of recent isolates, which are shared with the WHO national coordination team to assist in the outbreak response, by identifying putative transmission clusters, particularly new introductions into prefectures. A maximum-likelihood tree of consensusis constructed from consensus alignments using RaxML using a GTR model with 100 bootstraps, and visualised with the Microreact website. This provides a privately shareable URL that can be sent to the diagnostic laboratories, epidemiologists and others involved in the outbreak response.

  2. With a one week delay to give the epidemiologists prior knowledge of the results, these consensus sequences are shared with Trevor Bedford and Richard Nehrer’s Nextflu website, to help provide an overall view of the international outbreak.

  3. Consensus sequences are available via Github.

  4. Raw FAST5 files will soon be available via the European Nucleotide Archive in real-time.

Results

As of 5th August 2015, a total of 114 samples have been sequenced. 92 have been assigned QC status of ‘good’, 11 have been assigned QC status of ‘fair’, with the remaining 11 ‘poor’.

Appendices

The PCR scheme is as follows (coordinates relative to gi|674810549|gb|KJ660346.2|)

11 reaction scheme

#Region    Primer_ID  Sequence_(5-3')              Coords  Amplicon_size
region_1   1_F        TTTAGGATCTTTTGTGTGCGAAT      27      1911
region_1   4_R        TGGTGTCCTCGTCGTCCT           1938    1911
region_2   5_F        CGATCTAGACGAGGACGACGA        1927    1901
region_2   8_R        TGGAAAGCAGTTCCAAAACC         3828    1901
region_3   9_F        TGCCTGGTTTTGGAACTGC          3823    1895
region_3   12_R       TGCAATGAGAAAGATTGACATTTG     5718    1895
region_4   13_F       TCCTCAAATTGCCTACATGCTT       5759    1874
region_4   16_R       GCTGGCCCGAAATATGGT           7633    1874
region_5   17_F       GATGAAGGTGCTGCAATCG          7601    2406
region_5   20_R       GGGCAACTGGTATACAGCTAAAAG     10007   2406
region_6   20_F       AACCCAAACATTGACCAAAGAA       9550    1371
region_6   22_R       CCACCAGAAAACCCATGTTAGT       10921   1371
region_7   23_F       GCTCCAAGAACCCGACAAA          10944   1410
region_7   26_F       GGTTGAGGACCCAGTTTGC          12354   1410
region_8   26_F       GGTTGAGGACCCAGTTTGC          12354   1898
region_8   29_R       CCGAAATCCAGAGGTTTGC          14252   1898
region_9   30_F       CAAACCTCTGGATTTCGGAAC        14253   1427
region_9   32_R       CTCGGTATCTTGTTAAATCTAAATCCA  15680   1427
region_10  33_F       TTAACAAGATACCGAGAAAATGAATTG  15691   1396
region_10  35_R       AAGGCACCAGCACCTTCTC          17087   1396
region_11  35_F       TGATGGCACTGAACGGAGT          16632   1921
region_11  38_R       GTGTTATCAACCAAAGCACTATTCCA   18553   1921

19 reaction scheme

For degraded or hard to amplify samples, a 19 reaction scheme is used:

#Region    Primer_ID  Sequence_(5-3')              Coords  Amplicon_size
region_1   1_F        TTTAGGATCTTTTGTGTGCGAAT      27      1426
region_1   3_R        ACTCCTGCGAGGGTGCTC           1453    1426
region_2   3_F        AAGGCTTGCCTTGAGAAGGT         965     973
region_2   4_R        TGGTGTCCTCGTCGTCCT           1938    973
region_3   5_F        CGATCTAGACGAGGACGACGA        1927    952
region_3   6_R        TGTGGCTTAACGCTTATTTGC        2879    952
region_4   7_F        GCGTTAAGCCACAGTTATAGCC       2887    941
region_4   8_R        TGGAAAGCAGTTCCAAAACC         3828    941
region_5   9_F        TGCCTGGTTTTGGAACTGC          3823    940
region_5   10_R       CAGCGACACCTAGAGGAAGC         4763    940
region_6   11_F       TTGGCTTCCTCTAGGTGTCG         4760    958
region_6   12_R       TGCAATGAGAAAGATTGACATTTG     5718    958
region_7   13_F       TCCTCAAATTGCCTACATGCTT       5759    906
region_7   14_R       TGTGGTAGAATAATAGCCACTCGAC    6665    906
region_8   15_F       GGACCCGTCGAGTGGCTAT          6659    974
region_8   16_R       GCTGGCCCGAAATATGGT           7633    974
region_9   17_F       GATGAAGGTGCTGCAATCG          7601    969
region_9   18_R       GCTCGAACATGGTGGTCGT          8570    969
region_10  19_F       GGATGGACACGACCACCA           8562    1445
region_10  20_R       GGGCAACTGGTATACAGCTAAAAG     10007   1445
region_11  21_F       TGTATACCAGTTGCCCCTGAG        10015   906
region_11  22_R       CCACCAGAAAACCCATGTTAGT       10921   906
region_12  23_F       GCTCCAAGAACCCGACAAA          10944   958
region_12  24_R       TCAGGAAGAGAGCATCTTGCAT       11902   958
region_13  25_F       TGCAAGATGCTCTCTTCCTGA        11903   947
region_13  26_R       CTGAGGTAACACTGTACCAAGATCC    12850   947
region_14  26_F       GGTTGAGGACCCAGTTTGC          12354   1898
region_14  29_R       CCGAAATCCAGAGGTTTGC          14252   1898
region_15  29_F       TTGCGCTCAGCTGTGATG           13783   946
region_15  30_R       TAATGTGCGTGTTCCTTCCA         14729   946
region_16  31_F       GAGACGCCGGTTTTGGAC           14779   901
region_16  32_R       CTCGGTATCTTGTTAAATCTAAATCCA  15680   901
region_17  33_F       TTAACAAGATACCGAGAAAATGAATTG  15691   963
region_17  34_R       CATGGCTCATTTGCAGGAC          16654   963
region_18  35_F       TGATGGCACTGAACGGAGT          16632   977
region_18  36_R       TGGTGTGGCATCTTACGTGTAG       17609   977
region_19  37_F       TGGTATCTTTGTCTGACGAACTTCT    17578   975
region_19  38_R       GTGTTATCAACCAAAGCACTATTCCA   18553   975

Very nice setup, thanks for sharing Nick. It’s really great to see the MinION performing so well in the field producing realtime data! Just one quick comment on this part:

I totally understand you’re limited by the technology, but we actually see quite a few - and potentially interesting - indels cropping up in the EBOV genome during the outbreak. We know about the indels in the GP causing framshifts, but we observe quite a few similar indel ‘hotspots’ scattered around the genome in poly-A tracts. Not sure what the biological significance is, but could be interesting and important to monitor. We also see some a few classical indels causing framshifts - presumably these are ‘passenger’ viruses hitching a ride with in-frame viruses. Do you know if there would be any way you could correct your data to capture these? E.g. strand-bias filters, etc?

Hi Kristian! Thanks for taking the time to comment here, nice to hear from you!

Right now I am not very optimistic that we can do any useful calling in homopolymer regions. The nanopore instrument detects electrical signals from k-mer transitions through the pore. Currently these transitions are modelled as 5-mers. It is likely that actually more than 5 bases are in the pore at a time, but the 5-mer model is computationally tractable for base calling. Because the basecalling software does not use any information about the time between transitions, there really isn’t much of a chance of calling a homopolymer >5 successfully. Having said that, alternative k-mer models may be available soon that could help, and if we could model that ‘dwell’ time, perhaps there is a chance of accessing the information in the future.

These mutations do sound biologically potentially rather interesting. My only worry about using this information for epidemiology is that I wonder whether these are sites that are particularly prone to homoplasies. Also could they result from some kind of polymerase slippage and therefore could they also be introduced during RT-PCR?

Nick

Hi Nick. I see - yes, it sounds like calling homopolymers is going to be exceptionally difficult in this situation - I wasn’t too familiar with base calling from nanopore sequencing, so thanks for the primer ;-).

I should have mentioned that these indels are rarely (ever?) fixed in the viral population, but rather exist as minor intra-host populations (typically around 1-5%). I too was worried about RT-PCR errors, but that doesn’t appear to be the case here - the results are consistent within single individuals but not necessarily shared across individuals although the sequence context is identical or nearly identical. If these mutations were caused by RT-PCR errors then we would expect to see them either systematically across individuals, or find them not to replicate within the same individual.

All that said, I haven’t actually looked closely at these indels, but have been thinking about doing a more systematic study. I’m wondering if @dpark has some more recent analyses from the Broad team?

No problem :slight_smile:

It is interesting. I always wondered if single-molecule unique barcode tagging would be a way of determining what are RT-PCR induced errors versus genuine biological variation. Perhaps soon we can sequence the RNA directly on nanopore, I believe some progress has been made with directly sequencing RNA:cDNA duplex molecules. A whole new world!?

I have updated this page today to provide links to the Github repository for consensus sequences, and also further details on bioinformatics methods.

An updated dataset is available via Github with 98 sequences marked as ‘good’ and 11 sequences marked as ‘fair’. The most recent sequence is from 3rd August 2015.

Please access via http://github.com/nickloman/ebov/

An updated dataset is available via Github that takes real-time surveillance data up to 24th August in Guinea. Just two subclusters (distance cut-off 0.0003) are currenty detectable in August.

Please access via http://github.com/nickloman/ebov/