This is a draft post and is subject to updates.
Josh Quick, Nicholas Loman, Sophie Duraffour, Lauren Cowley, Andrew Rambaut, Jared Simpson, Simon Weller, Phil Rachwal, Jamie Taylor, Daniel Turner, Stephan Gunther, Miles Carroll and the European Mobile Laboratory EVIDENT-EU project.
(Author list subject to change).
We have established a portable genomic surveillance laboratory in Coyah, Guinea employing the Oxford Nanopore MinION sequencing technology. This sequencer is “USB stick” format and the entire set of reagents to port the laboratory fits into a hard-shell flightcase and a soft bag, a huge boon for field work.
This Virological thread is intended to support rapid data release and description of Makona Ebola virus sequences from Guinea during the ongoing 2014-2015 West African outbreak. We anticipate it will be continually updated with new data and information. We aim for this thread to be a citable “pre-publication” on the sequencing data and to help stimulate community discussion and review of the datasets in terms of the ongoing outbreak in Guinea.
Viral nucleic acid is enriched and converted to cDNA by RT-PCR amplification from inactivated RNA extracts from blood samples (whole blood or serum) supplied by a network of referring diagnostic laboratories . Typically we use an 11 amplicon scheme, with most amplicons being ~1.3 - 2.4kb, in order to tile across the genome. This scheme achieves 98.4% theoretical coverage of the genome, with around 400 bases at the 3’ end of the viral genome not covered by the current amplicon scheme (we plan to address this in a future update).
After pooling them in approximately equimolar amounts, amplicon pool is cleaned with SPRI and prepared for MinION sequencing. We employ the standard gDNA sequencing kit (SQK-MAP-005.1) according to manufacturer’s protocol (Oxford Nanopore Technologies, UK) and run each pool on an individual nanopore flowcell (FLO-MAP-003). The nanopore instrument can read the full-length amplicons without need for fragmentation. Within 30-60 minutes, sufficient sequencing data has typically been produced for high (>100x) coverage of the genome. For some flowcells this time may be longer.
Nanopore data currently requires an online Internet connection to produce basecalled data via a cloud service called Metrichor. Internet access is limited in Coyah to mobile 3G Wifi hotspot connections which can be unreliable, so reduced datasets of around 200-400Mb of compressed ‘raw’ or ‘squiggle’ data are transferred to University of Birmingham for analysis. This represents a trade-off between communications speed and genome coverage. We have explored many possible avenues to improve connectivity including external 3G antennae, satellite phones and alternative mobile networks, with limited success.
The bioinformatics approach currently utilises the marginAlign and marginVariant software (Jain et al.) to detect single nucleotide variants when aligned to a reference strain from early in the outbreak (EM_079517). Due to the nature of the sequencing data, which is dominated by insertion and deletion errors, we do not attempt to call insertion or deletions, however these mutations are rarely detected in this Ebola outbreak. Variants with a posterior probability value of >=0.5 and coverage depth of >15x are accepted and a consensus is generated. Variants detected by marginAlign with a posterior probability of <= 0.5 are further checked using a signal-level algorithm, nanopolish, to ensure they have read-level support of >75%. Regions of uncertainty (for example in hard to sequence homopolymeric regions, primer binding sites) or with coverage <15x are masked with an N character. This approach gives a high true positive variant calling rate (assessed by artificially mutating the reference genome and assessing TP/FN rates), assuming sufficient genomic coverage of each amplicon. We assess the true positive rate for each sample individually and assign each sample a quality score. Those with a true positive rate (TPR, i.e. sensitivity) of >= 95% (meaning a maximum of one missed polymorphism out of 30) are assigned as ‘good’. Those with a TPR of >= 80% and less than 95% are assigned as ‘fair’. Those with TPR <=80% are assigned as poor and not released, although may be inspected manually for canonical lineage-defining mutations. TPR drops when individual amplicons fail to amplify during RT-PCR, or where coverage levels are low for that amplicon.
The data are currently used in the following ways:
To generate a phylogenetic tree of recent isolates, which are shared with the WHO national coordination team to assist in the outbreak response, by identifying putative transmission clusters, particularly new introductions into prefectures. A maximum-likelihood tree of consensusis constructed from consensus alignments using RaxML using a GTR model with 100 bootstraps, and visualised with the Microreact website. This provides a privately shareable URL that can be sent to the diagnostic laboratories, epidemiologists and others involved in the outbreak response.
With a one week delay to give the epidemiologists prior knowledge of the results, these consensus sequences are shared with Trevor Bedford and Richard Nehrer’s Nextflu website, to help provide an overall view of the international outbreak.
Consensus sequences are available via Github.
Raw FAST5 files will soon be available via the European Nucleotide Archive in real-time.
As of 5th August 2015, a total of 114 samples have been sequenced. 92 have been assigned QC status of ‘good’, 11 have been assigned QC status of ‘fair’, with the remaining 11 ‘poor’.
The PCR scheme is as follows (coordinates relative to gi|674810549|gb|KJ660346.2|)
11 reaction scheme
#Region Primer_ID Sequence_(5-3') Coords Amplicon_size region_1 1_F TTTAGGATCTTTTGTGTGCGAAT 27 1911 region_1 4_R TGGTGTCCTCGTCGTCCT 1938 1911 region_2 5_F CGATCTAGACGAGGACGACGA 1927 1901 region_2 8_R TGGAAAGCAGTTCCAAAACC 3828 1901 region_3 9_F TGCCTGGTTTTGGAACTGC 3823 1895 region_3 12_R TGCAATGAGAAAGATTGACATTTG 5718 1895 region_4 13_F TCCTCAAATTGCCTACATGCTT 5759 1874 region_4 16_R GCTGGCCCGAAATATGGT 7633 1874 region_5 17_F GATGAAGGTGCTGCAATCG 7601 2406 region_5 20_R GGGCAACTGGTATACAGCTAAAAG 10007 2406 region_6 20_F AACCCAAACATTGACCAAAGAA 9550 1371 region_6 22_R CCACCAGAAAACCCATGTTAGT 10921 1371 region_7 23_F GCTCCAAGAACCCGACAAA 10944 1410 region_7 26_F GGTTGAGGACCCAGTTTGC 12354 1410 region_8 26_F GGTTGAGGACCCAGTTTGC 12354 1898 region_8 29_R CCGAAATCCAGAGGTTTGC 14252 1898 region_9 30_F CAAACCTCTGGATTTCGGAAC 14253 1427 region_9 32_R CTCGGTATCTTGTTAAATCTAAATCCA 15680 1427 region_10 33_F TTAACAAGATACCGAGAAAATGAATTG 15691 1396 region_10 35_R AAGGCACCAGCACCTTCTC 17087 1396 region_11 35_F TGATGGCACTGAACGGAGT 16632 1921 region_11 38_R GTGTTATCAACCAAAGCACTATTCCA 18553 1921
19 reaction scheme
For degraded or hard to amplify samples, a 19 reaction scheme is used:
#Region Primer_ID Sequence_(5-3') Coords Amplicon_size region_1 1_F TTTAGGATCTTTTGTGTGCGAAT 27 1426 region_1 3_R ACTCCTGCGAGGGTGCTC 1453 1426 region_2 3_F AAGGCTTGCCTTGAGAAGGT 965 973 region_2 4_R TGGTGTCCTCGTCGTCCT 1938 973 region_3 5_F CGATCTAGACGAGGACGACGA 1927 952 region_3 6_R TGTGGCTTAACGCTTATTTGC 2879 952 region_4 7_F GCGTTAAGCCACAGTTATAGCC 2887 941 region_4 8_R TGGAAAGCAGTTCCAAAACC 3828 941 region_5 9_F TGCCTGGTTTTGGAACTGC 3823 940 region_5 10_R CAGCGACACCTAGAGGAAGC 4763 940 region_6 11_F TTGGCTTCCTCTAGGTGTCG 4760 958 region_6 12_R TGCAATGAGAAAGATTGACATTTG 5718 958 region_7 13_F TCCTCAAATTGCCTACATGCTT 5759 906 region_7 14_R TGTGGTAGAATAATAGCCACTCGAC 6665 906 region_8 15_F GGACCCGTCGAGTGGCTAT 6659 974 region_8 16_R GCTGGCCCGAAATATGGT 7633 974 region_9 17_F GATGAAGGTGCTGCAATCG 7601 969 region_9 18_R GCTCGAACATGGTGGTCGT 8570 969 region_10 19_F GGATGGACACGACCACCA 8562 1445 region_10 20_R GGGCAACTGGTATACAGCTAAAAG 10007 1445 region_11 21_F TGTATACCAGTTGCCCCTGAG 10015 906 region_11 22_R CCACCAGAAAACCCATGTTAGT 10921 906 region_12 23_F GCTCCAAGAACCCGACAAA 10944 958 region_12 24_R TCAGGAAGAGAGCATCTTGCAT 11902 958 region_13 25_F TGCAAGATGCTCTCTTCCTGA 11903 947 region_13 26_R CTGAGGTAACACTGTACCAAGATCC 12850 947 region_14 26_F GGTTGAGGACCCAGTTTGC 12354 1898 region_14 29_R CCGAAATCCAGAGGTTTGC 14252 1898 region_15 29_F TTGCGCTCAGCTGTGATG 13783 946 region_15 30_R TAATGTGCGTGTTCCTTCCA 14729 946 region_16 31_F GAGACGCCGGTTTTGGAC 14779 901 region_16 32_R CTCGGTATCTTGTTAAATCTAAATCCA 15680 901 region_17 33_F TTAACAAGATACCGAGAAAATGAATTG 15691 963 region_17 34_R CATGGCTCATTTGCAGGAC 16654 963 region_18 35_F TGATGGCACTGAACGGAGT 16632 977 region_18 36_R TGGTGTGGCATCTTACGTGTAG 17609 977 region_19 37_F TGGTATCTTTGTCTGACGAACTTCT 17578 975 region_19 38_R GTGTTATCAACCAAAGCACTATTCCA 18553 975