This is a draft post and is subject to updates.
Josh Quick, Nicholas Loman, Sophie Duraffour, Lauren Cowley, Andrew Rambaut, Jared Simpson, Simon Weller, Phil Rachwal, Jamie Taylor, Daniel Turner, Stephan Gunther, Miles Carroll and the European Mobile Laboratory EVIDENT-EU project.
(Author list subject to change).
We have established a portable genomic surveillance laboratory in Coyah, Guinea employing the Oxford Nanopore MinION sequencing technology. This sequencer is "USB stick" format and the entire set of reagents to port the laboratory fits into a hard-shell flightcase and a soft bag, a huge boon for field work.
This Virological thread is intended to support rapid data release and description of Makona Ebola virus sequences from Guinea during the ongoing 2014-2015 West African outbreak. We anticipate it will be continually updated with new data and information. We aim for this thread to be a citable "pre-publication" on the sequencing data and to help stimulate community discussion and review of the datasets in terms of the ongoing outbreak in Guinea.
Viral nucleic acid is enriched and converted to cDNA by RT-PCR amplification from inactivated RNA extracts from blood samples (whole blood or serum) supplied by a network of referring diagnostic laboratories . Typically we use an 11 amplicon scheme, with most amplicons being ~1.3 - 2.4kb, in order to tile across the genome. This scheme achieves 98.4% theoretical coverage of the genome, with around 400 bases at the 3' end of the viral genome not covered by the current amplicon scheme (we plan to address this in a future update).
After pooling them in approximately equimolar amounts, amplicon pool is cleaned with SPRI and prepared for MinION sequencing. We employ the standard gDNA sequencing kit (SQK-MAP-005.1) according to manufacturer's protocol (Oxford Nanopore Technologies, UK) and run each pool on an individual nanopore flowcell (FLO-MAP-003). The nanopore instrument can read the full-length amplicons without need for fragmentation. Within 30-60 minutes, sufficient sequencing data has typically been produced for high (>100x) coverage of the genome. For some flowcells this time may be longer.
Nanopore data currently requires an online Internet connection to produce basecalled data via a cloud service called Metrichor. Internet access is limited in Coyah to mobile 3G Wifi hotspot connections which can be unreliable, so reduced datasets of around 200-400Mb of compressed 'raw' or 'squiggle' data are transferred to University of Birmingham for analysis. This represents a trade-off between communications speed and genome coverage. We have explored many possible avenues to improve connectivity including external 3G antennae, satellite phones and alternative mobile networks, with limited success.
The bioinformatics approach currently utilises the marginAlign and marginVariant software (Jain et al.) to detect single nucleotide variants when aligned to a reference strain from early in the outbreak (EM_079517). Due to the nature of the sequencing data, which is dominated by insertion and deletion errors, we do not attempt to call insertion or deletions, however these mutations are rarely detected in this Ebola outbreak. Variants with a posterior probability value of >=0.5 and coverage depth of >15x are accepted and a consensus is generated. Variants detected by marginAlign with a posterior probability of <= 0.5 are further checked using a signal-level algorithm, nanopolish, to ensure they have read-level support of >75%. Regions of uncertainty (for example in hard to sequence homopolymeric regions, primer binding sites) or with coverage <15x are masked with an N character. This approach gives a high true positive variant calling rate (assessed by artificially mutating the reference genome and assessing TP/FN rates), assuming sufficient genomic coverage of each amplicon. We assess the true positive rate for each sample individually and assign each sample a quality score. Those with a true positive rate (TPR, i.e. sensitivity) of >= 95% (meaning a maximum of one missed polymorphism out of 30) are assigned as 'good'. Those with a TPR of >= 80% and less than 95% are assigned as 'fair'. Those with TPR <=80% are assigned as poor and not released, although may be inspected manually for canonical lineage-defining mutations. TPR drops when individual amplicons fail to amplify during RT-PCR, or where coverage levels are low for that amplicon.
The data are currently used in the following ways:
1) To generate a phylogenetic tree of recent isolates, which are shared with the WHO national coordination team to assist in the outbreak response, by identifying putative transmission clusters, particularly new introductions into prefectures. A maximum-likelihood tree of consensusis constructed from consensus alignments using RaxML using a GTR model with 100 bootstraps, and visualised with the Microreact website. This provides a privately shareable URL that can be sent to the diagnostic laboratories, epidemiologists and others involved in the outbreak response.
2) With a one week delay to give the epidemiologists prior knowledge of the results, these consensus sequences are shared with Trevor Bedford and Richard Nehrer's Nextflu website, to help provide an overall view of the international outbreak.
3) Consensus sequences are available via Github.
4) Raw FAST5 files will soon be available via the European Nucleotide Archive in real-time.
As of 5th August 2015, a total of 114 samples have been sequenced. 92 have been assigned QC status of 'good', 11 have been assigned QC status of 'fair', with the remaining 11 'poor'.
The PCR scheme is as follows (coordinates relative to gi|674810549|gb|KJ660346.2|)
11 reaction scheme
#Region Primer_ID Sequence_(5-3') Coords Amplicon_size
region_1 1_F TTTAGGATCTTTTGTGTGCGAAT 27 1911
region_1 4_R TGGTGTCCTCGTCGTCCT 1938 1911
region_2 5_F CGATCTAGACGAGGACGACGA 1927 1901
region_2 8_R TGGAAAGCAGTTCCAAAACC 3828 1901
region_3 9_F TGCCTGGTTTTGGAACTGC 3823 1895
region_3 12_R TGCAATGAGAAAGATTGACATTTG 5718 1895
region_4 13_F TCCTCAAATTGCCTACATGCTT 5759 1874
region_4 16_R GCTGGCCCGAAATATGGT 7633 1874
region_5 17_F GATGAAGGTGCTGCAATCG 7601 2406
region_5 20_R GGGCAACTGGTATACAGCTAAAAG 10007 2406
region_6 20_F AACCCAAACATTGACCAAAGAA 9550 1371
region_6 22_R CCACCAGAAAACCCATGTTAGT 10921 1371
region_7 23_F GCTCCAAGAACCCGACAAA 10944 1410
region_7 26_F GGTTGAGGACCCAGTTTGC 12354 1410
region_8 26_F GGTTGAGGACCCAGTTTGC 12354 1898
region_8 29_R CCGAAATCCAGAGGTTTGC 14252 1898
region_9 30_F CAAACCTCTGGATTTCGGAAC 14253 1427
region_9 32_R CTCGGTATCTTGTTAAATCTAAATCCA 15680 1427
region_10 33_F TTAACAAGATACCGAGAAAATGAATTG 15691 1396
region_10 35_R AAGGCACCAGCACCTTCTC 17087 1396
region_11 35_F TGATGGCACTGAACGGAGT 16632 1921
region_11 38_R GTGTTATCAACCAAAGCACTATTCCA 18553 1921
19 reaction scheme
For degraded or hard to amplify samples, a 19 reaction scheme is used:
#Region Primer_ID Sequence_(5-3') Coords Amplicon_size
region_1 1_F TTTAGGATCTTTTGTGTGCGAAT 27 1426
region_1 3_R ACTCCTGCGAGGGTGCTC 1453 1426
region_2 3_F AAGGCTTGCCTTGAGAAGGT 965 973
region_2 4_R TGGTGTCCTCGTCGTCCT 1938 973
region_3 5_F CGATCTAGACGAGGACGACGA 1927 952
region_3 6_R TGTGGCTTAACGCTTATTTGC 2879 952
region_4 7_F GCGTTAAGCCACAGTTATAGCC 2887 941
region_4 8_R TGGAAAGCAGTTCCAAAACC 3828 941
region_5 9_F TGCCTGGTTTTGGAACTGC 3823 940
region_5 10_R CAGCGACACCTAGAGGAAGC 4763 940
region_6 11_F TTGGCTTCCTCTAGGTGTCG 4760 958
region_6 12_R TGCAATGAGAAAGATTGACATTTG 5718 958
region_7 13_F TCCTCAAATTGCCTACATGCTT 5759 906
region_7 14_R TGTGGTAGAATAATAGCCACTCGAC 6665 906
region_8 15_F GGACCCGTCGAGTGGCTAT 6659 974
region_8 16_R GCTGGCCCGAAATATGGT 7633 974
region_9 17_F GATGAAGGTGCTGCAATCG 7601 969
region_9 18_R GCTCGAACATGGTGGTCGT 8570 969
region_10 19_F GGATGGACACGACCACCA 8562 1445
region_10 20_R GGGCAACTGGTATACAGCTAAAAG 10007 1445
region_11 21_F TGTATACCAGTTGCCCCTGAG 10015 906
region_11 22_R CCACCAGAAAACCCATGTTAGT 10921 906
region_12 23_F GCTCCAAGAACCCGACAAA 10944 958
region_12 24_R TCAGGAAGAGAGCATCTTGCAT 11902 958
region_13 25_F TGCAAGATGCTCTCTTCCTGA 11903 947
region_13 26_R CTGAGGTAACACTGTACCAAGATCC 12850 947
region_14 26_F GGTTGAGGACCCAGTTTGC 12354 1898
region_14 29_R CCGAAATCCAGAGGTTTGC 14252 1898
region_15 29_F TTGCGCTCAGCTGTGATG 13783 946
region_15 30_R TAATGTGCGTGTTCCTTCCA 14729 946
region_16 31_F GAGACGCCGGTTTTGGAC 14779 901
region_16 32_R CTCGGTATCTTGTTAAATCTAAATCCA 15680 901
region_17 33_F TTAACAAGATACCGAGAAAATGAATTG 15691 963
region_17 34_R CATGGCTCATTTGCAGGAC 16654 963
region_18 35_F TGATGGCACTGAACGGAGT 16632 977
region_18 36_R TGGTGTGGCATCTTACGTGTAG 17609 977
region_19 37_F TGGTATCTTTGTCTGACGAACTTCT 17578 975
region_19 38_R GTGTTATCAACCAAAGCACTATTCCA 18553 975