Mid/Early release - 96 EBOV genomes from Sierra Leone

Kristian_Andersen · January 30, 2015, 12:50am

Update:
Please use BioProject link instead since the sequences have all be finalized:
http://www.ncbi.nlm.nih.gov/bioproject/PRJNA257197/

As part of our continued collaboration with the Sierra Leone Ministry of Health and Sanitation, Kenema Government Hospital and the VHFC, we’re continuing our early release of EBOV genomes sequenced at the Broad Institute.

This release contains all the EBOV genomes sequenced in December 2014 and January 2015:

51 new early-release genomes
45 previously released genomes (on this site) that have been refined, including additional sequencing.

The sequences can be downloaded here:

http://cl.ly/0Y2p3w0w2H2R

The sequences were generated by 101bp PE Illumina sequencing using the protocols described in the Gire et al. and Matranga et al. papers. The genomes were assembled using Trinity, followed by an alignment refinement step with NovoAlign.

We are in the process of gathering metadata, but at this moment we don’t have any exact dates or other metadata for the individual samples.

Please note that this is an early release, so accuracy can’t be guaranteed at this stage.

Disclaimer:
Please feel free to download, share, use, and analyze this data. We are currently in the process of preparing a publication and will post progress on this forum. If you intend to use these sequences for publication prior to the release of our paper, please contact us directly.

arambaut · January 30, 2015, 12:29pm

A couple of things to check:

G4415 has a sequence of 8 T->C changes starting at site 5512 (5453 in the original sequence).

There are seven sequences with a G->A at 18914, which is possibly an artefact (they otherwise don’t group together): G6091,G4907,G4190,G5844,G5997,G5617,G4415,G5647

dpark · January 30, 2015, 10:11pm

Hi Andrew,

Regarding position 18914, we’ve noticed this before (and you’ve mentioned it before) in our earlier sequencing outputs from December. I’ve spent some time looking at the assemblies and reads in this part of the genome and have managed to tune the parameters in a way that corrects out a lot of the other spurious SNPs we saw earlier, but the ones at 18914 are not ones that I can remove by any systematic filter. They have decent read support and the alignments look fine. But because of where it is positioned about 50bp from the end, I am highly suspicious of this SNP and I don’t trust it on a personal level. But I can’t find an automated way to take it out. Since the alleles here make no sense phylogenetically, I think we should simply manually omit this position based on its extremely close proximity to the end… that, combined with the phylogenetic argument should be defensible in a methods section of a paper.

The stretch of T->C changes in G4415 is interesting: we haven’t seen that in other samples before, and there are reasons to think it could be a weird artifact. The alignments look normal in that area, and the T allele is often present at a pretty minor frequency, the C’s are certainly the majority in those areas. That patch of genome doesn’t look overtly repetitive, but I haven’t looked systematically yet. Anyway, my thought: it’s good to be suspicious of it, but I can’t yet rule it out. We’re going to queue that up for an independent sequencing run (along with many others) and see if it replicates.

dpark · January 30, 2015, 10:23pm

Dear all: let me give a little status update on this sequencing effort. As you can see, this batch of samples has encountered more technical difficulty than the earlier batch we did last summer. There are various reasons for that, but we’re cranking through and trying to just hit it with more coverage where we can.

This represents the completion of the first pass of sequencing on all 573 new samples from Sierra Leone collected between June - Sept of 2014. 96 of these samples produced high quality assemblies. Many more produced EBOV reads and intermediate quality assemblies that have not been released here yet. And a number of these samples are too degraded to be salvaged.

We are now cherry picking out another round of samples that produced intermediate quality results and sequencing them deeper and with independent libraries. Included in that resequencing effort will be a few miscellaneous samples (like G4415 and G5119) that showed unusual SNPs that are worth validating independently. Let me know if you have any particular favorites, otherwise the bulk of the effort will be focused on recovering these intermediate quality samples and getting them to a high quality assembly. Similarly, intrahost variant calling will follow as the deeper coverage from independent runs get added.

arambaut · January 31, 2015, 12:22pm

Will mask out the entire 18914 site. Although it is interesting that both the PHE UK cases (from Sierra Leone) have an ‘R’ ambiguity at this site (and a ‘W’ 2 bases upstream).

For G4415 - it seems really implausible that 8 T->Cs in a short region would be an evolutionary change and there are 4 other unique T->Cs in this sequence elsewhere. If it is real then perhaps some host editing process? For phylogenetics I think I will replace these with Ns or just omit this sequence.

Interestingly there is another phylogenetic cluster (G5016, G5012, G4382, G5640, G5647) showing a shared insertion in an intergenic region (T at site 18649). Like the previous one, it looks real as there are other shared segregating sites.