Early release - 16 new EBOV genomes from Sierra Leone

Update:
Please use BioProject link instead since the sequences have all be finalized:
http://www.ncbi.nlm.nih.gov/bioproject/PRJNA257197/

As part of our continued collaboration with the Sierra Leone Ministry of Health and Sanitation, Kenema Government Hospital and the VHFC, we recently received a new batch of inactivated EBOV samples from Sierra Leone for sequencing at the Broad Institute. We have now completed another set of 16 EBOV samples and the assembled consensus sequences can be downloaded here:

New link (45 sequences):
http://cl.ly/281I071K3c1V

The sequences were generated by 101bp PE Illumina sequencing using the protocols described in the Gire et al. and Matranga et al. papers. The genomes were assembled using Trinity, followed by an alignment refinement step with NovoAlign. The average bp coverage for this dataset was 362x [17x - 827x].

We are in the process of gathering metadata, but at this moment we don’t have any exact dates or other metadate for the individual samples.

We are currently in the process of preparing the data for GenBank and SRA submissions. Please note that this is an early release, so accuracy can’t be guaranteed at this stage.

Disclaimer:
Please feel free to download, share, use, and analyze this data. We are currently in the process of preparing a publication and will post progress on this forum. If you intend to use these sequences for publication prior to the release of our paper, please contact us directly.

Sorry, just to clarify - these 16 sequences are entirely new samples - so together with the previous 21 sequences, we now have 37 additional sequences.

A group of them from both the previous set and this new one have some stuff going on at the end:

One of this new set (G3724) was in the original set from the summer (accession KM233053).

@arambaut - Oh yeah sorry about that. With each batch, we’ve been running an old sample from this summer as a control. Last week when I passed the assemblies to Kristian it was clearly labeled, but this week I forgot to. That’s the exact same sample from this summer, ignore it.

@dpark. Any thoughts on the ends? Must be artifacts - I have actually seen this happen before (adapters?), so the reads should probably just be trimmed back. It might be good to check the bams.

Here is a tree (with all available genomes) and a root-to-tip divergence plot for those with known dates. The new Sierra Leone genomes are in dark blue. The end bits mentioned above were trimmed out of the alignment.


$$slope=1.24{\times 10^{-3}}$$

As an estimate of evolutionary rate, this is slower than the estimate in the Gire et al paper but still within the credible intervals of that estimate. This may be because the elevating effect of transient deleterious mutations is being diluted out as the data spans a greater time interval. This is now much closer to the long term rate of evolution estimated over 38 years of EBOV outbreaks in Dudas and Rambaut, 2014 & Carroll et al 2012.

@arambaut - so there are two explanations for a slower substitution rate: one is that we’re seeing more purifying selection bringing the rate further down from the baseline mutation rate. The other is that we’re just getting better estimates now because our denominator (the time spanned by our data) is larger. Thoughts? Maybe both?

Also, regarding the funny SNPs on the end, they’re definitely technical artifacts and I’m tuning the knobs a bit on our assembly refinement parameters to clean that up before submitting to Genbank. This is what we mean by preliminary! Turns out that those assemblies happen to end in the middle of a semi-palindromic section of sequence, and the “SNPs” are actually coming from the other side of the palindrome (so they end up having great read support, but they’re still not real).

I suspect a bit of both. We did see elevated dN/dS rates in the June data and if it were the former then we would predict this would be dropping as the rate drops. Here is a BEAST density for the new rate compared with the long-term and the June data:


The ZEBOV and the October analyses are done with a relaxed clock, the June 2014 with a strict clock.