Early release - 21 new EBOV genomes from Sierra Leone

Update:
Please use BioProject link instead since the sequences have all be finalized:
http://www.ncbi.nlm.nih.gov/bioproject/PRJNA257197/

As part of our continued collaboration with the Sierra Leone Ministry of Health and Sanitation, Kenema Government Hospital and VHFC, a couple of weeks ago we received a new batch of inactivated EBOV samples from Sierra Leone for sequencing at the Broad Institute. Thirty-six hours ago our first run completed and we have now assembled 21 full-length genomes covering dates in June and October/early November. The assembled consensus sequences can be downloaded here:

New link (45 sequences):
http://cl.ly/281I071K3c1V

The sequences were generated by 101bp PE Illumina sequencing using the protocols described in the Gire et al. and Matranga et al. papers. The genomes were assembled using Trinity, followed by an alignment refinement step with NovoAlign. The average bp coverage for this dataset was 1,056x [29x - 9,555x].

We have more samples and expect to be releasing data over the coming weeks and months - this was our initial run on this batch of samples. We are in the process of gathering metadata, but at this moment we don’t have any exact dates or other metadate for the individual samples.

We are currently in the process of preparing the data for GenBank and SRA submissions. Please note that this is an early release, so accuracy can’t be guaranteed at this stage.

Disclaimer:
Please feel free to download, share, use, and analyze this data. We are currently in the process of preparing a publication and will post progress on this forum. If you intend to use these sequences for publication prior to the release of our paper, please contact us directly.

Here is a quick tree of this initial data:


Green labels denote the new Sierra Leone viruses. The clades defined in Gire et al are labelled SL1-3.

Several analyses to follow, but some quick insights:

  1. These sequences follow the same evolutionary patterns as observed in previous samples from Sierra Leone.

  2. The sequenced samples derive from ‘clade 3’ as described in our Gire et al. publication. No evidence of genetic clades 1 and 2.

  3. No evidence for further ‘spill over’ events from the animal reservoir - the data suggest that this outbreak continues to be fueled by human-to-human transmission.

  4. As expected, we continue to observe genetic variation and viral mutation - however, we do not have any evidence to suggest that any of these changes are linked to functional differences or adaptation.

Please feel free to contact us if you’re interested in posting your analyses to this forum - we strongly encourage collaboration and data sharing.

To me, what’s most interesting is the lack of further importations into Sierra Leone from elsewhere. Not only do these all appear to descend from the initial entry in May, but they carry the derived allele (at 10,218) that we observed the appearance of in our earlier Gire, et al data set, meaning that these are really likely to have descended from those initial 14 travelers.

@dpark. Yes I agree - although, I realized that I don’t think we can actually say this for certain? We don’t have any later sequences from Guinea, so it’s possible that the current lineages circulating in Guinea are all from the same clades? Especially the ones in the border regions?

More data from Guinea would be very helpful.

If it’s true that the Liberian strain is indeed clade 2, then it’s possible that all the currently circulating lineages share clade 2/3 ancestry.

@Kristian_Andersen I would agree with you on clade 2 viruses (like the Liberian sample) that perhaps it was just prevalent in Guinea and could have come from there. But for the clade 3 viruses, we observed this novel mutation arise in Sierra Leone. I’d be pretty confident in saying that all viruses with the 10,218 derived allele were of Sierra Leonean origin. Thoughts?

@dpark. Yup, I think you’re right - clade 3 definitely originated in Sierra Leone (but clade 2 - which is only one mutation away, originated in Guinea). However, because we’re so close to the border region of Sierra Leone and Guinea, there could likely have been ‘cross-talk’ between Guinea and Sierra Leone. I certainly wouldn’t be surprised to find clade 3 in Guinea today.

The meat of the question is correct though - clade 3 originated in Sierra Leone, and the lineages we now observe in Sierra Leone all came from that.

I’m very surprised by clade 2 in Liberia - (if the sequence is correct) that suggests a reintroduction into Liberia from Guinea or Sierra Leone (as opposed to a ‘slow fuse’ scenario from the introduction of EBOV into Liberia in early April).

There are 4 SNPs that define the SL2 clade. We suggest this lineage came in at the funeral so the lineage must have been in Guinea. So the jump to Libera could have been from Guinea but it would have been later than the first cases in Libera. Alternatively, could the SL2 lineage come from Libera? We are talking about a 3-way border here.

Ah good point, so it’d be too strong a statement to say that there were no Ebola border crossings into Sierra Leone since May, because a transmission chain may have exited the country and later returned. But we can say that there’s a good likelihood that all Ebola in SL continues to have ultimately originated from the original May event.

How about clocking and rates? I know that sample dates and metadata are currently an issue, but what about a simple root-to-tip vs. G# plot?

Dates are now available for most of the new sequences. Here is a root-to-tip regression with the new genomes (light blue), Gire et al genomes (dark blue) and the Guinea genomes (green).

The slope of the regression (an estimate of the rate of evolution) is 1.6x10^-3.

Some odd SNPs I’ve noticed:
301, 4200, 5849, 17849 and 18707 - EBOV_2014_G5112.1+EBOV_2014_G4861.1 and EBOV_2014_G5119.1 have independent T>C, A>G, T>C, G>A and T>C mutations in intergenic regions, respectively, except for 17849, which is non-synonymous and changes alanine into threonine in the L gene. Could be sequencing errors, mis-inferred tree or genuine homoplasies (though very unlikely).
6726 - EBOV_2014_G4999.1, EBOV_2014_G4994.1 and EBOV_2014_G4955.1 have a non-synonymous A>G mutation, which changes a GP residue shared by all isoforms from a threonine into alanine.
9971 - EBOV_2014_G5012.3 and EBOV_2014_G5016.1, apparent A>G homoplasy, intergenic.
14020 - EBOV_2014_G5016.1, EBOV_2014_G4886.1 and EBOV_2014_G6069.1 have a shared reversion C>T (synonymous) from their small clade in the L gene.

I strongly suspect that the 5 apparently homoplasious sites are from the same clade but are split in the tree by unique sequencing errors - EBOV_2014_G5119.1 has 4 unique mutations between sites 6675 and 6710, whereas EBOV_2014_G4861.1 has unique mutations at sites 8536 and 18910. Definitely have to double-check those. Also we should keep an eye on the shared non-synonymous mutations in future sequences - if selection is acting on those sites we expect to see those alleles rising in frequency.

Thanks @evogytis - we’ll definitely double check those positions in the BAM files. We’re still ironing out a few kinks with the data so it’s possible that these could be errors (we were pretty careful with the basecalls though). @dpark, do you think you could please take a look at these?

You beat me to the reply, yeah that’s exactly what I was going to say. I’m staring at some of the reads at certain spots anyway, and I’ll also look at those positions in those individuals to see if there might be anything questionable.

A few updates. Regarding the GP mutations in G5119.1 that @swohl, @alin and @evogytis are discussing, I haven’t yet produced iSNV calls systematically since we don’t yet have replicate sequencing runs from independent library constructions. But since I’ve been staring at the reads for G5119.1 between 6675 and 6170, I can say that the typical Makona-2014 sequence in this region is present at about 5% in the reads, and the four SNPs of interest are at about 95%, and they do all appear to be linked (at least the three that are all really close are definitely linked, so basically, just two intrahost haplotypes here. The read support looks quite solid, there’s no reason to suspect sequencing errors in this region.

@arambaut - I’ve fixed the assembly errors in the palindromic region at the very end of the genome by increasing some of our edge trimming parameters. Those SNPs all go away with the exception of 18910, which continues to be well supported by reads. Still have reason to be suspicious of 18910 because of where it is, but I can’t come up with a good reason to exclude it, so in my latest assemblies, we keep that one in there.

As for the other potential homoplasies pointed out by @evogytis, the ones I’ve managed to spot check look well supported. My guess is that the tree topology might shift around a bit as we add new sequences each week? Looking at the sequences, is it impossible to come up with a tree that eliminates these recurrent mutations? Is it possible that at this point we’ve seen enough evolutionary time pass (within just 2014 itself) that a recurrent mutation is believable?

I’m not sure I have a good sense of scale for that. @arambaut, if you say 1e-3 / site / year, then that means a single genome would experience 38 mutations a year? Right now, the WHO case totals for 2014 are about the same as the number of base pairs in the genome, so does that mean every position in the genome has had 38 opportunities to turn over this year? I must be wrong about the scale of this, that doesn’t seem right.

I moved 2 posts to a new topic: Interesting SNPs in EBOV GP