Mid/Early release - 45 new EBOV genomes from Sierra Leone

Update:
Please use BioProject link instead since the sequences have all be finalized:
http://www.ncbi.nlm.nih.gov/bioproject/PRJNA257197/

As part of our continued collaboration with the Sierra Leone Ministry of Health and Sanitation, Kenema Government Hospital and the VHFC, we’re continuing our early release of EBOV genomes sequenced at the Broad Institute.

This release contains all the genome sequenced in December of this year:

  1. 10 new early-release genomes
  2. 35 previously released genomes (on this site) that have been refined, including additional sequencing

We have turned a couple of knobs on our assembly pipeline and believe most of these genomes to be accurate. A couple of them will still need additional sequencing before final release to NCBI. The sequences can be downloaded here:

http://cl.ly/281I071K3c1V

The sequences were generated by 101bp PE Illumina sequencing using the protocols described in the Gire et al. and Matranga et al. papers. The genomes were assembled using Trinity, followed by an alignment refinement step with NovoAlign.

We are in the process of gathering metadata, but at this moment we don’t have any exact dates or other metadate for the individual samples.

We are currently in the process of preparing the data for GenBank and SRA submissions. Please note that this is an early release, so accuracy can’t be guaranteed at this stage. We have run into a couple of speed bumps releasing the raw data - please contact us directly if you need the raw reads before final release to NCBI (should be completed shortly).

Disclaimer:
Please feel free to download, share, use, and analyze this data. We are currently in the process of preparing a publication and will post progress on this forum. If you intend to use these sequences for publication prior to the release of our paper, please contact us directly.

A couple of quick things to look at:

G4955, G4994, G4999, G5617 have an extra A in a run of 5 As at position 11,536.
G5016, G5012, G5640 have an extra T in a run of 4 Ts at position 18,596.

Hi @arambaut, yeah that’s one of the first things we noticed and have been scrutinizing. I’ll note that just yesterday I received two more lanes of sequencing for some of the samples which include a good number from the ones you had listed, so we’ll see if those line up. Unfortunately, most of these do not yet have replicate libraries made yet, just replicate sequencing of the same pools.

This data set in general seems to have a higher prevalence of 1bp indels in homopolymer runs than earlier over the summer. After a number of tweaks to the assembly refinement steps, many of the more questionable ones have been filtered out (they probably really exist, but at a minority fraction within the patient). But the two positions mentioned here are well supported by reads (along the lines of 100 reads for and 5 reads against) and are also found in multiple patients. They also don’t have coding / frameshifting effects on the genome. So we’re thinking these two indels in particular may be real.

@Kristian_Andersen, is it possible these indels might actually be a technically accurate picture of our sample, but an inaccurate picture of what was in the patient, due to sample degradation over time? My understanding is that indel errors in these homopolymer runs would require replication of some kind to introduce the error, and that’s not the kind of error mode we would expect from a degrading RNA sample? If we manage to do another library of these samples and can confirm them in both (and can convince ourselves that our data accurately reflects what we had in the tube), then how sure are we that the patients had these indels as well?

I’m willing to believe that a longer period of evolutionary time for this outbreak has allowed these 1bp indels to increase in prevalence. But the thing that makes me hesitate is that we tend not to see such things happen between outbreaks, which is an even longer period of time, so I’m not sure what to think about that. These 1bp indels are not particularly common in EBOV (that we’ve seen so far) or in LASV (which tends to be 3bp indels), but other viruses see this more often (I think WNV and other previous Broad viral projects, Amr would know, this is what Pilon/Bellini were optimized for… the small indels).

Looking at the tree and alignment these two sets of sequences group together irrespective of the 1bp indels.

G4955, G4994, G4999, G5617 have the following additional shared unique SNPs:
A->G at 6681, C->T at 15,757, T->C at 18,037

G5016, G5012, G5640 have the following additional shared unique SNPs:
A->G at 9892, T->C at 17,895

Seems plausible that the indels are being transmitted as well.

That’s reassuring to know, that they’re concordant with SNP calls phylogenetically. @evogytis, let me know if any of the SNPs still look out of place in newly constructed trees.

It just occurred to me that there may be an explanation about my earlier concern about lack of evidence of 1bp indels between EBOV outbreaks. Maybe they actually do happen, but since all genomes from previous outbreaks are reference-assisted assemblies from ABI reads, maybe they don’t show up in Genbank. Alternatively, the biological explanation (which is a bit more of a reach) is that selective constraint in the reservoir somehow prevents these indels from happening over such long timescales, but such constraint doesn’t exist in human hosts.

I think our sequencing probably gives a pretty accurate picture of the EBOV RNA present in our sample, so what we’re seeing is probably biologically accurate. One thing to note though, is that we don’t know whether we’re looking at transcripts (although these positions are outside the CDRs) or genomic copies. I do not believe what we’re seeing could be due to sample degradation. Importantly, if we had some sort of systematic bias, I’d expect these indels to crop up at significant frequencies across all samples, since the genomes are so similar. In that regards, it’s reassuring that Andrew sees them clustering on the tree (@arambaut, I assume they have other differences than the indels, so it’s not just the indel causing them to cluster?).

Glancing at the data, it’s pretty clear that we see more evidence of high(er) frequency indels in the later dataset than the one from this summer. Presumably most of these indels come with a low fitness cost, so I guess that’s not totally unexpected. @dpark, we should probably look at this more closely - it’d be interesting to see if there’s a correlation between time and something like % editing sites, % indels, or the frequencies of individual indels. Maybe Shirlee could take a look at this? Do we already have the iSNV calls or are we waiting for the replicates? As long as with stick to >5% (and possibly >2%) iSNVs, we should be okay with the individual libraries.

The sequences are looking better now. The only SNP I would double check is at position 18914: there are 7 independent A>G mutations, 2 on internal branches and then 5 in 6012, 6020, 5898, 5617 and 4683. Resolving this site should sort out the other odd one I found (at position 5276).

Overall the sequences look fine. GP seems to have an excess of amino acid substitutions per site, but I guess that’s expected.

@Kristian_Andersen - yes they are supported by 3 and 2 additional shared SNPs, respectively, as well as the 1bp insertions (see my reply, above, for details).