Continuing the discussion from Mid/Early release - 45 new EBOV genomes from Sierra Leone :
As part of our continued collaboration with the Sierra Leone Ministry of Health and Sanitation, Kenema Government Hospital and the VHFC , we’re continuing our early release of EBOV genomes sequenced at the Broad Institute.
This release contains all the genome sequenced in December of this year:
10 new early-release genomes
35 previously released genomes (on this site) that have been refined, including additional sequencing
We have turned a couple of knobs on our assembly pipeline and believe most of these genomes to be accurate. A couple of them will still need additional sequencing before final release to NCBI. The sequences can be downloaded here:
http://cl.ly/281I071K3c1V
The sequences were generated by 101bp PE Illumina sequencing using the protocols described in the Gire et al. and Matranga et al. papers. The genomes were assembled using Trinity, followed by an alignment refinement step with NovoAlign.
We are in the process of gathering metadata, but at this moment we don’t have any exact dates or other metadate for the individual samples.
We are currently in the process of preparing the data for GenBank and SRA submissions. Please note that this is an early release, so accuracy can’t be guaranteed at this stage. We have run into a couple of speed bumps releasing the raw data - please contact us directly if you need the raw reads before final release to NCBI (should be completed shortly).
Disclaimer:
Please feel free to download, share, use, and analyze this data. We are currently in the process of preparing a publication and will post progress on this forum. If you intend to use these sequences for publication prior to the release of our paper, please contact us directly.
We have dates of reporting for many of these sequence but some are missing from the WHO line list. We can impute these dates from other samples with adjacent patient ids so here is some documentation of the logic used for these imputations. The dates here (in dd/mm/yy form) are the dates or reporting of the cases but these are almost always the same as the date of initial sample collection where this is known.
G4955 is likely from 2014-08-13:
G4942 12/08/14
G4946 13/08/14
G4950 13/08/14
G4956 13/08/14
G4960 14/08/14
G5119 likely from 2014-08-19 or 2014-08-20:
G5117 19/08/14
G5118 19/08/14
G5134 20/08/14
G5212 22/08/14
G5640 is likely from 2014-09-10 to 2014-09-12:
G5621 09/09/14
G5643 10/09/14
G5661 12/09/14
G5684 13/09/14
G5982, G5983, G5997, G6012 & G6020 are likely from 2014-09-23 to 2014-09-25
G5948 23/09/14
G5950 23/09/14
G6050 25/09/14
G6060 25/09/14
The remaining 3 - G6089, G6091, G6104 - aer likely to be on or after 2014-09-25 but probably not by much:
G6069 25/09/14
Great, thanks @arambaut . Do you have a .csv file with all the various dates (imputed and otherwise) that you could please share?
Here is a .csv file with for all 45 sequences: