Inferring date of collection from sample ID

Many of the new samples are lacking metadata, including the rather critical information on date of sampling. This metadata should be forthcoming, but in an effort to allow some immediate analysis I’ve attempted to infer dates of sampling from sample ID.

578 of the new samples follow GXXXX-X, e.g. G3884-1, where the first four digits refer to the patient and the last digit distinguishes multiple samples from within a patient. Here, I look only at the first sample if a patient has multiple samples.

The patient code correlates strongly with date of sampling for those samples with known collection dates:

This gives an R2 of 0.917 and narrow prediction bounds. The average 50% prediction interval is 7.3 days and the average 95% prediction interval is 21.3 days.

This relationship is used to extrapolate the collection dates of the remaining samples:

Additional information is available on in the ebola-dates repo on GitHub (pull requests welcome).

There is also a .tsv file of predicted date of sampling.