Genomic epidemiology of the early stages of SARS-CoV-2 outbreak in Russia

g.bazykin · July 21, 2020, 4:30pm

Russia ranks fourth in the number of confirmed COVID-19 cases globally. In our new preprint, we perform the first (to my knowledge) study of the genomic epidemiology of SARS-CoV-2 in Russia in March-April. This is based on 211 genomes from 25 (out of the 85) Russia’s regions.

Key findings:

67 independent introductions (probably more), mostly from Europe;
9 Russian transmission lineages;
phylogeographic positions of samples match direct travel data. In 9 out of 13 cases, it is consistent with the country of origin, including 3 cases when the country of origin is uniquely and correctly identified (France, Switzerland and Saudi Arabia).
no trace of export outside Russia.

Additionally, we study a large nosocomial cluster – the Vreden hospital in Saint Petersburg. Over 700 patients and medical staff stayed there locked down for over a month; over 400 got infected. We find that the virus was introduced into the hospital up to 4 times; each introduction gave rise to an outbreak of its own, with initial Rt~4, later reduced to ~1.

https://www.medrxiv.org/content/10.1101/2020.07.14.20150979v1

g.bazykin · July 21, 2020, 5:46pm

To count introductions, we use the (limited) direct data on travel that we have. We split our Russian samples into five distinct groups, depending on their phylogenetic position relative to other Russian and non-Russian samples (see figure):

For Russian singletons and Russian transmission lineages, we used maximum parsimony, assuming that they each result from a distinct introduction.

For stem clusters, stem-derived singletons and stem-derived transmission lineages, it’s more complex. For example, the pattern in the left panel in the figure above could result from anywhere between 1 and 8 distinct introductions, depending on which of the transmissions corresponding to the ancestral node occurred prior to introduction, and which in Russia.

Facing a similar problem (on a much larger UK dataset), Pybus et al. (https://pando.tools/t/preliminary-analysis-of-sars-cov-2-importation-establishment-of-uk-transmission-lineages/507) assumed that the ancestral state was non-UK, so that each transmission lineage resulted from a distinct introduction. It would be tempting to use a similar simple rule to estimate the number of introductions for stem clusters and stem-derived singletons.

However, from travel data, we see that no simple rule would work. E.g., for some of the stem-derived transmission lineages, we know that most individuals haven’t travelled:

(Russian flag means no travel), suggesting that this lineage could have resulted from transmission within Russia. In other lineages, however, we see multiple individuals who have travelled:

To address this as well as we can, we use a mixed approach. We assume that the number of introductions for each of the categories above is proportional to the fraction of individuals who have travelled, among all individuals with travel history. This gives us ~0.33 imports per stem-derived transmission lineage; ~0.14 imports per stem-derived singleton; and ~0.36 imports per sequence in a stem cluster. For details, see here: https://www.medrxiv.org/content/medrxiv/suppl/2020/07/17/2020.07.14.20150979.DC1/2020.07.14.20150979-1.pdf

This yields our estimate of 67 introductions overall giving rise to the sampled diversity.

trilisser · July 28, 2020, 9:53am

Добрый день!
Хочется уточнить некоторые моменты касательно применения BD Skyline для оценки эпидемических параметров.
К сожалению, на MCC дереве не отображена апостериорная вероятность существования узлов. Как показывает практика, получаемые сиквенсы SARS-CoV-2 весьма гомогенны, и если узлам на дереве не соответствует достаточная поддержка, то ветвление можно считать случайным, как и оцениваемые эпид. параметры с использованием упомянутой модели.

Также в вашей работе не содержится данных об оценке временного сигнала (temporal signal) в последовательностях, чтобы иметь основание для применения молекулярных часов. 52 сиквенсов SARS-CoV-2 может оказаться недостаточно для оценки времени дивергенции.

С уважением,
Артём Б.

g.bazykin · August 1, 2020, 11:18pm

Добрый день, Артём,

на MCC дереве нас интересуют в первую очередь глубокие ветвления, которые и обсуждаются в тексте. Эти внутренние узлы поддерживаются ручным анализом мутаций и ML деревом. Также специально был проведён анализ всего датасета и двух его подмножеств, чтобы датировать их. Тем не менее мы обдумаем возможность отображения апостериорных вероятностей на узлах MCC дерева, спасибо за предложение.

Что касается молекулярных часов, то мы использовали сильное априорное распределение на clockrate. Для этого мы использовали апостериорную оценку этого параметра из анализа британской эпидемии, которая была получена из большого датасета, как описано в Методах.

С уважением,
Георгий

trilisser · August 17, 2020, 9:20am

Извиняюсь за поздний ответ.

Что касается молекулярных часов, то мы использовали сильное априорное распределение на clockrate. Для этого мы использовали апостериорную оценку этого параметра из анализа британской эпидемии, которая была получена из большого датасета, как описано в Методах.

Спасибо, невнимательно прочитал текст, не заметил.

на MCC дереве нас интересуют в первую очередь глубокие ветвления, которые и обсуждаются в тексте. Эти внутренние узлы поддерживаются ручным анализом мутаций и ML деревом.

Меня заинтересовала поддержка более поздних узлов в связи с оценкой Re после закрытия госпиталя (после 8 апреля). На хронограмме этот временной отрезок приходится как раз на более молодые узлы.

С уважением,
Артём

g.bazykin · August 21, 2020, 6:42pm

Поддержка узлов и оценка Re не связаны: Re считается не на MCC дереве, а по всевозможным деревьям.

trilisser · December 8, 2020, 1:51am

Поддержка узлов на МСС дереве так же считается по всем возможным деревьям, и если топология МСС дерева не сходится к одному решению (то есть все пространство деревьев не сходится к одному решению, на что в данном случае указывает низкая поддержка узлов), тогда и оценка Re в модели BDSKY не имеет схождения и в большей степени основана на априорных распределениях.

Как показывает практика, даже с использованием всех доступных геномов из России поддержка клад в основном колеблется в районе нуля, поэтому возникает некоторое сомнение насчет того, может ли ~50 сиквенсов из одного учреждения дать достаточный уровень гетерогенности => сигнала для применения такой сложной многопараметрической модели как BDSKY.