Zika Virus isolate nomenclature and annotation

jones · January 28, 2016, 5:35pm

A large number of Zika virus genome sequences are going to start appearing in GenBank, etc. Four new genomes of Brazilian isolates just appeared on GenBank, adding the one Brazilian isolate from Dec 2015.

An ongoing challenge to anyone analyzing viral genomes is the limited annotation that often accompanies GenBank records, in terms of Collection Date, Collection Location, etc.

For example the four new Brazil isolates are annotated at the level of:

             /strain="BeH818995"
             /host="Homo sapiens"
             /country="Brazil"
             /collection_date="2015"

For many purposes this is just fine but the more information that is provided the better

I would ask the community to consider expanding the annotation as follows

Collection date - provide the full date, or at least the month and year
Location - provide the geographic region within the country - state, province, nearest city
Patient - sex, age

And adopting a nomenclature standard along the lines of those used with Flu and Ebola, amongst other, would be helpful.
For example: Ebola virus/H.sapiens-wt/SLE/2014/Makona-0106_C2_KT2315

This scheme is imperfect but it is more informative than a lab’s internal sample ID

One more thing, providing data prior to publication is invaluable in outbreaks like Ebola and now Zika, but there have been cases where the pre-publication data uses one naming scheme for sample and the published GenBank records use a different one. That means that an end-user has to match up records by sequences and then figure out which records refer to the same isolate. It would be great is we can avoid that with Zika data.

thanks

–Rob Jones

PS Can the administrators set up a Zika virus category on the site ? I think we’ll be needing it…

system · January 31, 2016, 5:52pm

Hello Rob,

I think an standardized isolate naming scheme is a good idea. I was part of the collaborations that put together the Rotavirus, Human adenovirus, and Filovirus isolate naming schemes and would be happy to contribute.

It would also be great to set metadata standards along the lines of your suggestions. While this topic can be difficult and fraught with political and/or privacy issues, a grass roots effort to establish community accepted standards would be a very good start and could possibly address some of the historical impediments.

If you or others would like to contact me directly please email me.

Thanks,
Rodney

J. Rodney Brister
[email protected]

Kristian_Andersen · January 31, 2016, 11:24pm

We need @kuhnjens on this…

system · February 1, 2016, 3:52pm

I spoke to Jens about this, over the weekend. He is traveling, but I expect him to chime in at some point. Jens and I worked together on the filovirus isolate nomenclature, and as folks who have spent a lot of time thinking about the subject, we would suggest that the filovirus nomenclature represents the most current approach and would be a good jumping off point for Zika virus.

Another consideration here: I do not think there are any standardized flavivirus isolate naming schemes out there, so it may make sense to look towards a scheme that would fit more than just Zika virus.

Thanks,
Rodney

system · February 5, 2016, 10:50pm

Sorry for getting into this discussion so late… as Rodney mentioned, I was travelling. I agree that the filovirus naming scheme would make the most sense. It is by now the most elaborate scheme out there and even has provisions for laboratory-adapted strains or cDNA clone-derived variants. The Ebola sequencing community has adopted it rapidly and almost all new 2013-present Ebola GenBank entries have adopted it. Furthermore, we are also using this scheme now for the bornaviruses, nyamiviruses, and all new, non-standard mononegaviruses. It’s overall easy - and we could think about a one-page article simply outlining it for Zika and then publishing it relatively high-profile (Lancet Infect Dis?) if you guys are interested.

Best,
Jens

jones · February 5, 2016, 11:19pm

I think a one page published specification would be a great reference for labs - but that can take a little time to come through, so I would combine that with direct outreach to the labs that are publishing data right now, and encourage them to update any pre-existing GenBank records.

I suspect that most submitters are not familiar with how to update existing records, so a short how-to might be helpful.

I do think we should also propose guidelines for annotating the GenBank/EMBL ‘source’ features (/collection_date etc). You can be more descriptive in that text than in the concise nomenclature string.

Do you want to write up the nomenclature string specification ? I’ll have a go at the ‘source’ annotation part.

–Rob

system · February 5, 2016, 11:35pm

Yeah, I can do that, probably on Tuesday. We should probably have a string of good author names on that note to give it the necessary “umph”…

jones · February 11, 2016, 12:18am

I have written up a draft spec for a naming scheme and annotation for Zika virus genomes in GenBank, ENA, etc. It’s not actually Zika specific.
Basically it’s spelling out what a lot of labs are doing right now, but sometimes it helps to have things spelled out…

It’s four pages long - so more of a reference spec than a one page proposal that we might get published, but it could serve as a starting point for that.

Please let me know what you think - any and all feedback is welcome

Unfortunately the web site does not allow uploads of Word documents so this is a PDF. If you would like Word format so you can mark it up then please email me at [email protected]

–Rob Jones

ZikaNomenclatureProposal_20160210.pdf (67.1 KB)

system · February 11, 2016, 3:28am

I was sent this by someone at NIH. You might want to get in touch.

Cheers,

Eddie

Hello,

I am a Scientist at the National Center for Biotechnology Information (NCBI). Our group is trying to provide support to research in response to the current Zika virus outbreak. We noticed that most of the submitted Zika virus sequences do not include annotation of the mature peptides, and we wanted annotate these and make them available through our Virus Variation Resource (https://www.ncbi.nlm.nih.gov/genome/viruses/variation/), similar to what we have done for West Nile virus (https://www.ncbi.nlm.nih.gov/genomes/VirusVariation/Database/nph-select.cgi?cmd=database&taxid=11082)

To annotate Zika virus sequences, we first need to create the best reference genome possible - one that includes all mature peptides accurately annotated - so it can be used a template. Once we develop a Zika virus reference genome, we will make it publically available where it can be used as a template for the annotation of future Zika virus sequences. This should help standardize Zika virus sequence data and make it more accessible to users. Please let me know if you consider yourself familiar with Flavivirus genomics, or if you can let us know about other researchers who may be interested in helping us.

We also thought that it may be helpful to put together a working group to help establish isolate naming standards like we have done in the past for Human adenovirus, Rotavirus, and Ebolavirus. This group could also work on establishing metadata standards. Would any of you be interested in participating in such a group? Could you recommend others? It may make sense to extend these discussions to include all Flaviviruses.

Thank you for your help and please let us know if you have any questions.

Best regards,
Eneida

Eneida Hatcher, Ph.D.
Scientist (contractor), Viral Genomes Group
Information Engineering Branch, National Center for Biotechnology Information
National Library of Medicine, National Institutes of Health
[email protected]
(301) 435-7938

jones · February 11, 2016, 5:49pm

I’ve put a MS Word version on Google Drive which should be editable

https://drive.google.com/a/craic.com/file/d/0B6DooTPDLDFXUW5IZ2NqdloxS1U/view

jones · February 12, 2016, 6:50pm

I contacted Eneida Hatcher at NCBI after Eddie’s post.

She just sent this email about their Zika efforts :

Hello,

In expectation of several large sequencing efforts, we are currently trying to update/develop NCBI Zika virus resources.

To that end, we have updated the Zika virus reference sequence, and it now includes predicted mature peptide positions and standardized protein/peptide names (http://www.ncbi.nlm.nih.gov/nuccore/NC_012532.1). This should provide an annotation standard for these features and be useful for annotation pipelines.

We are also building a specialized Zika virus module in our Virus Variation Resource, similar to what we have done for West Nile virus (http://www.ncbi.nlm.nih.gov/genomes/VirusVariation/Database/nph-select.cgi?cmd=database&taxid=11082) and Dengue virus (http://www.ncbi.nlm.nih.gov/genomes/VirusVariation/Database/nph-select.cgi?cmd=database&taxid=12637). These resources include specialized databases that store metadata that has been parsed from sequence submissions and mapped to standardized terms. This allows users to more easily find sequences of interest based on metadata terms. Typically this metadata includes country of isolation and host, and we were wondering if other types of metadata are important to the field. We are also interested in any comments you have regarding the functionality of these resources.

Finally, we are interested in helping to develop standards for GenBank, BioSample, and SRA submission. The goal is to improve the usability of ZIka virus sequence data, and we would like to better understand what metadata is important to the community.

I appreciate your time and any ideas you want to offer, and I would be grateful if you want to pass this along to anyone who might be interested.

Sincerely,
Eneida

Eneida Hatcher, Ph.D.
Scientist (contractor), Viral Genomes Group
Information Engineering Branch, National Center for Biotechnology Information
National Library of Medicine, National Institutes of Health
[email protected]
(301) 435-7938

jones · March 9, 2016, 6:31pm

The latest issue of Nature has a letter from Richard Sheuermann (Venter Inst) on behalf of the Viral Genome Annotation Standards Working Group - Rodney Brister (NIBI), Elliot Lefkowitz (U Alabama) and Philippe Le Mercier (SIB Switzerland)

Zika virus: designate standardized names

Unfortunately it is behind the paywall so I can’t tell you they propose…

–Rob

n_j_loman · March 10, 2016, 12:06pm

It’s a very short letter but the relevant part is:

“Building on conventions in other viral fields, we urge the Zika community to adopt a standard nomenclature for isolate names, specifying the virus type (ZIKV), host species abbreviation, geographical location of isolation, unique identification string and year of isolation. The preferred isolate name for BeH818995, for example, would then be ZIKV/H. sapiens/Brazil/BeH818995/2015.”

arambaut · March 10, 2016, 3:50pm

Looks good to me. Would be better without spaces. Hsapiens or H_sapiens. Also strain/sample ID as the second field after the virus name (because it is the most canonical bit of data). Fields should be manditory with ‘?’ or UNK to denote missing data. Country code should be ‘laboratory’ or some such when not isolated in a host.

This of course should be the standard for all viruses of any type.

n_j_loman · March 11, 2016, 2:32pm

Yeah agree no spaces in FASTA header, breaks lots of scripts.

system · March 11, 2016, 6:22pm

I am using the nomenclature Nick posted in my manuscript: ZIKV/H. sapiens/Brazil/BeH818995/2015

I replaced the period with an underscore and I use UNK when I don’t know, yep.

arambaut · March 11, 2016, 6:30pm

Great. But think about putting the sample ID as the first field after the ZIKV. The reason is that people tend to mess up the other fields (leaving them out etc). Having the canonical ID as the second field means you can always parse it.

system · March 11, 2016, 6:32pm

My Bad - yes I did do that - example: EU545988_Zikavirus/H_sapiens/MICRONESIA/2007-06-21/ECMN2007

system · March 11, 2016, 6:39pm

There was a reason I did that - we were collaborating with USAMRIID and that was their request per our previous AFRIMS/WRAIR/USAMRIID sequencing submissions of Zika. Accessions KU681081 and KU681082

system · March 12, 2016, 12:05am

Hello all,

Some of you may have been contacted by Eneida Hatcher from my group already, but I thought it would be worth posting that our Zika virus resource is now available for testing:

http://preview.ncbi.nlm.nih.gov/genomes/VirusVariation/Database/nph-select.cgi

This is a beta release, and there are a couple of improvements that will be made in the next 10 days. That said, we wanted to get this out to the community as quickly as possible. We only ask that you please document any suggestions to improve the resource.

There are several improvements that should be released within the next couple of weeks. These include the option to download only a selected sequence region from a larger GenBank sequence, the naming of partial proteins by our de novo annotation pipeline, and the ability to search by authors.

On another note, having participated in several isolate naming efforts over the past few years there is always a give in take between machine parsable and human readable formats. IMHO, if you are parsing metadata from deflines, you are better off using a resource like ours, where you can download sequences with customized FASTA deflines that incorporate standardized metadata or simply download a table of standardized metadata that accompanies a sequence.

BTW, there is not supposed to be a space in H.sapiens or other species names. It looks like something got last in the edits. The host field was conceived as it was for the filovirus isolate names.

Also, the rationale for the placement of sample ID was that the other fields in front of this are absolutely required for sequence submission to GenBank and should be available for use in the naming construct. This is the same approach that groups I worked with took towards a number of organisms - Human adenovirus, Rotavirus, Filovirus - and will be part of a basis for a universal scheme.

BTW, you can search for these IDs in our resource under “Additional filters” if they are part of the defline.

Reading through the comments, it looks like we can do some things on the submission side to enable easier access to lab IDs and other unique IDs. One of the problems is there is no specific field in GenBank records to accommodate this, but there is in BioSample…

Thanks,
Rodney