Hello Andrew, thanks for your insights/usage cases. It is always good to hear from people using data, and my group has been reaching out to folks to better understand their operations. Indeed, that was one reason behind my participation on this forum.
We use the data ourselves quite a bit, and sometimes we are concerned about horizons that others are not - at least not yet. For example, last Fall we started creating pairwise comparisons between all “validated” viral genomes in our db - 80,000+ at this point. The goal is to better understand the sequence space and derive functional and representational models. At this scale, the concept of standardized isolate names (or standardized anything else;-) goes out the window.
BTW, we are happy to share this data if anyone is interested. Please just ask. (Two caveats: Flu is not included, and we only have tree files available on our public site as the actual pairwise comparison tables or huge).
Anyway, our scope right now is only 2.5 million GenBank sequences, but ultimately we are designing systems for 25 million GenBank sequences and perhaps almost as many short read datasets (both nonaligned, raw data, and aligned data). Some from our perspective there really needs to be a fundamental change in the way data is stored and retrieved - one that is more relational and scalable.
Once we started thinking about all this, it became clear that we needed approaches that were not necessarily dependent on standardized formats, but rather, approaches that were flexible enough to handle diverse formats and produce customized user experiences. This view predicates the separation of each “GenBank record” (or similar) into three elements - sequence, annotation, and metadata - and storing these elements in a way that allows a user to “build” an object that fits their specific requirements on demand.
So back to isolate names: One of the primary reasons we got involved in isolate naming scheme was to help bring attention to the importance of metadata. Getting communities to think about naming schemes is not just an exercise in “how should this look,” but it is also an exploration in what metadata is important to sequence data analysis in a given field.
While a “universal” isolate naming format is a admirable goal with obvious benefits, our interactions with user communities have underscored that there are often important metadata that cannot be supported in an isolate name of reasonable length and that ultimately, every user and every community will have slightly different preferences in what/how metadata is included in a “name.” Given this, the critical elements become getting all the important isolate metadata submitted with the sequence (as part of a GenBank record or better, as a BioSample record) and providing users tools that allow people to use this metadata in ways that best serve their needs.
That gets us back to the Virus Variation Resource and the stuff we are trying to do there. Ultimately, we want to process, store, and make available data in such a way that people can easily find the data they need and interact with it in the way they want. This includes deflines, and users can currently download all Zika or Dengue or Ebola virus data with customized deflines derived from standardized metadata. I am sure there are ways to improve this function, and we are hoping to get input from this and other communities towards this goal.
I should emphasize that these are early days, and much work remains, but we are pushing forward and hope to have all viruses loaded into our Virus Variation construct over the next year and to improve features within the resource. The input we receive now will like help shape a number of things to come as we continue to scale.
Sorry for the long response. Hope it is useful.