Our team at the Pathogen and Microbiome Institute at Northern Arizona University published a software announcement this week on
genome-sampler, a tool that we developed to support our analysis of the SARS-CoV-2 genomes isolated from patients in Arizona in the context of SARS-CoV-2 genomes from GISAID. Our paper is currently in open peer review at F1000 research here, and the software is open source and free for all use (see the GitHub repo for source code and the website for installation and usage instructions).
Our goal with
genome-sampler is to provide an easy-to-use, easy-to-install utility that samples context genomes (e.g., GISAID) across time of genome isolation, geographic location of genome isolation, and genome diversity while ensuring that the context sequences which are the nearest neighbors of the focal sequences (e.g., those from your local region) are represented in the resulting sample. Given the large number of SARS-CoV-2 genomes that are available, and the rate at which new genomes are becoming available, this type of sampling will be essential for supporting reliable and efficient downstream analysis.
genome-sampler is modular in design to support custom workflows and benchmarking - for example, any of the sampling steps I mentioned could be skipped or replaced with a random sampling approach to explore its impact on downstream results.
genome-sampler also helps with filtering noisy sequence data when loading fasta files, so if you’re struggling with non-IUPAC characters or other issues when obtaining data from public sources, it should help facilitate analysis. We’re now starting work on some benchmarks of this, and plan to post an updated version of our paper when those are ready.
We’d love to get feedback on what we can do to make this more useful to the community. Comments can be added to the paper at F1000, and we’re also happy to discuss on the project’s issue tracker or here. If you have tech support questions, please post those to the QIIME 2 Forum.