SNVPhyl: a single nucleotide variant phylogenomics pipeline for microbial genomic epidemiology

Aaron Petkau; Philip Mabon; Cameron Sieffert; Natalie C. Knox; Jennifer Cabral; Mariam Iskander; Mark Iskander; Kelly Weedmark; Rahat Zaheer; Lee S. Katz; Celine Nadon; Aleisha Reimer; Eduardo Taboada; Robert G. Beiko; William Hsiao; Fiona Brinkman; Morag Graham; Gary Van Domselaar

doi:10.1099/mgen.0.000116

Volume 3, Issue 6

Other

Open Access

SNVPhyl: a single nucleotide variant phylogenomics pipeline for microbial genomic epidemiology

Aaron Petkau¹, Philip Mabon¹, Cameron Sieffert¹, Natalie C. Knox¹, Jennifer Cabral¹, Mariam Iskander², Mark Iskander², Kelly Weedmark³, Rahat Zaheer⁴, Lee S. Katz⁵, Celine Nadon¹, Aleisha Reimer¹, Eduardo Taboada¹, Robert G. Beiko⁶, William Hsiao⁷, Fiona Brinkman⁸, Morag Graham¹ and Gary Van Domselaar¹
View Affiliations Hide Affiliations

Affiliations: ¹ 1National Microbiology Laboratory, Public Health Agency of Canada, Winnipeg, MB R3E 3R2, Canada ² 2University of Manitoba, Winnipeg, MB R3T 2N2, Canada ³ 3Health Canada – Bureau of Microbial Hazards, Ottawa, ON K1A 0K9, Canada ⁴ 4Lethbridge Research and Development Centre, Lethbridge, AB T1J 4B1, Canada ⁵ 5Centers for Disease Control and Prevention, Atlanta, GA 30333, USA ⁶ 6Dalhousie University, Halifax, NS B3H 4R2, Canada ⁷ 7BC Public Health Microbiology and Reference Laboratory, Vancouver, BC V5Z 4R4, Canada ⁸ 8Simon Fraser University, Burnaby, BC V5A 1S6, Canada
*Correspondence: Gary Van Domselaar [email protected]
Published: 08 June 2017 https://doi.org/10.1099/mgen.0.000116

Abstract

The recent widespread application of whole-genome sequencing (WGS) for microbial disease investigations has spurred the development of new bioinformatics tools, including a notable proliferation of phylogenomics pipelines designed for infectious disease surveillance and outbreak investigation. Transitioning the use of WGS data out of the research laboratory and into the front lines of surveillance and outbreak response requires user-friendly, reproducible and scalable pipelines that have been well validated. Single Nucleotide Variant Phylogenomics (SNVPhyl) is a bioinformatics pipeline for identifying high-quality single-nucleotide variants (SNVs) and constructing a whole-genome phylogeny from a collection of WGS reads and a reference genome. Individual pipeline components are integrated into the Galaxy bioinformatics framework, enabling data analysis in a user-friendly, reproducible and scalable environment. We show that SNVPhyl can detect SNVs with high sensitivity and specificity, and identify and remove regions of high SNV density (indicative of recombination). SNVPhyl is able to correctly distinguish outbreak from non-outbreak isolates across a range of variant-calling settings, sequencing-coverage thresholds or in the presence of contamination. SNVPhyl is available as a Galaxy workflow, Docker and virtual machine images, and a Unix-based command-line application. SNVPhyl is released under the Apache 2.0 license and available at http://snvphyl.readthedocs.io/ or at https://github.com/phac-nml/snvphyl-galaxy.

Received: 06/02/2017
Accepted: 12/04/2017
Published Online: 08/06/2017

Keyword(s): bacterial genomics , bioinformatics , genomic epidemiology , infectious disease surveillance , phylogenomics and single nucleotide variation detection

This is an open access article under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution and reproduction in any medium, provided the original author and source are credited.

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000116

2017-06-08

2024-04-19

Full text loading...

/deliver/fulltext/mgen/3/6/mgen000116.html?itemId=/content/journal/mgen/10.1099/mgen.0.000116&mimeType=html&fmt=ahah

References

Hendriksen RS, Price LB, Schupp JM, Gillece JD, Kaas RS et al. Population genetics of Vibrio cholerae from Nepal in 2010: evidence on the origin of the Haitian outbreak. MBio 2011; 2:e00157-11 [View Article][PubMed]
[Google Scholar]
Katz LS, Petkau A, Beaulaurier J, Tyler S, Antonova ES et al. Evolutionary dynamics of Vibrio cholerae O1 following a single-source introduction to Haiti. MBio 2013; 4:e00398-13 [View Article][PubMed]
[Google Scholar]
Frerichs RR, Keim PS, Barrais R, Piarroux R. Nepalese origin of cholera epidemic in Haiti. Clin Microbiol Infect 2012; 18:E158E163 [View Article][PubMed]
[Google Scholar]
Gardy JL, Johnston JC, Ho Sui SJ, Cook VJ, Shah L et al. Whole-genome sequencing and social-network analysis of a tuberculosis outbreak. N Engl J Med 2011; 364:730–739 [View Article][PubMed]
[Google Scholar]
Roetzer A, Diel R, Kohl TA, Rückert C, Nübel U et al. Whole genome sequencing versus traditional genotyping for investigation of a Mycobacterium tuberculosis outbreak: a longitudinal molecular epidemiological study. PLoS Med 2013; 10:e1001387 [View Article][PubMed]
[Google Scholar]
Holmes A, Allison L, Ward M, Dallman TJ, Clark R et al. Utility of whole-genome sequencing of Escherichia coli O157 for outbreak detection and epidemiological surveillance. J Clin Microbiol 2015; 53:3565–3573 [View Article][PubMed]
[Google Scholar]
Sánchez-Busó L, Comas I, Jorques G, González-Candelas F. Recombination drives genome evolution in outbreak-related Legionella pneumophila isolates. Nat Genet 2014; 46:1205–1211 [View Article][PubMed]
[Google Scholar]
Allard MW, Strain E, Melka D, Bunning K, Musser SM et al. Practical value of food pathogen traceability through building a whole-genome sequencing network and database. J Clin Microbiol 2016; 54:1975–1983 [View Article][PubMed]
[Google Scholar]
Franz E, Gras LM, Dallman T. Significance of whole genome sequencing for surveillance, source attribution and microbial risk assessment of foodborne pathogens. Curr Opin Food Sci 2016; 8:74–79 [View Article]
[Google Scholar]
Ashton PM, Nair S, Peters TM, Bale JA, Powell DG et al. Identification of Salmonella for public health surveillance using whole genome sequencing. PeerJ 2016; 4:e1752 [View Article][PubMed]
[Google Scholar]
Maiden MC, Jansen van Rensburg MJ, Bray JE, Earle SG, Ford SA et al. MLST revisited: the gene-by-gene approach to bacterial genomics. Nat Rev Microbiol 2013; 11:728–736 [View Article][PubMed]
[Google Scholar]
Moura A, Criscuolo A, Pouseele H, Maury MM, Leclercq A et al. Whole genome-based population biology and epidemiological surveillance of Listeria monocytogenes . Nat Microbiol 2016; 2:16185 [View Article][PubMed]
[Google Scholar]
Kwong JC, Mercoulia K, Tomita T, Easton M, Li HY et al. Prospective whole-genome sequencing enhances national surveillance of Listeria monocytogenes . J Clin Microbiol 2016; 54:333–342 [View Article][PubMed]
[Google Scholar]
Bertels F, Silander OK, Pachkov M, Rainey PB, van Nimwegen E. Automated reconstruction of whole-genome phylogenies from short-sequence reads. Mol Biol Evol 2014; 31:1077–1088 [View Article][PubMed]
[Google Scholar]
Jackson BR, Tarr C, Strain E, Jackson KA, Conrad A et al. Implementation of nationwide real-time whole-genome sequencing to enhance listeriosis outbreak detection and investigation. Clin Infect Dis 2016; 63:380–386 [View Article][PubMed]
[Google Scholar]
Kaas RS, Leekitcharoenphon P, Aarestrup FM, Lund O. Solving the problem of comparing whole bacterial genomes across different sequencing platforms. PLoS One 2014; 9:e104984 [View Article][PubMed]
[Google Scholar]
Davis S, Pettengill JB, Luo Y, Payne J, Shpuntoff A et al. CFSAN SNP Pipeline: an automated method for constructing SNP matrices from next-generation sequence data. PeerJ Comput Sci 2015; 1:e20 [View Article]
[Google Scholar]
Croucher NJ, Page AJ, Connor TR, Delaney AJ, Keane JA et al. Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Res 2015; 43:e15 [View Article][PubMed]
[Google Scholar]
Didelot X, Wilson DJ. ClonalFrameML: efficient inference of recombination in whole bacterial genomes. PLoS Comput Biol 2015; 11:e1004041 [View Article][PubMed]
[Google Scholar]
Sahl JW, Lemmer D, Travis J, Schupp JM, Gillece JD et al. NASP: an accurate, rapid method for the identification of SNPs in WGS datasets that supports flexible input and output formats. Microb Genom 2016; 2:e000074 [View Article][PubMed]
[Google Scholar]
Katz LS, Griswold T, Williams-Newkirk AJ, Wagner D, Petkau A et al. A comparative analysis of the Lyve-SET phylogenomics pipeline for genomic epidemiology of foodborne pathogens. Front Microbiol 2017; 8:375 [View Article][PubMed]
[Google Scholar]
Afgan E, Baker D, van den Beek M, Blankenberg D, Bouvier D et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update. Nucleic Acids Res 2016; 44:W3–W10 [View Article][PubMed]
[Google Scholar]
Afgan E, Sloggett C, Goonasekera N, Makunin I, Benson D et al. Genomics virtual laboratory: a practical bioinformatics workbench for the cloud. PLoS One 2015; 10:e0140829 [View Article][PubMed]
[Google Scholar]
Blankenberg D, Von Kuster G, Bouvier E, Baker D, Afgan E et al. Dissemination of scientific software with Galaxy ToolShed. Genome Biol 2014; 15:403 [View Article][PubMed]
[Google Scholar]
Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M et al. Versatile and open software for comparing large genomes. Genome Biol 2004; 5:R12 [View Article][PubMed]
[Google Scholar]
Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv 2012;arXiv:1207.3907
[Google Scholar]
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J et al. The sequence alignment/map format and SAMtools. Bioinformatics 2009; 25:2078–2079 [View Article][PubMed]
[Google Scholar]
Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 2011; 27:2987–2993 [View Article][PubMed]
[Google Scholar]
Guindon S, Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 2003; 52:696–704 [View Article][PubMed]
[Google Scholar]
Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol 2010; 59:307–321 [View Article][PubMed]
[Google Scholar]
Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics 2012; 28:593–594 [View Article][PubMed]
[Google Scholar]
Croucher NJ, Harris SR, Fraser C, Quail MA, Burton J et al. Rapid pneumococcal evolution in response to clinical interventions. Science 2011; 331:430–434 [View Article][PubMed]
[Google Scholar]
Soria-Carrasco V, Talavera G, Igea J, Castresana J. The K tree score: quantification of differences in the relative branch length and topology of phylogenetic trees. Bioinformatics 2007; 23:2954–2956 [View Article][PubMed]
[Google Scholar]
Revell LJ. Phytools: an R package for phylogenetic comparative biology (and other things). Methods Ecol Evol 2012; 3:217–223 [View Article]
[Google Scholar]
Zhu Y, Stephens RM, Meltzer PS, Davis SR. SRAdb: query and use public next-generation sequencing data from within R. BMC Bioinformatics 2013; 14:19 [View Article][PubMed]
[Google Scholar]
Bekal S, Berry C, Reimer AR, van Domselaar G, Beaudry G et al. Usefulness of high-quality core genome single-nucleotide variant analysis for subtyping the highly clonal and the most prevalent Salmonella enterica serovar Heidelberg clone in the context of outbreak investigations. J Clin Microbiol 2016; 54:289–295 [View Article][PubMed]
[Google Scholar]
Paradis E, Claude J, Strimmer K. APE: analyses of phylogenetics and evolution in R language. Bioinformatics 2004; 20:289–290 [View Article][PubMed]
[Google Scholar]
Koren S, Treangen TJ, Hill CM, Pop M, Phillippy AM. Automated ensemble assembly and validation of microbial genomes. BMC Bioinformatics 2014; 15:126 [View Article][PubMed]
[Google Scholar]
Lynch T, Petkau A, Knox N, Graham M, van Domselaar G. A primer on infectious disease bacterial genomics. Clin Microbiol Rev 2016; 29:881–913 [View Article]
[Google Scholar]
Olson ND, Lund SP, Colman RE, Foster JT, Sahl JW et al. Best practices for evaluating single nucleotide variant calling methods for microbial genomics. Front Genet 2015; 6:235 [View Article][PubMed]
[Google Scholar]
Croucher NJ, Harris SR, Grad YH, Hanage WP. Bacterial genomes in epidemiology—present and future. Philos Trans R Soc Lond B Biol Sci 2013; 368:20120202 [View Article][PubMed]
[Google Scholar]
Marttinen P, Hanage WP, Croucher NJ, Connor TR, Harris SR et al. Detection of recombination events in bacterial genomes from large population samples. Nucleic Acids Res 2012; 40:e6 [View Article][PubMed]
[Google Scholar]
Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 2014; 15:R46 [View Article][PubMed]
[Google Scholar]
Gardner SN, Slezak T, Hall BG. kSNP3.0: SNP detection and phylogenetic analysis of genomes without genome alignment or reference genome. Bioinformatics 2015; 31:2877–2878 [View Article][PubMed]
[Google Scholar]
Treangen TJ, Ondov BD, Koren S, Phillippy AM. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol 2014; 15:524 [View Article][PubMed]
[Google Scholar]
Ahmed SA, Lo C, Li P, Davenport KW, Chain PSG et al. From raw reads to trees: whole genome SNP phylogenetics across the tree of life. bioRxiv 2015 doi:10.1101/032250
[Google Scholar]

http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.000116

SNVPhyl: a single nucleotide variant phylogenomics pipeline for microbial genomic epidemiology

M Gen 3, e000116 (2017); https://doi.org/10.1099/mgen.0.000116

/content/journal/mgen/10.1099/mgen.0.000116

Volume 3, Issue 6

Other

Open Access

SNVPhyl: a single nucleotide variant phylogenomics pipeline for microbial genomic epidemiology

Abstract

Supplementary File 1

Supplementary File 2

Supplementary File 3

Supplementary File 4

Most read this month

Most cited Most Cited RSS feed

ResFinder – an open online resource for identification of antimicrobial resistance genes in next-generation sequencing data and prediction of phenotypes from genotypes

Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification

MOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies

Completing bacterial genome assemblies with multiplex MinION sequencing

SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments

ClermonTyping: an easy-to-use and accurate in silico method for Escherichia genus strain phylotyping

Identification of Klebsiella capsule synthesis loci from whole genome data

Emergence, molecular mechanisms and global spread of carbapenem-resistant Acinetobacter baumannii

chewBBACA: A complete suite for gene-by-gene schema creation and strain identification

Microreact: visualizing and sharing data for genomic epidemiology and phylogeography