Exploring SNP filtering strategies: the influence of strict vs soft core.

IF 4 2区 生物学 Q1 GENETICS & HEREDITY Microbial Genomics Pub Date : 2025-01-01 DOI:10.1099/mgen.0.001346
Mona L Taouk, Leo A Featherstone, George Taiaroa, Torsten Seemann, Danielle J Ingle, Timothy P Stinear, Ryan R Wick
{"title":"Exploring SNP filtering strategies: the influence of strict vs soft core.","authors":"Mona L Taouk, Leo A Featherstone, George Taiaroa, Torsten Seemann, Danielle J Ingle, Timothy P Stinear, Ryan R Wick","doi":"10.1099/mgen.0.001346","DOIUrl":null,"url":null,"abstract":"<p><p>Phylogenetic analyses are crucial for understanding microbial evolution and infectious disease transmission. Bacterial phylogenies are often inferred from SNP alignments, with SNPs as the fundamental signal within these data. SNP alignments can be reduced to a 'strict core' by removing those sites that do not have data present in every sample. However, as sample size and genome diversity increase, a strict core can shrink markedly, discarding potentially informative data. Here, we propose and provide evidence to support the use of a 'soft core' that tolerates some missing data, preserving more information for phylogenetic analysis. Using large datasets of <i>Neisseria gonorrhoeae</i> and <i>Salmonella enterica</i> serovar Typhi, we assess different core thresholds. Our results show that strict cores can drastically reduce informative sites compared to soft cores. In a 10 000-genome alignment of <i>Salmonella enterica</i> serovar Typhi, a 95% soft core yielded ten times more informative sites than a 100% strict core. Similar patterns were observed in <i>N. gonorrhoeae</i>. We further evaluated the accuracy of phylogenies built from strict- and soft-core alignments using datasets with strong temporal signals. Soft-core alignments generally outperformed strict cores in producing trees displaying clock-like behaviour; for instance, the <i>N. gonorrhoeae</i> 95% soft-core phylogeny had a root-to-tip regression <i>R</i> <sup>2</sup> of 0.50 compared to 0.21 for the strict-core phylogeny. This study suggests that soft-core strategies are preferable for large, diverse microbial datasets. To facilitate this, we developed <i>Core-SNP-filter</i> (https://github.com/rrwick/Core-SNP-filter), an open-source software tool for generating soft-core alignments from whole-genome alignments based on user-defined thresholds.</p>","PeriodicalId":18487,"journal":{"name":"Microbial Genomics","volume":"11 1","pages":""},"PeriodicalIF":4.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11734701/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Microbial Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1099/mgen.0.001346","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0

Abstract

Phylogenetic analyses are crucial for understanding microbial evolution and infectious disease transmission. Bacterial phylogenies are often inferred from SNP alignments, with SNPs as the fundamental signal within these data. SNP alignments can be reduced to a 'strict core' by removing those sites that do not have data present in every sample. However, as sample size and genome diversity increase, a strict core can shrink markedly, discarding potentially informative data. Here, we propose and provide evidence to support the use of a 'soft core' that tolerates some missing data, preserving more information for phylogenetic analysis. Using large datasets of Neisseria gonorrhoeae and Salmonella enterica serovar Typhi, we assess different core thresholds. Our results show that strict cores can drastically reduce informative sites compared to soft cores. In a 10 000-genome alignment of Salmonella enterica serovar Typhi, a 95% soft core yielded ten times more informative sites than a 100% strict core. Similar patterns were observed in N. gonorrhoeae. We further evaluated the accuracy of phylogenies built from strict- and soft-core alignments using datasets with strong temporal signals. Soft-core alignments generally outperformed strict cores in producing trees displaying clock-like behaviour; for instance, the N. gonorrhoeae 95% soft-core phylogeny had a root-to-tip regression R 2 of 0.50 compared to 0.21 for the strict-core phylogeny. This study suggests that soft-core strategies are preferable for large, diverse microbial datasets. To facilitate this, we developed Core-SNP-filter (https://github.com/rrwick/Core-SNP-filter), an open-source software tool for generating soft-core alignments from whole-genome alignments based on user-defined thresholds.

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
探索SNP过滤策略:严格与软核的影响。
系统发育分析对于理解微生物进化和传染病传播至关重要。细菌系统发育通常是从SNP比对中推断出来的,SNP是这些数据中的基本信号。通过去除每个样本中不存在数据的那些位点,SNP比对可以减少到“严格核心”。然而,随着样本量和基因组多样性的增加,严格的核心可能会显着缩小,从而丢弃潜在的信息数据。在这里,我们提出并提供证据来支持使用“软核”,它可以容忍一些缺失的数据,为系统发育分析保留更多的信息。利用淋病奈瑟菌和伤寒沙门氏菌的大型数据集,我们评估了不同的核心阈值。我们的研究结果表明,与软核相比,严格核可以大大减少信息位点。在大肠沙门氏菌血清型伤寒的1万个基因组比对中,95%软核比100%严格核产生的信息位点多10倍。在淋病奈瑟菌中也观察到类似的模式。我们使用具有强时间信号的数据集进一步评估了从严格核比对和软核比对建立的系统发育的准确性。软核排列在产生显示时钟行为的树方面通常优于严格核;例如,淋病奈瑟菌95%软核系统发育的根尖回归r2为0.50,而严格核系统发育的回归r2为0.21。这项研究表明,软核策略更适合于大型、多样化的微生物数据集。为了促进这一点,我们开发了Core-SNP-filter (https://github.com/rrwick/Core-SNP-filter),这是一个开源软件工具,用于根据用户定义的阈值从全基因组比对中生成软核比对。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Microbial Genomics
Microbial Genomics Medicine-Epidemiology
CiteScore
6.60
自引率
2.60%
发文量
153
审稿时长
12 weeks
期刊介绍: Microbial Genomics (MGen) is a fully open access, mandatory open data and peer-reviewed journal publishing high-profile original research on archaea, bacteria, microbial eukaryotes and viruses.
期刊最新文献
Genomic analysis and antimicrobial resistance of Vibrio cholerae isolated during Zambia's 2023 cholera epidemic. Erratum: Cappable-seq reveals the transcriptional landscape of stress responses in the bacterial endosymbiont Wolbachia. Genomic characterization and SNP analysis connect respiratory infections caused by Mycobacterium intracellulare with a pool facility. Genomic characterization of Vibrio parahaemolyticus strain (AG1) causing translucent post-larvae disease in Penaeus vannamei. Gut microbiota and bile acid profiles in purebred vs. crossbred sows: links to oxidative stress and inflammation in late gestation.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1