机器学习揭示了附属序列对沙门氏菌爆发聚类的动态重要性。

IF 4.7 1区 生物学 Q1 MICROBIOLOGY mBio Pub Date : 2025-03-12 Epub Date: 2025-01-28 DOI:10.1128/mbio.02650-24
Chao Chun Liu, William W L Hsiao
{"title":"机器学习揭示了附属序列对沙门氏菌爆发聚类的动态重要性。","authors":"Chao Chun Liu, William W L Hsiao","doi":"10.1128/mbio.02650-24","DOIUrl":null,"url":null,"abstract":"<p><p>Bacterial typing at whole-genome scales is now feasible owing to decreasing costs in high-throughput sequencing and the recent advances in computation. The unprecedented resolution of whole-genome typing is achieved by genotyping the variable segments of bacterial genomes that can fluctuate significantly in gene content. However, due to the transient and hypervariable nature of many accessory elements, the value of the added resolution in outbreak investigations remains disputed. To assess the analytical value of bacterial accessory genomes in clustering epidemiologically related cases, we trained classifiers on a set of genomes collected from 24 <i>Salmonella enterica</i> outbreaks of food, animal, or environmental origin. The models demonstrated high precision and recall on unseen test data with near-perfect accuracy in classifying clonal and short-term outbreaks. Annotating the genomic features important for cluster classification revealed functional enrichment of molecular fingerprints in genes involved in membrane transportation, trafficking, and carbohydrate metabolism. Importantly, we discovered polymorphisms in mobile genetic elements (MGEs) and gain/loss of MGEs to be informative in defining outbreak clusters. To quantify the ability of MGE variations to cluster outbreak clones, we devised a reference-free tree-building algorithm inspired by colored de Bruijn graphs, which enabled topological comparisons between MGE and standard typing methods. Systematic evaluation of clustering MGEs on an unseen dataset of 34 <i>Salmonella</i> outbreaks yielded mixed results that exemplified the power of accessory sequence variations when core genomes of unrelated cases are insufficiently discriminatory, as well as the distortion of outbreak signals by microevolution events or the incomplete assembly of MGEs.</p><p><strong>Importance: </strong>Gene-by-gene typing is widely used to detect clusters of foodborne illnesses that share a common origin. It remains actively debated whether the inclusion of accessory sequences in bacterial typing schema is informative or deleterious for cluster definitions in outbreak investigations due to the potential confounding effects of horizontal gene transfer. By training machine learning models on a curated set of historical <i>Salmonella</i> outbreaks, we revealed an enriched presence of outbreak distinguishing features in a wide range of mobile genetic elements. Systematic comparison of the efficacy of clustering different accessory elements against standard sequence typing methods led to our cataloging of scenarios where accessory sequence variations were beneficial and uninformative to resolving outbreak clusters. The presented work underscores the complexity of the molecular trends in enteric outbreaks and seeks to inspire novel computational ways to exploit whole-genome sequencing data in enteric disease surveillance and management.</p>","PeriodicalId":18315,"journal":{"name":"mBio","volume":" ","pages":"e0265024"},"PeriodicalIF":4.7000,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11898705/pdf/","citationCount":"0","resultStr":"{\"title\":\"Machine learning reveals the dynamic importance of accessory sequences for <i>Salmonella</i> outbreak clustering.\",\"authors\":\"Chao Chun Liu, William W L Hsiao\",\"doi\":\"10.1128/mbio.02650-24\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Bacterial typing at whole-genome scales is now feasible owing to decreasing costs in high-throughput sequencing and the recent advances in computation. The unprecedented resolution of whole-genome typing is achieved by genotyping the variable segments of bacterial genomes that can fluctuate significantly in gene content. However, due to the transient and hypervariable nature of many accessory elements, the value of the added resolution in outbreak investigations remains disputed. To assess the analytical value of bacterial accessory genomes in clustering epidemiologically related cases, we trained classifiers on a set of genomes collected from 24 <i>Salmonella enterica</i> outbreaks of food, animal, or environmental origin. The models demonstrated high precision and recall on unseen test data with near-perfect accuracy in classifying clonal and short-term outbreaks. Annotating the genomic features important for cluster classification revealed functional enrichment of molecular fingerprints in genes involved in membrane transportation, trafficking, and carbohydrate metabolism. Importantly, we discovered polymorphisms in mobile genetic elements (MGEs) and gain/loss of MGEs to be informative in defining outbreak clusters. To quantify the ability of MGE variations to cluster outbreak clones, we devised a reference-free tree-building algorithm inspired by colored de Bruijn graphs, which enabled topological comparisons between MGE and standard typing methods. Systematic evaluation of clustering MGEs on an unseen dataset of 34 <i>Salmonella</i> outbreaks yielded mixed results that exemplified the power of accessory sequence variations when core genomes of unrelated cases are insufficiently discriminatory, as well as the distortion of outbreak signals by microevolution events or the incomplete assembly of MGEs.</p><p><strong>Importance: </strong>Gene-by-gene typing is widely used to detect clusters of foodborne illnesses that share a common origin. It remains actively debated whether the inclusion of accessory sequences in bacterial typing schema is informative or deleterious for cluster definitions in outbreak investigations due to the potential confounding effects of horizontal gene transfer. By training machine learning models on a curated set of historical <i>Salmonella</i> outbreaks, we revealed an enriched presence of outbreak distinguishing features in a wide range of mobile genetic elements. Systematic comparison of the efficacy of clustering different accessory elements against standard sequence typing methods led to our cataloging of scenarios where accessory sequence variations were beneficial and uninformative to resolving outbreak clusters. The presented work underscores the complexity of the molecular trends in enteric outbreaks and seeks to inspire novel computational ways to exploit whole-genome sequencing data in enteric disease surveillance and management.</p>\",\"PeriodicalId\":18315,\"journal\":{\"name\":\"mBio\",\"volume\":\" \",\"pages\":\"e0265024\"},\"PeriodicalIF\":4.7000,\"publicationDate\":\"2025-03-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11898705/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"mBio\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1128/mbio.02650-24\",\"RegionNum\":1,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/28 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"MICROBIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"mBio","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1128/mbio.02650-24","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/28 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

由于高通量测序成本的降低和计算技术的最新进展,全基因组细菌分型现在是可行的。通过对基因含量显著波动的细菌基因组可变片段进行基因分型,实现了前所未有的全基因组分型分辨率。然而,由于许多辅助因素的瞬态和高可变性质,在疫情调查中增加解决方案的价值仍然存在争议。为了评估细菌辅助基因组在聚类流行病学相关病例中的分析价值,我们对从24例食物、动物或环境来源的肠炎沙门氏菌暴发中收集的一组基因组进行了训练。该模型对未见过的测试数据显示出较高的精度和召回率,在克隆和短期爆发分类方面具有近乎完美的准确性。对聚类分类重要的基因组特征进行注释,揭示了参与膜运输、运输和碳水化合物代谢的基因的分子指纹功能富集。重要的是,我们发现移动遗传元件(MGEs)的多态性和MGEs的获得/损失对定义爆发集群具有重要的信息。为了量化MGE变异对群集爆发克隆的能力,我们设计了一种受彩色de Bruijn图启发的无参考树构建算法,该算法可以在MGE和标准分型方法之间进行拓扑比较。对34例沙门氏菌暴发的未见数据集进行的MGEs聚类系统评估得出了不同的结果,这些结果表明,当不相关病例的核心基因组没有足够的区别性时,附属序列变异的力量,以及微进化事件或MGEs不完整组装造成的暴发信号扭曲。重要性:基因分型被广泛用于检测具有共同起源的食源性疾病群。由于水平基因转移的潜在混淆效应,在细菌分型模式中包含辅助序列对疫情调查中的集群定义是有益的还是有害的,目前仍存在积极的争论。通过在一组精心策划的沙门氏菌爆发历史上训练机器学习模型,我们揭示了在广泛的移动遗传元素中丰富的爆发区分特征。系统地比较了不同辅助元素聚类与标准序列分型方法的有效性,从而使我们对辅助序列变化对解决疫情聚集有益和无信息的情况进行了编目。所提出的工作强调了肠道暴发分子趋势的复杂性,并寻求激发新的计算方法来利用肠道疾病监测和管理中的全基因组测序数据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Machine learning reveals the dynamic importance of accessory sequences for Salmonella outbreak clustering.

Bacterial typing at whole-genome scales is now feasible owing to decreasing costs in high-throughput sequencing and the recent advances in computation. The unprecedented resolution of whole-genome typing is achieved by genotyping the variable segments of bacterial genomes that can fluctuate significantly in gene content. However, due to the transient and hypervariable nature of many accessory elements, the value of the added resolution in outbreak investigations remains disputed. To assess the analytical value of bacterial accessory genomes in clustering epidemiologically related cases, we trained classifiers on a set of genomes collected from 24 Salmonella enterica outbreaks of food, animal, or environmental origin. The models demonstrated high precision and recall on unseen test data with near-perfect accuracy in classifying clonal and short-term outbreaks. Annotating the genomic features important for cluster classification revealed functional enrichment of molecular fingerprints in genes involved in membrane transportation, trafficking, and carbohydrate metabolism. Importantly, we discovered polymorphisms in mobile genetic elements (MGEs) and gain/loss of MGEs to be informative in defining outbreak clusters. To quantify the ability of MGE variations to cluster outbreak clones, we devised a reference-free tree-building algorithm inspired by colored de Bruijn graphs, which enabled topological comparisons between MGE and standard typing methods. Systematic evaluation of clustering MGEs on an unseen dataset of 34 Salmonella outbreaks yielded mixed results that exemplified the power of accessory sequence variations when core genomes of unrelated cases are insufficiently discriminatory, as well as the distortion of outbreak signals by microevolution events or the incomplete assembly of MGEs.

Importance: Gene-by-gene typing is widely used to detect clusters of foodborne illnesses that share a common origin. It remains actively debated whether the inclusion of accessory sequences in bacterial typing schema is informative or deleterious for cluster definitions in outbreak investigations due to the potential confounding effects of horizontal gene transfer. By training machine learning models on a curated set of historical Salmonella outbreaks, we revealed an enriched presence of outbreak distinguishing features in a wide range of mobile genetic elements. Systematic comparison of the efficacy of clustering different accessory elements against standard sequence typing methods led to our cataloging of scenarios where accessory sequence variations were beneficial and uninformative to resolving outbreak clusters. The presented work underscores the complexity of the molecular trends in enteric outbreaks and seeks to inspire novel computational ways to exploit whole-genome sequencing data in enteric disease surveillance and management.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
mBio
mBio MICROBIOLOGY-
CiteScore
10.50
自引率
3.10%
发文量
762
审稿时长
1 months
期刊介绍: mBio® is ASM''s first broad-scope, online-only, open access journal. mBio offers streamlined review and publication of the best research in microbiology and allied fields.
期刊最新文献
The microbiota affects energy production, nitrogen excretion, and sterol metabolism in mosquito larvae. Potential mechanisms underlying Enterococcus faecalis-driven pancreatic cancer cell proliferation. Combination adjuvants drive long-lived plastic Th17 cells that convert to multi-functional Th1 cells and protect mice against fungal infection. Cross-utilization of viral polymerase: parainfluenza virus hijacks the RdRp of porcine sapelovirus to facilitate its replication during co-infection. Host GPCR-cAMP signaling balances Gαs and Gαi activity to control intracellular Brucella infection.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1