PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model

Sajib Acharjee Dip, Uddip Acharjee Shuvo, Tran Chau, Haoqiu Song, Petra Choi, Xuan Wang, Liqing Zhang
{"title":"PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model","authors":"Sajib Acharjee Dip, Uddip Acharjee Shuvo, Tran Chau, Haoqiu Song, Petra Choi, Xuan Wang, Liqing Zhang","doi":"arxiv-2406.13133","DOIUrl":null,"url":null,"abstract":"Pathogen identification is pivotal in diagnosing, treating, and preventing\ndiseases, crucial for controlling infections and safeguarding public health.\nTraditional alignment-based methods, though widely used, are computationally\nintense and reliant on extensive reference databases, often failing to detect\nnovel pathogens due to their low sensitivity and specificity. Similarly,\nconventional machine learning techniques, while promising, require large\nannotated datasets and extensive feature engineering and are prone to\noverfitting. Addressing these challenges, we introduce PathoLM, a cutting-edge\npathogen language model optimized for the identification of pathogenicity in\nbacterial and viral sequences. Leveraging the strengths of pre-trained DNA\nmodels such as the Nucleotide Transformer, PathoLM requires minimal data for\nfine-tuning, thereby enhancing pathogen detection capabilities. It effectively\ncaptures a broader genomic context, significantly improving the identification\nof novel and divergent pathogens. We developed a comprehensive data set\ncomprising approximately 30 species of viruses and bacteria, including ESKAPEE\npathogens, seven notably virulent bacterial strains resistant to antibiotics.\nAdditionally, we curated a species classification dataset centered specifically\non the ESKAPEE group. In comparative assessments, PathoLM dramatically\noutperforms existing models like DciPatho, demonstrating robust zero-shot and\nfew-shot capabilities. Furthermore, we expanded PathoLM-Sp for ESKAPEE species\nclassification, where it showed superior performance compared to other advanced\ndeep learning methods, despite the complexities of the task.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"46 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.13133","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Pathogen identification is pivotal in diagnosing, treating, and preventing diseases, crucial for controlling infections and safeguarding public health. Traditional alignment-based methods, though widely used, are computationally intense and reliant on extensive reference databases, often failing to detect novel pathogens due to their low sensitivity and specificity. Similarly, conventional machine learning techniques, while promising, require large annotated datasets and extensive feature engineering and are prone to overfitting. Addressing these challenges, we introduce PathoLM, a cutting-edge pathogen language model optimized for the identification of pathogenicity in bacterial and viral sequences. Leveraging the strengths of pre-trained DNA models such as the Nucleotide Transformer, PathoLM requires minimal data for fine-tuning, thereby enhancing pathogen detection capabilities. It effectively captures a broader genomic context, significantly improving the identification of novel and divergent pathogens. We developed a comprehensive data set comprising approximately 30 species of viruses and bacteria, including ESKAPEE pathogens, seven notably virulent bacterial strains resistant to antibiotics. Additionally, we curated a species classification dataset centered specifically on the ESKAPEE group. In comparative assessments, PathoLM dramatically outperforms existing models like DciPatho, demonstrating robust zero-shot and few-shot capabilities. Furthermore, we expanded PathoLM-Sp for ESKAPEE species classification, where it showed superior performance compared to other advanced deep learning methods, despite the complexities of the task.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
PathoLM:通过基因组基础模型从 DNA 序列识别致病性
病原体鉴定是诊断、治疗和预防疾病的关键,对控制感染和保障公共卫生至关重要。传统的基于配准的方法虽然应用广泛,但计算量大,依赖于大量的参考数据库,由于灵敏度和特异性低,往往无法检测到新的病原体。同样,传统的机器学习技术虽然前景广阔,但需要大量的标注数据集和广泛的特征工程,容易造成拟合过度。为了应对这些挑战,我们推出了 PathoLM,这是一种针对细菌和病毒序列致病性识别而优化的前沿病原体语言模型。PathoLM 充分利用了核苷酸转换器等预训练 DNA 模型的优势,只需最少的数据进行微调,从而提高了病原体检测能力。它能有效捕捉更广泛的基因组背景,大大提高了对新型和不同病原体的识别能力。我们开发了一个包含约 30 种病毒和细菌的综合数据集,其中包括 ESKAPEE 病原体,即七种对抗生素具有抗药性的显著毒性细菌菌株。在比较评估中,PathoLM 显著优于 DciPatho 等现有模型,展示了强大的零点和零点能力。此外,我们还将 PathoLM-Sp 扩展到了 ESKAPEE 物种分类中,尽管该任务非常复杂,但与其他先进的深度学习方法相比,PathoLM-Sp 表现出了卓越的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Allium Vegetables Intake and Digestive System Cancer Risk: A Study Based on Mendelian Randomization, Network Pharmacology and Molecular Docking wgatools: an ultrafast toolkit for manipulating whole genome alignments Selecting Differential Splicing Methods: Practical Considerations Advancements in colored k-mer sets: essentials for the curious Advancements in practical k-mer sets: essentials for the curious
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1