Sajib Acharjee Dip, Uddip Acharjee Shuvo, Tran Chau, Haoqiu Song, Petra Choi, Xuan Wang, Liqing Zhang
{"title":"PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model","authors":"Sajib Acharjee Dip, Uddip Acharjee Shuvo, Tran Chau, Haoqiu Song, Petra Choi, Xuan Wang, Liqing Zhang","doi":"arxiv-2406.13133","DOIUrl":null,"url":null,"abstract":"Pathogen identification is pivotal in diagnosing, treating, and preventing\ndiseases, crucial for controlling infections and safeguarding public health.\nTraditional alignment-based methods, though widely used, are computationally\nintense and reliant on extensive reference databases, often failing to detect\nnovel pathogens due to their low sensitivity and specificity. Similarly,\nconventional machine learning techniques, while promising, require large\nannotated datasets and extensive feature engineering and are prone to\noverfitting. Addressing these challenges, we introduce PathoLM, a cutting-edge\npathogen language model optimized for the identification of pathogenicity in\nbacterial and viral sequences. Leveraging the strengths of pre-trained DNA\nmodels such as the Nucleotide Transformer, PathoLM requires minimal data for\nfine-tuning, thereby enhancing pathogen detection capabilities. It effectively\ncaptures a broader genomic context, significantly improving the identification\nof novel and divergent pathogens. We developed a comprehensive data set\ncomprising approximately 30 species of viruses and bacteria, including ESKAPEE\npathogens, seven notably virulent bacterial strains resistant to antibiotics.\nAdditionally, we curated a species classification dataset centered specifically\non the ESKAPEE group. In comparative assessments, PathoLM dramatically\noutperforms existing models like DciPatho, demonstrating robust zero-shot and\nfew-shot capabilities. Furthermore, we expanded PathoLM-Sp for ESKAPEE species\nclassification, where it showed superior performance compared to other advanced\ndeep learning methods, despite the complexities of the task.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"46 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.13133","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Pathogen identification is pivotal in diagnosing, treating, and preventing
diseases, crucial for controlling infections and safeguarding public health.
Traditional alignment-based methods, though widely used, are computationally
intense and reliant on extensive reference databases, often failing to detect
novel pathogens due to their low sensitivity and specificity. Similarly,
conventional machine learning techniques, while promising, require large
annotated datasets and extensive feature engineering and are prone to
overfitting. Addressing these challenges, we introduce PathoLM, a cutting-edge
pathogen language model optimized for the identification of pathogenicity in
bacterial and viral sequences. Leveraging the strengths of pre-trained DNA
models such as the Nucleotide Transformer, PathoLM requires minimal data for
fine-tuning, thereby enhancing pathogen detection capabilities. It effectively
captures a broader genomic context, significantly improving the identification
of novel and divergent pathogens. We developed a comprehensive data set
comprising approximately 30 species of viruses and bacteria, including ESKAPEE
pathogens, seven notably virulent bacterial strains resistant to antibiotics.
Additionally, we curated a species classification dataset centered specifically
on the ESKAPEE group. In comparative assessments, PathoLM dramatically
outperforms existing models like DciPatho, demonstrating robust zero-shot and
few-shot capabilities. Furthermore, we expanded PathoLM-Sp for ESKAPEE species
classification, where it showed superior performance compared to other advanced
deep learning methods, despite the complexities of the task.