He Wang, Yikun Zhang, Jie Chen, Jian Zhan, Yaoqi Zhou
RNA language models (LMs) are increasingly applied to RNA structure and function analysis, yet their intrinsic representational capacities remain poorly characterized. Here, we present a standardized zero-shot evaluation of 21 RNA LMs, with representative DNA LMs included as reference controls. Three complementary tasks-attention-based RNA secondary structure prediction, embedding-based RNA classification, and mutational fitness estimation from sequence likelihoods-are evaluated without downstream fine-tuning. Our results reveal substantial variability across models and clear trade-offs between structural, functional, and evolutionary representations. RNA-specific, noncoding RNA-enriched pretraining is crucial for capturing structural information, while evolutionary signals from multiple sequence alignments substantially boost performance. Although model scaling yields gains, architectural and objective choices critically influence performance across task categories. Together, this study provides a foundational benchmark, highlights inherent challenges in learning unified RNA representations, and offers insights for developing next-generation RNA foundation models.
{"title":"Zero-shot benchmarking of RNA language models in structural, functional, and evolutionary learning.","authors":"He Wang, Yikun Zhang, Jie Chen, Jian Zhan, Yaoqi Zhou","doi":"10.1093/bib/bbag098","DOIUrl":"10.1093/bib/bbag098","url":null,"abstract":"<p><p>RNA language models (LMs) are increasingly applied to RNA structure and function analysis, yet their intrinsic representational capacities remain poorly characterized. Here, we present a standardized zero-shot evaluation of 21 RNA LMs, with representative DNA LMs included as reference controls. Three complementary tasks-attention-based RNA secondary structure prediction, embedding-based RNA classification, and mutational fitness estimation from sequence likelihoods-are evaluated without downstream fine-tuning. Our results reveal substantial variability across models and clear trade-offs between structural, functional, and evolutionary representations. RNA-specific, noncoding RNA-enriched pretraining is crucial for capturing structural information, while evolutionary signals from multiple sequence alignments substantially boost performance. Although model scaling yields gains, architectural and objective choices critically influence performance across task categories. Together, this study provides a foundational benchmark, highlights inherent challenges in learning unified RNA representations, and offers insights for developing next-generation RNA foundation models.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12963973/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147364182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large language models (LLMs) are evolving from passive predictors into agentic systems capable of planning, tool-use, and multimodal reasoning. This shift is especially consequential for biology, where complex, noisy, and multi-scale data require adaptive and integrative computational strategies. In this review, we provide the first systematic synthesis of LLM-based agents across genomics, molecular biology, imaging, biomedical analysis, and automated bioinformatics workflows. We analyze >60 emerging systems and organize them within a unifying framework that characterizes agentic traits, such as autonomous decision-making, external tool invocation, memory, and self-correction. Across domains, agentic LLMs show early promise in enabling multi-step analysis, linking heterogeneous evidence, and supporting exploratory scientific tasks. At the same time, our comparative assessment highlights consistent challenges, including unstable reasoning, limited biological grounding, retrieval misalignment, and barriers to reproducibility and biosafety. We conclude by outlining opportunities for trustworthy and collaborative biological agents, including multimodal integration, closed-loop experimental design, and robust evaluation practices. This survey aims to clarify the emerging landscape and chart a path toward reliable agentic systems for biological discovery.
{"title":"Large language model agents for biological intelligence across genomics, proteomics, spatial biology, and biomedicine.","authors":"Sajib Acharjee Dip, Dipanwita Mallick, Uddip Acharjee Shuvo, Shovito Barua Soumma, Fazle Rafsani, Bikash Kumar Paul, Nazifa Ahmed Moumi, Shafayat Ahmed, Liqing Zhang","doi":"10.1093/bib/bbag110","DOIUrl":"https://doi.org/10.1093/bib/bbag110","url":null,"abstract":"<p><p>Large language models (LLMs) are evolving from passive predictors into agentic systems capable of planning, tool-use, and multimodal reasoning. This shift is especially consequential for biology, where complex, noisy, and multi-scale data require adaptive and integrative computational strategies. In this review, we provide the first systematic synthesis of LLM-based agents across genomics, molecular biology, imaging, biomedical analysis, and automated bioinformatics workflows. We analyze >60 emerging systems and organize them within a unifying framework that characterizes agentic traits, such as autonomous decision-making, external tool invocation, memory, and self-correction. Across domains, agentic LLMs show early promise in enabling multi-step analysis, linking heterogeneous evidence, and supporting exploratory scientific tasks. At the same time, our comparative assessment highlights consistent challenges, including unstable reasoning, limited biological grounding, retrieval misalignment, and barriers to reproducibility and biosafety. We conclude by outlining opportunities for trustworthy and collaborative biological agents, including multimodal integration, closed-loop experimental design, and robust evaluation practices. This survey aims to clarify the emerging landscape and chart a path toward reliable agentic systems for biological discovery.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147509817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Imaging-based spatial transcriptomics (ST) technologies offer unparalleled resolution for mapping gene expression within intact tissues but are fundamentally constrained by the limited size of their gene panels. This restriction hinders comprehensive biological discovery by omitting potentially crucial genes from analysis. To overcome this limitation, we introduce STGNET, a deep learning framework that extends gene panel coverage by integrating generative adversarial networks (GANs) with graph neural networks. STGNET employs a multi-stage GAN to learn the global transcriptomic distribution from single-cell RNA sequencing data, followed by a spatially aware graph convolutional network that refines imputations by modeling both physical cell proximity and transcriptional similarity. We rigorously benchmarked STGNET against seven state-of-the-art methods across nine diverse ST datasets. STGNET consistently achieved superior performance, demonstrating enhanced accuracy in gene imputation, and exceptional preservation of cellular topology. We further showcase its biological utility by accurately reconstructing developmental marker patterns in mouse embryogenesis, revealing a novel transitional cell state in breast cancer progression, and uncovering extensive, previously obscured cell-cell communication networks in the mouse brain. STGNET provides a powerful and robust solution for unlocking the full potential of targeted ST assays, thereby enabling deeper and more comprehensive spatial biology. STGNET is freely accessible at https://github.com/wuyuanwuhuii/STGNET.
{"title":"STGNET: extending panel coverage in imaging-based spatial transcriptomics using deep generative adversarial networks.","authors":"Tao Wang, Bingtao Wang, Han Shu, Peimeng Zhen, Jialu Hu, Yongtian Wang, Jiajie Peng, Xuequn Shang, Zhiyuan Wu, Bing Xiao, Jing Chen","doi":"10.1093/bib/bbag122","DOIUrl":"https://doi.org/10.1093/bib/bbag122","url":null,"abstract":"<p><p>Imaging-based spatial transcriptomics (ST) technologies offer unparalleled resolution for mapping gene expression within intact tissues but are fundamentally constrained by the limited size of their gene panels. This restriction hinders comprehensive biological discovery by omitting potentially crucial genes from analysis. To overcome this limitation, we introduce STGNET, a deep learning framework that extends gene panel coverage by integrating generative adversarial networks (GANs) with graph neural networks. STGNET employs a multi-stage GAN to learn the global transcriptomic distribution from single-cell RNA sequencing data, followed by a spatially aware graph convolutional network that refines imputations by modeling both physical cell proximity and transcriptional similarity. We rigorously benchmarked STGNET against seven state-of-the-art methods across nine diverse ST datasets. STGNET consistently achieved superior performance, demonstrating enhanced accuracy in gene imputation, and exceptional preservation of cellular topology. We further showcase its biological utility by accurately reconstructing developmental marker patterns in mouse embryogenesis, revealing a novel transitional cell state in breast cancer progression, and uncovering extensive, previously obscured cell-cell communication networks in the mouse brain. STGNET provides a powerful and robust solution for unlocking the full potential of targeted ST assays, thereby enabling deeper and more comprehensive spatial biology. STGNET is freely accessible at https://github.com/wuyuanwuhuii/STGNET.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13011812/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147509834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The growing threat of antimicrobial resistance (AMR) necessitates the rapid discovery of novel antimicrobial peptides (AMPs) as alternative therapeutics. However, most computational approaches rely on binary AMP or non-AMP classification or permissive MIC thresholds (e.g. ≤128 μg/mL), offering limited biological interpretability and translational value. We present CVAE-BIO, a biochemical-knowledge-driven, multi-module pipeline for the discovery of AMPs targeting drug-resistant Escherichia coli as a model pathogen yet generalisable to other bacterial targets. The model integrates a conditional variational autoencoder (CVAE) constrained by key biochemical properties (MIC≤10 μg/mL, net charge > + 2, peptide length < 40 residues, instability index <40, and Boman index <0) with a Random Forest classifier trained on 30 biochemical descriptors. In vitro validation showed that 18.5% of generated peptides exhibited strong activity (MIC≤10 μg/mL), with 38.9% reaching MIC ≤50 μg/mL while maintaining key biochemical properties. Most validated novel peptides are narrow-spectrum AMP targeting E. coli. Wet-lab results also showed that highly active cationic-amphipathic AMPs are characterized by significantly low counts of tiny and small residues, suggesting that avoiding using these residues or limiting them to a maximum of 2 and 3, respectively, might improve the activity of AMP. Taking both antimicrobial activity and hemolytic toxicity into account, 9 peptides were identified as non-toxic and active AMP candidates. This explainable framework enables efficient AMP discovery under biochemical constraints and yields experimentally validated candidates with translational potential.
{"title":"Biochemical-knowledge-driven machine learning pipeline for generating potent antimicrobial peptides.","authors":"Deliang Yang, Yifan Li, Chenxi Li, Qingpeng Zhang, Jiandong Huang, Xue Li, Peng Gao","doi":"10.1093/bib/bbag115","DOIUrl":"10.1093/bib/bbag115","url":null,"abstract":"<p><p>The growing threat of antimicrobial resistance (AMR) necessitates the rapid discovery of novel antimicrobial peptides (AMPs) as alternative therapeutics. However, most computational approaches rely on binary AMP or non-AMP classification or permissive MIC thresholds (e.g. ≤128 μg/mL), offering limited biological interpretability and translational value. We present CVAE-BIO, a biochemical-knowledge-driven, multi-module pipeline for the discovery of AMPs targeting drug-resistant Escherichia coli as a model pathogen yet generalisable to other bacterial targets. The model integrates a conditional variational autoencoder (CVAE) constrained by key biochemical properties (MIC≤10 μg/mL, net charge > + 2, peptide length < 40 residues, instability index <40, and Boman index <0) with a Random Forest classifier trained on 30 biochemical descriptors. In vitro validation showed that 18.5% of generated peptides exhibited strong activity (MIC≤10 μg/mL), with 38.9% reaching MIC ≤50 μg/mL while maintaining key biochemical properties. Most validated novel peptides are narrow-spectrum AMP targeting E. coli. Wet-lab results also showed that highly active cationic-amphipathic AMPs are characterized by significantly low counts of tiny and small residues, suggesting that avoiding using these residues or limiting them to a maximum of 2 and 3, respectively, might improve the activity of AMP. Taking both antimicrobial activity and hemolytic toxicity into account, 9 peptides were identified as non-toxic and active AMP candidates. This explainable framework enables efficient AMP discovery under biochemical constraints and yields experimentally validated candidates with translational potential.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12998437/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147479601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Supervisory signals are intriguingly high in even simple features for predicting anticancer effect of antibody drug conjugates.","authors":"Sunil Nagpal","doi":"10.1093/bib/bbag108","DOIUrl":"10.1093/bib/bbag108","url":null,"abstract":"","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12981646/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147442614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accurately predicting the structures of RNA-protein complexes remains a major challenge. Recently, machine learning-based methods such as AlphaFold3 and RosettaFoldNA have been proposed. However, most conventional approaches rely on docking simulations to generate candidate structures, which are then identified as accurate using various methods. This study presents a method that integrates specialized molecular dynamics simulations and machine learning (ML) techniques to identify the correct structure among many docking poses. First, steered molecular dynamics simulations are performed to estimate the stability of the candidate structures. The simulation data then serve as the training data for a ML model, which classifies the results as either correct or incorrect. Next, the candidates predicted as correct are narrowed down using thermodynamic simulations and ML methods. Findings indicated that candidate structures could be classified as correct or incorrect with an accuracy of 0.934 in the RNA-protein docking simulation results. Additionally, we used AlphaFold3 to predict 15 RNA-protein complexes that Zou's group categorized as difficult, medium or easy category. Subsequently, our method classified these binding structures as correct or incorrect, with accuracies of 0.80, 0.92 and 0.96, respectively. Thus, our method is powerful for accurately predicting the structures of RNA-protein complexes.
{"title":"Differentiation of RNA-protein docking structures through molecular dynamics simulation and machine learning methods.","authors":"Bui Tien Thanh, Yoichi Kurumida, Kaito Kobayashi, Michiaki Hamada, Tomoshi Kameda","doi":"10.1093/bib/bbag109","DOIUrl":"10.1093/bib/bbag109","url":null,"abstract":"<p><p>Accurately predicting the structures of RNA-protein complexes remains a major challenge. Recently, machine learning-based methods such as AlphaFold3 and RosettaFoldNA have been proposed. However, most conventional approaches rely on docking simulations to generate candidate structures, which are then identified as accurate using various methods. This study presents a method that integrates specialized molecular dynamics simulations and machine learning (ML) techniques to identify the correct structure among many docking poses. First, steered molecular dynamics simulations are performed to estimate the stability of the candidate structures. The simulation data then serve as the training data for a ML model, which classifies the results as either correct or incorrect. Next, the candidates predicted as correct are narrowed down using thermodynamic simulations and ML methods. Findings indicated that candidate structures could be classified as correct or incorrect with an accuracy of 0.934 in the RNA-protein docking simulation results. Additionally, we used AlphaFold3 to predict 15 RNA-protein complexes that Zou's group categorized as difficult, medium or easy category. Subsequently, our method classified these binding structures as correct or incorrect, with accuracies of 0.80, 0.92 and 0.96, respectively. Thus, our method is powerful for accurately predicting the structures of RNA-protein complexes.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12991047/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147466952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Single-cell Hi-C (scHi-C) provides unprecedented insight into 3D genome organization, but its sparse and noisy data pose challenges in accurately detecting A/B compartments, which are crucial for understanding chromatin structure and gene regulation. We presented scDIAGRAM, a data-driven method for annotating A/B compartments in single cells using direct statistical modeling and graph community detection. Unlike existing approaches, scDIAGRAM infers chromatin compartments directly from individual scHi-C matrix without imputation or external reference features, and subsequently assigns A/B labels using conventional genomic annotations. Accuracy and robustness of scDIAGRAM were illustrated through simulated scHi-C datasets and a human cell line. We applied scDIAGRAM to real scHi-C datasets from the mouse brain cortex, mouse embryonic development, and human acute myeloid leukemia, demonstrating its ability to capture compartmental shifts associated with transcriptional variation. This robust framework offers new insights into the functional roles of chromatin compartments at single-cell resolution across various biological contexts.
{"title":"scDIAGRAM: detecting chromatin compartments from individual single-cell Hi-C matrix without imputation or reference features.","authors":"Yongli Peng, Yujing Deng, Menghan Liu, Zhiyuan Liu, Ya-Hui Li, Xiang-Yu Zhao, Dong Xing, Jinzhu Jia, Hao Ge","doi":"10.1093/bib/bbag096","DOIUrl":"10.1093/bib/bbag096","url":null,"abstract":"<p><p>Single-cell Hi-C (scHi-C) provides unprecedented insight into 3D genome organization, but its sparse and noisy data pose challenges in accurately detecting A/B compartments, which are crucial for understanding chromatin structure and gene regulation. We presented scDIAGRAM, a data-driven method for annotating A/B compartments in single cells using direct statistical modeling and graph community detection. Unlike existing approaches, scDIAGRAM infers chromatin compartments directly from individual scHi-C matrix without imputation or external reference features, and subsequently assigns A/B labels using conventional genomic annotations. Accuracy and robustness of scDIAGRAM were illustrated through simulated scHi-C datasets and a human cell line. We applied scDIAGRAM to real scHi-C datasets from the mouse brain cortex, mouse embryonic development, and human acute myeloid leukemia, demonstrating its ability to capture compartmental shifts associated with transcriptional variation. This robust framework offers new insights into the functional roles of chromatin compartments at single-cell resolution across various biological contexts.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12967335/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147375848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Protein-protein interactions (PPIs) are central to cellular signaling and regulation, and their dysregulation underlies many diseases. Predicting the impact of mutations on PPI stability, quantified as ΔΔG, is essential for understanding disease mechanisms and guiding protein engineering. Here, we first present MutPPI, a graph-based deep-learning model that encodes full-residue structural features of protein-protein complexes and employs a shared GIN-GAT feature extractor for wild-type and mutant complexes. MutPPI outperforms 12 existing methods on an antibody-antigen single-point mutation dataset (S645). By integrating evolutionary information from protein language models, we further develop MutPPI-plus, achieving enhanced predictive performance. Second, we proposed a mutation-path-based data augmentation strategy, which enriches input modalities and improves generalization of both MutPPI and MutPPI-plus. After data augmentation, MutPPI-plus demonstrates state-of-the-art performance on S645 and three additional multi-point mutation datasets (SM_ZEMu, SM595, SM1124), substantially surpassing DDMut-PPI. Our analyses highlight the benefits of the multimodal framework and the physically informed data augmentation method. Together, these results provide a versatile computational tool for accurate ΔΔG prediction, advancing rational protein design.
{"title":"MutPPI+: a multimodal framework for predicting mutation effects on protein-protein interactions via mutation-path-based data augmentation.","authors":"Juntao Deng, Miao Gu, Pengyan Zhang, Tao Liu, Guansong Hu, Mingyu Dong, Yabin Zhang, Yizhen Song, Yunfan Zhang, Min Liu, Junzhang Tian, Weibin Cheng","doi":"10.1093/bib/bbag105","DOIUrl":"10.1093/bib/bbag105","url":null,"abstract":"<p><p>Protein-protein interactions (PPIs) are central to cellular signaling and regulation, and their dysregulation underlies many diseases. Predicting the impact of mutations on PPI stability, quantified as ΔΔG, is essential for understanding disease mechanisms and guiding protein engineering. Here, we first present MutPPI, a graph-based deep-learning model that encodes full-residue structural features of protein-protein complexes and employs a shared GIN-GAT feature extractor for wild-type and mutant complexes. MutPPI outperforms 12 existing methods on an antibody-antigen single-point mutation dataset (S645). By integrating evolutionary information from protein language models, we further develop MutPPI-plus, achieving enhanced predictive performance. Second, we proposed a mutation-path-based data augmentation strategy, which enriches input modalities and improves generalization of both MutPPI and MutPPI-plus. After data augmentation, MutPPI-plus demonstrates state-of-the-art performance on S645 and three additional multi-point mutation datasets (SM_ZEMu, SM595, SM1124), substantially surpassing DDMut-PPI. Our analyses highlight the benefits of the multimodal framework and the physically informed data augmentation method. Together, these results provide a versatile computational tool for accurate ΔΔG prediction, advancing rational protein design.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12967331/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147375806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Syed Mohammed Khalid, Tom Wölker, Leidy-Alejandra G Molano, Simon Graf, Andreas Keller
Post-Acute Infection Syndromes (PAIS) are medical conditions that persist following acute infections from pathogens such as SARS-CoV-2, Epstein-Barr virus, and Influenza virus. Despite growing global awareness of PAIS and the exponential increase in biomedical literature, only a small fraction of this literature pertains specifically to PAIS, making the identification of pathogen-disease associations within such a vast, heterogeneous, and unstructured corpus a significant challenge for researchers. This study evaluated the effectiveness of large language models (LLMs) in extracting these associations through a binary classification task using a curated dataset of 1000 manually labeled PubMed abstracts. We benchmarked a wide range of open-source LLMs of varying sizes (4B-70B parameters), including generalist, reasoning, and biomedical-specific models. We also investigated the extent to which prompting strategies such as zero-shot, few-shot, and Chain of Thought (CoT) methods can improve classification performance. Our results indicate that model performance varied by size, architecture, and prompting strategy. Zero-shot prompting produced the most reliable results: Mistral-Small-Instruct-2409 and Llama-3.1-Nemotron-70B-Instruct achieved balanced accuracy scores of 0.81 and 0.80, respectively, along with macro-F1 scores of up to 0.80, while maintaining minimal invalid outputs. While few-shot and CoT prompting often degraded performance in generalist models, reasoning models such as DeepSeek-R1-Distill-Llama-70B and QwQ-32B demonstrated improved accuracy and consistency when provided with additional context.
急性感染后综合征(PAIS)是在SARS-CoV-2、爱泼斯坦-巴尔病毒和流感病毒等病原体急性感染后持续存在的医疗状况。尽管全球对PAIS的认识不断提高,生物医学文献也呈指数级增长,但只有一小部分文献专门与PAIS有关,这使得在如此庞大、异构和非结构化的语料库中识别病原体-疾病关联对研究人员来说是一个重大挑战。本研究评估了大型语言模型(llm)通过一个二元分类任务提取这些关联的有效性,该任务使用了1000个人工标记的PubMed摘要的精选数据集。我们对各种不同大小(4B-70B参数)的开源法学硕士进行了基准测试,包括通才、推理和生物医学特定模型。我们还研究了zero-shot、few-shot和Chain of Thought (CoT)方法等提示策略在多大程度上可以提高分类性能。我们的结果表明,模型性能因大小、体系结构和提示策略而异。零射击提示产生了最可靠的结果:mistral - small - directive -2409和llama -3.1- nemotron - 70b - directive分别达到了0.81和0.80的平衡精度分数,以及高达0.80的宏观f1分数,同时保持了最小的无效输出。虽然在通才模型中,少量射击和CoT提示通常会降低性能,但DeepSeek-R1-Distill-Llama-70B和QwQ-32B等推理模型在提供额外的上下文时显示出更高的准确性和一致性。
{"title":"Benchmarking large language models for pathogen-disease classification in post-acute infection syndromes.","authors":"Syed Mohammed Khalid, Tom Wölker, Leidy-Alejandra G Molano, Simon Graf, Andreas Keller","doi":"10.1093/bib/bbag089","DOIUrl":"10.1093/bib/bbag089","url":null,"abstract":"<p><p>Post-Acute Infection Syndromes (PAIS) are medical conditions that persist following acute infections from pathogens such as SARS-CoV-2, Epstein-Barr virus, and Influenza virus. Despite growing global awareness of PAIS and the exponential increase in biomedical literature, only a small fraction of this literature pertains specifically to PAIS, making the identification of pathogen-disease associations within such a vast, heterogeneous, and unstructured corpus a significant challenge for researchers. This study evaluated the effectiveness of large language models (LLMs) in extracting these associations through a binary classification task using a curated dataset of 1000 manually labeled PubMed abstracts. We benchmarked a wide range of open-source LLMs of varying sizes (4B-70B parameters), including generalist, reasoning, and biomedical-specific models. We also investigated the extent to which prompting strategies such as zero-shot, few-shot, and Chain of Thought (CoT) methods can improve classification performance. Our results indicate that model performance varied by size, architecture, and prompting strategy. Zero-shot prompting produced the most reliable results: Mistral-Small-Instruct-2409 and Llama-3.1-Nemotron-70B-Instruct achieved balanced accuracy scores of 0.81 and 0.80, respectively, along with macro-F1 scores of up to 0.80, while maintaining minimal invalid outputs. While few-shot and CoT prompting often degraded performance in generalist models, reasoning models such as DeepSeek-R1-Distill-Llama-70B and QwQ-32B demonstrated improved accuracy and consistency when provided with additional context.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12963971/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147364094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paramita Roy, Dibakar Roy, Sudipto Bhattacharjee, Abhirupa Ghosh, Sudipto Saha
Pulmonary diseases are becoming a serious threat worldwide, and enormous data from different human microbiomes have been generated to understand these complex diseases. Here, we introduce Microbiome Database of Pulmonary Diseases (MDPD), an open-access, comprehensive systemic catalog of pulmonary diseases by manually curating global studies from 2012 to 2024 (13 years). We have compiled 59 362 runs from 430 BioProjects, encompassing data from 10 body sites related to 19 pulmonary diseases and healthy groups covering 278 distinct sub-groups. MDPD enables users to analyze each BioProject and customize analysis with multiple BioProjects to identify taxonomic profiles and disease group/sub-group specific microbial signatures. The re-analyzed intermediate Biological Observation Matrix files are provided for each BioProject for the accessibility of users for further applications, such as machine learning-based classification. Identified microbes (bacteria, fungi, viruses) in MDPD are annotated with several attributes, providing further insights into their disease-causing potential and specificity to certain diseases and body sites. MDPD is freely available at: https://bicresources.jcbose.ac.in/ssaha4/mdpd/.
{"title":"MDPD reveals specific microbial signatures in human pulmonary diseases.","authors":"Paramita Roy, Dibakar Roy, Sudipto Bhattacharjee, Abhirupa Ghosh, Sudipto Saha","doi":"10.1093/bib/bbag017","DOIUrl":"10.1093/bib/bbag017","url":null,"abstract":"<p><p>Pulmonary diseases are becoming a serious threat worldwide, and enormous data from different human microbiomes have been generated to understand these complex diseases. Here, we introduce Microbiome Database of Pulmonary Diseases (MDPD), an open-access, comprehensive systemic catalog of pulmonary diseases by manually curating global studies from 2012 to 2024 (13 years). We have compiled 59 362 runs from 430 BioProjects, encompassing data from 10 body sites related to 19 pulmonary diseases and healthy groups covering 278 distinct sub-groups. MDPD enables users to analyze each BioProject and customize analysis with multiple BioProjects to identify taxonomic profiles and disease group/sub-group specific microbial signatures. The re-analyzed intermediate Biological Observation Matrix files are provided for each BioProject for the accessibility of users for further applications, such as machine learning-based classification. Identified microbes (bacteria, fungi, viruses) in MDPD are annotated with several attributes, providing further insights into their disease-causing potential and specificity to certain diseases and body sites. MDPD is freely available at: https://bicresources.jcbose.ac.in/ssaha4/mdpd/.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12962063/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147364103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}