The growing threat of antimicrobial resistance (AMR) necessitates the rapid discovery of novel antimicrobial peptides (AMPs) as alternative therapeutics. However, most computational approaches rely on binary AMP or non-AMP classification or permissive MIC thresholds (e.g. ≤128 μg/mL), offering limited biological interpretability and translational value. We present CVAE-BIO, a biochemical-knowledge-driven, multi-module pipeline for the discovery of AMPs targeting drug-resistant Escherichia coli as a model pathogen yet generalisable to other bacterial targets. The model integrates a conditional variational autoencoder (CVAE) constrained by key biochemical properties (MIC≤10 μg/mL, net charge > + 2, peptide length < 40 residues, instability index <40, and Boman index <0) with a Random Forest classifier trained on 30 biochemical descriptors. In vitro validation showed that 18.5% of generated peptides exhibited strong activity (MIC≤10 μg/mL), with 38.9% reaching MIC ≤50 μg/mL while maintaining key biochemical properties. Most validated novel peptides are narrow-spectrum AMP targeting E. coli. Wet-lab results also showed that highly active cationic-amphipathic AMPs are characterized by significantly low counts of tiny and small residues, suggesting that avoiding using these residues or limiting them to a maximum of 2 and 3, respectively, might improve the activity of AMP. Taking both antimicrobial activity and hemolytic toxicity into account, 9 peptides were identified as non-toxic and active AMP candidates. This explainable framework enables efficient AMP discovery under biochemical constraints and yields experimentally validated candidates with translational potential.
{"title":"Biochemical-knowledge-driven machine learning pipeline for generating potent antimicrobial peptides.","authors":"Deliang Yang, Yifan Li, Chenxi Li, Qingpeng Zhang, Jiandong Huang, Xue Li, Peng Gao","doi":"10.1093/bib/bbag115","DOIUrl":"10.1093/bib/bbag115","url":null,"abstract":"<p><p>The growing threat of antimicrobial resistance (AMR) necessitates the rapid discovery of novel antimicrobial peptides (AMPs) as alternative therapeutics. However, most computational approaches rely on binary AMP or non-AMP classification or permissive MIC thresholds (e.g. ≤128 μg/mL), offering limited biological interpretability and translational value. We present CVAE-BIO, a biochemical-knowledge-driven, multi-module pipeline for the discovery of AMPs targeting drug-resistant Escherichia coli as a model pathogen yet generalisable to other bacterial targets. The model integrates a conditional variational autoencoder (CVAE) constrained by key biochemical properties (MIC≤10 μg/mL, net charge > + 2, peptide length < 40 residues, instability index <40, and Boman index <0) with a Random Forest classifier trained on 30 biochemical descriptors. In vitro validation showed that 18.5% of generated peptides exhibited strong activity (MIC≤10 μg/mL), with 38.9% reaching MIC ≤50 μg/mL while maintaining key biochemical properties. Most validated novel peptides are narrow-spectrum AMP targeting E. coli. Wet-lab results also showed that highly active cationic-amphipathic AMPs are characterized by significantly low counts of tiny and small residues, suggesting that avoiding using these residues or limiting them to a maximum of 2 and 3, respectively, might improve the activity of AMP. Taking both antimicrobial activity and hemolytic toxicity into account, 9 peptides were identified as non-toxic and active AMP candidates. This explainable framework enables efficient AMP discovery under biochemical constraints and yields experimentally validated candidates with translational potential.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12998437/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147479601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Supervisory signals are intriguingly high in even simple features for predicting anticancer effect of antibody drug conjugates.","authors":"Sunil Nagpal","doi":"10.1093/bib/bbag108","DOIUrl":"10.1093/bib/bbag108","url":null,"abstract":"","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12981646/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147442614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accurately predicting the structures of RNA-protein complexes remains a major challenge. Recently, machine learning-based methods such as AlphaFold3 and RosettaFoldNA have been proposed. However, most conventional approaches rely on docking simulations to generate candidate structures, which are then identified as accurate using various methods. This study presents a method that integrates specialized molecular dynamics simulations and machine learning (ML) techniques to identify the correct structure among many docking poses. First, steered molecular dynamics simulations are performed to estimate the stability of the candidate structures. The simulation data then serve as the training data for a ML model, which classifies the results as either correct or incorrect. Next, the candidates predicted as correct are narrowed down using thermodynamic simulations and ML methods. Findings indicated that candidate structures could be classified as correct or incorrect with an accuracy of 0.934 in the RNA-protein docking simulation results. Additionally, we used AlphaFold3 to predict 15 RNA-protein complexes that Zou's group categorized as difficult, medium or easy category. Subsequently, our method classified these binding structures as correct or incorrect, with accuracies of 0.80, 0.92 and 0.96, respectively. Thus, our method is powerful for accurately predicting the structures of RNA-protein complexes.
{"title":"Differentiation of RNA-protein docking structures through molecular dynamics simulation and machine learning methods.","authors":"Bui Tien Thanh, Yoichi Kurumida, Kaito Kobayashi, Michiaki Hamada, Tomoshi Kameda","doi":"10.1093/bib/bbag109","DOIUrl":"10.1093/bib/bbag109","url":null,"abstract":"<p><p>Accurately predicting the structures of RNA-protein complexes remains a major challenge. Recently, machine learning-based methods such as AlphaFold3 and RosettaFoldNA have been proposed. However, most conventional approaches rely on docking simulations to generate candidate structures, which are then identified as accurate using various methods. This study presents a method that integrates specialized molecular dynamics simulations and machine learning (ML) techniques to identify the correct structure among many docking poses. First, steered molecular dynamics simulations are performed to estimate the stability of the candidate structures. The simulation data then serve as the training data for a ML model, which classifies the results as either correct or incorrect. Next, the candidates predicted as correct are narrowed down using thermodynamic simulations and ML methods. Findings indicated that candidate structures could be classified as correct or incorrect with an accuracy of 0.934 in the RNA-protein docking simulation results. Additionally, we used AlphaFold3 to predict 15 RNA-protein complexes that Zou's group categorized as difficult, medium or easy category. Subsequently, our method classified these binding structures as correct or incorrect, with accuracies of 0.80, 0.92 and 0.96, respectively. Thus, our method is powerful for accurately predicting the structures of RNA-protein complexes.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12991047/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147466952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Single-cell Hi-C (scHi-C) provides unprecedented insight into 3D genome organization, but its sparse and noisy data pose challenges in accurately detecting A/B compartments, which are crucial for understanding chromatin structure and gene regulation. We presented scDIAGRAM, a data-driven method for annotating A/B compartments in single cells using direct statistical modeling and graph community detection. Unlike existing approaches, scDIAGRAM infers chromatin compartments directly from individual scHi-C matrix without imputation or external reference features, and subsequently assigns A/B labels using conventional genomic annotations. Accuracy and robustness of scDIAGRAM were illustrated through simulated scHi-C datasets and a human cell line. We applied scDIAGRAM to real scHi-C datasets from the mouse brain cortex, mouse embryonic development, and human acute myeloid leukemia, demonstrating its ability to capture compartmental shifts associated with transcriptional variation. This robust framework offers new insights into the functional roles of chromatin compartments at single-cell resolution across various biological contexts.
{"title":"scDIAGRAM: detecting chromatin compartments from individual single-cell Hi-C matrix without imputation or reference features.","authors":"Yongli Peng, Yujing Deng, Menghan Liu, Zhiyuan Liu, Ya-Hui Li, Xiang-Yu Zhao, Dong Xing, Jinzhu Jia, Hao Ge","doi":"10.1093/bib/bbag096","DOIUrl":"10.1093/bib/bbag096","url":null,"abstract":"<p><p>Single-cell Hi-C (scHi-C) provides unprecedented insight into 3D genome organization, but its sparse and noisy data pose challenges in accurately detecting A/B compartments, which are crucial for understanding chromatin structure and gene regulation. We presented scDIAGRAM, a data-driven method for annotating A/B compartments in single cells using direct statistical modeling and graph community detection. Unlike existing approaches, scDIAGRAM infers chromatin compartments directly from individual scHi-C matrix without imputation or external reference features, and subsequently assigns A/B labels using conventional genomic annotations. Accuracy and robustness of scDIAGRAM were illustrated through simulated scHi-C datasets and a human cell line. We applied scDIAGRAM to real scHi-C datasets from the mouse brain cortex, mouse embryonic development, and human acute myeloid leukemia, demonstrating its ability to capture compartmental shifts associated with transcriptional variation. This robust framework offers new insights into the functional roles of chromatin compartments at single-cell resolution across various biological contexts.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12967335/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147375848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Protein-protein interactions (PPIs) are central to cellular signaling and regulation, and their dysregulation underlies many diseases. Predicting the impact of mutations on PPI stability, quantified as ΔΔG, is essential for understanding disease mechanisms and guiding protein engineering. Here, we first present MutPPI, a graph-based deep-learning model that encodes full-residue structural features of protein-protein complexes and employs a shared GIN-GAT feature extractor for wild-type and mutant complexes. MutPPI outperforms 12 existing methods on an antibody-antigen single-point mutation dataset (S645). By integrating evolutionary information from protein language models, we further develop MutPPI-plus, achieving enhanced predictive performance. Second, we proposed a mutation-path-based data augmentation strategy, which enriches input modalities and improves generalization of both MutPPI and MutPPI-plus. After data augmentation, MutPPI-plus demonstrates state-of-the-art performance on S645 and three additional multi-point mutation datasets (SM_ZEMu, SM595, SM1124), substantially surpassing DDMut-PPI. Our analyses highlight the benefits of the multimodal framework and the physically informed data augmentation method. Together, these results provide a versatile computational tool for accurate ΔΔG prediction, advancing rational protein design.
{"title":"MutPPI+: a multimodal framework for predicting mutation effects on protein-protein interactions via mutation-path-based data augmentation.","authors":"Juntao Deng, Miao Gu, Pengyan Zhang, Tao Liu, Guansong Hu, Mingyu Dong, Yabin Zhang, Yizhen Song, Yunfan Zhang, Min Liu, Junzhang Tian, Weibin Cheng","doi":"10.1093/bib/bbag105","DOIUrl":"10.1093/bib/bbag105","url":null,"abstract":"<p><p>Protein-protein interactions (PPIs) are central to cellular signaling and regulation, and their dysregulation underlies many diseases. Predicting the impact of mutations on PPI stability, quantified as ΔΔG, is essential for understanding disease mechanisms and guiding protein engineering. Here, we first present MutPPI, a graph-based deep-learning model that encodes full-residue structural features of protein-protein complexes and employs a shared GIN-GAT feature extractor for wild-type and mutant complexes. MutPPI outperforms 12 existing methods on an antibody-antigen single-point mutation dataset (S645). By integrating evolutionary information from protein language models, we further develop MutPPI-plus, achieving enhanced predictive performance. Second, we proposed a mutation-path-based data augmentation strategy, which enriches input modalities and improves generalization of both MutPPI and MutPPI-plus. After data augmentation, MutPPI-plus demonstrates state-of-the-art performance on S645 and three additional multi-point mutation datasets (SM_ZEMu, SM595, SM1124), substantially surpassing DDMut-PPI. Our analyses highlight the benefits of the multimodal framework and the physically informed data augmentation method. Together, these results provide a versatile computational tool for accurate ΔΔG prediction, advancing rational protein design.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12967331/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147375806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Syed Mohammed Khalid, Tom Wölker, Leidy-Alejandra G Molano, Simon Graf, Andreas Keller
Post-Acute Infection Syndromes (PAIS) are medical conditions that persist following acute infections from pathogens such as SARS-CoV-2, Epstein-Barr virus, and Influenza virus. Despite growing global awareness of PAIS and the exponential increase in biomedical literature, only a small fraction of this literature pertains specifically to PAIS, making the identification of pathogen-disease associations within such a vast, heterogeneous, and unstructured corpus a significant challenge for researchers. This study evaluated the effectiveness of large language models (LLMs) in extracting these associations through a binary classification task using a curated dataset of 1000 manually labeled PubMed abstracts. We benchmarked a wide range of open-source LLMs of varying sizes (4B-70B parameters), including generalist, reasoning, and biomedical-specific models. We also investigated the extent to which prompting strategies such as zero-shot, few-shot, and Chain of Thought (CoT) methods can improve classification performance. Our results indicate that model performance varied by size, architecture, and prompting strategy. Zero-shot prompting produced the most reliable results: Mistral-Small-Instruct-2409 and Llama-3.1-Nemotron-70B-Instruct achieved balanced accuracy scores of 0.81 and 0.80, respectively, along with macro-F1 scores of up to 0.80, while maintaining minimal invalid outputs. While few-shot and CoT prompting often degraded performance in generalist models, reasoning models such as DeepSeek-R1-Distill-Llama-70B and QwQ-32B demonstrated improved accuracy and consistency when provided with additional context.
急性感染后综合征(PAIS)是在SARS-CoV-2、爱泼斯坦-巴尔病毒和流感病毒等病原体急性感染后持续存在的医疗状况。尽管全球对PAIS的认识不断提高,生物医学文献也呈指数级增长,但只有一小部分文献专门与PAIS有关,这使得在如此庞大、异构和非结构化的语料库中识别病原体-疾病关联对研究人员来说是一个重大挑战。本研究评估了大型语言模型(llm)通过一个二元分类任务提取这些关联的有效性,该任务使用了1000个人工标记的PubMed摘要的精选数据集。我们对各种不同大小(4B-70B参数)的开源法学硕士进行了基准测试,包括通才、推理和生物医学特定模型。我们还研究了zero-shot、few-shot和Chain of Thought (CoT)方法等提示策略在多大程度上可以提高分类性能。我们的结果表明,模型性能因大小、体系结构和提示策略而异。零射击提示产生了最可靠的结果:mistral - small - directive -2409和llama -3.1- nemotron - 70b - directive分别达到了0.81和0.80的平衡精度分数,以及高达0.80的宏观f1分数,同时保持了最小的无效输出。虽然在通才模型中,少量射击和CoT提示通常会降低性能,但DeepSeek-R1-Distill-Llama-70B和QwQ-32B等推理模型在提供额外的上下文时显示出更高的准确性和一致性。
{"title":"Benchmarking large language models for pathogen-disease classification in post-acute infection syndromes.","authors":"Syed Mohammed Khalid, Tom Wölker, Leidy-Alejandra G Molano, Simon Graf, Andreas Keller","doi":"10.1093/bib/bbag089","DOIUrl":"10.1093/bib/bbag089","url":null,"abstract":"<p><p>Post-Acute Infection Syndromes (PAIS) are medical conditions that persist following acute infections from pathogens such as SARS-CoV-2, Epstein-Barr virus, and Influenza virus. Despite growing global awareness of PAIS and the exponential increase in biomedical literature, only a small fraction of this literature pertains specifically to PAIS, making the identification of pathogen-disease associations within such a vast, heterogeneous, and unstructured corpus a significant challenge for researchers. This study evaluated the effectiveness of large language models (LLMs) in extracting these associations through a binary classification task using a curated dataset of 1000 manually labeled PubMed abstracts. We benchmarked a wide range of open-source LLMs of varying sizes (4B-70B parameters), including generalist, reasoning, and biomedical-specific models. We also investigated the extent to which prompting strategies such as zero-shot, few-shot, and Chain of Thought (CoT) methods can improve classification performance. Our results indicate that model performance varied by size, architecture, and prompting strategy. Zero-shot prompting produced the most reliable results: Mistral-Small-Instruct-2409 and Llama-3.1-Nemotron-70B-Instruct achieved balanced accuracy scores of 0.81 and 0.80, respectively, along with macro-F1 scores of up to 0.80, while maintaining minimal invalid outputs. While few-shot and CoT prompting often degraded performance in generalist models, reasoning models such as DeepSeek-R1-Distill-Llama-70B and QwQ-32B demonstrated improved accuracy and consistency when provided with additional context.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12963971/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147364094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paramita Roy, Dibakar Roy, Sudipto Bhattacharjee, Abhirupa Ghosh, Sudipto Saha
Pulmonary diseases are becoming a serious threat worldwide, and enormous data from different human microbiomes have been generated to understand these complex diseases. Here, we introduce Microbiome Database of Pulmonary Diseases (MDPD), an open-access, comprehensive systemic catalog of pulmonary diseases by manually curating global studies from 2012 to 2024 (13 years). We have compiled 59 362 runs from 430 BioProjects, encompassing data from 10 body sites related to 19 pulmonary diseases and healthy groups covering 278 distinct sub-groups. MDPD enables users to analyze each BioProject and customize analysis with multiple BioProjects to identify taxonomic profiles and disease group/sub-group specific microbial signatures. The re-analyzed intermediate Biological Observation Matrix files are provided for each BioProject for the accessibility of users for further applications, such as machine learning-based classification. Identified microbes (bacteria, fungi, viruses) in MDPD are annotated with several attributes, providing further insights into their disease-causing potential and specificity to certain diseases and body sites. MDPD is freely available at: https://bicresources.jcbose.ac.in/ssaha4/mdpd/.
{"title":"MDPD reveals specific microbial signatures in human pulmonary diseases.","authors":"Paramita Roy, Dibakar Roy, Sudipto Bhattacharjee, Abhirupa Ghosh, Sudipto Saha","doi":"10.1093/bib/bbag017","DOIUrl":"10.1093/bib/bbag017","url":null,"abstract":"<p><p>Pulmonary diseases are becoming a serious threat worldwide, and enormous data from different human microbiomes have been generated to understand these complex diseases. Here, we introduce Microbiome Database of Pulmonary Diseases (MDPD), an open-access, comprehensive systemic catalog of pulmonary diseases by manually curating global studies from 2012 to 2024 (13 years). We have compiled 59 362 runs from 430 BioProjects, encompassing data from 10 body sites related to 19 pulmonary diseases and healthy groups covering 278 distinct sub-groups. MDPD enables users to analyze each BioProject and customize analysis with multiple BioProjects to identify taxonomic profiles and disease group/sub-group specific microbial signatures. The re-analyzed intermediate Biological Observation Matrix files are provided for each BioProject for the accessibility of users for further applications, such as machine learning-based classification. Identified microbes (bacteria, fungi, viruses) in MDPD are annotated with several attributes, providing further insights into their disease-causing potential and specificity to certain diseases and body sites. MDPD is freely available at: https://bicresources.jcbose.ac.in/ssaha4/mdpd/.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12962063/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147364103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accurately predicting protein-ligand interactions is vital for structure-based drug discovery. Although deep learning (DL) models have shown strong performance, the potential of traditional statistical potentials under data-limited conditions remains underexplored. Here, we systematically assess several statistical potential models in docking and virtual screening. We find that docking benefits from distance-dependent pairwise atom-atom potentials with clear physical meanings, while screening relies more on orientation-dependent atom-residue potentials that capture local chemical environments. Based on these findings, we propose HybridSP, a hybrid potential combining distance-dependent atom-atom, atom-residue, and orientation-dependent atom-residue terms. An affinity-weighted scheme is applied to correct biases in statistical distributions. On the CASF-2016 benchmark, HybridSP achieves a 91.6% docking success rate and an enrichment factor of 29.35 at the top 1%, rivaling and even surpassing state-of-the-art DL models. Its strong screening ability is further validated on directory of useful decoys-enhanced and directory of useful decoys-adjusted. These results demonstrate that well-designed statistical potentials can achieve high performance and interpretability without complex DL architectures, offering an efficient alternative for scoring function design. The models are available at: https://github.com/zelixirSH/HybridSP.git.
{"title":"Could statistical potential models achieve comparable or better performance than deep learning models?","authors":"Zhihao Wang, Sheng Wang, Jingjing Guo, Yuguang Mu, Xiangdong Liu, Liangzhen Zheng, Weifeng Li","doi":"10.1093/bib/bbag088","DOIUrl":"10.1093/bib/bbag088","url":null,"abstract":"<p><p>Accurately predicting protein-ligand interactions is vital for structure-based drug discovery. Although deep learning (DL) models have shown strong performance, the potential of traditional statistical potentials under data-limited conditions remains underexplored. Here, we systematically assess several statistical potential models in docking and virtual screening. We find that docking benefits from distance-dependent pairwise atom-atom potentials with clear physical meanings, while screening relies more on orientation-dependent atom-residue potentials that capture local chemical environments. Based on these findings, we propose HybridSP, a hybrid potential combining distance-dependent atom-atom, atom-residue, and orientation-dependent atom-residue terms. An affinity-weighted scheme is applied to correct biases in statistical distributions. On the CASF-2016 benchmark, HybridSP achieves a 91.6% docking success rate and an enrichment factor of 29.35 at the top 1%, rivaling and even surpassing state-of-the-art DL models. Its strong screening ability is further validated on directory of useful decoys-enhanced and directory of useful decoys-adjusted. These results demonstrate that well-designed statistical potentials can achieve high performance and interpretability without complex DL architectures, offering an efficient alternative for scoring function design. The models are available at: https://github.com/zelixirSH/HybridSP.git.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12951076/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147324693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wasif Jalal, Mubasshira Musarrat, Md Abul Hassan Samee, M Sohel Rahman
Despite aging being a fundamental biological process that profoundly influences health and disease, the interplay between tissue-specific aging and mortality remains underexplored. This study applies machine learning on GTEx transcriptomic data to model tissue-specific biological ages across 12 different types of tissues and introduces an age-gap metric to quantify deviations from the chronological age. We use several modeling techniques optimized with three feature selection strategies: Pearson correlation, age-related differentially expressed genes, and tissue-enriched genes (expressed at least four-fold higher in a specific tissue). Among these, Pearson correlation combined with elastic net regression yields the best performance, with models achieving an average root mean squared error of 6.44 years and an R2 of 0.64. To quantify deviations from chronological age relative to the population, we train neural networks to regress predicted ages against chronological ages, and subtract their outputs from the predicted ages to calculate a metric that we call the age-gap. Age-gap statistics reveal significant tissue-specific aging patterns, identifying extreme agers and correlations between extreme aging and mortality. About 20% of subjects are found to exhibit extreme aging in one tissue, while 1% show multi-organ aging. Further analysis reveals that accelerated aging in specific tissues correlates with greater risk of death from illness. These findings greatly emphasize the role of transcriptomics in aging research and its implications for health and longevity.
{"title":"ORANGE: a machine learning approach for modeling tissue-specific aging from transcriptomic data.","authors":"Wasif Jalal, Mubasshira Musarrat, Md Abul Hassan Samee, M Sohel Rahman","doi":"10.1093/bib/bbag093","DOIUrl":"10.1093/bib/bbag093","url":null,"abstract":"<p><p>Despite aging being a fundamental biological process that profoundly influences health and disease, the interplay between tissue-specific aging and mortality remains underexplored. This study applies machine learning on GTEx transcriptomic data to model tissue-specific biological ages across 12 different types of tissues and introduces an age-gap metric to quantify deviations from the chronological age. We use several modeling techniques optimized with three feature selection strategies: Pearson correlation, age-related differentially expressed genes, and tissue-enriched genes (expressed at least four-fold higher in a specific tissue). Among these, Pearson correlation combined with elastic net regression yields the best performance, with models achieving an average root mean squared error of 6.44 years and an R2 of 0.64. To quantify deviations from chronological age relative to the population, we train neural networks to regress predicted ages against chronological ages, and subtract their outputs from the predicted ages to calculate a metric that we call the age-gap. Age-gap statistics reveal significant tissue-specific aging patterns, identifying extreme agers and correlations between extreme aging and mortality. About 20% of subjects are found to exhibit extreme aging in one tissue, while 1% show multi-organ aging. Further analysis reveals that accelerated aging in specific tissues correlates with greater risk of death from illness. These findings greatly emphasize the role of transcriptomics in aging research and its implications for health and longevity.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12951074/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147324766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The pathological aggregation of α-synuclein (α-syn) constitutes a pivotal hallmark in the progression of neurodegenerative disorders, including Parkinson's disease, underscoring the imperative need for identifying site-specific ligands. This study presents, for the first time, an advanced deep learning framework specifically designed for the prediction of molecular properties associated with α-syn. The framework integrates graph-based contextual attention mechanisms, structural feature aggregation protocols, and dual-channel feature integration, complemented by a composite regularization strategy that synergizes mean squared error minimization, Kullback-Leibler divergence-induced latent space regularization, and L2 norm penalization, thereby delivering outstanding predictive accuracy on the independent test dataset with MSE of 0.1812. Mechanistic insights derived from GNNExplainer analysis and molecular docking studies (PDB: 6A6B) elucidated that aromatic ring systems (benzene ring significance: 0.737) and hydrogen bond donor groups (amino group significance: 0.438) play critical roles in mediating high-affinity ligand-receptor interactions through π-π stacking within the hydrophobic pocket formed by Val82 and Ala89 residues, as well as directed hydrogen bonding involving catalytic residues Ser42 and Lys45. These findings not only enhance the understanding of inhibitor mechanisms but also establish a novel framework for the preliminary screening of small-molecule therapeutics, thereby laying a rigorous groundwork for structure-guided drug optimization and rational molecular design.
{"title":"Drug screening for α-synuclein aggregation inhibitors via multimodal graph neural network.","authors":"Tingle Gu, Zixu Ran, Wenyin Li, Xudong Guo, Bo Li, Fuyi Li, Cangzhi Jia","doi":"10.1093/bib/bbag118","DOIUrl":"https://doi.org/10.1093/bib/bbag118","url":null,"abstract":"<p><p>The pathological aggregation of α-synuclein (α-syn) constitutes a pivotal hallmark in the progression of neurodegenerative disorders, including Parkinson's disease, underscoring the imperative need for identifying site-specific ligands. This study presents, for the first time, an advanced deep learning framework specifically designed for the prediction of molecular properties associated with α-syn. The framework integrates graph-based contextual attention mechanisms, structural feature aggregation protocols, and dual-channel feature integration, complemented by a composite regularization strategy that synergizes mean squared error minimization, Kullback-Leibler divergence-induced latent space regularization, and L2 norm penalization, thereby delivering outstanding predictive accuracy on the independent test dataset with MSE of 0.1812. Mechanistic insights derived from GNNExplainer analysis and molecular docking studies (PDB: 6A6B) elucidated that aromatic ring systems (benzene ring significance: 0.737) and hydrogen bond donor groups (amino group significance: 0.438) play critical roles in mediating high-affinity ligand-receptor interactions through π-π stacking within the hydrophobic pocket formed by Val82 and Ala89 residues, as well as directed hydrogen bonding involving catalytic residues Ser42 and Lys45. These findings not only enhance the understanding of inhibitor mechanisms but also establish a novel framework for the preliminary screening of small-molecule therapeutics, thereby laying a rigorous groundwork for structure-guided drug optimization and rational molecular design.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"27 2","pages":""},"PeriodicalIF":7.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147497677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}