Pub Date : 2025-08-29DOI: 10.1186/s13321-025-01077-2
Dayan Liu, Tao Song, Shuang Wang, Xue Li, Peifu Han, Jianmin Wang, Shudong Wang
Protein-protein interactions (PPIs) regulate essential biological processes through complex interfaces, with their dysfunction is associated with various diseases. Consequently, the identification of PPIs and their interface-targeting modulators has emerged as a critical therapeutic approach. However, discovering modulators that target PPIs and PPI interfaces remains challenging as traditional structure-similarity-based methods fail to effectively characterize PPI targets, particularly those for which no active compounds are known. Here, we present AlphaPPIMI, a comprehensive deep learning framework that combines large-scale pretrained language models with domain adaptation for predicting PPI-modulator interactions, specifically targeting PPI interface. To enable robust model development and evaluation, we constructed comprehensive benchmark datasets of PPI-modulator interactions (PPIMI). Our framework integrates comprehensive molecular features from Uni-Mol2, protein representations derived from state-of-the-art language models (ESM2 and ProTrans), and PPI structural characteristics encoded by PFeature. Through a specialized cross-attention architecture and conditional domain adversarial networks (CDAN), AlphaPPIMI effectively learns potential associations between PPI targets and modulators while ensuring robust cross-domain generalization. Extensive evaluations indicate that AlphaPPIMI achieves consistently improved performance over existing methods in PPIMI prediction, offering a promising approach for prioritizing candidate PPI modulators, particularly those targeting protein–protein interfaces.
This work presents AlphaPPIMI, a novel deep learning framework for accurately predicting modulators targeting protein-protein interactions (PPIs) and their interfaces. Its core contributions include a specialized cross-attention module for the synergistic fusion of multimodal pretrained representations, and the novel application of a Conditional Domain Adversarial Network (CDAN) to significantly improve generalization across diverse protein families. AlphaPPIMI demonstrates superior performance on curated benchmarks, providing a powerful computational tool for the discovery of targeted PPI therapeutics.
{"title":"Alphappimi: a comprehensive deep learning framework for predicting PPI-modulator interactions","authors":"Dayan Liu, Tao Song, Shuang Wang, Xue Li, Peifu Han, Jianmin Wang, Shudong Wang","doi":"10.1186/s13321-025-01077-2","DOIUrl":"10.1186/s13321-025-01077-2","url":null,"abstract":"<p>Protein-protein interactions (PPIs) regulate essential biological processes through complex interfaces, with their dysfunction is associated with various diseases. Consequently, the identification of PPIs and their interface-targeting modulators has emerged as a critical therapeutic approach. However, discovering modulators that target PPIs and PPI interfaces remains challenging as traditional structure-similarity-based methods fail to effectively characterize PPI targets, particularly those for which no active compounds are known. Here, we present AlphaPPIMI, a comprehensive deep learning framework that combines large-scale pretrained language models with domain adaptation for predicting PPI-modulator interactions, specifically targeting PPI interface. To enable robust model development and evaluation, we constructed comprehensive benchmark datasets of PPI-modulator interactions (PPIMI). Our framework integrates comprehensive molecular features from Uni-Mol2, protein representations derived from state-of-the-art language models (ESM2 and ProTrans), and PPI structural characteristics encoded by PFeature. Through a specialized cross-attention architecture and conditional domain adversarial networks (CDAN), AlphaPPIMI effectively learns potential associations between PPI targets and modulators while ensuring robust cross-domain generalization. Extensive evaluations indicate that AlphaPPIMI achieves consistently improved performance over existing methods in PPIMI prediction, offering a promising approach for prioritizing candidate PPI modulators, particularly those targeting protein–protein interfaces.</p><p>This work presents AlphaPPIMI, a novel deep learning framework for accurately predicting modulators targeting protein-protein interactions (PPIs) and their interfaces. Its core contributions include a specialized cross-attention module for the synergistic fusion of multimodal pretrained representations, and the novel application of a Conditional Domain Adversarial Network (CDAN) to significantly improve generalization across diverse protein families. AlphaPPIMI demonstrates superior performance on curated benchmarks, providing a powerful computational tool for the discovery of targeted PPI therapeutics.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01077-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144916122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-29DOI: 10.1186/s13321-025-01062-9
Roda Bounaceur, Francisco Paes, Romain Privat, Jean-Noël Jaubert
In this paper, we propose a robust deep-learning model based on a Quantitative Structure − Property Relationship (QSPR) approach for estimating the critical temperature (TC), critical pressure (PC), acentric factor (ACEN) and normal boiling point (NBP) of any C, H, O, N, S, P, F, Cl, Br, I molecule. The Mordred calculator was used to determine 247 descriptors to characterize the molecules considered in this work. For each evaluated property, multiple neural networks were trained within a bagging framework. The predictions from the final ensemble were successfully tested against a large set of experimental data comprising more than 1700 molecules and compared with those from different recent learning models found in the literature. Comprehensive comparisons and extensive testing highlight the robustness and predictive power of the newly proposed multimodal learning model. The developed prediction tool is available on a website at https://lrgp-thermoppt.streamlit.app/. Furthermore, a source code for implementing the trained models in Python is available via github https://github.com/bounac80/AI-ThermPpt.
{"title":"AI-powered prediction of critical properties and boiling points: a hybrid ensemble learning and QSPR approach","authors":"Roda Bounaceur, Francisco Paes, Romain Privat, Jean-Noël Jaubert","doi":"10.1186/s13321-025-01062-9","DOIUrl":"10.1186/s13321-025-01062-9","url":null,"abstract":"<div><p>In this paper, we propose a robust deep-learning model based on a Quantitative Structure − Property Relationship (QSPR) approach for estimating the critical temperature (TC), critical pressure (PC), acentric factor (ACEN) and normal boiling point (NBP) of any C, H, O, N, S, P, F, Cl, Br, I molecule. The Mordred calculator was used to determine 247 descriptors to characterize the molecules considered in this work. For each evaluated property, multiple neural networks were trained within a <i>bagging</i> framework. The predictions from the final ensemble were successfully tested against a large set of experimental data comprising more than 1700 molecules and compared with those from different recent learning models found in the literature. Comprehensive comparisons and extensive testing highlight the robustness and predictive power of the newly proposed multimodal learning model. The developed prediction tool is available on a website at https://lrgp-thermoppt.streamlit.app/. Furthermore, a source code for implementing the trained models in Python is available via github https://github.com/bounac80/AI-ThermPpt.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01062-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144916149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-29DOI: 10.1186/s13321-025-01081-6
Lucina-May Nollen, David Meijer, Maria Sorokina, Justin J. J. van der Hooft
Natural products provide a rich source of bioactive molecules for a variety of applications. Molecular fingerprints are the tool of choice for systematic large-scale studies of their structures. However, current molecular fingerprints insufficiently represent characteristic features of natural products inherently, decreasing the interpretability of natural product-specific predictions. Here, we show that a natural product-specific molecular fingerprint based on a relatively small set of selected biosynthetic building blocks provides more interpretable predictions of biosynthetic distance and natural product classification. Our fingerprint Biosynfoni outperforms MACCS, Morgan, and Daylight-like fingerprints in biosynthetic distance estimation, using 39 substructure keys. Moreover, Biosynfoni’s design, compactness, and concrete substructure definition allow easy visualisation of the detected substructures and their respective biosynthetic pathway origins. Through Biosynfoni, users can gain more insights from predictions and better examine the importance of features within machine learning models. Our results show that a short fingerprint consisting of biologically significant building blocks performs on par with top-performing molecular fingerprints for natural product classification while improving prediction explainability.
{"title":"Biosynfoni: a biosynthesis-informed and interpretable lightweight molecular fingerprint","authors":"Lucina-May Nollen, David Meijer, Maria Sorokina, Justin J. J. van der Hooft","doi":"10.1186/s13321-025-01081-6","DOIUrl":"10.1186/s13321-025-01081-6","url":null,"abstract":"<div><p>Natural products provide a rich source of bioactive molecules for a variety of applications. Molecular fingerprints are the tool of choice for systematic large-scale studies of their structures. However, current molecular fingerprints insufficiently represent characteristic features of natural products inherently, decreasing the interpretability of natural product-specific predictions. Here, we show that a natural product-specific molecular fingerprint based on a relatively small set of selected biosynthetic building blocks provides more interpretable predictions of biosynthetic distance and natural product classification. Our fingerprint Biosynfoni outperforms MACCS, Morgan, and Daylight-like fingerprints in biosynthetic distance estimation, using 39 substructure keys. Moreover, Biosynfoni’s design, compactness, and concrete substructure definition allow easy visualisation of the detected substructures and their respective biosynthetic pathway origins. Through Biosynfoni, users can gain more insights from predictions and better examine the importance of features within machine learning models. Our results show that a short fingerprint consisting of biologically significant building blocks performs on par with top-performing molecular fingerprints for natural product classification while improving prediction explainability.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01081-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144916145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-29DOI: 10.1186/s13321-025-01073-6
Yutong Lu, Yan Yi Li, Yan Sun, Pingzhao Hu
Chemical Language Models (CLMs) have demonstrated capabilities in extracting patterns and predicting from vast volume of the Simplified Molecular Input Line Entry System (SMILES), a notation used to represent molecular structures. Different CLMs, developed from various architectures, can provide unique insights into molecular properties. To harness the uniqueness of different CLMs, we propose FusionCLM, a novel stacking-ensemble learning algorithm that integrate the outputs of multiple CLMs into a unified framework. FusionCLM first generates SMILES embeddings, predictions, and losses from each CLM. Auxiliary models are trained on these first-level predictions and embeddings to estimate test losses during inference. The losses and predictions are then concatenated to create an integrated feature matrix, which trains second-level meta-models for final predictions. Empirical testing on five datasets demonstrates that FusionCLM have better performance than individual CLM at the first level and three advanced multimodal deep learning frameworks, showcasing FusionCLM’s potential in advancing molecular property prediction.
{"title":"FusionCLM: enhanced molecular property prediction via knowledge fusion of chemical language models","authors":"Yutong Lu, Yan Yi Li, Yan Sun, Pingzhao Hu","doi":"10.1186/s13321-025-01073-6","DOIUrl":"10.1186/s13321-025-01073-6","url":null,"abstract":"<div><p>Chemical Language Models (CLMs) have demonstrated capabilities in extracting patterns and predicting from vast volume of the Simplified Molecular Input Line Entry System (SMILES), a notation used to represent molecular structures. Different CLMs, developed from various architectures, can provide unique insights into molecular properties. To harness the uniqueness of different CLMs, we propose FusionCLM, a novel stacking-ensemble learning algorithm that integrate the outputs of multiple CLMs into a unified framework. FusionCLM first generates SMILES embeddings, predictions, and losses from each CLM. Auxiliary models are trained on these first-level predictions and embeddings to estimate test losses during inference. The losses and predictions are then concatenated to create an integrated feature matrix, which trains second-level meta-models for final predictions. Empirical testing on five datasets demonstrates that FusionCLM have better performance than individual CLM at the first level and three advanced multimodal deep learning frameworks, showcasing FusionCLM’s potential in advancing molecular property prediction.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01073-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144916147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, the integration of Artificial Intelligence and Machine Learning methods with biochemical and biomedical research has revolutionized the field of toxicology, significantly advancing our understanding of the toxicological effects of chemicals on biological systems. Cardiovascular diseases remain the leading global cause of death. The constant exposure to multiple chemicals with potential cardiotoxic effects, including environmental contaminants, pesticides, food additives, and drugs, can significantly contribute to these adverse health outcomes. Traditional methods for assessing chemical hazards and their impact on biological function heavily rely on experimental assays and animal studies, which are often time-consuming, resource-intensive, and limited in scalability. To overcome these limitations in silico methods have emerged as indispensable tools in toxicological research, reducing the need for traditional in vivo testing and conserving valuable resources in terms of time and cost. In this study, Artificial Intelligence methods are used as first-tier components within an Integrated Approach to Testing and Assessment. We explored the potential benefits of using Multitask Neural Networks, where multiple levels of cardiotoxicity information are combined to enhance model performance. Multitask learning, based on specific architectures such as Mixture of Experts (MoE), showed promising results and surpasses the performance of single-task baseline models. When predicting a holdout set, multitask model achieved high performance on twelve different endpoints related to cardiotoxicity defined by Adverse Outcome Pathways Network. The best developed model achieved a balanced accuracy of 78%, a sensitivity of 80%, and a specificity of 76% across all endpoints in the holdout set.
An advanced multitask model was developed to predict cardiotoxicity mechanisms induced by small molecules. The model demonstrates broad mechanistic coverage and achieves performance comparable to, or exceeding, state-of-the-art methods. These results suggest that the model could serve as a valuable first-tier component in advanced New Approach Methodologies for prioritizing chemicals for further testing.
{"title":"Mixture of experts for multitask learning in cardiotoxicity assessment","authors":"Edoardo Luca Viganò, Mateusz Iwan, Erika Colombo, Davide Ballabio, Alessandra Roncaglioni","doi":"10.1186/s13321-025-01072-7","DOIUrl":"10.1186/s13321-025-01072-7","url":null,"abstract":"<p>In recent years, the integration of Artificial Intelligence and Machine Learning methods with biochemical and biomedical research has revolutionized the field of toxicology, significantly advancing our understanding of the toxicological effects of chemicals on biological systems. Cardiovascular diseases remain the leading global cause of death. The constant exposure to multiple chemicals with potential cardiotoxic effects, including environmental contaminants, pesticides, food additives, and drugs, can significantly contribute to these adverse health outcomes. Traditional methods for assessing chemical hazards and their impact on biological function heavily rely on experimental assays and animal studies, which are often time-consuming, resource-intensive, and limited in scalability. To overcome these limitations in silico methods have emerged as indispensable tools in toxicological research, reducing the need for traditional in vivo testing and conserving valuable resources in terms of time and cost. In this study, Artificial Intelligence methods are used as first-tier components within an Integrated Approach to Testing and Assessment. We explored the potential benefits of using Multitask Neural Networks, where multiple levels of cardiotoxicity information are combined to enhance model performance. Multitask learning, based on specific architectures such as Mixture of Experts (MoE), showed promising results and surpasses the performance of single-task baseline models. When predicting a holdout set, multitask model achieved high performance on twelve different endpoints related to cardiotoxicity defined by Adverse Outcome Pathways Network. The best developed model achieved a balanced accuracy of 78%, a sensitivity of 80%, and a specificity of 76% across all endpoints in the holdout set.</p><p>An advanced multitask model was developed to predict cardiotoxicity mechanisms induced by small molecules. The model demonstrates broad mechanistic coverage and achieves performance comparable to, or exceeding, state-of-the-art methods. These results suggest that the model could serve as a valuable first-tier component in advanced New Approach Methodologies for prioritizing chemicals for further testing.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01072-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144916148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-28DOI: 10.1186/s13321-025-01071-8
Yixiang Mao, Souparno Ghosh, Ranadip Pal
Quantitative structure–activity relationship (QSAR) modeling has become a critical tool in drug design. Recently proposed Topological Regression (TR), a computationally efficient and highly interpretable QSAR model that maps distances in the chemical domain to distances in the activity domain, has shown predictive performance comparable to state-of-the-art deep learning-based models. However, TR’s dependence on simple random sampling-based anchor selection and utilization of radial basis function for response reconstruction constrain its interpretability and predictive capacity. To address these limitations, we propose Adaptive Topological Regression (AdapToR) with adaptive anchor selection and optimization-based reconstruction. We evaluated AdapToR on the NCI60 GI50 dataset, which consists of over 50,000 drug responses across 60 human cancer cell lines, and compared its performance to Transformer CNN, Graph Transformer, TR, and other baseline models. The results demonstrate that AdapToR outperforms competing QSAR models for drug response prediction with significantly lower computational cost and greater interpretability as compared to deep learning-based models.
{"title":"AdapTor: Adaptive Topological Regression for quantitative structure–activity relationship modeling","authors":"Yixiang Mao, Souparno Ghosh, Ranadip Pal","doi":"10.1186/s13321-025-01071-8","DOIUrl":"10.1186/s13321-025-01071-8","url":null,"abstract":"<div><p>Quantitative structure–activity relationship (QSAR) modeling has become a critical tool in drug design. Recently proposed Topological Regression (TR), a computationally efficient and highly interpretable QSAR model that maps distances in the chemical domain to distances in the activity domain, has shown predictive performance comparable to state-of-the-art deep learning-based models. However, TR’s dependence on simple random sampling-based anchor selection and utilization of radial basis function for response reconstruction constrain its interpretability and predictive capacity. To address these limitations, we propose Adaptive Topological Regression (AdapToR) with adaptive anchor selection and optimization-based reconstruction. We evaluated AdapToR on the NCI60 GI50 dataset, which consists of over 50,000 drug responses across 60 human cancer cell lines, and compared its performance to Transformer CNN, Graph Transformer, TR, and other baseline models. The results demonstrate that AdapToR outperforms competing QSAR models for drug response prediction with significantly lower computational cost and greater interpretability as compared to deep learning-based models.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01071-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144909651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-28DOI: 10.1186/s13321-025-01088-z
Junseok Choe, Hajung Kim, Yan Ting Chok, Mogan Gim, Jaewoo Kang
Retrosynthesis—the process of deconstructing complex molecules into simpler, more accessible precursors—is a cornerstone of drug discovery and material design. While machine learning has improved single-step retrosynthesis prediction, generating complete multi-step retrosynthetic routes remains challenging. In this study, we explore the integration of single-step retrosynthesis models with various planning algorithms to improve multi-step retrosynthetic route generation. We expand the exploration space beyond previously limited settings by incorporating combinations of planning algorithms and single-step retrosynthesis models and diverse datasets, enabling a more comprehensive assessment of retrosynthetic strategies. We evaluated synthetic routes based on both solvability, the ability to generate a complete route, and route feasibility, which reflects their practical executability in the laboratory. Our findings show that the model combination with the highest solvability does not always produce the most feasible routes, underscoring the need for more nuanced evaluation. Through a systematic analysis of combinations of planning algorithms and single-step retrosynthesis models, their performance across different datasets, and various practical metrics, our study provides a more comprehensive evaluation of retrosynthetic planning strategies. These insights contribute to a better understanding of computational retrosynthesis and its alignment with real-world applicability.
{"title":"Retrosynthetic crosstalk between single-step reaction and multi-step planning","authors":"Junseok Choe, Hajung Kim, Yan Ting Chok, Mogan Gim, Jaewoo Kang","doi":"10.1186/s13321-025-01088-z","DOIUrl":"10.1186/s13321-025-01088-z","url":null,"abstract":"<div><p>Retrosynthesis—the process of deconstructing complex molecules into simpler, more accessible precursors—is a cornerstone of drug discovery and material design. While machine learning has improved single-step retrosynthesis prediction, generating complete multi-step retrosynthetic routes remains challenging. In this study, we explore the integration of single-step retrosynthesis models with various planning algorithms to improve multi-step retrosynthetic route generation. We expand the exploration space beyond previously limited settings by incorporating combinations of planning algorithms and single-step retrosynthesis models and diverse datasets, enabling a more comprehensive assessment of retrosynthetic strategies. We evaluated synthetic routes based on both solvability, the ability to generate a complete route, and route feasibility, which reflects their practical executability in the laboratory. Our findings show that the model combination with the highest solvability does not always produce the most feasible routes, underscoring the need for more nuanced evaluation. Through a systematic analysis of combinations of planning algorithms and single-step retrosynthesis models, their performance across different datasets, and various practical metrics, our study provides a more comprehensive evaluation of retrosynthetic planning strategies. These insights contribute to a better understanding of computational retrosynthesis and its alignment with real-world applicability.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01088-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144910812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-28DOI: 10.1186/s13321-025-01070-9
Maximilian Fleck, Samir Darouich, Marcelle B. M. Spera, Niels Hansen
When data availability is limited, the prediction of properties through purely data-driven machine learning (ML) is challenging. Integrating physically-based modeling techniques into ML methods may lead to better performance. In a recent work by Chew et al. (“Advancing material property prediction: using physics-informed machine learning models for viscosity”) descriptors from classical molecular dynamics (MD) simulations were included into a quantitative structure–property relationship to accurately predict temperature-dependent viscosity of pure liquids. Through feature importance analysis, the authors found that heat of vaporization was the most relevant descriptor for the prediction of viscosity. In this comment, we would like to discuss the physical origin of this finding by referring to Eyring’s rate theory, and develop an alternative modeling approach using a thermodynamic-based architecture that requires less input data.
{"title":"Comment on “Advancing material property prediction: using physics-informed machine learning models for viscosity”","authors":"Maximilian Fleck, Samir Darouich, Marcelle B. M. Spera, Niels Hansen","doi":"10.1186/s13321-025-01070-9","DOIUrl":"10.1186/s13321-025-01070-9","url":null,"abstract":"<div><p>When data availability is limited, the prediction of properties through purely data-driven machine learning (ML) is challenging. Integrating physically-based modeling techniques into ML methods may lead to better performance. In a recent work by Chew et al. (“<i>Advancing material property prediction: using physics-informed machine learning models for viscosity</i>”) descriptors from classical molecular dynamics (MD) simulations were included into a quantitative structure–property relationship to accurately predict temperature-dependent viscosity of pure liquids. Through feature importance analysis, the authors found that heat of vaporization was the most relevant descriptor for the prediction of viscosity. In this comment, we would like to discuss the physical origin of this finding by referring to Eyring’s rate theory, and develop an alternative modeling approach using a thermodynamic-based architecture that requires less input data.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01070-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144910811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-28DOI: 10.1186/s13321-025-01083-4
Wei Liu, Jianguo Li, Chandra S. Verma, Hwee Kuan Lee
Cyclic peptides are promising drug candidates due to their ability to modulate intracellular protein–protein interactions, a property often inaccessible to small molecules. However, their typically poor membrane permeability limits therapeutic applicability. Accurate computational prediction of permeability can accelerate the identification of cell-permeable candidates, reducing reliance on time-consuming and costly experimental screening. Although deep learning has shown potential in predicting molecular properties, its application in permeability prediction remains underexplored. A systematic evaluation of these models is important to assess current capabilities and guide future development. In this study, we conduct a comprehensive benchmark of 13 machine learning models for predicting cyclic peptide membrane permeability. These models cover four types of molecular representations: fingerprints, SMILES strings, molecular graphs, and 2D images. We use experimentally measured PAMPA permeability data from the CycPeptMPDB database, comprising nearly 6000 cyclic peptides, and evaluate performance across three prediction tasks: regression, binary classification, and soft-label classification. Two data-splitting strategies, random split and scaffold split, are used to assess the generalizability of trained models. Our results show that model performance depends strongly on molecular representation and model architecture. Graph-based models, particularly the Directed Message Passing Neural Network (DMPNN), consistently achieve top performance across tasks. Regression generally outperforms classification. Scaffold-based splitting, although intended to more rigorously assess generalization, yields substantially lower model generalizability compared to random splitting. Comparing prediction errors with experimental variability highlights the practical value of current models while also indicating room for further improvement.
{"title":"Systematic benchmarking of 13 AI methods for predicting cyclic peptide membrane permeability","authors":"Wei Liu, Jianguo Li, Chandra S. Verma, Hwee Kuan Lee","doi":"10.1186/s13321-025-01083-4","DOIUrl":"10.1186/s13321-025-01083-4","url":null,"abstract":"<div><p>Cyclic peptides are promising drug candidates due to their ability to modulate intracellular protein–protein interactions, a property often inaccessible to small molecules. However, their typically poor membrane permeability limits therapeutic applicability. Accurate computational prediction of permeability can accelerate the identification of cell-permeable candidates, reducing reliance on time-consuming and costly experimental screening. Although deep learning has shown potential in predicting molecular properties, its application in permeability prediction remains underexplored. A systematic evaluation of these models is important to assess current capabilities and guide future development. In this study, we conduct a comprehensive benchmark of 13 machine learning models for predicting cyclic peptide membrane permeability. These models cover four types of molecular representations: fingerprints, SMILES strings, molecular graphs, and 2D images. We use experimentally measured PAMPA permeability data from the CycPeptMPDB database, comprising nearly 6000 cyclic peptides, and evaluate performance across three prediction tasks: regression, binary classification, and soft-label classification. Two data-splitting strategies, random split and scaffold split, are used to assess the generalizability of trained models. Our results show that model performance depends strongly on molecular representation and model architecture. Graph-based models, particularly the Directed Message Passing Neural Network (DMPNN), consistently achieve top performance across tasks. Regression generally outperforms classification. Scaffold-based splitting, although intended to more rigorously assess generalization, yields substantially lower model generalizability compared to random splitting. Comparing prediction errors with experimental variability highlights the practical value of current models while also indicating room for further improvement.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01083-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144909733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bitter peptides (BPs), derived from the hydrolysis of proteins in food, play a crucial role in both food science and biomedicine by influencing taste perception and participating in various physiological processes. Accurate identification of BPs is essential for understanding food quality and potential health impacts. Traditional machine learning approaches for BP identification have relied on conventional feature descriptors, achieving moderate success but struggling with the complexities of biological sequence data. Recent advances utilizing protein language model embedding and meta-learning approaches have improved the accuracy, but frequently neglect the molecular representations of peptides and lack interpretability. In this study, we propose xBitterT5, a novel multimodal and interpretable framework for BP identification that integrates pretrained transformer-based embeddings from BioT5+ with the combination of peptide sequence and its SELFIES molecular representation. Specifically, incorporating both peptide sequences and their molecular strings, xBitterT5 demonstrates superior performance compared to previous methods on the same benchmark datasets. Importantly, the model provides residue-level interpretability, highlighting chemically meaningful substructures that significantly contribute to its bitterness, thus offering mechanistic insights beyond black-box predictions. A user-friendly web server (https://balalab-skku.org/xBitterT5/) and a standalone version (https://github.com/cbbl-skku-org/xBitterT5/) are freely available to support both computational biologists and experimental researchers in peptide-based food and biomedicine.
{"title":"xBitterT5: an explainable transformer-based framework with multimodal inputs for identifying bitter-taste peptides","authors":"Nguyen Doan Hieu Nguyen, Nhat Truong Pham, Duong Thanh Tran, Leyi Wei, Adeel Malik, Balachandran Manavalan","doi":"10.1186/s13321-025-01078-1","DOIUrl":"10.1186/s13321-025-01078-1","url":null,"abstract":"<div><p>Bitter peptides (BPs), derived from the hydrolysis of proteins in food, play a crucial role in both food science and biomedicine by influencing taste perception and participating in various physiological processes. Accurate identification of BPs is essential for understanding food quality and potential health impacts. Traditional machine learning approaches for BP identification have relied on conventional feature descriptors, achieving moderate success but struggling with the complexities of biological sequence data. Recent advances utilizing protein language model embedding and meta-learning approaches have improved the accuracy, but frequently neglect the molecular representations of peptides and lack interpretability. In this study, we propose xBitterT5, a novel multimodal and interpretable framework for BP identification that integrates pretrained transformer-based embeddings from BioT5+ with the combination of peptide sequence and its SELFIES molecular representation. Specifically, incorporating both peptide sequences and their molecular strings, xBitterT5 demonstrates superior performance compared to previous methods on the same benchmark datasets. Importantly, the model provides residue-level interpretability, highlighting chemically meaningful substructures that significantly contribute to its bitterness, thus offering mechanistic insights beyond black-box predictions. A user-friendly web server (https://balalab-skku.org/xBitterT5/) and a standalone version (https://github.com/cbbl-skku-org/xBitterT5/) are freely available to support both computational biologists and experimental researchers in peptide-based food and biomedicine.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01078-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144880932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}