Pub Date : 2025-08-28DOI: 10.1186/s13321-025-01071-8
Yixiang Mao, Souparno Ghosh, Ranadip Pal
Quantitative structure–activity relationship (QSAR) modeling has become a critical tool in drug design. Recently proposed Topological Regression (TR), a computationally efficient and highly interpretable QSAR model that maps distances in the chemical domain to distances in the activity domain, has shown predictive performance comparable to state-of-the-art deep learning-based models. However, TR’s dependence on simple random sampling-based anchor selection and utilization of radial basis function for response reconstruction constrain its interpretability and predictive capacity. To address these limitations, we propose Adaptive Topological Regression (AdapToR) with adaptive anchor selection and optimization-based reconstruction. We evaluated AdapToR on the NCI60 GI50 dataset, which consists of over 50,000 drug responses across 60 human cancer cell lines, and compared its performance to Transformer CNN, Graph Transformer, TR, and other baseline models. The results demonstrate that AdapToR outperforms competing QSAR models for drug response prediction with significantly lower computational cost and greater interpretability as compared to deep learning-based models.
{"title":"AdapTor: Adaptive Topological Regression for quantitative structure–activity relationship modeling","authors":"Yixiang Mao, Souparno Ghosh, Ranadip Pal","doi":"10.1186/s13321-025-01071-8","DOIUrl":"10.1186/s13321-025-01071-8","url":null,"abstract":"<div><p>Quantitative structure–activity relationship (QSAR) modeling has become a critical tool in drug design. Recently proposed Topological Regression (TR), a computationally efficient and highly interpretable QSAR model that maps distances in the chemical domain to distances in the activity domain, has shown predictive performance comparable to state-of-the-art deep learning-based models. However, TR’s dependence on simple random sampling-based anchor selection and utilization of radial basis function for response reconstruction constrain its interpretability and predictive capacity. To address these limitations, we propose Adaptive Topological Regression (AdapToR) with adaptive anchor selection and optimization-based reconstruction. We evaluated AdapToR on the NCI60 GI50 dataset, which consists of over 50,000 drug responses across 60 human cancer cell lines, and compared its performance to Transformer CNN, Graph Transformer, TR, and other baseline models. The results demonstrate that AdapToR outperforms competing QSAR models for drug response prediction with significantly lower computational cost and greater interpretability as compared to deep learning-based models.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01071-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144909651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-28DOI: 10.1186/s13321-025-01088-z
Junseok Choe, Hajung Kim, Yan Ting Chok, Mogan Gim, Jaewoo Kang
Retrosynthesis—the process of deconstructing complex molecules into simpler, more accessible precursors—is a cornerstone of drug discovery and material design. While machine learning has improved single-step retrosynthesis prediction, generating complete multi-step retrosynthetic routes remains challenging. In this study, we explore the integration of single-step retrosynthesis models with various planning algorithms to improve multi-step retrosynthetic route generation. We expand the exploration space beyond previously limited settings by incorporating combinations of planning algorithms and single-step retrosynthesis models and diverse datasets, enabling a more comprehensive assessment of retrosynthetic strategies. We evaluated synthetic routes based on both solvability, the ability to generate a complete route, and route feasibility, which reflects their practical executability in the laboratory. Our findings show that the model combination with the highest solvability does not always produce the most feasible routes, underscoring the need for more nuanced evaluation. Through a systematic analysis of combinations of planning algorithms and single-step retrosynthesis models, their performance across different datasets, and various practical metrics, our study provides a more comprehensive evaluation of retrosynthetic planning strategies. These insights contribute to a better understanding of computational retrosynthesis and its alignment with real-world applicability.
{"title":"Retrosynthetic crosstalk between single-step reaction and multi-step planning","authors":"Junseok Choe, Hajung Kim, Yan Ting Chok, Mogan Gim, Jaewoo Kang","doi":"10.1186/s13321-025-01088-z","DOIUrl":"10.1186/s13321-025-01088-z","url":null,"abstract":"<div><p>Retrosynthesis—the process of deconstructing complex molecules into simpler, more accessible precursors—is a cornerstone of drug discovery and material design. While machine learning has improved single-step retrosynthesis prediction, generating complete multi-step retrosynthetic routes remains challenging. In this study, we explore the integration of single-step retrosynthesis models with various planning algorithms to improve multi-step retrosynthetic route generation. We expand the exploration space beyond previously limited settings by incorporating combinations of planning algorithms and single-step retrosynthesis models and diverse datasets, enabling a more comprehensive assessment of retrosynthetic strategies. We evaluated synthetic routes based on both solvability, the ability to generate a complete route, and route feasibility, which reflects their practical executability in the laboratory. Our findings show that the model combination with the highest solvability does not always produce the most feasible routes, underscoring the need for more nuanced evaluation. Through a systematic analysis of combinations of planning algorithms and single-step retrosynthesis models, their performance across different datasets, and various practical metrics, our study provides a more comprehensive evaluation of retrosynthetic planning strategies. These insights contribute to a better understanding of computational retrosynthesis and its alignment with real-world applicability.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01088-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144910812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-28DOI: 10.1186/s13321-025-01070-9
Maximilian Fleck, Samir Darouich, Marcelle B. M. Spera, Niels Hansen
When data availability is limited, the prediction of properties through purely data-driven machine learning (ML) is challenging. Integrating physically-based modeling techniques into ML methods may lead to better performance. In a recent work by Chew et al. (“Advancing material property prediction: using physics-informed machine learning models for viscosity”) descriptors from classical molecular dynamics (MD) simulations were included into a quantitative structure–property relationship to accurately predict temperature-dependent viscosity of pure liquids. Through feature importance analysis, the authors found that heat of vaporization was the most relevant descriptor for the prediction of viscosity. In this comment, we would like to discuss the physical origin of this finding by referring to Eyring’s rate theory, and develop an alternative modeling approach using a thermodynamic-based architecture that requires less input data.
{"title":"Comment on “Advancing material property prediction: using physics-informed machine learning models for viscosity”","authors":"Maximilian Fleck, Samir Darouich, Marcelle B. M. Spera, Niels Hansen","doi":"10.1186/s13321-025-01070-9","DOIUrl":"10.1186/s13321-025-01070-9","url":null,"abstract":"<div><p>When data availability is limited, the prediction of properties through purely data-driven machine learning (ML) is challenging. Integrating physically-based modeling techniques into ML methods may lead to better performance. In a recent work by Chew et al. (“<i>Advancing material property prediction: using physics-informed machine learning models for viscosity</i>”) descriptors from classical molecular dynamics (MD) simulations were included into a quantitative structure–property relationship to accurately predict temperature-dependent viscosity of pure liquids. Through feature importance analysis, the authors found that heat of vaporization was the most relevant descriptor for the prediction of viscosity. In this comment, we would like to discuss the physical origin of this finding by referring to Eyring’s rate theory, and develop an alternative modeling approach using a thermodynamic-based architecture that requires less input data.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01070-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144910811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-28DOI: 10.1186/s13321-025-01083-4
Wei Liu, Jianguo Li, Chandra S. Verma, Hwee Kuan Lee
Cyclic peptides are promising drug candidates due to their ability to modulate intracellular protein–protein interactions, a property often inaccessible to small molecules. However, their typically poor membrane permeability limits therapeutic applicability. Accurate computational prediction of permeability can accelerate the identification of cell-permeable candidates, reducing reliance on time-consuming and costly experimental screening. Although deep learning has shown potential in predicting molecular properties, its application in permeability prediction remains underexplored. A systematic evaluation of these models is important to assess current capabilities and guide future development. In this study, we conduct a comprehensive benchmark of 13 machine learning models for predicting cyclic peptide membrane permeability. These models cover four types of molecular representations: fingerprints, SMILES strings, molecular graphs, and 2D images. We use experimentally measured PAMPA permeability data from the CycPeptMPDB database, comprising nearly 6000 cyclic peptides, and evaluate performance across three prediction tasks: regression, binary classification, and soft-label classification. Two data-splitting strategies, random split and scaffold split, are used to assess the generalizability of trained models. Our results show that model performance depends strongly on molecular representation and model architecture. Graph-based models, particularly the Directed Message Passing Neural Network (DMPNN), consistently achieve top performance across tasks. Regression generally outperforms classification. Scaffold-based splitting, although intended to more rigorously assess generalization, yields substantially lower model generalizability compared to random splitting. Comparing prediction errors with experimental variability highlights the practical value of current models while also indicating room for further improvement.
{"title":"Systematic benchmarking of 13 AI methods for predicting cyclic peptide membrane permeability","authors":"Wei Liu, Jianguo Li, Chandra S. Verma, Hwee Kuan Lee","doi":"10.1186/s13321-025-01083-4","DOIUrl":"10.1186/s13321-025-01083-4","url":null,"abstract":"<div><p>Cyclic peptides are promising drug candidates due to their ability to modulate intracellular protein–protein interactions, a property often inaccessible to small molecules. However, their typically poor membrane permeability limits therapeutic applicability. Accurate computational prediction of permeability can accelerate the identification of cell-permeable candidates, reducing reliance on time-consuming and costly experimental screening. Although deep learning has shown potential in predicting molecular properties, its application in permeability prediction remains underexplored. A systematic evaluation of these models is important to assess current capabilities and guide future development. In this study, we conduct a comprehensive benchmark of 13 machine learning models for predicting cyclic peptide membrane permeability. These models cover four types of molecular representations: fingerprints, SMILES strings, molecular graphs, and 2D images. We use experimentally measured PAMPA permeability data from the CycPeptMPDB database, comprising nearly 6000 cyclic peptides, and evaluate performance across three prediction tasks: regression, binary classification, and soft-label classification. Two data-splitting strategies, random split and scaffold split, are used to assess the generalizability of trained models. Our results show that model performance depends strongly on molecular representation and model architecture. Graph-based models, particularly the Directed Message Passing Neural Network (DMPNN), consistently achieve top performance across tasks. Regression generally outperforms classification. Scaffold-based splitting, although intended to more rigorously assess generalization, yields substantially lower model generalizability compared to random splitting. Comparing prediction errors with experimental variability highlights the practical value of current models while also indicating room for further improvement.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01083-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144909733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bitter peptides (BPs), derived from the hydrolysis of proteins in food, play a crucial role in both food science and biomedicine by influencing taste perception and participating in various physiological processes. Accurate identification of BPs is essential for understanding food quality and potential health impacts. Traditional machine learning approaches for BP identification have relied on conventional feature descriptors, achieving moderate success but struggling with the complexities of biological sequence data. Recent advances utilizing protein language model embedding and meta-learning approaches have improved the accuracy, but frequently neglect the molecular representations of peptides and lack interpretability. In this study, we propose xBitterT5, a novel multimodal and interpretable framework for BP identification that integrates pretrained transformer-based embeddings from BioT5+ with the combination of peptide sequence and its SELFIES molecular representation. Specifically, incorporating both peptide sequences and their molecular strings, xBitterT5 demonstrates superior performance compared to previous methods on the same benchmark datasets. Importantly, the model provides residue-level interpretability, highlighting chemically meaningful substructures that significantly contribute to its bitterness, thus offering mechanistic insights beyond black-box predictions. A user-friendly web server (https://balalab-skku.org/xBitterT5/) and a standalone version (https://github.com/cbbl-skku-org/xBitterT5/) are freely available to support both computational biologists and experimental researchers in peptide-based food and biomedicine.
{"title":"xBitterT5: an explainable transformer-based framework with multimodal inputs for identifying bitter-taste peptides","authors":"Nguyen Doan Hieu Nguyen, Nhat Truong Pham, Duong Thanh Tran, Leyi Wei, Adeel Malik, Balachandran Manavalan","doi":"10.1186/s13321-025-01078-1","DOIUrl":"10.1186/s13321-025-01078-1","url":null,"abstract":"<div><p>Bitter peptides (BPs), derived from the hydrolysis of proteins in food, play a crucial role in both food science and biomedicine by influencing taste perception and participating in various physiological processes. Accurate identification of BPs is essential for understanding food quality and potential health impacts. Traditional machine learning approaches for BP identification have relied on conventional feature descriptors, achieving moderate success but struggling with the complexities of biological sequence data. Recent advances utilizing protein language model embedding and meta-learning approaches have improved the accuracy, but frequently neglect the molecular representations of peptides and lack interpretability. In this study, we propose xBitterT5, a novel multimodal and interpretable framework for BP identification that integrates pretrained transformer-based embeddings from BioT5+ with the combination of peptide sequence and its SELFIES molecular representation. Specifically, incorporating both peptide sequences and their molecular strings, xBitterT5 demonstrates superior performance compared to previous methods on the same benchmark datasets. Importantly, the model provides residue-level interpretability, highlighting chemically meaningful substructures that significantly contribute to its bitterness, thus offering mechanistic insights beyond black-box predictions. A user-friendly web server (https://balalab-skku.org/xBitterT5/) and a standalone version (https://github.com/cbbl-skku-org/xBitterT5/) are freely available to support both computational biologists and experimental researchers in peptide-based food and biomedicine.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01078-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144880932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-19DOI: 10.1186/s13321-025-01075-4
Tatsuya Sagawa, Ryosuke Kojima
Accurate chemical reaction prediction is critical for reducing both cost and time in drug development. This study introduces ReactionT5, a transformer-based chemical reaction foundation model pre-trained on the Open Reaction Database—a large publicly available reaction dataset. In benchmarks for product prediction, retrosynthesis, and yield prediction, ReactionT5 outperformed existing models. Specifically, ReactionT5 achieved 97.5% accuracy in product prediction, 71.0% in retrosynthesis, and a coefficient of determination of 0.947 in yield prediction. Remarkably, ReactionT5, when fine-tuned with only a limited dataset of reactions, achieved performance on par with models fine-tuned on the complete dataset. Additionally, the visualization of ReactionT5 embeddings illustrates that the model successfully captures and represents the chemical reaction space, indicating effective learning of reaction properties.