Molecular representations of chirality, derived from latent space vectors (LSVs) of SMILES heteroencoders, were explored to train machine learning models to predict chiral properties, and were compared to conventional circular fingerprints. Latent space arithmetic was applied to enhance the representation of chirality, by calculating differences between the original descriptor of a molecule and the descriptor of its enantiomer, or the difference between the original descriptor and the descriptor obtained with the stereochemistry-depleted SMILES string. Machine learning was performed with the Random Forest algorithm applied to a dataset of 3858 molecules extracted from the literature (1929 pairs of enantiomers) to predict the elution order observed on the Chiralpak® AD-H column, as well as intrinsic structural chirality labels (R/S or canonical SMILES @/@@). The descriptors derived from the heteroencoders achieved an accuracy of up to 0.75 in the prediction of the elution order, and the fingerprints were superior (0.82). A better predictive ability was observed with the difference LSV descriptors than with the original descriptors.
利用smile异质编码器的潜在空间向量(latent space vector, LSVs)对手性分子表征进行了探索,以训练机器学习模型来预测手性,并与传统圆形指纹进行了比较。通过计算分子的原始描述符与其对映体描述符之间的差异,或者原始描述符与用立体化学缺失的SMILES字符串得到的描述符之间的差异,应用潜在空间算法增强了手性的表示。使用随机森林算法对从文献中提取的3858个分子(1929对对映体)进行机器学习,以预测Chiralpak®AD-H柱上观察到的洗脱顺序,以及固有结构手性标签(R/S或规范SMILES @/@)。基于异质编码器的描述符对洗脱顺序的预测精度高达0.75,指纹图谱的预测精度为0.82。与原始描述符相比,不同的LSV描述符具有更好的预测能力。我们的工作提出了潜在空间算法来获得分子手性的描述符从SMILES异质编码器。我们利用这种分子表征建立了定量结构-对映体选择性关系,用于预测手性色谱中对映体的洗脱顺序,并与圆形指纹图谱的结果进行了比较。研究表明,相对对映体的δ描述子增强了潜在空间向量编码手性的能力。
{"title":"Evaluation of chirality descriptors derived from SMILES heteroencoders","authors":"Natalia Baimacheva, Xinyue Gao, Joao Aires-de-Sousa","doi":"10.1186/s13321-025-01080-7","DOIUrl":"10.1186/s13321-025-01080-7","url":null,"abstract":"<div><p>Molecular representations of chirality, derived from latent space vectors (LSVs) of SMILES heteroencoders, were explored to train machine learning models to predict chiral properties, and were compared to conventional circular fingerprints. Latent space arithmetic was applied to enhance the representation of chirality, by calculating differences between the original descriptor of a molecule and the descriptor of its enantiomer, or the difference between the original descriptor and the descriptor obtained with the stereochemistry-depleted SMILES string. Machine learning was performed with the Random Forest algorithm applied to a dataset of 3858 molecules extracted from the literature (1929 pairs of enantiomers) to predict the elution order observed on the Chiralpak® AD-H column, as well as intrinsic structural chirality labels (R/S or canonical SMILES @/@@). The descriptors derived from the heteroencoders achieved an accuracy of up to 0.75 in the prediction of the elution order, and the fingerprints were superior (0.82). A better predictive ability was observed with the difference LSV descriptors than with the original descriptors.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01080-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144920688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-29DOI: 10.1186/s13321-025-01077-2
Dayan Liu, Tao Song, Shuang Wang, Xue Li, Peifu Han, Jianmin Wang, Shudong Wang
Protein-protein interactions (PPIs) regulate essential biological processes through complex interfaces, with their dysfunction is associated with various diseases. Consequently, the identification of PPIs and their interface-targeting modulators has emerged as a critical therapeutic approach. However, discovering modulators that target PPIs and PPI interfaces remains challenging as traditional structure-similarity-based methods fail to effectively characterize PPI targets, particularly those for which no active compounds are known. Here, we present AlphaPPIMI, a comprehensive deep learning framework that combines large-scale pretrained language models with domain adaptation for predicting PPI-modulator interactions, specifically targeting PPI interface. To enable robust model development and evaluation, we constructed comprehensive benchmark datasets of PPI-modulator interactions (PPIMI). Our framework integrates comprehensive molecular features from Uni-Mol2, protein representations derived from state-of-the-art language models (ESM2 and ProTrans), and PPI structural characteristics encoded by PFeature. Through a specialized cross-attention architecture and conditional domain adversarial networks (CDAN), AlphaPPIMI effectively learns potential associations between PPI targets and modulators while ensuring robust cross-domain generalization. Extensive evaluations indicate that AlphaPPIMI achieves consistently improved performance over existing methods in PPIMI prediction, offering a promising approach for prioritizing candidate PPI modulators, particularly those targeting protein–protein interfaces.
This work presents AlphaPPIMI, a novel deep learning framework for accurately predicting modulators targeting protein-protein interactions (PPIs) and their interfaces. Its core contributions include a specialized cross-attention module for the synergistic fusion of multimodal pretrained representations, and the novel application of a Conditional Domain Adversarial Network (CDAN) to significantly improve generalization across diverse protein families. AlphaPPIMI demonstrates superior performance on curated benchmarks, providing a powerful computational tool for the discovery of targeted PPI therapeutics.
{"title":"Alphappimi: a comprehensive deep learning framework for predicting PPI-modulator interactions","authors":"Dayan Liu, Tao Song, Shuang Wang, Xue Li, Peifu Han, Jianmin Wang, Shudong Wang","doi":"10.1186/s13321-025-01077-2","DOIUrl":"10.1186/s13321-025-01077-2","url":null,"abstract":"<p>Protein-protein interactions (PPIs) regulate essential biological processes through complex interfaces, with their dysfunction is associated with various diseases. Consequently, the identification of PPIs and their interface-targeting modulators has emerged as a critical therapeutic approach. However, discovering modulators that target PPIs and PPI interfaces remains challenging as traditional structure-similarity-based methods fail to effectively characterize PPI targets, particularly those for which no active compounds are known. Here, we present AlphaPPIMI, a comprehensive deep learning framework that combines large-scale pretrained language models with domain adaptation for predicting PPI-modulator interactions, specifically targeting PPI interface. To enable robust model development and evaluation, we constructed comprehensive benchmark datasets of PPI-modulator interactions (PPIMI). Our framework integrates comprehensive molecular features from Uni-Mol2, protein representations derived from state-of-the-art language models (ESM2 and ProTrans), and PPI structural characteristics encoded by PFeature. Through a specialized cross-attention architecture and conditional domain adversarial networks (CDAN), AlphaPPIMI effectively learns potential associations between PPI targets and modulators while ensuring robust cross-domain generalization. Extensive evaluations indicate that AlphaPPIMI achieves consistently improved performance over existing methods in PPIMI prediction, offering a promising approach for prioritizing candidate PPI modulators, particularly those targeting protein–protein interfaces.</p><p>This work presents AlphaPPIMI, a novel deep learning framework for accurately predicting modulators targeting protein-protein interactions (PPIs) and their interfaces. Its core contributions include a specialized cross-attention module for the synergistic fusion of multimodal pretrained representations, and the novel application of a Conditional Domain Adversarial Network (CDAN) to significantly improve generalization across diverse protein families. AlphaPPIMI demonstrates superior performance on curated benchmarks, providing a powerful computational tool for the discovery of targeted PPI therapeutics.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01077-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144916122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-29DOI: 10.1186/s13321-025-01062-9
Roda Bounaceur, Francisco Paes, Romain Privat, Jean-Noël Jaubert
In this paper, we propose a robust deep-learning model based on a Quantitative Structure − Property Relationship (QSPR) approach for estimating the critical temperature (TC), critical pressure (PC), acentric factor (ACEN) and normal boiling point (NBP) of any C, H, O, N, S, P, F, Cl, Br, I molecule. The Mordred calculator was used to determine 247 descriptors to characterize the molecules considered in this work. For each evaluated property, multiple neural networks were trained within a bagging framework. The predictions from the final ensemble were successfully tested against a large set of experimental data comprising more than 1700 molecules and compared with those from different recent learning models found in the literature. Comprehensive comparisons and extensive testing highlight the robustness and predictive power of the newly proposed multimodal learning model. The developed prediction tool is available on a website at https://lrgp-thermoppt.streamlit.app/. Furthermore, a source code for implementing the trained models in Python is available via github https://github.com/bounac80/AI-ThermPpt.
{"title":"AI-powered prediction of critical properties and boiling points: a hybrid ensemble learning and QSPR approach","authors":"Roda Bounaceur, Francisco Paes, Romain Privat, Jean-Noël Jaubert","doi":"10.1186/s13321-025-01062-9","DOIUrl":"10.1186/s13321-025-01062-9","url":null,"abstract":"<div><p>In this paper, we propose a robust deep-learning model based on a Quantitative Structure − Property Relationship (QSPR) approach for estimating the critical temperature (TC), critical pressure (PC), acentric factor (ACEN) and normal boiling point (NBP) of any C, H, O, N, S, P, F, Cl, Br, I molecule. The Mordred calculator was used to determine 247 descriptors to characterize the molecules considered in this work. For each evaluated property, multiple neural networks were trained within a <i>bagging</i> framework. The predictions from the final ensemble were successfully tested against a large set of experimental data comprising more than 1700 molecules and compared with those from different recent learning models found in the literature. Comprehensive comparisons and extensive testing highlight the robustness and predictive power of the newly proposed multimodal learning model. The developed prediction tool is available on a website at https://lrgp-thermoppt.streamlit.app/. Furthermore, a source code for implementing the trained models in Python is available via github https://github.com/bounac80/AI-ThermPpt.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01062-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144916149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-29DOI: 10.1186/s13321-025-01081-6
Lucina-May Nollen, David Meijer, Maria Sorokina, Justin J. J. van der Hooft
Natural products provide a rich source of bioactive molecules for a variety of applications. Molecular fingerprints are the tool of choice for systematic large-scale studies of their structures. However, current molecular fingerprints insufficiently represent characteristic features of natural products inherently, decreasing the interpretability of natural product-specific predictions. Here, we show that a natural product-specific molecular fingerprint based on a relatively small set of selected biosynthetic building blocks provides more interpretable predictions of biosynthetic distance and natural product classification. Our fingerprint Biosynfoni outperforms MACCS, Morgan, and Daylight-like fingerprints in biosynthetic distance estimation, using 39 substructure keys. Moreover, Biosynfoni’s design, compactness, and concrete substructure definition allow easy visualisation of the detected substructures and their respective biosynthetic pathway origins. Through Biosynfoni, users can gain more insights from predictions and better examine the importance of features within machine learning models. Our results show that a short fingerprint consisting of biologically significant building blocks performs on par with top-performing molecular fingerprints for natural product classification while improving prediction explainability.
{"title":"Biosynfoni: a biosynthesis-informed and interpretable lightweight molecular fingerprint","authors":"Lucina-May Nollen, David Meijer, Maria Sorokina, Justin J. J. van der Hooft","doi":"10.1186/s13321-025-01081-6","DOIUrl":"10.1186/s13321-025-01081-6","url":null,"abstract":"<div><p>Natural products provide a rich source of bioactive molecules for a variety of applications. Molecular fingerprints are the tool of choice for systematic large-scale studies of their structures. However, current molecular fingerprints insufficiently represent characteristic features of natural products inherently, decreasing the interpretability of natural product-specific predictions. Here, we show that a natural product-specific molecular fingerprint based on a relatively small set of selected biosynthetic building blocks provides more interpretable predictions of biosynthetic distance and natural product classification. Our fingerprint Biosynfoni outperforms MACCS, Morgan, and Daylight-like fingerprints in biosynthetic distance estimation, using 39 substructure keys. Moreover, Biosynfoni’s design, compactness, and concrete substructure definition allow easy visualisation of the detected substructures and their respective biosynthetic pathway origins. Through Biosynfoni, users can gain more insights from predictions and better examine the importance of features within machine learning models. Our results show that a short fingerprint consisting of biologically significant building blocks performs on par with top-performing molecular fingerprints for natural product classification while improving prediction explainability.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01081-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144916145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-29DOI: 10.1186/s13321-025-01073-6
Yutong Lu, Yan Yi Li, Yan Sun, Pingzhao Hu
Chemical Language Models (CLMs) have demonstrated capabilities in extracting patterns and predicting from vast volume of the Simplified Molecular Input Line Entry System (SMILES), a notation used to represent molecular structures. Different CLMs, developed from various architectures, can provide unique insights into molecular properties. To harness the uniqueness of different CLMs, we propose FusionCLM, a novel stacking-ensemble learning algorithm that integrate the outputs of multiple CLMs into a unified framework. FusionCLM first generates SMILES embeddings, predictions, and losses from each CLM. Auxiliary models are trained on these first-level predictions and embeddings to estimate test losses during inference. The losses and predictions are then concatenated to create an integrated feature matrix, which trains second-level meta-models for final predictions. Empirical testing on five datasets demonstrates that FusionCLM have better performance than individual CLM at the first level and three advanced multimodal deep learning frameworks, showcasing FusionCLM’s potential in advancing molecular property prediction.
{"title":"FusionCLM: enhanced molecular property prediction via knowledge fusion of chemical language models","authors":"Yutong Lu, Yan Yi Li, Yan Sun, Pingzhao Hu","doi":"10.1186/s13321-025-01073-6","DOIUrl":"10.1186/s13321-025-01073-6","url":null,"abstract":"<div><p>Chemical Language Models (CLMs) have demonstrated capabilities in extracting patterns and predicting from vast volume of the Simplified Molecular Input Line Entry System (SMILES), a notation used to represent molecular structures. Different CLMs, developed from various architectures, can provide unique insights into molecular properties. To harness the uniqueness of different CLMs, we propose FusionCLM, a novel stacking-ensemble learning algorithm that integrate the outputs of multiple CLMs into a unified framework. FusionCLM first generates SMILES embeddings, predictions, and losses from each CLM. Auxiliary models are trained on these first-level predictions and embeddings to estimate test losses during inference. The losses and predictions are then concatenated to create an integrated feature matrix, which trains second-level meta-models for final predictions. Empirical testing on five datasets demonstrates that FusionCLM have better performance than individual CLM at the first level and three advanced multimodal deep learning frameworks, showcasing FusionCLM’s potential in advancing molecular property prediction.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01073-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144916147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, the integration of Artificial Intelligence and Machine Learning methods with biochemical and biomedical research has revolutionized the field of toxicology, significantly advancing our understanding of the toxicological effects of chemicals on biological systems. Cardiovascular diseases remain the leading global cause of death. The constant exposure to multiple chemicals with potential cardiotoxic effects, including environmental contaminants, pesticides, food additives, and drugs, can significantly contribute to these adverse health outcomes. Traditional methods for assessing chemical hazards and their impact on biological function heavily rely on experimental assays and animal studies, which are often time-consuming, resource-intensive, and limited in scalability. To overcome these limitations in silico methods have emerged as indispensable tools in toxicological research, reducing the need for traditional in vivo testing and conserving valuable resources in terms of time and cost. In this study, Artificial Intelligence methods are used as first-tier components within an Integrated Approach to Testing and Assessment. We explored the potential benefits of using Multitask Neural Networks, where multiple levels of cardiotoxicity information are combined to enhance model performance. Multitask learning, based on specific architectures such as Mixture of Experts (MoE), showed promising results and surpasses the performance of single-task baseline models. When predicting a holdout set, multitask model achieved high performance on twelve different endpoints related to cardiotoxicity defined by Adverse Outcome Pathways Network. The best developed model achieved a balanced accuracy of 78%, a sensitivity of 80%, and a specificity of 76% across all endpoints in the holdout set.
An advanced multitask model was developed to predict cardiotoxicity mechanisms induced by small molecules. The model demonstrates broad mechanistic coverage and achieves performance comparable to, or exceeding, state-of-the-art methods. These results suggest that the model could serve as a valuable first-tier component in advanced New Approach Methodologies for prioritizing chemicals for further testing.
{"title":"Mixture of experts for multitask learning in cardiotoxicity assessment","authors":"Edoardo Luca Viganò, Mateusz Iwan, Erika Colombo, Davide Ballabio, Alessandra Roncaglioni","doi":"10.1186/s13321-025-01072-7","DOIUrl":"10.1186/s13321-025-01072-7","url":null,"abstract":"<p>In recent years, the integration of Artificial Intelligence and Machine Learning methods with biochemical and biomedical research has revolutionized the field of toxicology, significantly advancing our understanding of the toxicological effects of chemicals on biological systems. Cardiovascular diseases remain the leading global cause of death. The constant exposure to multiple chemicals with potential cardiotoxic effects, including environmental contaminants, pesticides, food additives, and drugs, can significantly contribute to these adverse health outcomes. Traditional methods for assessing chemical hazards and their impact on biological function heavily rely on experimental assays and animal studies, which are often time-consuming, resource-intensive, and limited in scalability. To overcome these limitations in silico methods have emerged as indispensable tools in toxicological research, reducing the need for traditional in vivo testing and conserving valuable resources in terms of time and cost. In this study, Artificial Intelligence methods are used as first-tier components within an Integrated Approach to Testing and Assessment. We explored the potential benefits of using Multitask Neural Networks, where multiple levels of cardiotoxicity information are combined to enhance model performance. Multitask learning, based on specific architectures such as Mixture of Experts (MoE), showed promising results and surpasses the performance of single-task baseline models. When predicting a holdout set, multitask model achieved high performance on twelve different endpoints related to cardiotoxicity defined by Adverse Outcome Pathways Network. The best developed model achieved a balanced accuracy of 78%, a sensitivity of 80%, and a specificity of 76% across all endpoints in the holdout set.</p><p>An advanced multitask model was developed to predict cardiotoxicity mechanisms induced by small molecules. The model demonstrates broad mechanistic coverage and achieves performance comparable to, or exceeding, state-of-the-art methods. These results suggest that the model could serve as a valuable first-tier component in advanced New Approach Methodologies for prioritizing chemicals for further testing.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01072-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144916148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-28DOI: 10.1186/s13321-025-01071-8
Yixiang Mao, Souparno Ghosh, Ranadip Pal
Quantitative structure–activity relationship (QSAR) modeling has become a critical tool in drug design. Recently proposed Topological Regression (TR), a computationally efficient and highly interpretable QSAR model that maps distances in the chemical domain to distances in the activity domain, has shown predictive performance comparable to state-of-the-art deep learning-based models. However, TR’s dependence on simple random sampling-based anchor selection and utilization of radial basis function for response reconstruction constrain its interpretability and predictive capacity. To address these limitations, we propose Adaptive Topological Regression (AdapToR) with adaptive anchor selection and optimization-based reconstruction. We evaluated AdapToR on the NCI60 GI50 dataset, which consists of over 50,000 drug responses across 60 human cancer cell lines, and compared its performance to Transformer CNN, Graph Transformer, TR, and other baseline models. The results demonstrate that AdapToR outperforms competing QSAR models for drug response prediction with significantly lower computational cost and greater interpretability as compared to deep learning-based models.
{"title":"AdapTor: Adaptive Topological Regression for quantitative structure–activity relationship modeling","authors":"Yixiang Mao, Souparno Ghosh, Ranadip Pal","doi":"10.1186/s13321-025-01071-8","DOIUrl":"10.1186/s13321-025-01071-8","url":null,"abstract":"<div><p>Quantitative structure–activity relationship (QSAR) modeling has become a critical tool in drug design. Recently proposed Topological Regression (TR), a computationally efficient and highly interpretable QSAR model that maps distances in the chemical domain to distances in the activity domain, has shown predictive performance comparable to state-of-the-art deep learning-based models. However, TR’s dependence on simple random sampling-based anchor selection and utilization of radial basis function for response reconstruction constrain its interpretability and predictive capacity. To address these limitations, we propose Adaptive Topological Regression (AdapToR) with adaptive anchor selection and optimization-based reconstruction. We evaluated AdapToR on the NCI60 GI50 dataset, which consists of over 50,000 drug responses across 60 human cancer cell lines, and compared its performance to Transformer CNN, Graph Transformer, TR, and other baseline models. The results demonstrate that AdapToR outperforms competing QSAR models for drug response prediction with significantly lower computational cost and greater interpretability as compared to deep learning-based models.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01071-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144909651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-28DOI: 10.1186/s13321-025-01088-z
Junseok Choe, Hajung Kim, Yan Ting Chok, Mogan Gim, Jaewoo Kang
Retrosynthesis—the process of deconstructing complex molecules into simpler, more accessible precursors—is a cornerstone of drug discovery and material design. While machine learning has improved single-step retrosynthesis prediction, generating complete multi-step retrosynthetic routes remains challenging. In this study, we explore the integration of single-step retrosynthesis models with various planning algorithms to improve multi-step retrosynthetic route generation. We expand the exploration space beyond previously limited settings by incorporating combinations of planning algorithms and single-step retrosynthesis models and diverse datasets, enabling a more comprehensive assessment of retrosynthetic strategies. We evaluated synthetic routes based on both solvability, the ability to generate a complete route, and route feasibility, which reflects their practical executability in the laboratory. Our findings show that the model combination with the highest solvability does not always produce the most feasible routes, underscoring the need for more nuanced evaluation. Through a systematic analysis of combinations of planning algorithms and single-step retrosynthesis models, their performance across different datasets, and various practical metrics, our study provides a more comprehensive evaluation of retrosynthetic planning strategies. These insights contribute to a better understanding of computational retrosynthesis and its alignment with real-world applicability.
{"title":"Retrosynthetic crosstalk between single-step reaction and multi-step planning","authors":"Junseok Choe, Hajung Kim, Yan Ting Chok, Mogan Gim, Jaewoo Kang","doi":"10.1186/s13321-025-01088-z","DOIUrl":"10.1186/s13321-025-01088-z","url":null,"abstract":"<div><p>Retrosynthesis—the process of deconstructing complex molecules into simpler, more accessible precursors—is a cornerstone of drug discovery and material design. While machine learning has improved single-step retrosynthesis prediction, generating complete multi-step retrosynthetic routes remains challenging. In this study, we explore the integration of single-step retrosynthesis models with various planning algorithms to improve multi-step retrosynthetic route generation. We expand the exploration space beyond previously limited settings by incorporating combinations of planning algorithms and single-step retrosynthesis models and diverse datasets, enabling a more comprehensive assessment of retrosynthetic strategies. We evaluated synthetic routes based on both solvability, the ability to generate a complete route, and route feasibility, which reflects their practical executability in the laboratory. Our findings show that the model combination with the highest solvability does not always produce the most feasible routes, underscoring the need for more nuanced evaluation. Through a systematic analysis of combinations of planning algorithms and single-step retrosynthesis models, their performance across different datasets, and various practical metrics, our study provides a more comprehensive evaluation of retrosynthetic planning strategies. These insights contribute to a better understanding of computational retrosynthesis and its alignment with real-world applicability.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01088-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144910812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-28DOI: 10.1186/s13321-025-01070-9
Maximilian Fleck, Samir Darouich, Marcelle B. M. Spera, Niels Hansen
When data availability is limited, the prediction of properties through purely data-driven machine learning (ML) is challenging. Integrating physically-based modeling techniques into ML methods may lead to better performance. In a recent work by Chew et al. (“Advancing material property prediction: using physics-informed machine learning models for viscosity”) descriptors from classical molecular dynamics (MD) simulations were included into a quantitative structure–property relationship to accurately predict temperature-dependent viscosity of pure liquids. Through feature importance analysis, the authors found that heat of vaporization was the most relevant descriptor for the prediction of viscosity. In this comment, we would like to discuss the physical origin of this finding by referring to Eyring’s rate theory, and develop an alternative modeling approach using a thermodynamic-based architecture that requires less input data.
{"title":"Comment on “Advancing material property prediction: using physics-informed machine learning models for viscosity”","authors":"Maximilian Fleck, Samir Darouich, Marcelle B. M. Spera, Niels Hansen","doi":"10.1186/s13321-025-01070-9","DOIUrl":"10.1186/s13321-025-01070-9","url":null,"abstract":"<div><p>When data availability is limited, the prediction of properties through purely data-driven machine learning (ML) is challenging. Integrating physically-based modeling techniques into ML methods may lead to better performance. In a recent work by Chew et al. (“<i>Advancing material property prediction: using physics-informed machine learning models for viscosity</i>”) descriptors from classical molecular dynamics (MD) simulations were included into a quantitative structure–property relationship to accurately predict temperature-dependent viscosity of pure liquids. Through feature importance analysis, the authors found that heat of vaporization was the most relevant descriptor for the prediction of viscosity. In this comment, we would like to discuss the physical origin of this finding by referring to Eyring’s rate theory, and develop an alternative modeling approach using a thermodynamic-based architecture that requires less input data.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01070-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144910811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-28DOI: 10.1186/s13321-025-01083-4
Wei Liu, Jianguo Li, Chandra S. Verma, Hwee Kuan Lee
Cyclic peptides are promising drug candidates due to their ability to modulate intracellular protein–protein interactions, a property often inaccessible to small molecules. However, their typically poor membrane permeability limits therapeutic applicability. Accurate computational prediction of permeability can accelerate the identification of cell-permeable candidates, reducing reliance on time-consuming and costly experimental screening. Although deep learning has shown potential in predicting molecular properties, its application in permeability prediction remains underexplored. A systematic evaluation of these models is important to assess current capabilities and guide future development. In this study, we conduct a comprehensive benchmark of 13 machine learning models for predicting cyclic peptide membrane permeability. These models cover four types of molecular representations: fingerprints, SMILES strings, molecular graphs, and 2D images. We use experimentally measured PAMPA permeability data from the CycPeptMPDB database, comprising nearly 6000 cyclic peptides, and evaluate performance across three prediction tasks: regression, binary classification, and soft-label classification. Two data-splitting strategies, random split and scaffold split, are used to assess the generalizability of trained models. Our results show that model performance depends strongly on molecular representation and model architecture. Graph-based models, particularly the Directed Message Passing Neural Network (DMPNN), consistently achieve top performance across tasks. Regression generally outperforms classification. Scaffold-based splitting, although intended to more rigorously assess generalization, yields substantially lower model generalizability compared to random splitting. Comparing prediction errors with experimental variability highlights the practical value of current models while also indicating room for further improvement.
{"title":"Systematic benchmarking of 13 AI methods for predicting cyclic peptide membrane permeability","authors":"Wei Liu, Jianguo Li, Chandra S. Verma, Hwee Kuan Lee","doi":"10.1186/s13321-025-01083-4","DOIUrl":"10.1186/s13321-025-01083-4","url":null,"abstract":"<div><p>Cyclic peptides are promising drug candidates due to their ability to modulate intracellular protein–protein interactions, a property often inaccessible to small molecules. However, their typically poor membrane permeability limits therapeutic applicability. Accurate computational prediction of permeability can accelerate the identification of cell-permeable candidates, reducing reliance on time-consuming and costly experimental screening. Although deep learning has shown potential in predicting molecular properties, its application in permeability prediction remains underexplored. A systematic evaluation of these models is important to assess current capabilities and guide future development. In this study, we conduct a comprehensive benchmark of 13 machine learning models for predicting cyclic peptide membrane permeability. These models cover four types of molecular representations: fingerprints, SMILES strings, molecular graphs, and 2D images. We use experimentally measured PAMPA permeability data from the CycPeptMPDB database, comprising nearly 6000 cyclic peptides, and evaluate performance across three prediction tasks: regression, binary classification, and soft-label classification. Two data-splitting strategies, random split and scaffold split, are used to assess the generalizability of trained models. Our results show that model performance depends strongly on molecular representation and model architecture. Graph-based models, particularly the Directed Message Passing Neural Network (DMPNN), consistently achieve top performance across tasks. Regression generally outperforms classification. Scaffold-based splitting, although intended to more rigorously assess generalization, yields substantially lower model generalizability compared to random splitting. Comparing prediction errors with experimental variability highlights the practical value of current models while also indicating room for further improvement.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01083-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144909733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}