Journal of Cheminformatics最新文献_第10页

FusionCLM: enhanced molecular property prediction via knowledge fusion of chemical language models FusionCLM：通过化学语言模型的知识融合增强分子性质预测

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-08-29 DOI: 10.1186/s13321-025-01073-6

Yutong Lu, Yan Yi Li, Yan Sun, Pingzhao Hu

Chemical Language Models (CLMs) have demonstrated capabilities in extracting patterns and predicting from vast volume of the Simplified Molecular Input Line Entry System (SMILES), a notation used to represent molecular structures. Different CLMs, developed from various architectures, can provide unique insights into molecular properties. To harness the uniqueness of different CLMs, we propose FusionCLM, a novel stacking-ensemble learning algorithm that integrate the outputs of multiple CLMs into a unified framework. FusionCLM first generates SMILES embeddings, predictions, and losses from each CLM. Auxiliary models are trained on these first-level predictions and embeddings to estimate test losses during inference. The losses and predictions are then concatenated to create an integrated feature matrix, which trains second-level meta-models for final predictions. Empirical testing on five datasets demonstrates that FusionCLM have better performance than individual CLM at the first level and three advanced multimodal deep learning frameworks, showcasing FusionCLM’s potential in advancing molecular property prediction.

化学语言模型（CLMs）已经证明了从大量简化分子输入线输入系统（SMILES）中提取模式和预测的能力，这是一种用于表示分子结构的符号。从不同架构开发的不同clm可以提供对分子特性的独特见解。为了利用不同clm的独特性，我们提出了一种新的堆叠集成学习算法FusionCLM，它将多个clm的输出集成到一个统一的框架中。FusionCLM首先从每个CLM中生成SMILES嵌入、预测和损失。辅助模型在这些一级预测和嵌入上进行训练，以估计推理过程中的测试损失。然后将损失和预测连接起来创建一个集成的特征矩阵，该特征矩阵为最终预测训练第二级元模型。在五个数据集上的实证测试表明，FusionCLM在一级和三个高级多模态深度学习框架上的性能优于单个CLM，显示了FusionCLM在推进分子性质预测方面的潜力。FusionCLM使用堆叠集成学习方法，该方法集成了来自多个clm的独特表示学习，从而可以更全面地学习分子smile数据。这可以提供更准确的分子性质预测，有助于促进早期发现和开发有前途的候选药物。通过评估和比较其与单个clm和现有多模态深度学习框架的性能，FusionCLM展示了预测精度的改进，将其与该领域的先前模型区分开来。

{"title":"FusionCLM: enhanced molecular property prediction via knowledge fusion of chemical language models","authors":"Yutong Lu, Yan Yi Li, Yan Sun, Pingzhao Hu","doi":"10.1186/s13321-025-01073-6","DOIUrl":"10.1186/s13321-025-01073-6","url":null,"abstract":"<div><p>Chemical Language Models (CLMs) have demonstrated capabilities in extracting patterns and predicting from vast volume of the Simplified Molecular Input Line Entry System (SMILES), a notation used to represent molecular structures. Different CLMs, developed from various architectures, can provide unique insights into molecular properties. To harness the uniqueness of different CLMs, we propose FusionCLM, a novel stacking-ensemble learning algorithm that integrate the outputs of multiple CLMs into a unified framework. FusionCLM first generates SMILES embeddings, predictions, and losses from each CLM. Auxiliary models are trained on these first-level predictions and embeddings to estimate test losses during inference. The losses and predictions are then concatenated to create an integrated feature matrix, which trains second-level meta-models for final predictions. Empirical testing on five datasets demonstrates that FusionCLM have better performance than individual CLM at the first level and three advanced multimodal deep learning frameworks, showcasing FusionCLM’s potential in advancing molecular property prediction.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01073-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144916147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mixture of experts for multitask learning in cardiotoxicity assessment 心脏毒性评估中多任务学习的专家组合

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-08-29 DOI: 10.1186/s13321-025-01072-7

Edoardo Luca Viganò, Mateusz Iwan, Erika Colombo, Davide Ballabio, Alessandra Roncaglioni

In recent years, the integration of Artificial Intelligence and Machine Learning methods with biochemical and biomedical research has revolutionized the field of toxicology, significantly advancing our understanding of the toxicological effects of chemicals on biological systems. Cardiovascular diseases remain the leading global cause of death. The constant exposure to multiple chemicals with potential cardiotoxic effects, including environmental contaminants, pesticides, food additives, and drugs, can significantly contribute to these adverse health outcomes. Traditional methods for assessing chemical hazards and their impact on biological function heavily rely on experimental assays and animal studies, which are often time-consuming, resource-intensive, and limited in scalability. To overcome these limitations in silico methods have emerged as indispensable tools in toxicological research, reducing the need for traditional in vivo testing and conserving valuable resources in terms of time and cost. In this study, Artificial Intelligence methods are used as first-tier components within an Integrated Approach to Testing and Assessment. We explored the potential benefits of using Multitask Neural Networks, where multiple levels of cardiotoxicity information are combined to enhance model performance. Multitask learning, based on specific architectures such as Mixture of Experts (MoE), showed promising results and surpasses the performance of single-task baseline models. When predicting a holdout set, multitask model achieved high performance on twelve different endpoints related to cardiotoxicity defined by Adverse Outcome Pathways Network. The best developed model achieved a balanced accuracy of 78%, a sensitivity of 80%, and a specificity of 76% across all endpoints in the holdout set.

An advanced multitask model was developed to predict cardiotoxicity mechanisms induced by small molecules. The model demonstrates broad mechanistic coverage and achieves performance comparable to, or exceeding, state-of-the-art methods. These results suggest that the model could serve as a valuable first-tier component in advanced New Approach Methodologies for prioritizing chemicals for further testing.

近年来，人工智能和机器学习方法与生物化学和生物医学研究的结合彻底改变了毒理学领域，大大提高了我们对化学物质对生物系统的毒理学作用的理解。心血管疾病仍然是全球主要的死亡原因。持续接触多种具有潜在心脏毒性作用的化学物质，包括环境污染物、农药、食品添加剂和药物，可显著导致这些不利的健康结果。评估化学危害及其对生物功能影响的传统方法严重依赖于实验分析和动物研究，这些方法往往耗时、资源密集且可扩展性有限。为了克服这些限制，计算机方法已成为毒理学研究中不可或缺的工具，减少了对传统体内测试的需求，并在时间和成本方面节省了宝贵的资源。在本研究中，人工智能方法被用作测试和评估集成方法中的第一层组件。我们探索了使用多任务神经网络的潜在好处，其中多级心脏毒性信息相结合以提高模型性能。基于专家混合（MoE）等特定架构的多任务学习显示出令人鼓舞的结果，并且超过了单任务基线模型的性能。当预测抵抗集时，多任务模型在与不良结果通路网络定义的心脏毒性相关的12个不同终点上取得了高性能。开发的最佳模型在holdout集中的所有端点上实现了78%的平衡精度，80%的灵敏度和76%的特异性。建立了一种先进的多任务模型来预测小分子诱导的心脏毒性机制。该模型展示了广泛的机制覆盖范围，并实现了与最先进方法相当或超过最先进方法的性能。这些结果表明，该模型可以作为先进的新方法方法中有价值的第一级组成部分，用于确定化学物质的优先级以进行进一步测试。

{"title":"Mixture of experts for multitask learning in cardiotoxicity assessment","authors":"Edoardo Luca Viganò, Mateusz Iwan, Erika Colombo, Davide Ballabio, Alessandra Roncaglioni","doi":"10.1186/s13321-025-01072-7","DOIUrl":"10.1186/s13321-025-01072-7","url":null,"abstract":"<p>In recent years, the integration of Artificial Intelligence and Machine Learning methods with biochemical and biomedical research has revolutionized the field of toxicology, significantly advancing our understanding of the toxicological effects of chemicals on biological systems. Cardiovascular diseases remain the leading global cause of death. The constant exposure to multiple chemicals with potential cardiotoxic effects, including environmental contaminants, pesticides, food additives, and drugs, can significantly contribute to these adverse health outcomes. Traditional methods for assessing chemical hazards and their impact on biological function heavily rely on experimental assays and animal studies, which are often time-consuming, resource-intensive, and limited in scalability. To overcome these limitations in silico methods have emerged as indispensable tools in toxicological research, reducing the need for traditional in vivo testing and conserving valuable resources in terms of time and cost. In this study, Artificial Intelligence methods are used as first-tier components within an Integrated Approach to Testing and Assessment. We explored the potential benefits of using Multitask Neural Networks, where multiple levels of cardiotoxicity information are combined to enhance model performance. Multitask learning, based on specific architectures such as Mixture of Experts (MoE), showed promising results and surpasses the performance of single-task baseline models. When predicting a holdout set, multitask model achieved high performance on twelve different endpoints related to cardiotoxicity defined by Adverse Outcome Pathways Network. The best developed model achieved a balanced accuracy of 78%, a sensitivity of 80%, and a specificity of 76% across all endpoints in the holdout set.</p><p>An advanced multitask model was developed to predict cardiotoxicity mechanisms induced by small molecules. The model demonstrates broad mechanistic coverage and achieves performance comparable to, or exceeding, state-of-the-art methods. These results suggest that the model could serve as a valuable first-tier component in advanced New Approach Methodologies for prioritizing chemicals for further testing.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01072-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144916148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AdapTor: Adaptive Topological Regression for quantitative structure–activity relationship modeling AdapTor：自适应拓扑回归定量结构-活动关系建模

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-08-28 DOI: 10.1186/s13321-025-01071-8

Yixiang Mao, Souparno Ghosh, Ranadip Pal

Quantitative structure–activity relationship (QSAR) modeling has become a critical tool in drug design. Recently proposed Topological Regression (TR), a computationally efficient and highly interpretable QSAR model that maps distances in the chemical domain to distances in the activity domain, has shown predictive performance comparable to state-of-the-art deep learning-based models. However, TR’s dependence on simple random sampling-based anchor selection and utilization of radial basis function for response reconstruction constrain its interpretability and predictive capacity. To address these limitations, we propose Adaptive Topological Regression (AdapToR) with adaptive anchor selection and optimization-based reconstruction. We evaluated AdapToR on the NCI60 GI50 dataset, which consists of over 50,000 drug responses across 60 human cancer cell lines, and compared its performance to Transformer CNN, Graph Transformer, TR, and other baseline models. The results demonstrate that AdapToR outperforms competing QSAR models for drug response prediction with significantly lower computational cost and greater interpretability as compared to deep learning-based models.

定量构效关系（QSAR）模型已成为药物设计的重要工具。最近提出的拓扑回归（TR）是一种计算效率高、可高度解释的QSAR模型，它将化学域的距离映射到活动域的距离，其预测性能可与最先进的基于深度学习的模型相媲美。然而，TR依赖于基于简单随机抽样的锚点选择和利用径向基函数进行响应重建，限制了其可解释性和预测能力。为了解决这些限制，我们提出了自适应拓扑回归（AdapToR），具有自适应锚点选择和基于优化的重建。我们在NCI60 GI50数据集上对AdapToR进行了评估，该数据集包括60种人类癌细胞系中超过50,000种药物反应，并将其性能与Transformer CNN、Graph Transformer、TR和其他基线模型进行了比较。结果表明，与基于深度学习的模型相比，AdapToR在药物反应预测方面优于竞争对手的QSAR模型，计算成本显著降低，可解释性更高。

{"title":"AdapTor: Adaptive Topological Regression for quantitative structure–activity relationship modeling","authors":"Yixiang Mao, Souparno Ghosh, Ranadip Pal","doi":"10.1186/s13321-025-01071-8","DOIUrl":"10.1186/s13321-025-01071-8","url":null,"abstract":"<div><p>Quantitative structure–activity relationship (QSAR) modeling has become a critical tool in drug design. Recently proposed Topological Regression (TR), a computationally efficient and highly interpretable QSAR model that maps distances in the chemical domain to distances in the activity domain, has shown predictive performance comparable to state-of-the-art deep learning-based models. However, TR’s dependence on simple random sampling-based anchor selection and utilization of radial basis function for response reconstruction constrain its interpretability and predictive capacity. To address these limitations, we propose Adaptive Topological Regression (AdapToR) with adaptive anchor selection and optimization-based reconstruction. We evaluated AdapToR on the NCI60 GI50 dataset, which consists of over 50,000 drug responses across 60 human cancer cell lines, and compared its performance to Transformer CNN, Graph Transformer, TR, and other baseline models. The results demonstrate that AdapToR outperforms competing QSAR models for drug response prediction with significantly lower computational cost and greater interpretability as compared to deep learning-based models.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01071-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144909651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Retrosynthetic crosstalk between single-step reaction and multi-step planning 单步反应与多步计划间的反合成串扰

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-08-28 DOI: 10.1186/s13321-025-01088-z

Junseok Choe, Hajung Kim, Yan Ting Chok, Mogan Gim, Jaewoo Kang

Retrosynthesis—the process of deconstructing complex molecules into simpler, more accessible precursors—is a cornerstone of drug discovery and material design. While machine learning has improved single-step retrosynthesis prediction, generating complete multi-step retrosynthetic routes remains challenging. In this study, we explore the integration of single-step retrosynthesis models with various planning algorithms to improve multi-step retrosynthetic route generation. We expand the exploration space beyond previously limited settings by incorporating combinations of planning algorithms and single-step retrosynthesis models and diverse datasets, enabling a more comprehensive assessment of retrosynthetic strategies. We evaluated synthetic routes based on both solvability, the ability to generate a complete route, and route feasibility, which reflects their practical executability in the laboratory. Our findings show that the model combination with the highest solvability does not always produce the most feasible routes, underscoring the need for more nuanced evaluation. Through a systematic analysis of combinations of planning algorithms and single-step retrosynthesis models, their performance across different datasets, and various practical metrics, our study provides a more comprehensive evaluation of retrosynthetic planning strategies. These insights contribute to a better understanding of computational retrosynthesis and its alignment with real-world applicability.

逆转录合成——将复杂分子分解成更简单、更容易获取的前体的过程——是药物发现和材料设计的基石。虽然机器学习改进了单步反合成预测，但生成完整的多步反合成路线仍然具有挑战性。在本研究中，我们探索了单步反合成模型与各种规划算法的集成，以改进多步反合成路线生成。我们通过结合规划算法、单步反合成模型和多种数据集，扩展了以前有限的探索空间，从而能够更全面地评估反合成策略。我们根据可解性、生成完整路线的能力和路线可行性来评估合成路线，这反映了它们在实验室中的实际可执行性。我们的研究结果表明，具有最高可解性的模型组合并不总是产生最可行的路线，强调需要更细致的评估。通过系统分析规划算法和单步反合成模型的组合，以及它们在不同数据集上的性能，以及各种实际指标，我们的研究为反合成规划策略提供了更全面的评估。这些见解有助于更好地理解计算反合成及其与现实世界适用性的一致性。我们为反合成任务提供了扩展的研究成果。我们还提出了实际世界中反合成路线有效性的可行性概念及其实用性。

{"title":"Retrosynthetic crosstalk between single-step reaction and multi-step planning","authors":"Junseok Choe, Hajung Kim, Yan Ting Chok, Mogan Gim, Jaewoo Kang","doi":"10.1186/s13321-025-01088-z","DOIUrl":"10.1186/s13321-025-01088-z","url":null,"abstract":"<div><p>Retrosynthesis—the process of deconstructing complex molecules into simpler, more accessible precursors—is a cornerstone of drug discovery and material design. While machine learning has improved single-step retrosynthesis prediction, generating complete multi-step retrosynthetic routes remains challenging. In this study, we explore the integration of single-step retrosynthesis models with various planning algorithms to improve multi-step retrosynthetic route generation. We expand the exploration space beyond previously limited settings by incorporating combinations of planning algorithms and single-step retrosynthesis models and diverse datasets, enabling a more comprehensive assessment of retrosynthetic strategies. We evaluated synthetic routes based on both solvability, the ability to generate a complete route, and route feasibility, which reflects their practical executability in the laboratory. Our findings show that the model combination with the highest solvability does not always produce the most feasible routes, underscoring the need for more nuanced evaluation. Through a systematic analysis of combinations of planning algorithms and single-step retrosynthesis models, their performance across different datasets, and various practical metrics, our study provides a more comprehensive evaluation of retrosynthetic planning strategies. These insights contribute to a better understanding of computational retrosynthesis and its alignment with real-world applicability.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01088-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144910812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Comment on “Advancing material property prediction: using physics-informed machine learning models for viscosity” 对“推进材料性能预测：使用物理信息的粘度机器学习模型”的评论

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-08-28 DOI: 10.1186/s13321-025-01070-9

Maximilian Fleck, Samir Darouich, Marcelle B. M. Spera, Niels Hansen

When data availability is limited, the prediction of properties through purely data-driven machine learning (ML) is challenging. Integrating physically-based modeling techniques into ML methods may lead to better performance. In a recent work by Chew et al. (“Advancing material property prediction: using physics-informed machine learning models for viscosity”) descriptors from classical molecular dynamics (MD) simulations were included into a quantitative structure–property relationship to accurately predict temperature-dependent viscosity of pure liquids. Through feature importance analysis, the authors found that heat of vaporization was the most relevant descriptor for the prediction of viscosity. In this comment, we would like to discuss the physical origin of this finding by referring to Eyring’s rate theory, and develop an alternative modeling approach using a thermodynamic-based architecture that requires less input data.

当数据可用性有限时，通过纯数据驱动的机器学习（ML）预测属性是具有挑战性的。将基于物理的建模技术集成到ML方法中可能会带来更好的性能。在Chew等人最近的一项工作（“推进材料性能预测：使用物理信息的粘度机器学习模型”）中，来自经典分子动力学（MD）模拟的描述符被纳入定量结构-性能关系中，以准确预测纯液体的温度依赖性粘度。通过特征重要性分析，发现汽化热是预测粘度最相关的描述符。在这篇评论中，我们想通过参考Eyring的速率理论来讨论这一发现的物理起源，并开发一种使用基于热力学的架构的替代建模方法，该方法需要较少的输入数据。

引用次数: 0

Systematic benchmarking of 13 AI methods for predicting cyclic peptide membrane permeability 13种预测环肽膜通透性的人工智能方法的系统基准测试

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-08-28 DOI: 10.1186/s13321-025-01083-4

Wei Liu, Jianguo Li, Chandra S. Verma, Hwee Kuan Lee

Cyclic peptides are promising drug candidates due to their ability to modulate intracellular protein–protein interactions, a property often inaccessible to small molecules. However, their typically poor membrane permeability limits therapeutic applicability. Accurate computational prediction of permeability can accelerate the identification of cell-permeable candidates, reducing reliance on time-consuming and costly experimental screening. Although deep learning has shown potential in predicting molecular properties, its application in permeability prediction remains underexplored. A systematic evaluation of these models is important to assess current capabilities and guide future development. In this study, we conduct a comprehensive benchmark of 13 machine learning models for predicting cyclic peptide membrane permeability. These models cover four types of molecular representations: fingerprints, SMILES strings, molecular graphs, and 2D images. We use experimentally measured PAMPA permeability data from the CycPeptMPDB database, comprising nearly 6000 cyclic peptides, and evaluate performance across three prediction tasks: regression, binary classification, and soft-label classification. Two data-splitting strategies, random split and scaffold split, are used to assess the generalizability of trained models. Our results show that model performance depends strongly on molecular representation and model architecture. Graph-based models, particularly the Directed Message Passing Neural Network (DMPNN), consistently achieve top performance across tasks. Regression generally outperforms classification. Scaffold-based splitting, although intended to more rigorously assess generalization, yields substantially lower model generalizability compared to random splitting. Comparing prediction errors with experimental variability highlights the practical value of current models while also indicating room for further improvement.

环肽是很有希望的候选药物，因为它们能够调节细胞内蛋白质-蛋白质相互作用，这是小分子通常无法获得的特性。然而，它们典型的膜渗透性差限制了治疗的适用性。准确的渗透性计算预测可以加速细胞渗透性候选物的识别，减少对耗时和昂贵的实验筛选的依赖。尽管深度学习在预测分子性质方面显示出潜力，但其在渗透率预测方面的应用仍未得到充分探索。对这些模型进行系统的评估对于评估当前的能力和指导未来的发展非常重要。在这项研究中，我们对13种预测环肽膜通透性的机器学习模型进行了综合基准测试。这些模型涵盖了四种类型的分子表示：指纹、SMILES字符串、分子图和2D图像。我们使用实验测量的PAMPA渗透率数据来自CycPeptMPDB数据库，包含近6000个环肽，并评估了三个预测任务的性能：回归、二元分类和软标签分类。使用随机分割和支架分割两种数据分割策略来评估训练模型的泛化性。我们的研究结果表明，模型性能在很大程度上取决于分子表示和模型结构。基于图的模型，特别是定向消息传递神经网络（DMPNN），可以在任务之间始终实现最佳性能。回归通常优于分类。基于支架的分裂虽然旨在更严格地评估泛化，但与随机分裂相比，产生的模型泛化性要低得多。将预测误差与实验变率进行比较，突出了当前模型的实用价值，同时也表明了进一步改进的空间。

{"title":"Systematic benchmarking of 13 AI methods for predicting cyclic peptide membrane permeability","authors":"Wei Liu, Jianguo Li, Chandra S. Verma, Hwee Kuan Lee","doi":"10.1186/s13321-025-01083-4","DOIUrl":"10.1186/s13321-025-01083-4","url":null,"abstract":"<div><p>Cyclic peptides are promising drug candidates due to their ability to modulate intracellular protein–protein interactions, a property often inaccessible to small molecules. However, their typically poor membrane permeability limits therapeutic applicability. Accurate computational prediction of permeability can accelerate the identification of cell-permeable candidates, reducing reliance on time-consuming and costly experimental screening. Although deep learning has shown potential in predicting molecular properties, its application in permeability prediction remains underexplored. A systematic evaluation of these models is important to assess current capabilities and guide future development. In this study, we conduct a comprehensive benchmark of 13 machine learning models for predicting cyclic peptide membrane permeability. These models cover four types of molecular representations: fingerprints, SMILES strings, molecular graphs, and 2D images. We use experimentally measured PAMPA permeability data from the CycPeptMPDB database, comprising nearly 6000 cyclic peptides, and evaluate performance across three prediction tasks: regression, binary classification, and soft-label classification. Two data-splitting strategies, random split and scaffold split, are used to assess the generalizability of trained models. Our results show that model performance depends strongly on molecular representation and model architecture. Graph-based models, particularly the Directed Message Passing Neural Network (DMPNN), consistently achieve top performance across tasks. Regression generally outperforms classification. Scaffold-based splitting, although intended to more rigorously assess generalization, yields substantially lower model generalizability compared to random splitting. Comparing prediction errors with experimental variability highlights the practical value of current models while also indicating room for further improvement.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01083-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144909733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

xBitterT5: an explainable transformer-based framework with multimodal inputs for identifying bitter-taste peptides xbitt5：一个可解释的基于转换器的框架，具有多模态输入，用于识别苦味肽

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-08-20 DOI: 10.1186/s13321-025-01078-1

Nguyen Doan Hieu Nguyen, Nhat Truong Pham, Duong Thanh Tran, Leyi Wei, Adeel Malik, Balachandran Manavalan

Bitter peptides (BPs), derived from the hydrolysis of proteins in food, play a crucial role in both food science and biomedicine by influencing taste perception and participating in various physiological processes. Accurate identification of BPs is essential for understanding food quality and potential health impacts. Traditional machine learning approaches for BP identification have relied on conventional feature descriptors, achieving moderate success but struggling with the complexities of biological sequence data. Recent advances utilizing protein language model embedding and meta-learning approaches have improved the accuracy, but frequently neglect the molecular representations of peptides and lack interpretability. In this study, we propose xBitterT5, a novel multimodal and interpretable framework for BP identification that integrates pretrained transformer-based embeddings from BioT5+ with the combination of peptide sequence and its SELFIES molecular representation. Specifically, incorporating both peptide sequences and their molecular strings, xBitterT5 demonstrates superior performance compared to previous methods on the same benchmark datasets. Importantly, the model provides residue-level interpretability, highlighting chemically meaningful substructures that significantly contribute to its bitterness, thus offering mechanistic insights beyond black-box predictions. A user-friendly web server (https://balalab-skku.org/xBitterT5/) and a standalone version (https://github.com/cbbl-skku-org/xBitterT5/) are freely available to support both computational biologists and experimental researchers in peptide-based food and biomedicine.

苦肽（Bitter peptides, BPs）是由食物中的蛋白质水解而成，通过影响味觉和参与多种生理过程，在食品科学和生物医学中发挥着重要作用。准确识别bp对于了解食品质量和潜在的健康影响至关重要。BP识别的传统机器学习方法依赖于传统的特征描述符，取得了中等程度的成功，但在生物序列数据的复杂性方面存在困难。利用蛋白质语言模型嵌入和元学习方法的最新进展提高了准确性，但经常忽略肽的分子表示和缺乏可解释性。在这项研究中，我们提出了一种新的多模态和可解释的BP识别框架xBitterT5，它将来自BioT5+的预训练变压器嵌入与肽序列及其自定义分子表示相结合。具体来说，结合肽序列及其分子链，xBitterT5在相同的基准数据集上比以前的方法表现出更优越的性能。重要的是，该模型提供了残留水平的可解释性，突出了化学上有意义的子结构，这些子结构对其苦味有重要贡献，从而提供了超越黑箱预测的机制见解。一个用户友好的web服务器（https://balalab-skku.org/xBitterT5/）和一个独立的版本（https://github.com/cbbl-skku-org/xBitterT5/）是免费的，以支持计算生物学家和实验研究人员在肽类食品和生物医学。

{"title":"xBitterT5: an explainable transformer-based framework with multimodal inputs for identifying bitter-taste peptides","authors":"Nguyen Doan Hieu Nguyen, Nhat Truong Pham, Duong Thanh Tran, Leyi Wei, Adeel Malik, Balachandran Manavalan","doi":"10.1186/s13321-025-01078-1","DOIUrl":"10.1186/s13321-025-01078-1","url":null,"abstract":"<div><p>Bitter peptides (BPs), derived from the hydrolysis of proteins in food, play a crucial role in both food science and biomedicine by influencing taste perception and participating in various physiological processes. Accurate identification of BPs is essential for understanding food quality and potential health impacts. Traditional machine learning approaches for BP identification have relied on conventional feature descriptors, achieving moderate success but struggling with the complexities of biological sequence data. Recent advances utilizing protein language model embedding and meta-learning approaches have improved the accuracy, but frequently neglect the molecular representations of peptides and lack interpretability. In this study, we propose xBitterT5, a novel multimodal and interpretable framework for BP identification that integrates pretrained transformer-based embeddings from BioT5+ with the combination of peptide sequence and its SELFIES molecular representation. Specifically, incorporating both peptide sequences and their molecular strings, xBitterT5 demonstrates superior performance compared to previous methods on the same benchmark datasets. Importantly, the model provides residue-level interpretability, highlighting chemically meaningful substructures that significantly contribute to its bitterness, thus offering mechanistic insights beyond black-box predictions. A user-friendly web server (https://balalab-skku.org/xBitterT5/) and a standalone version (https://github.com/cbbl-skku-org/xBitterT5/) are freely available to support both computational biologists and experimental researchers in peptide-based food and biomedicine.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01078-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144880932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ReactionT5: a pre-trained transformer model for accurate chemical reaction prediction with limited data 反应5：一个预先训练的变压器模型，在有限的数据下进行准确的化学反应预测

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-08-19 DOI: 10.1186/s13321-025-01075-4

Tatsuya Sagawa, Ryosuke Kojima

Accurate chemical reaction prediction is critical for reducing both cost and time in drug development. This study introduces ReactionT5, a transformer-based chemical reaction foundation model pre-trained on the Open Reaction Database—a large publicly available reaction dataset. In benchmarks for product prediction, retrosynthesis, and yield prediction, ReactionT5 outperformed existing models. Specifically, ReactionT5 achieved 97.5% accuracy in product prediction, 71.0% in retrosynthesis, and a coefficient of determination of 0.947 in yield prediction. Remarkably, ReactionT5, when fine-tuned with only a limited dataset of reactions, achieved performance on par with models fine-tuned on the complete dataset. Additionally, the visualization of ReactionT5 embeddings illustrates that the model successfully captures and represents the chemical reaction space, indicating effective learning of reaction properties.

Graphical Abstract

准确的化学反应预测对于降低药物开发的成本和时间至关重要。本研究介绍了基于变压器的化学反应基础模型reaction t5，该模型是在Open reaction database（一个大型公开可用的反应数据集）上预先训练的。在产品预测、反合成和产率预测的基准测试中，ReactionT5优于现有模型。其中，ReactionT5的产物预测准确率为97.5%，反合成准确率为71.0%，产率预测的决定系数为0.947。值得注意的是，当仅对有限的反应数据集进行微调时，反动5达到了与在完整数据集上微调的模型相当的性能。此外，reaction t5嵌入的可视化表明，该模型成功捕获并表示了化学反应空间，表明对反应性质的有效学习。图形抽象

引用次数: 0

Improving drug-induced liver injury prediction using graph neural networks with augmented graph features from molecular optimisation 基于分子优化增广图特征的图神经网络改进药物性肝损伤预测

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-08-18 DOI: 10.1186/s13321-025-01068-3

Taeyeub Lee, Joram M. Posma

Purpose

Drug-induced liver injury (DILI) is a significant concern in drug development, often leading to the discontinuation of clinical trials and the withdrawal of drugs from the market. This study explores the application of graph neural networks (GNNs) for DILI prediction, using molecular graph representations as the primary input.

Methods

We evaluated several GNN architectures, including Graph Convolutional Networks (GCNs), Graph Attention Networks (GATs), Graph Sample and Aggregation (GraphSAGE), and Graph Isomorphism Networks (GINs), using the latest FDA DILI dataset and other molecular property prediction datasets. We introduce a novel approach that creates a custom graph dataset, driven by molecular optimisation, that incorporates detailed and realistic chemical features such as bond lengths and partial charges as input into the GNN models. We have named our model approach DILIGeNN.

Results

DILIGeNN achieved an AUC of 0.897 on the DILI dataset, surpassing the current state-of-the-art model in the DILI prediction task. Furthermore, DILIGeNN outperformed the state-of-the-art in other graph-based molecular prediction tasks, achieving an AUC of 0.918 on the Clintox dataset, 0.993 on the BBBP dataset, and 0.953 on the BACE dataset, indicating strong generalisation and performance across different datasets.

Conclusion

DILIGeNN, utilising a single graph representation as input, outperforms the state-of-the-art methods in DILI prediction that incorporate both molecular fingerprint and graph-structured data. These findings highlight the effectiveness of our molecular graph generation and the GNN training approach as a powerful tool for early-stage drug development and drug repurposing pipeline.

Scientific Contribution: DILIGeNN is a GNN framework that extracts graph features from 3D optimised molecular structures as is done in target-based drug discovery and molecular docking simulation. Our method is the first to encode spatial and electrostatic information into a single graph representation, as opposed to other work that require multiple graphs or additional chemical descriptors for feature representation. Our approach, using warm starts following repeated early stopping during training, outperforms the current state-of-the-art methods in liver toxicity (DILI), permeability (BBBP) and activity (BACE) prediction tasks.

Graphic Abstract

目的药物性肝损伤（DILI）是药物开发中的一个重要问题，经常导致临床试验中止和药物退出市场。本研究探索了图神经网络（GNNs）在DILI预测中的应用，使用分子图表示作为主要输入。方法利用最新的FDA DILI数据集和其他分子性质预测数据集，我们评估了几种GNN架构，包括图卷积网络（GCNs）、图注意力网络（GATs）、图样本和聚合（GraphSAGE）和图同构网络（GINs）。我们引入了一种新方法，该方法创建了一个自定义图形数据集，由分子优化驱动，该数据集将详细和现实的化学特征（如键长和部分电荷）作为输入输入到GNN模型中。我们将我们的模型方法命名为DILIGeNN。结果diligenn在DILI数据集上的AUC为0.897，在DILI预测任务中超过了目前最先进的模型。此外，DILIGeNN在其他基于图的分子预测任务中表现优于最先进的技术，在Clintox数据集上实现了0.918的AUC，在BBBP数据集上实现了0.993，在BACE数据集上实现了0.953，表明在不同数据集上具有很强的泛化性和性能。结论：使用单个图表示作为输入的diligenn在DILI预测中优于结合分子指纹和图结构数据的最先进方法。这些发现突出了我们的分子图生成和GNN训练方法作为早期药物开发和药物再利用管道的强大工具的有效性。科学贡献：DILIGeNN是一个GNN框架，从3D优化的分子结构中提取图形特征，就像在基于靶标的药物发现和分子对接模拟中所做的那样。我们的方法是第一个将空间和静电信息编码成单个图表示，而不是其他需要多个图或额外的化学描述符来表示特征的工作。我们的方法是在训练中反复提前停止后进行热启动，在肝毒性（DILI）、渗透性（BBBP）和活性（BACE）预测任务中优于当前最先进的方法。图形抽象

{"title":"Improving drug-induced liver injury prediction using graph neural networks with augmented graph features from molecular optimisation","authors":"Taeyeub Lee, Joram M. Posma","doi":"10.1186/s13321-025-01068-3","DOIUrl":"10.1186/s13321-025-01068-3","url":null,"abstract":"<div><h3>Purpose</h3><p>Drug-induced liver injury (DILI) is a significant concern in drug development, often leading to the discontinuation of clinical trials and the withdrawal of drugs from the market. This study explores the application of graph neural networks (GNNs) for DILI prediction, using molecular graph representations as the primary input.</p><h3>Methods</h3><p>We evaluated several GNN architectures, including Graph Convolutional Networks (GCNs), Graph Attention Networks (GATs), Graph Sample and Aggregation (GraphSAGE), and Graph Isomorphism Networks (GINs), using the latest FDA DILI dataset and other molecular property prediction datasets. We introduce a novel approach that creates a custom graph dataset, driven by molecular optimisation, that incorporates detailed and realistic chemical features such as bond lengths and partial charges as input into the GNN models. We have named our model approach DILIGeNN.</p><h3>Results</h3><p>DILIGeNN achieved an AUC of 0.897 on the DILI dataset, surpassing the current state-of-the-art model in the DILI prediction task. Furthermore, DILIGeNN outperformed the state-of-the-art in other graph-based molecular prediction tasks, achieving an AUC of 0.918 on the Clintox dataset, 0.993 on the BBBP dataset, and 0.953 on the BACE dataset, indicating strong generalisation and performance across different datasets.</p><h3>Conclusion</h3><p>DILIGeNN, utilising a single graph representation as input, outperforms the state-of-the-art methods in DILI prediction that incorporate both molecular fingerprint and graph-structured data. These findings highlight the effectiveness of our molecular graph generation and the GNN training approach as a powerful tool for early-stage drug development and drug repurposing pipeline.</p><p>Scientific Contribution: DILIGeNN is a GNN framework that extracts graph features from 3D optimised molecular structures as is done in target-based drug discovery and molecular docking simulation. Our method is the first to encode spatial and electrostatic information into a single graph representation, as opposed to other work that require multiple graphs or additional chemical descriptors for feature representation. Our approach, using warm starts following repeated early stopping during training, outperforms the current state-of-the-art methods in liver toxicity (DILI), permeability (BBBP) and activity (BACE) prediction tasks.</p><h3>Graphic Abstract</h3><div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01068-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144861405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cycle-configuration descriptors: a novel graph-theoretic approach to enhancing molecular inference 循环构型描述符：一种新的增强分子推理的图论方法

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-08-18 DOI: 10.1186/s13321-025-01042-z

Bowen Song, Jianshen Zhu, Naveed Ahmed Azam, Kazuya Haraguchi, Liang Zhao, Tatsuya Akutsu

Inference of molecules with desired activities/properties is one of the key and challenging issues in cheminformatics and bioinformatics. For that purpose, our research group has recently developed a state-of-the-art framework mol-infer for molecular inference. This framework first constructs a prediction function for a fixed property using machine learning models, which is then simulated by mixed-integer linear programming to infer desired molecules. The accuracy of the framework heavily relies on the representation power of the descriptors. In this study, we highlight a typical class of non-isomorphic chemical graphs with reasonably different property values that cannot be distinguished by the standard “two-layered (2L) model" of mol-infer. To address this distinguishability problem of the 2L model, we propose a novel family of descriptors, named cycle-configuration (CC), which captures the notion of ortho/meta/para patterns that appear in aromatic rings, which was impossible in the framework so far. Extensive computational experiments show that with the new descriptors, we can construct prediction functions with similar or better performance for all 44 tested chemical properties, including 27 regression datasets and 17 classification datasets comparing with our previous studies, confirming the effectiveness of the CC descriptors. For inference, we also provide a system of linear constraints to formulate the CC descriptors as linear constraints. We demonstrate that a chemical graph with up to 50 non-hydrogen vertices can be inferred within a practical time frame.

分子活性/性质的推断是化学信息学和生物信息学的关键和挑战性问题之一。为此，我们的研究小组最近开发了一个用于分子推理的最先进的框架moll -infer。该框架首先使用机器学习模型构建固定属性的预测函数，然后通过混合整数线性规划模拟以推断所需分子。框架的准确性很大程度上依赖于描述符的表示能力。在这项研究中，我们强调了一类典型的非同构化学图，它们具有合理不同的性质值，不能被mol-infer的标准“两层（2L）模型”所区分。为了解决2L模型的可分辨性问题，我们提出了一个新的描述符家族，称为循环配置（CC），它捕获了芳香环中出现的邻位/元/对位模式的概念，这在目前的框架中是不可能的。大量的计算实验表明，与我们之前的研究相比，使用新的描述符，我们可以对所有44个被测试的化学性质构建具有相似或更好性能的预测函数，包括27个回归数据集和17个分类数据集，证实了CC描述符的有效性。对于推理，我们还提供了一个线性约束系统来将CC描述符表述为线性约束。我们证明了在实际时间框架内可以推断出具有多达50个非氢顶点的化学图。

{"title":"Cycle-configuration descriptors: a novel graph-theoretic approach to enhancing molecular inference","authors":"Bowen Song, Jianshen Zhu, Naveed Ahmed Azam, Kazuya Haraguchi, Liang Zhao, Tatsuya Akutsu","doi":"10.1186/s13321-025-01042-z","DOIUrl":"10.1186/s13321-025-01042-z","url":null,"abstract":"<div><p>Inference of molecules with desired activities/properties is one of the key and challenging issues in cheminformatics and bioinformatics. For that purpose, our research group has recently developed a state-of-the-art framework <span>mol-infer</span> for molecular inference. This framework first constructs a prediction function for a fixed property using machine learning models, which is then simulated by mixed-integer linear programming to infer desired molecules. The accuracy of the framework heavily relies on the representation power of the descriptors. In this study, we highlight a typical class of non-isomorphic chemical graphs with reasonably different property values that cannot be distinguished by the standard “two-layered (2L) model\" of <span>mol-infer</span>. To address this distinguishability problem of the 2L model, we propose a novel family of descriptors, named <i>cycle-configuration (CC)</i>, which captures the notion of ortho/meta/para patterns that appear in aromatic rings, which was impossible in the framework so far. Extensive computational experiments show that with the new descriptors, we can construct prediction functions with similar or better performance for all 44 tested chemical properties, including 27 regression datasets and 17 classification datasets comparing with our previous studies, confirming the effectiveness of the CC descriptors. For inference, we also provide a system of linear constraints to formulate the CC descriptors as linear constraints. We demonstrate that a chemical graph with up to 50 non-hydrogen vertices can be inferred within a practical time frame. </p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01042-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144861404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0