首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
Multi-modal contrastive drug synergy prediction model guided by single modality 单模态指导下的多模态对比药物协同作用预测模型
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-09-26 DOI: 10.1186/s13321-025-01087-0
Tong Luo, Zheng Zhang, Xian-gan Chen, Zhi Li

Compared to monotherapy, drug combinations exhibit stronger efficacy, fewer side effects, and lower drug resistance in cancer treatment. However, traditional wet-lab methods for screening synergistic drug combinations are both costly and inefficient. Lately, the development of various drug synergy methods has been promoted by the emergence of multiple drug synergy databases. Many of these methods use multimodal data and achieve good results. However, if various modalities of data is given equal consideration without taking into account the differences in features between the two modalities, this may lead to less effective multi-modal learning. We propose a multi-modal contrastive learning method for drug synergy prediction, named MCDSP. Specifically, MCDSP extracts entity embedding features of drugs and cell lines from heterogeneous graphs, while leveraging molecular fingerprints and gene expression features as biomolecular features for drugs and cell lines. These two different types of features serve as two types of modality information. Under the guided of single modality prediction tasks, we evaluated the relevant information of each modality. Through contrastive learning, the prediction bias of the two modalities are reduced, which obtain improved quality of multi-modal feature. Experiments show that MCDSP outperforms baseline methods on large datasets, and it performs well in handling unknown drug combinations and cell lines. MCDSP has demonstrated significant effectiveness in predicting drug synergy.

与单一治疗相比,药物联合治疗在癌症治疗中表现出更强的疗效、更少的副作用和更低的耐药性。然而,传统的湿实验室筛选协同药物组合的方法既昂贵又低效。近年来,多种药物协同数据库的出现促进了各种药物协同方法的发展。其中许多方法使用多模态数据并取得了良好的效果。然而,如果对数据的各种模式给予同等的考虑,而不考虑两种模式之间的特征差异,这可能会导致多模式学习的效率降低。我们提出了一种用于药物协同预测的多模态对比学习方法,命名为MCDSP。具体而言,MCDSP从异构图中提取药物和细胞系的实体嵌入特征,同时利用分子指纹和基因表达特征作为药物和细胞系的生物分子特征。这两种不同类型的特征作为两种类型的情态信息。在单模态预测任务的指导下,对各模态的相关信息进行评价。通过对比学习,减少了两种模态的预测偏差,提高了多模态特征的质量。实验表明,MCDSP在大型数据集上优于基线方法,并且在处理未知药物组合和细胞系方面表现良好。MCDSP在预测药物协同作用方面显示出显著的有效性。本研究通过对比学习将两种模式的特征对准有效成分,从而提高了多模式特征的质量,显著提高了药物协同作用预测模型的性能。我们创新性地利用单模态预测任务指导下的对比学习,使本研究有别于以往的研究,为药物协同作用预测提供了新的工具。
{"title":"Multi-modal contrastive drug synergy prediction model guided by single modality","authors":"Tong Luo,&nbsp;Zheng Zhang,&nbsp;Xian-gan Chen,&nbsp;Zhi Li","doi":"10.1186/s13321-025-01087-0","DOIUrl":"10.1186/s13321-025-01087-0","url":null,"abstract":"<div><p>Compared to monotherapy, drug combinations exhibit stronger efficacy, fewer side effects, and lower drug resistance in cancer treatment. However, traditional wet-lab methods for screening synergistic drug combinations are both costly and inefficient. Lately, the development of various drug synergy methods has been promoted by the emergence of multiple drug synergy databases. Many of these methods use multimodal data and achieve good results. However, if various modalities of data is given equal consideration without taking into account the differences in features between the two modalities, this may lead to less effective multi-modal learning. We propose a multi-modal contrastive learning method for drug synergy prediction, named MCDSP. Specifically, MCDSP extracts entity embedding features of drugs and cell lines from heterogeneous graphs, while leveraging molecular fingerprints and gene expression features as biomolecular features for drugs and cell lines. These two different types of features serve as two types of modality information. Under the guided of single modality prediction tasks, we evaluated the relevant information of each modality. Through contrastive learning, the prediction bias of the two modalities are reduced, which obtain improved quality of multi-modal feature. Experiments show that MCDSP outperforms baseline methods on large datasets, and it performs well in handling unknown drug combinations and cell lines. MCDSP has demonstrated significant effectiveness in predicting drug synergy.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01087-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PKSmart: an open-source computational model to predict intravenous pharmacokinetics of small molecules PKSmart:一个开源的计算模型,用于预测静脉小分子的药代动力学
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-09-26 DOI: 10.1186/s13321-025-01066-5
Srijit Seal, Maria-Anna Trapotsi, Manas Mahale, Vigneshwari Subramanian, Nigel Greene, Ola Spjuth, Andreas Bender

Drug exposure, a key determinant of drug safety and efficacy, is governed by pharmacokinetic (PK) parameters such as volume of distribution (VDss), clearance (CL), half-life (t½), fraction unbound in plasma (fu), and mean residence time (MRT). In this study, we developed machine learning models to predict human PK parameters for 1,283 unique compounds using molecular structure, physicochemical properties, and predicted animal PK data. Our approach involved a two-stage modeling pipeline. First, we trained models to predict rat, dog, and monkey PK parameters (VDss, CL, fu) from chemical structure and properties for 371 compounds. These models were used to predict animal PK values for 1,283 unique compounds with human PK data. These animal PK predictions were then integrated with molecular descriptors and fingerprints to build Random Forest models for human PK parameters. The models demonstrated consistent performance across nested cross-validation and external validation sets, with predictive accuracy for VDss comparable to proprietary models developed by AstraZeneca. Notably, human VDss and CL predictions achieved external R2 values of 0.39 and 0.46, respectively. To support broad accessibility and integration into early drug discovery workflows such as Design-Make-Test-Analyze (DMTA), we developed PKSmart (https://broad.io/PKSmart), a freely available web application. All code and models are also open source, enabling local deployment. To our knowledge, this represents the first public suite of PK prediction models with performance on par with industry standard models.

This study introduces the first publicly available pharmacokinetic (PK) models that match industry-standard predictions, utilizing molecular structural fingerprints, physicochemical properties, and predicted animal PK data to model human pharmacokinetics. Our approach is validated through repeated nested cross-validation and an external test set, including comparing predictions to an industry standard model. The models are released via a web-hosted application (https://broad.io/PKSmart) for wider accessibility and utility in drug development processes.

药物暴露是药物安全性和有效性的关键决定因素,受药代动力学(PK)参数的控制,如分布体积(VDss)、清除率(CL)、半衰期(t½)、血浆中未结合分数(fu)和平均停留时间(MRT)。在这项研究中,我们开发了机器学习模型,利用分子结构、物理化学性质和预测动物PK数据来预测1,283种独特化合物的人类PK参数。我们的方法包括一个两阶段的建模管道。首先,我们训练模型从371种化合物的化学结构和性质来预测大鼠、狗和猴子的PK参数(VDss、CL、fu)。这些模型被用来预测1283种独特化合物与人类PK数据的动物PK值。然后将这些动物PK预测与分子描述符和指纹相结合,构建人类PK参数的随机森林模型。该模型在嵌套交叉验证和外部验证集中表现出一致的性能,其VDss的预测精度可与阿斯利康开发的专有模型相媲美。值得注意的是,人类VDss和CL预测的外部R2值分别为0.39和0.46。为了支持广泛的可访问性并集成到早期药物发现工作流程中,例如设计-制造-测试-分析(DMTA),我们开发了PKSmart (https://broad.io/PKSmart),这是一个免费的web应用程序。所有的代码和模型也是开源的,支持本地部署。据我们所知,这是第一个公开的PK预测模型套件,其性能与行业标准模型相当。本研究引入了第一个公开可用的药代动力学(PK)模型,该模型符合行业标准预测,利用分子结构指纹图谱、物理化学性质和预测的动物PK数据来模拟人类药代动力学。我们的方法通过重复嵌套交叉验证和外部测试集得到验证,包括将预测与行业标准模型进行比较。这些模型通过网络托管应用程序(https://broad.io/PKSmart)发布,以便在药物开发过程中更广泛地访问和使用。
{"title":"PKSmart: an open-source computational model to predict intravenous pharmacokinetics of small molecules","authors":"Srijit Seal,&nbsp;Maria-Anna Trapotsi,&nbsp;Manas Mahale,&nbsp;Vigneshwari Subramanian,&nbsp;Nigel Greene,&nbsp;Ola Spjuth,&nbsp;Andreas Bender","doi":"10.1186/s13321-025-01066-5","DOIUrl":"10.1186/s13321-025-01066-5","url":null,"abstract":"<p>Drug exposure, a key determinant of drug safety and efficacy, is governed by pharmacokinetic (PK) parameters such as volume of distribution (VDss), clearance (CL), half-life (t½), fraction unbound in plasma (fu), and mean residence time (MRT). In this study, we developed machine learning models to predict human PK parameters for 1,283 unique compounds using molecular structure, physicochemical properties, and predicted animal PK data. Our approach involved a two-stage modeling pipeline. First, we trained models to predict rat, dog, and monkey PK parameters (VDss, CL, fu) from chemical structure and properties for 371 compounds. These models were used to predict animal PK values for 1,283 unique compounds with human PK data. These animal PK predictions were then integrated with molecular descriptors and fingerprints to build Random Forest models for human PK parameters. The models demonstrated consistent performance across nested cross-validation and external validation sets, with predictive accuracy for VDss comparable to proprietary models developed by AstraZeneca. Notably, human VDss and CL predictions achieved external R<sup>2</sup> values of 0.39 and 0.46, respectively. To support broad accessibility and integration into early drug discovery workflows such as Design-Make-Test-Analyze (DMTA), we developed PKSmart (https://broad.io/PKSmart), a freely available web application. All code and models are also open source, enabling local deployment. To our knowledge, this represents the first public suite of PK prediction models with performance on par with industry standard models.</p><p>This study introduces the first publicly available pharmacokinetic (PK) models that match industry-standard predictions, utilizing molecular structural fingerprints, physicochemical properties, and predicted animal PK data to model human pharmacokinetics. Our approach is validated through repeated nested cross-validation and an external test set, including comparing predictions to an industry standard model. The models are released via a web-hosted application (https://broad.io/PKSmart) for wider accessibility and utility in drug development processes.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01066-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating ligand docking methods for drugging protein–protein interfaces: insights from AlphaFold2 and molecular dynamics refinement 评估药物蛋白-蛋白界面的配体对接方法:来自AlphaFold2和分子动力学改进的见解
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-09-25 DOI: 10.1186/s13321-025-01067-4
Jordi Gómez Borrego, Marc Torrent Burgas

Advances in docking protocols have significantly enhanced the field of protein–protein interaction (PPI) modulation, with AlphaFold2 (AF2) and molecular dynamics (MD) refinements playing pivotal roles. This study evaluates the performance of AF2 models against experimentally solved structures in docking protocols targeting PPIs. Using a dataset of 16 interactions with validated modulators, we benchmarked eight docking protocols, revealing similar performance between native and AF2 models. Local docking strategies outperformed blind docking, with TankBind_local and Glide providing the best results across the structural types tested. MD simulations and other ensemble generation algorithms such as AlphaFlow, refined both native and AF2 models, improving docking outcomes but showing significant variability across conformations. These results suggest that, while structural refinement can enhance docking in some cases, overall performance appears to be constrained by limitations in scoring functions and docking methodologies. Although protein ensembles can improve virtual screening, predicting the most effective conformations for docking remains a challenge. These findings support the use of AF2-generated structures in docking protocols targeting PPIs and highlight the need to improve current scoring methodologies.

This study provides a systematic benchmark of docking protocols applied to protein–proteininteractions (PPIs) using both experimentally solved structures and AlphaFold2 models. Byintegrating molecular dynamics ensembles and AlphaFlow-generated conformations, we showthat structural refinement improves docking outcomes in selected cases, but overallperformance remains constrained by docking scoring function limitations. Our analysis showsthat AlphaFold2 models perform comparably to native structures in PPI docking, validating theiruse when experimental data are unavailable. These results establish a reference framework forfuture PPI-focused virtual screening and underscore the need for improved scoring functionsand ensemble-based approaches to better exploit emerging structural prediction tools.

对接协议的进展显著增强了蛋白-蛋白相互作用(PPI)调节领域,其中AlphaFold2 (AF2)和分子动力学(MD)的改进发挥了关键作用。本研究针对针对PPIs的对接协议中实验解决的结构,评估了AF2模型的性能。使用包含16个经过验证的调制器交互的数据集,我们对8个对接协议进行了基准测试,发现本地模型和AF2模型之间的性能相似。局部对接策略优于盲对接策略,TankBind_local和Glide在测试的结构类型中提供了最好的结果。MD模拟和其他集成生成算法(如AlphaFlow)改进了原生模型和AF2模型,改善了对接结果,但显示出不同构象之间的显著差异。这些结果表明,虽然结构优化在某些情况下可以增强对接,但总体性能似乎受到评分函数和对接方法的限制。尽管蛋白质集合可以改善虚拟筛选,但预测最有效的对接构象仍然是一个挑战。这些发现支持在针对PPIs的对接协议中使用af2生成的结构,并强调了改进当前评分方法的必要性。该研究通过实验解决的结构和AlphaFold2模型,为蛋白质-蛋白质相互作用(PPIs)的对接协议提供了系统的基准。通过整合分子动力学集成和alphaflow生成的构象,我们发现结构优化在某些情况下改善了对接结果,但总体性能仍然受到对接评分函数的限制。我们的分析表明,AlphaFold2模型在PPI对接中的表现与原生结构相当,验证了它们在实验数据不可用时的使用。这些结果为未来以ppi为重点的虚拟筛选建立了参考框架,并强调了改进评分功能和基于集成的方法以更好地利用新兴结构预测工具的必要性。
{"title":"Evaluating ligand docking methods for drugging protein–protein interfaces: insights from AlphaFold2 and molecular dynamics refinement","authors":"Jordi Gómez Borrego,&nbsp;Marc Torrent Burgas","doi":"10.1186/s13321-025-01067-4","DOIUrl":"10.1186/s13321-025-01067-4","url":null,"abstract":"<p>Advances in docking protocols have significantly enhanced the field of protein–protein interaction (PPI) modulation, with AlphaFold2 (AF2) and molecular dynamics (MD) refinements playing pivotal roles. This study evaluates the performance of AF2 models against experimentally solved structures in docking protocols targeting PPIs. Using a dataset of 16 interactions with validated modulators, we benchmarked eight docking protocols, revealing similar performance between native and AF2 models. Local docking strategies outperformed blind docking, with TankBind_local and Glide providing the best results across the structural types tested. MD simulations and other ensemble generation algorithms such as AlphaFlow, refined both native and AF2 models, improving docking outcomes but showing significant variability across conformations. These results suggest that, while structural refinement can enhance docking in some cases, overall performance appears to be constrained by limitations in scoring functions and docking methodologies. Although protein ensembles can improve virtual screening, predicting the most effective conformations for docking remains a challenge. These findings support the use of AF2-generated structures in docking protocols targeting PPIs and highlight the need to improve current scoring methodologies.</p><p>This study provides a systematic benchmark of docking protocols applied to protein–proteininteractions (PPIs) using both experimentally solved structures and AlphaFold2 models. Byintegrating molecular dynamics ensembles and AlphaFlow-generated conformations, we showthat structural refinement improves docking outcomes in selected cases, but overallperformance remains constrained by docking scoring function limitations. Our analysis showsthat AlphaFold2 models perform comparably to native structures in PPI docking, validating theiruse when experimental data are unavailable. These results establish a reference framework forfuture PPI-focused virtual screening and underscore the need for improved scoring functionsand ensemble-based approaches to better exploit emerging structural prediction tools.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01067-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145133533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cache: Utilizing ultra-large library screening in Rosetta to identify novel binders of the WD-repeat domain of Leucine-Rich Repeat Kinase 2 缓存:利用Rosetta的超大文库筛选,鉴定富亮氨酸重复激酶2的WD-repeat结构域的新结合物
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-09-25 DOI: 10.1186/s13321-025-01084-3
Fabian Liessmann, Paul Eisenhuth, Alexander Fürll, Oanh Vu, Rocco Moretti, Jens Meiler

In this study, we present a pipeline for identifying novel ligands targeting the Tryptophan-Aspartate-Repeat domain 40 (WDR40) of Leucine-Rich Repeat Kinase 2 (LRRK2), a protein associated with Parkinson’s disease, as part of the first Critical Assessment of Computational Hit-finding Experiments (CACHE) challenge, a blind benchmark experiment for drug discovery. Mutations in this protein are the most common genetic cause of familial Parkinson’s disease, yet this target remains understudied. We conducted an ultra-large library screening (ULLS) of the Enamine REAL space using a newly developed evolutionary algorithm, RosettaEvolutionaryLigand (REvoLd), which allows for efficient screening of combinatorial compound libraries. The protocol involved refining the target structure with molecular dynamic simulations, identifying a binding site via blind-docking, and optimizing compounds through REvoLd, culminating in a manual selection amongst the top-scoring REvoLd hits. A single binder molecule was identified that derived from the combination of two Enamine building blocks. In the second round, derivatives of the hit compound were used as input for REvoLd to further sample within the Enamine REAL space. Ultimately, a total of five molecules were identified, from which three show a measurable dissociation constant K(_D) value better than 150 (upmu) μm, showcasing the effectiveness of this approach. However, it also highlighted shortcomings, such as the preference for nitrogen-rich rings in the RosettaLigand scoring function.

We introduce the first real-world application for REvoLd, an evolutionary docking algorithm enabling efficient ultra-large library screening for flexible protein targets. Our approach identified novel binders for the WDR40 domain of LRRK2 within the CACHE challenge #1, representing the first prospective validation of REvoLd. Here, we present a preparation pipeline to allow exploration of a large protein pocket with unspecific binding areas, and unlike prior brute-force docking efforts, our method integrates receptor flexibility and combinatorial chemistry optimization.

在这项研究中,我们提出了一个管道,用于鉴定针对富含亮氨酸重复激酶2 (LRRK2)的色氨酸-天冬氨酸-重复结构域40 (WDR40)的新型配体,这是与帕金森病相关的蛋白质,作为计算命中发现实验(CACHE)挑战的第一个关键评估的一部分,这是药物发现的盲基准实验。这种蛋白质的突变是家族性帕金森病最常见的遗传原因,但这一目标仍未得到充分研究。我们使用新开发的进化算法rosettaevolutionaryigand (REvoLd)对Enamine REAL空间进行了超大型文库筛选(ULLS),该算法允许有效筛选组合化合物文库。该方案包括通过分子动力学模拟来优化目标结构,通过盲对接确定结合位点,并通过REvoLd优化化合物,最终在REvoLd得分最高的命中中进行手动选择。发现了一种由两个烯胺基元组合而成的单一粘结剂分子。在第二轮中,使用命中化合物的衍生物作为REvoLd的输入,在Enamine REAL空间内进一步采样。最终,共鉴定出5个分子,其中3个分子的解离常数K $$_D$$值优于150 $$upmu$$ μm,证明了该方法的有效性。然而,它也突出了缺点,例如在RosettaLigand评分功能中对富氮环的偏好。我们介绍了REvoLd的第一个实际应用,REvoLd是一种进化对接算法,可以对灵活的蛋白质靶点进行高效的超大文库筛选。我们的方法在CACHE挑战#1中发现了LRRK2的WDR40结构域的新结合物,代表了REvoLd的首次前瞻性验证。在这里,我们提出了一个制备管道,允许探索具有非特异性结合区域的大蛋白质口袋,与之前的暴力对接工作不同,我们的方法集成了受体灵活性和组合化学优化。
{"title":"Cache: Utilizing ultra-large library screening in Rosetta to identify novel binders of the WD-repeat domain of Leucine-Rich Repeat Kinase 2","authors":"Fabian Liessmann,&nbsp;Paul Eisenhuth,&nbsp;Alexander Fürll,&nbsp;Oanh Vu,&nbsp;Rocco Moretti,&nbsp;Jens Meiler","doi":"10.1186/s13321-025-01084-3","DOIUrl":"10.1186/s13321-025-01084-3","url":null,"abstract":"<p>In this study, we present a pipeline for identifying novel ligands targeting the Tryptophan-Aspartate-Repeat domain 40 (WDR40) of Leucine-Rich Repeat Kinase 2 (LRRK2), a protein associated with Parkinson’s disease, as part of the first Critical Assessment of Computational Hit-finding Experiments (CACHE) challenge, a blind benchmark experiment for drug discovery. Mutations in this protein are the most common genetic cause of familial Parkinson’s disease, yet this target remains understudied. We conducted an ultra-large library screening (ULLS) of the Enamine REAL space using a newly developed evolutionary algorithm, RosettaEvolutionaryLigand (REvoLd), which allows for efficient screening of combinatorial compound libraries. The protocol involved refining the target structure with molecular dynamic simulations, identifying a binding site via blind-docking, and optimizing compounds through REvoLd, culminating in a manual selection amongst the top-scoring REvoLd hits. A single binder molecule was identified that derived from the combination of two Enamine building blocks. In the second round, derivatives of the hit compound were used as input for REvoLd to further sample within the Enamine REAL space. Ultimately, a total of five molecules were identified, from which three show a measurable dissociation constant K<span>(_D)</span> value better than 150 <span>(upmu)</span> μm, showcasing the effectiveness of this approach. However, it also highlighted shortcomings, such as the preference for nitrogen-rich rings in the RosettaLigand scoring function.</p><p>We introduce the first real-world application for REvoLd, an evolutionary docking algorithm enabling efficient ultra-large library screening for flexible protein targets. Our approach identified novel binders for the WDR40 domain of LRRK2 within the CACHE challenge #1, representing the first prospective validation of REvoLd. Here, we present a preparation pipeline to allow exploration of a large protein pocket with unspecific binding areas, and unlike prior brute-force docking efforts, our method integrates receptor flexibility and combinatorial chemistry optimization.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01084-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Contrastive explanations for machine learning predictions in chemistry 化学中机器学习预测的对比解释
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-09-23 DOI: 10.1186/s13321-025-01100-6
Alec Lamens, Jürgen Bajorath

The concept of contrastive explanations originating from human reasoning is used in explainable artificial intelligence. In machine learning, contrastive explanations relate alternative prediction outcomes to each other involving the identification of features leading to opposing model decisions. We introduce a methodological framework for deriving contrastive explanations for machine learning models in chemistry to systematically generate intuitive explanations of predictions in high-dimensional feature spaces. The molecular contrastive explanations (MolCE) methodology explores alternative model decisions by generating virtual analogues of test compounds through replacements of molecular building blocks and quantifies the degree of “contrastive shifts” resulting from changes in model probability distributions. In a proof-of-concept study, MolCE was applied to explain selectivity predictions of ligands of D2-like dopamine receptor isoforms.

源于人类推理的对比解释概念被用于可解释的人工智能。在机器学习中,对比解释将不同的预测结果相互关联,涉及识别导致相反模型决策的特征。我们引入了一种方法框架,用于推导化学中机器学习模型的对比解释,以系统地生成高维特征空间预测的直观解释。分子对比解释(MolCE)方法通过替换分子构建块来生成测试化合物的虚拟类似物,并量化由模型概率分布变化引起的“对比位移”的程度,从而探索替代模型决策。在一项概念验证研究中,MolCE被用于解释d2样多巴胺受体同种异构体配体的选择性预测。
{"title":"Contrastive explanations for machine learning predictions in chemistry","authors":"Alec Lamens,&nbsp;Jürgen Bajorath","doi":"10.1186/s13321-025-01100-6","DOIUrl":"10.1186/s13321-025-01100-6","url":null,"abstract":"<div><p>The concept of contrastive explanations originating from human reasoning is used in explainable artificial intelligence. In machine learning, contrastive explanations relate alternative prediction outcomes to each other involving the identification of features leading to opposing model decisions. We introduce a methodological framework for deriving contrastive explanations for machine learning models in chemistry to systematically generate intuitive explanations of predictions in high-dimensional feature spaces. The molecular contrastive explanations (MolCE) methodology explores alternative model decisions by generating virtual analogues of test compounds through replacements of molecular building blocks and quantifies the degree of “contrastive shifts” resulting from changes in model probability distributions. In a proof-of-concept study, MolCE was applied to explain selectivity predictions of ligands of D2-like dopamine receptor isoforms.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01100-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145110668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cheminformatics Microservice V3: a web portal for chemical structure manipulation and analysis 化学信息学微服务V3:用于化学结构操作和分析的门户网站
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-09-23 DOI: 10.1186/s13321-025-01094-1
Kohulan Rajan, Venkata Chandrasekhar, Nisha Sharma, Sri Ram Sagar Kanakam, Felix Baensch, Christoph Steinbeck

The widespread adoption of open-source cheminformatics toolkits remains constrained by technical implementation barriers, including complex installation procedures, dependency management, and integration challenges. Here, we present Cheminformatics Microservice V3, a significant update to the existing platform that provides unified programmatic access to cheminformatics libraries, including RDKit, Chemistry Development Kit (CDK), and Open Babel through a RESTful API framework. This latest version features a newly developed, interactive web-based frontend built with React, providing users with an intuitive graphical interface for manipulating and analysing chemical structures. The frontend supports essential cheminformatics operations, including structure editing, PubChem database integration, batch molecular processing, and standardised InChI/RInChI identifier generation. The microservice V3 addresses critical accessibility barriers in computational chemistry by providing researchers with immediate access to analytical tools, eliminating the need for specialised technical expertise or complex software installations. This approach facilitates reproducible research workflows and broadens the utilisation of cheminformatics methodologies across interdisciplinary research communities. The platform is publicly accessible at https://app.naturalproducts.net, and the complete source code and documentation are available on GitHub.

开源化学信息学工具包的广泛采用仍然受到技术实现障碍的限制,包括复杂的安装过程、依赖管理和集成挑战。在这里,我们介绍了Cheminformatics Microservice V3,这是对现有平台的重大更新,它通过RESTful API框架提供了对化学信息学库的统一编程访问,包括RDKit、Chemistry Development Kit (CDK)和Open Babel。这个最新版本的特点是使用React构建了一个新开发的交互式基于web的前端,为用户提供了一个直观的图形界面来操作和分析化学结构。前端支持基本的化学信息学操作,包括结构编辑、PubChem数据库集成、批量分子处理和标准化的InChI/RInChI标识符生成。微服务V3通过为研究人员提供即时访问分析工具,消除了对专业技术知识或复杂软件安装的需求,解决了计算化学中关键的可访问性障碍。这种方法促进了可重复的研究工作流程,并扩大了化学信息学方法在跨学科研究社区的应用。该平台可在https://app.naturalproducts.net上公开访问,完整的源代码和文档可在GitHub上获得。
{"title":"Cheminformatics Microservice V3: a web portal for chemical structure manipulation and analysis","authors":"Kohulan Rajan,&nbsp;Venkata Chandrasekhar,&nbsp;Nisha Sharma,&nbsp;Sri Ram Sagar Kanakam,&nbsp;Felix Baensch,&nbsp;Christoph Steinbeck","doi":"10.1186/s13321-025-01094-1","DOIUrl":"10.1186/s13321-025-01094-1","url":null,"abstract":"<div><p>The widespread adoption of open-source cheminformatics toolkits remains constrained by technical implementation barriers, including complex installation procedures, dependency management, and integration challenges. Here, we present <i>Cheminformatics Microservice V3</i>, a significant update to the existing platform that provides unified programmatic access to cheminformatics libraries, including RDKit, Chemistry Development Kit (CDK), and Open Babel through a RESTful API framework. This latest version features a newly developed, interactive web-based frontend built with React, providing users with an intuitive graphical interface for manipulating and analysing chemical structures. The frontend supports essential cheminformatics operations, including structure editing, PubChem database integration, batch molecular processing, and standardised InChI/RInChI identifier generation. The microservice V3 addresses critical accessibility barriers in computational chemistry by providing researchers with immediate access to analytical tools, eliminating the need for specialised technical expertise or complex software installations. This approach facilitates reproducible research workflows and broadens the utilisation of cheminformatics methodologies across interdisciplinary research communities. The platform is publicly accessible at https://app.naturalproducts.net, and the complete source code and documentation are available on GitHub.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01094-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145110667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A comprehensive landscape of AI applications in broad-spectrum drug interaction prediction: a systematic review 人工智能在广谱药物相互作用预测中的应用:系统综述
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-09-19 DOI: 10.1186/s13321-025-01093-2
Nour H. Marzouk, Sahar Selim, Mustafa Elattar, Mai S. Mabrouk, Mohamed Mysara

In drug development, managing interactions such as drug–drug, drug–disease, and drug–nutrient is critical for ensuring the safety and efficacy of pharmacological treatments. These interactions often overlap, forming a complex, interconnected landscape that necessitates accurate prediction to improve patient outcomes and support evidence-based care. Recent advances in artificial intelligence (AI), powered by large-scale datasets (e.g., DrugBank, TWOSIDES, SIDER), have significantly enhanced interaction prediction. Machine learning, deep learning, and graph-based models show great promise, but challenges persist, including data imbalance, noisy sources, Limited explainability, and underrepresentation of certain types of interactions. This systematic review of 147 studies (2018–2024) is the first to comprehensively map AI applications across major interaction types. We present a detailed taxonomy of models and datasets, emphasizing the growing roles of large language models and knowledge graphs in overcoming key limitations. Their integration—alongside explainable AI tools—enhances transparency, paving the way for AI-driven systems that proactively mitigate adverse interactions. By identifying the most promising approaches and critical research gaps, this review lays the groundwork for advancing more robust, interpretable, and personalized models for drug interaction prediction.

在药物开发中,管理药物-药物、药物-疾病和药物-营养等相互作用对于确保药物治疗的安全性和有效性至关重要。这些相互作用经常重叠,形成一个复杂的,相互关联的景观,需要准确的预测,以改善患者的结果和支持循证护理。人工智能(AI)的最新进展,由大规模数据集(例如,DrugBank, TWOSIDES, SIDER)提供支持,显著增强了交互预测。机器学习、深度学习和基于图的模型显示出巨大的前景,但挑战仍然存在,包括数据不平衡、噪声源、有限的可解释性以及某些类型交互的代表性不足。这项对147项研究(2018-2024)的系统回顾是第一个全面描绘主要交互类型的人工智能应用的研究。我们提出了模型和数据集的详细分类,强调了大型语言模型和知识图在克服关键限制方面日益增长的作用。它们与可解释的人工智能工具相结合,提高了透明度,为人工智能驱动的系统主动减轻不利的相互作用铺平了道路。通过确定最有希望的方法和关键的研究差距,本综述为推进更稳健、可解释和个性化的药物相互作用预测模型奠定了基础。
{"title":"A comprehensive landscape of AI applications in broad-spectrum drug interaction prediction: a systematic review","authors":"Nour H. Marzouk,&nbsp;Sahar Selim,&nbsp;Mustafa Elattar,&nbsp;Mai S. Mabrouk,&nbsp;Mohamed Mysara","doi":"10.1186/s13321-025-01093-2","DOIUrl":"10.1186/s13321-025-01093-2","url":null,"abstract":"<div><p>In drug development, managing interactions such as drug–drug, drug–disease, and drug–nutrient is critical for ensuring the safety and efficacy of pharmacological treatments. These interactions often overlap, forming a complex, interconnected landscape that necessitates accurate prediction to improve patient outcomes and support evidence-based care. Recent advances in artificial intelligence (AI), powered by large-scale datasets (e.g., DrugBank, TWOSIDES, SIDER), have significantly enhanced interaction prediction. Machine learning, deep learning, and graph-based models show great promise, but challenges persist, including data imbalance, noisy sources, Limited explainability, and underrepresentation of certain types of interactions. This systematic review of 147 studies (2018–2024) is the first to comprehensively map AI applications across major interaction types. We present a detailed taxonomy of models and datasets, emphasizing the growing roles of large language models and knowledge graphs in overcoming key limitations. Their integration—alongside explainable AI tools—enhances transparency, paving the way for AI-driven systems that proactively mitigate adverse interactions. By identifying the most promising approaches and critical research gaps, this review lays the groundwork for advancing more robust, interpretable, and personalized models for drug interaction prediction.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01093-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145079055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MetaboGNN: predicting liver metabolic stability with graph neural networks and cross-species data MetaboGNN:用图神经网络和跨物种数据预测肝脏代谢稳定性
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-09-03 DOI: 10.1186/s13321-025-01089-y
Jun Hyeong Park, Ri Han, Junbo Jang, Jisan Kim, Joonki Paik, Jaesung Heo, Yoonji Lee

The metabolic stability of a drug is a crucial determinant of its pharmacokinetic properties, including clearance, half-life, and oral bioavailability. Accurate predictions of metabolic stability can significantly streamline the drug discovery process. In this study, we present MetaboGNN, an advanced model for predicting liver metabolic stability based on Graph Neural Networks (GNNs) and Graph Contrastive Learning (GCL). Using a high-quality dataset from the 2023 South Korea Data Challenge for Drug Discovery, which comprises 3,498 training molecules and 483 test molecules, we presented molecular structures as graphs to capture the intricate structural relationships that influence metabolic stability. A GCL-driven pretraining step was employed to enhance model generalizability by learning robust, transferable graph-level representations. Notably, incorporating interspecies differences between human liver microsomes (HLM) and mouse liver microsomes (MLM) further improved predictive accuracy, achieving Root Mean Square Error (RMSE) values of 27.91 (HLM) and 27.86 (MLM), both expressed as the percentage of parent compound remaining after a 30-min incubation. Compared to traditional approaches, MetaboGNN demonstrates superior predictive performance and highlights the importance of considering interspecies enzymatic variations. In addition, attention-based analysis identified key molecular fragments associated with metabolic stability, highlighting chemically meaningful structural determinants. These findings establish MetaboGNN as a powerful tool for metabolic stability prediction, supporting more efficient lead optimization processes in drug discovery.

药物的代谢稳定性是其药代动力学特性的关键决定因素,包括清除率、半衰期和口服生物利用度。代谢稳定性的准确预测可以大大简化药物发现过程。在这项研究中,我们提出了MetaboGNN,一个基于图神经网络(GNNs)和图对比学习(GCL)预测肝脏代谢稳定性的先进模型。使用来自2023年韩国药物发现数据挑战赛的高质量数据集,其中包括3,498个训练分子和483个测试分子,我们将分子结构以图形的形式呈现,以捕获影响代谢稳定性的复杂结构关系。采用gcl驱动的预训练步骤,通过学习鲁棒的、可转移的图级表示来增强模型的泛化性。值得注意的是,纳入人肝微粒体(HLM)和小鼠肝微粒体(MLM)的种间差异进一步提高了预测准确性,实现了均方根误差(RMSE)值为27.91 (HLM)和27.86 (MLM),均表示为孵育30分钟后母体化合物残留的百分比。与传统方法相比,MetaboGNN显示出优越的预测性能,并强调了考虑物种间酶变化的重要性。此外,基于注意力的分析确定了与代谢稳定性相关的关键分子片段,突出了化学上有意义的结构决定因素。这些发现确立了MetaboGNN作为代谢稳定性预测的强大工具,支持药物发现中更有效的先导物优化过程。
{"title":"MetaboGNN: predicting liver metabolic stability with graph neural networks and cross-species data","authors":"Jun Hyeong Park,&nbsp;Ri Han,&nbsp;Junbo Jang,&nbsp;Jisan Kim,&nbsp;Joonki Paik,&nbsp;Jaesung Heo,&nbsp;Yoonji Lee","doi":"10.1186/s13321-025-01089-y","DOIUrl":"10.1186/s13321-025-01089-y","url":null,"abstract":"<div><p>The metabolic stability of a drug is a crucial determinant of its pharmacokinetic properties, including clearance, half-life, and oral bioavailability. Accurate predictions of metabolic stability can significantly streamline the drug discovery process. In this study, we present <i>MetaboGNN</i>, an advanced model for predicting liver metabolic stability based on Graph Neural Networks (GNNs) and Graph Contrastive Learning (GCL). Using a high-quality dataset from the 2023 South Korea Data Challenge for Drug Discovery, which comprises 3,498 training molecules and 483 test molecules, we presented molecular structures as graphs to capture the intricate structural relationships that influence metabolic stability. A GCL-driven pretraining step was employed to enhance model generalizability by learning robust, transferable graph-level representations. Notably, incorporating interspecies differences between human liver microsomes (HLM) and mouse liver microsomes (MLM) further improved predictive accuracy, achieving Root Mean Square Error (RMSE) values of 27.91 (HLM) and 27.86 (MLM), both expressed as the percentage of parent compound remaining after a 30-min incubation. Compared to traditional approaches, <i>MetaboGNN</i> demonstrates superior predictive performance and highlights the importance of considering interspecies enzymatic variations. In addition, attention-based analysis identified key molecular fragments associated with metabolic stability, highlighting chemically meaningful structural determinants. These findings establish <i>MetaboGNN</i> as a powerful tool for metabolic stability prediction, supporting more efficient lead optimization processes in drug discovery.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01089-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144934590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The first South Korean data challenge for drug discovery using human and mouse liver microsomal stability data 韩国首个利用人和小鼠肝微粒体稳定性数据进行药物发现的数据挑战
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-09-03 DOI: 10.1186/s13321-025-01047-8
Nam-Chul Cho, SeongEun Hong, Jin Sook Song, EuiJu Yeo, SoI Jung, Yuno Lee, Seul Gee Hwang, Su Min Kang, JaeSung Hwang, Tae-Eun Jin

The Korea Chemical Bank (KCB) has generated a dataset containing metabolic stability data for approximately 4,000 compounds that have been tested on human and mouse liver microsomes. The first South Korea Data Challenge, named the Jump AI Challenge for Drug Discovery (JUMP AI 2023), was opened using the metabolic stability data of KCB in 2023. The objective of the JUMP AI 2023 was to promote and encourage the development of new drugs using artificial intelligence (AI) technology in South Korea. A total of 1254 teams participated in the competition, developing algorithms to estimate the remaining percentage of compounds after 30 min of incubation with human and mouse liver microsomes. The data set comprised training and test sets of 3498 and 483 compounds, respectively. This paper provides an overview of the JUMP AI 2023 and its outcomes, highlighting the diverse range of algorithms and artificial intelligence technologies employed by the competing teams. Among these, five teams stood out by utilizing GNN-based approaches winning awards. This competition was the first AI competition for drug discovery in South Korea, attracting numerous researchers and playing a key role in promoting drug research through the application of artificial intelligence technologies.

韩国化学银行(KCB)制作了包含在人类和小鼠肝微粒体上测试的4000多种化合物的代谢稳定性数据的数据集。第一届韩国数据挑战赛名为Jump AI药物发现挑战赛(Jump AI 2023),于2023年利用KCB的代谢稳定性数据开启。JUMP AI 2023的目标是促进和鼓励利用人工智能(AI)技术在韩国开发新药。共有1254个团队参加了比赛,开发算法来估计人类和小鼠肝微粒体孵育30分钟后化合物的剩余百分比。数据集分别由3498个化合物的训练集和483个化合物的测试集组成。本文概述了JUMP AI 2023及其成果,重点介绍了参赛团队采用的各种算法和人工智能技术。其中,5个团队利用基于gnn的方法脱颖而出,获得了奖项。此次大赛是韩国首次举办药物研发人工智能大赛,吸引了众多研究人员,在通过应用人工智能技术促进药物研究方面发挥了关键作用。
{"title":"The first South Korean data challenge for drug discovery using human and mouse liver microsomal stability data","authors":"Nam-Chul Cho,&nbsp;SeongEun Hong,&nbsp;Jin Sook Song,&nbsp;EuiJu Yeo,&nbsp;SoI Jung,&nbsp;Yuno Lee,&nbsp;Seul Gee Hwang,&nbsp;Su Min Kang,&nbsp;JaeSung Hwang,&nbsp;Tae-Eun Jin","doi":"10.1186/s13321-025-01047-8","DOIUrl":"10.1186/s13321-025-01047-8","url":null,"abstract":"<div><p>The Korea Chemical Bank (KCB) has generated a dataset containing metabolic stability data for approximately 4,000 compounds that have been tested on human and mouse liver microsomes. The first South Korea Data Challenge, named the Jump AI Challenge for Drug Discovery (JUMP AI 2023), was opened using the metabolic stability data of KCB in 2023. The objective of the JUMP AI 2023 was to promote and encourage the development of new drugs using artificial intelligence (AI) technology in South Korea. A total of 1254 teams participated in the competition, developing algorithms to estimate the remaining percentage of compounds after 30 min of incubation with human and mouse liver microsomes. The data set comprised training and test sets of 3498 and 483 compounds, respectively. This paper provides an overview of the JUMP AI 2023 and its outcomes, highlighting the diverse range of algorithms and artificial intelligence technologies employed by the competing teams. Among these, five teams stood out by utilizing GNN-based approaches winning awards. This competition was the first AI competition for drug discovery in South Korea, attracting numerous researchers and playing a key role in promoting drug research through the application of artificial intelligence technologies.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01047-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144934591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Box embeddings for extending ontologies: a data-driven and interpretable approach 用于扩展本体的框嵌入:一种数据驱动和可解释的方法
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-09-01 DOI: 10.1186/s13321-025-01086-1
Adel Memariani, Martin Glauer, Simon Flügel, Fabian Neuhaus, Janna Hastings, Till Mossakowski

Deriving symbolic knowledge from trained deep learning models is challenging due to the lack of transparency in such models. A promising approach to address this issue is to couple a semantic structure with the model outputs and thereby make the model interpretable. In prediction tasks such as multi-label classification, labels tend to form hierarchical relationships. Therefore, we propose enforcing a taxonomical structure on the model’s outputs throughout the training phase. In vector space, a taxonomy can be represented using axis-aligned hyper-rectangles, or boxes, which may overlap or nest within one another. The boundaries of a box determine the extent of a particular category. Thus, we used box-shaped embeddings of ontology classes to learn and transparently represent logical relationships that are only implicit in multi-label datasets. We assessed our model by measuring its ability to approximate the full set of inferred subclass relations in the ChEBI ontology, which is an important knowledge base in the field of life science. We demonstrate that our model captures implicit hierarchical relationships among labels, ensuring consistency with the underlying ontological conceptualization, while also achieving state-of-the-art performance in multi-label classification. Notably, this is accomplished without requiring an explicit taxonomy during the training process.

Our proposed approach advances chemical classification by enablinginterpretable outputs through a structured and geometricallyexpressive representation of molecules and their classes.

由于这些模型缺乏透明度,从经过训练的深度学习模型中获得符号知识是具有挑战性的。解决这个问题的一个很有前途的方法是将语义结构与模型输出相耦合,从而使模型可解释。在多标签分类等预测任务中,标签倾向于形成层次关系。因此,我们建议在整个训练阶段对模型的输出实施分类结构。在向量空间中,分类法可以使用与轴对齐的超矩形或框来表示,它们可以相互重叠或嵌套。框的边界决定了特定类别的范围。因此,我们使用本体类的盒形嵌入来学习和透明地表示仅在多标签数据集中隐含的逻辑关系。我们通过测量其近似ChEBI本体中所有推断子类关系的能力来评估我们的模型,ChEBI本体是生命科学领域的一个重要知识库。我们证明了我们的模型捕获了标签之间的隐式层次关系,确保了与底层本体概念化的一致性,同时在多标签分类中也实现了最先进的性能。值得注意的是,这在训练过程中不需要显式分类法就可以完成。我们提出的方法通过分子及其类的结构化和几何表达表示实现可解释的输出,从而推进化学分类。
{"title":"Box embeddings for extending ontologies: a data-driven and interpretable approach","authors":"Adel Memariani,&nbsp;Martin Glauer,&nbsp;Simon Flügel,&nbsp;Fabian Neuhaus,&nbsp;Janna Hastings,&nbsp;Till Mossakowski","doi":"10.1186/s13321-025-01086-1","DOIUrl":"10.1186/s13321-025-01086-1","url":null,"abstract":"<p>Deriving symbolic knowledge from trained deep learning models is challenging due to the lack of transparency in such models. A promising approach to address this issue is to couple a semantic structure with the model outputs and thereby make the model interpretable. In prediction tasks such as multi-label classification, labels tend to form hierarchical relationships. Therefore, we propose enforcing a taxonomical structure on the model’s outputs throughout the training phase. In vector space, a taxonomy can be represented using axis-aligned hyper-rectangles, or boxes, which may overlap or nest within one another. The boundaries of a box determine the extent of a particular category. Thus, we used box-shaped embeddings of ontology classes to learn and transparently represent logical relationships that are only implicit in multi-label datasets. We assessed our model by measuring its ability to approximate the full set of inferred subclass relations in the ChEBI ontology, which is an important knowledge base in the field of life science. We demonstrate that our model captures implicit hierarchical relationships among labels, ensuring consistency with the underlying ontological conceptualization, while also achieving state-of-the-art performance in multi-label classification. Notably, this is accomplished without requiring an explicit taxonomy during the training process.</p><p>Our proposed approach advances chemical classification by enabling\u0000interpretable outputs through a structured and geometrically\u0000expressive representation of molecules and their classes.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01086-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144924125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1