首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
Prediction of UGT-mediated phase II metabolism via ligand- and structure-based predictive models 通过基于配体和结构的预测模型预测ugt介导的II期代谢
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-10-15 DOI: 10.1186/s13321-025-01097-y
Ludovica Bono, Filippo Lunghini, Emanuela Sabato, Akash Deep Biswas, Angelica Mazzolari, Alessandro Pedretti, Andrea R. Beccari, Giulio Vistoli, Serena Vittorio

The prediction of drugs metabolism by in silico techniques is gaining a growing interest due to the possibility to process large datasets allowing the stability and safety of new drug candidates to be evaluated during the early stages of the drug discovery process. To date, in silico models for metabolism prediction mainly exploits the ligand-based (LB) properties of the training molecules to predict the occurrence of a given metabolic reaction and/or the reactive site involved in the biotransformation. However, recent reports highlighted that structure-based (SB) modeling can be conveniently integrated with LB methods for drug metabolism prediction purpose, with the advantages to predict if a given molecule can fit the enzyme active site and which moiety approaches the catalytic residues. Herein, we developed machine learning models for UDP-glucuronosyltransferase (UGT)-mediated metabolism by using both LB and SB methods. In particular, this study was focused on UGT2B7 and UGT2B15 isoforms which are involved in the clearance of many drugs as well as in clinically relevant drug-drug interactions. First, molecular dynamics (MD) and docking simulations were combined to explore the binding mechanism of cofactor and substrate within the catalytic pocket of the studied UGT isoforms exploiting their AlphaFold structures. The analysis of the MD trajectories allowed an appropriate conformation of both UGT isoforms to be identified for the development of binary classification models. For this purpose, Random Forest algorithm and the metabolic data extracted from the MetaQSAR database were used. SB models were trained on a set of scoring functions and protein–ligand interaction fingerprints derived from docking, while the LB models were built on a set of physicochemical and constitutional descriptors. When the single models were evaluated, the LB classifiers outperformed the SB models. However, the application of a consensus strategy led to an improvement of the prediction accuracy if compared to the individual models, highlighting that LB and SB approaches convey complementary information whose aggregation allowed us to achieve better predictions than the single models.

由于处理大型数据集的可能性,在药物发现过程的早期阶段可以评估新候选药物的稳定性和安全性,因此通过硅技术预测药物代谢正在获得越来越多的兴趣。迄今为止,代谢预测的计算机模型主要利用训练分子的基于配体(LB)的特性来预测给定代谢反应的发生和/或生物转化中涉及的反应位点。然而,最近的报道强调,基于结构(SB)的建模可以方便地与LB方法相结合,用于药物代谢预测,其优点是预测给定分子是否适合酶的活性位点以及哪些部分接近催化残基。在此,我们利用LB和SB方法建立了udp -葡萄糖醛酸转移酶(UGT)介导代谢的机器学习模型。本研究特别关注UGT2B7和UGT2B15亚型,它们参与了许多药物的清除以及临床相关的药物-药物相互作用。首先,将分子动力学(MD)和对接模拟相结合,利用所研究的UGT同工型的AlphaFold结构,探索其催化口袋内辅因子和底物的结合机制。MD轨迹的分析允许两个UGT同种异构体的适当构象被确定为二元分类模型的发展。为此,我们使用随机森林算法和MetaQSAR数据库中提取的代谢数据。SB模型是基于一组评分函数和对接得到的蛋白质-配体相互作用指纹进行训练的,而LB模型是基于一组物理化学和结构描述符进行训练的。当评估单个模型时,LB分类器优于SB模型。然而,与单个模型相比,共识策略的应用导致了预测精度的提高,突出表明LB和SB方法传达了互补信息,其聚合使我们能够实现比单个模型更好的预测。通过计算机方法进行代谢预测是在药物发现的早期阶段评估新候选药物的药代动力学特征的有用工具。该研究提供了一种新的计算策略,可以整合基于配体和结构的方法,利用UGT2B7和ugt2b15的AlphaFold结构来预测其介导的代谢。与单个配体和基于结构的预测模型相比,这两种方法的结合产生了更高的性能,也证实了AlphaFold结构在开发基于结构的代谢预测模型方面的可靠性。
{"title":"Prediction of UGT-mediated phase II metabolism via ligand- and structure-based predictive models","authors":"Ludovica Bono,&nbsp;Filippo Lunghini,&nbsp;Emanuela Sabato,&nbsp;Akash Deep Biswas,&nbsp;Angelica Mazzolari,&nbsp;Alessandro Pedretti,&nbsp;Andrea R. Beccari,&nbsp;Giulio Vistoli,&nbsp;Serena Vittorio","doi":"10.1186/s13321-025-01097-y","DOIUrl":"10.1186/s13321-025-01097-y","url":null,"abstract":"<div><p>The prediction of drugs metabolism by in silico techniques is gaining a growing interest due to the possibility to process large datasets allowing the stability and safety of new drug candidates to be evaluated during the early stages of the drug discovery process. To date, in silico models for metabolism prediction mainly exploits the ligand-based (LB) properties of the training molecules to predict the occurrence of a given metabolic reaction and/or the reactive site involved in the biotransformation. However, recent reports highlighted that structure-based (SB) modeling can be conveniently integrated with LB methods for drug metabolism prediction purpose, with the advantages to predict if a given molecule can fit the enzyme active site and which moiety approaches the catalytic residues. Herein, we developed machine learning models for UDP-glucuronosyltransferase (UGT)-mediated metabolism by using both LB and SB methods. In particular, this study was focused on UGT2B7 and UGT2B15 isoforms which are involved in the clearance of many drugs as well as in clinically relevant drug-drug interactions. First, molecular dynamics (MD) and docking simulations were combined to explore the binding mechanism of cofactor and substrate within the catalytic pocket of the studied UGT isoforms exploiting their AlphaFold structures. The analysis of the MD trajectories allowed an appropriate conformation of both UGT isoforms to be identified for the development of binary classification models. For this purpose, Random Forest algorithm and the metabolic data extracted from the MetaQSAR database were used. SB models were trained on a set of scoring functions and protein–ligand interaction fingerprints derived from docking, while the LB models were built on a set of physicochemical and constitutional descriptors. When the single models were evaluated, the LB classifiers outperformed the SB models. However, the application of a consensus strategy led to an improvement of the prediction accuracy if compared to the individual models, highlighting that LB and SB approaches convey complementary information whose aggregation allowed us to achieve better predictions than the single models.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01097-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145289102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reverse engineering molecules from fingerprints through deterministic enumeration and generative models 逆向工程分子从指纹通过确定性枚举和生成模型
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-10-15 DOI: 10.1186/s13321-025-01074-5
Philippe Meyer, Thomas Duigou, Guillaume Gricourt, Jean-Loup Faulon

Reverse engineering in molecular design aims to identify optimal structures based on activities, or properties, computed through molecular descriptors like fingerprints. This task is known to be particularly difficult for the widely used Extended-Connectivity Fingerprints (ECFPs), due to significant loss of structural information during vectorization. While recent artificial intelligence-based works have raised awareness about the privacy risks associated with ECFP-based data sharing, we contribute a more conclusive demonstration by introducing a deterministic algorithm that reconstructs molecular structures from ECFPs. Using MetaNetX and eMolecules as databases of natural compounds and commercially available chemicals, the deterministic algorithm benchmarks a Transformer-based generative model trained to predict SMILES from ECFPs. The generative model achieves a top-ranked retrieval accuracy of 95.64% but struggles with exhaustive enumeration. Additionally, applying the deterministic method to a drug dataset reveals its potential for de novo drug design, as many of the reverse-engineered structures are found to be patented or supported by bioassay data.

Graphical Abstract

分子设计中的逆向工程旨在通过像指纹这样的分子描述符来确定基于活动或性质的最佳结构。众所周知,对于广泛使用的扩展连接指纹(ECFPs)来说,由于在向量化过程中大量丢失结构信息,这项任务尤其困难。虽然最近基于人工智能的工作提高了人们对与基于ecfp的数据共享相关的隐私风险的认识,但我们通过引入一种从ecfp重建分子结构的确定性算法,提供了更确凿的论证。使用MetaNetX和emolules作为天然化合物和商业化学物质的数据库,确定性算法对基于transformer的生成模型进行基准测试,以训练从ecfp中预测smile。生成模型达到95.64%的检索精度,但在穷举枚举方面存在问题。此外,将确定性方法应用于药物数据集揭示了其重新设计药物的潜力,因为许多反向工程结构被发现是专利或由生物测定数据支持的。我们提出了一种确定性算法,从ECFP向量重建分子结构,证明这些指纹是可逆的。与此同时,我们对基于transformer的生成模型进行了基准测试,该模型经过训练,可以从ecfp中预测smile,显示出很高的准确性,但在化学空间覆盖方面存在局限性。这种双重方法推进了分子设计中的逆向工程,为新药物发现提供了新的工具。
{"title":"Reverse engineering molecules from fingerprints through deterministic enumeration and generative models","authors":"Philippe Meyer,&nbsp;Thomas Duigou,&nbsp;Guillaume Gricourt,&nbsp;Jean-Loup Faulon","doi":"10.1186/s13321-025-01074-5","DOIUrl":"10.1186/s13321-025-01074-5","url":null,"abstract":"<div><p>Reverse engineering in molecular design aims to identify optimal structures based on activities, or properties, computed through molecular descriptors like fingerprints. This task is known to be particularly difficult for the widely used Extended-Connectivity Fingerprints (ECFPs), due to significant loss of structural information during vectorization. While recent artificial intelligence-based works have raised awareness about the privacy risks associated with ECFP-based data sharing, we contribute a more conclusive demonstration by introducing a deterministic algorithm that reconstructs molecular structures from ECFPs. Using MetaNetX and eMolecules as databases of natural compounds and commercially available chemicals, the deterministic algorithm benchmarks a Transformer-based generative model trained to predict SMILES from ECFPs. The generative model achieves a top-ranked retrieval accuracy of 95.64% but struggles with exhaustive enumeration. Additionally, applying the deterministic method to a drug dataset reveals its potential for de novo drug design, as many of the reverse-engineered structures are found to be patented or supported by bioassay data.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01074-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145289165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SMPR: a structure-enhanced multimodal drug‒disease prediction model for drug repositioning and cold start SMPR:用于药物重新定位和冷启动的结构增强多模态药物-疾病预测模型
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-10-14 DOI: 10.1186/s13321-025-01085-2
Xin Dong, Rui Miao, Suyan Zhang, Shuaibing Jia, Leifeng Zhang, Yong Liang, Jianhua Zhang, Yi Zhun Zhu

Repositioning drug‒disease relationships has always been a hot field of research. However, actual cases of biologically validated drug relocation remain very limited, and existing models have not yet fully utilized the structural information of the drug. Furthermore, most repositioning models are only used to complete the relationship matrix, and their practicality is poor when dealing with drug cold start problems. This paper proposes a structure-enhanced multimodal relationship prediction model (SMRP). SMPR is based on the SMILE structure of the drug, uses the MOL2VEC method to generate drug-embedded representations, and learns disease-embedded representations through heterogeneous network graph neural networks. Ultimately, a drug‒disease relationship matrix is constructed. In addition, to reduce the difficulty of use, SMPR also provides a cold start interface on the basis of structural similarity based on repositioning results to predict drug-related diseases simply and quickly. The repositioning ability and cold start capability of the model were verified from multiple perspectives. While the AUC and ACUPR scores of repositioning reached 99% and 61%, respectively, the AUC of the cold start method was 80%. In particular, the cold start recall indicator can reach more than 70%, which means that the SMPR is more sensitive to positive samples. Finally, case analysis was used to verify the practical value of the model, and visual analysis directly demonstrated the improvement in the structure of the model. For ease of use, we also provide local deployment of the model and packaged it into an executable program.

Scientific contribution

The SMPR model is a structure-enhanced multimodal learning model that focuses on the reference value of similar structures in practical docking and adds a drug embedding module. Considering that the predictive ability of the model is limited by the dataset, the model provides a cold start interface for the effective prediction of drugs that are not in the dataset. To further facilitate the use by pharmacology workers without programming knowledge and enhance its practical application value, the model was also encapsulated into a local executable program.

重新定位药病关系一直是一个研究热点。然而,经过生物学验证的药物再定位的实际案例仍然非常有限,现有模型尚未充分利用药物的结构信息。此外,大多数重定位模型仅用于完成关系矩阵,在处理药物冷启动问题时实用性较差。提出了一种结构增强的多模态关系预测模型(SMRP)。SMPR基于药物的SMILE结构,采用MOL2VEC方法生成药物嵌入表征,并通过异构网络图神经网络学习疾病嵌入表征。最后,构建了药物-疾病关系矩阵。此外,为了降低使用难度,SMPR还提供了基于重定位结果的结构相似性的冷启动界面,简单快速地预测药物相关疾病。从多个角度验证了模型的再定位能力和冷启动能力。重新定位的AUC和acur评分分别达到99%和61%,而冷启动法的AUC为80%。特别是冷启动召回指标可以达到70%以上,这意味着SMPR对阳性样品更加敏感。最后通过案例分析验证了模型的实用价值,可视化分析直接体现了模型在结构上的改进。为了便于使用,我们还提供了模型的本地部署,并将其打包为可执行程序。SMPR模型是一种结构增强的多模态学习模型,注重相似结构在实际对接中的参考价值,增加了药物嵌入模块。考虑到模型的预测能力受数据集的限制,模型提供了冷启动接口,对数据集中没有的药物进行有效预测。为了进一步方便没有编程知识的药理学工作者使用,提高其实际应用价值,还将该模型封装为本地可执行程序。
{"title":"SMPR: a structure-enhanced multimodal drug‒disease prediction model for drug repositioning and cold start","authors":"Xin Dong,&nbsp;Rui Miao,&nbsp;Suyan Zhang,&nbsp;Shuaibing Jia,&nbsp;Leifeng Zhang,&nbsp;Yong Liang,&nbsp;Jianhua Zhang,&nbsp;Yi Zhun Zhu","doi":"10.1186/s13321-025-01085-2","DOIUrl":"10.1186/s13321-025-01085-2","url":null,"abstract":"<div><p>Repositioning drug‒disease relationships has always been a hot field of research. However, actual cases of biologically validated drug relocation remain very limited, and existing models have not yet fully utilized the structural information of the drug. Furthermore, most repositioning models are only used to complete the relationship matrix, and their practicality is poor when dealing with drug cold start problems. This paper proposes a structure-enhanced multimodal relationship prediction model (SMRP). SMPR is based on the SMILE structure of the drug, uses the MOL2VEC method to generate drug-embedded representations, and learns disease-embedded representations through heterogeneous network graph neural networks. Ultimately, a drug‒disease relationship matrix is constructed. In addition, to reduce the difficulty of use, SMPR also provides a cold start interface on the basis of structural similarity based on repositioning results to predict drug-related diseases simply and quickly. The repositioning ability and cold start capability of the model were verified from multiple perspectives. While the AUC and ACUPR scores of repositioning reached 99% and 61%, respectively, the AUC of the cold start method was 80%. In particular, the cold start recall indicator can reach more than 70%, which means that the SMPR is more sensitive to positive samples. Finally, case analysis was used to verify the practical value of the model, and visual analysis directly demonstrated the improvement in the structure of the model. For ease of use, we also provide local deployment of the model and packaged it into an executable program.</p><p>Scientific contribution</p><p>The SMPR model is a structure-enhanced multimodal learning model that focuses on the reference value of similar structures in practical docking and adds a drug embedding module. Considering that the predictive ability of the model is limited by the dataset, the model provides a cold start interface for the effective prediction of drugs that are not in the dataset. To further facilitate the use by pharmacology workers without programming knowledge and enhance its practical application value, the model was also encapsulated into a local executable program.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01085-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145283154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PURE: policy-guided unbiased REpresentations for structure-constrained molecular generation PURE:结构约束分子生成的政策导向无偏表示
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-10-14 DOI: 10.1186/s13321-025-01090-5
Abhor Gupta, Barathi Lenin, Sean Current, Rohit Batra, Balaraman Ravindran, Karthik Raman, Srinivasan Parthasarathy

Structure-constrained molecular generation (SCMG) generates novel molecules that are structurally similar to a given molecule and have optimized properties. Deep learning solutions for SCMG are limited in that they are predisposed towards existing knowledge, and they suffer from a natural impedance mismatch problem due to the discrete nature of molecules, while deep learning methods for SCMG often operate in continuous space. Moreover, many task-specific evaluation metrics used during training often bias the model towards a particular metric -“metric-leakage”. To overcome these shortcomings, we propose Policy-guided Unbiased REpresentations (PURE) for SCMG that learns within a framework simulating molecular transformations for drug synthesis. PURE combines self-supervised learning with a policy-based reinforcement learning (RL) framework, thereby avoiding the need for external molecular metrics while learning high-quality representations that incorporate an inherent notion of similarity specific to the given task. Along with a semi-supervised training design, PURE utilizes template-based molecular simulations to better explore and navigate the discrete molecular search space. Despite the lack of metric biases, PURE achieves competitive or superior performance to state-of-the-art methods on multiple benchmarks. Our study emphasizes the importance of reevaluating current approaches for SCMG and developing strategies that naturally align with the problem. Finally, we illustrate how our methodology can be applied to combat drug resistance by identifying sorafenib-like compounds as a case study.

结构约束分子生成(SCMG)生成结构类似于给定分子并具有优化性能的新分子。SCMG的深度学习解决方案的局限性在于它们倾向于现有知识,并且由于分子的离散性,它们遭受自然阻抗失配问题,而SCMG的深度学习方法通常在连续空间中运行。此外,在训练过程中使用的许多特定任务的评估指标经常使模型偏向于特定的指标-“指标泄漏”。为了克服这些缺点,我们为SCMG提出了策略导向的无偏表示(PURE),该方法在模拟药物合成分子转化的框架内学习。PURE将自我监督学习与基于策略的强化学习(RL)框架相结合,从而避免了对外部分子度量的需求,同时学习包含特定于给定任务的固有相似性概念的高质量表示。与半监督训练设计一起,PURE利用基于模板的分子模拟来更好地探索和导航离散的分子搜索空间。尽管缺乏度量偏差,PURE在多个基准测试中实现了与最先进的方法相比具有竞争力或优越的性能。我们的研究强调了重新评估当前SCMG方法和制定与问题自然一致的策略的重要性。最后,我们通过鉴定索拉非尼样化合物作为案例研究来说明我们的方法如何应用于对抗耐药性。PURE是一种结构约束分子生成的新方法,它利用策略指导的训练来减少对SCMG模型特定任务指标的过度拟合。这种方法提供了一种高效和独特的方法来生成新的和不同的分子结构类似于目标分子,而不依赖于用户定义的相似性指标。在一个案例研究中,我们表明,与其他最先进的方法相比,PURE能够在多个基准任务上取得具有竞争力或更好的性能,并产生多个可行的替代分子来减少耐药性。
{"title":"PURE: policy-guided unbiased REpresentations for structure-constrained molecular generation","authors":"Abhor Gupta,&nbsp;Barathi Lenin,&nbsp;Sean Current,&nbsp;Rohit Batra,&nbsp;Balaraman Ravindran,&nbsp;Karthik Raman,&nbsp;Srinivasan Parthasarathy","doi":"10.1186/s13321-025-01090-5","DOIUrl":"10.1186/s13321-025-01090-5","url":null,"abstract":"<div><p>Structure-constrained molecular generation (SCMG) generates novel molecules that are structurally similar to a given molecule and have optimized properties. Deep learning solutions for SCMG are limited in that they are predisposed towards existing knowledge, and they suffer from a natural impedance mismatch problem due to the discrete nature of molecules, while deep learning methods for SCMG often operate in continuous space. Moreover, many task-specific evaluation metrics used during training often bias the model towards a particular metric -“metric-leakage”. To overcome these shortcomings, we propose Policy-guided Unbiased REpresentations (PURE) for SCMG that learns within a framework simulating molecular transformations for drug synthesis. PURE combines self-supervised learning with a policy-based reinforcement learning (RL) framework, thereby avoiding the need for external molecular metrics while learning high-quality representations that incorporate an inherent notion of similarity specific to the given task. Along with a semi-supervised training design, PURE utilizes template-based molecular simulations to better explore and navigate the discrete molecular search space. Despite the lack of metric biases, PURE achieves competitive or superior performance to state-of-the-art methods on multiple benchmarks. Our study emphasizes the importance of reevaluating current approaches for SCMG and developing strategies that naturally align with the problem. Finally, we illustrate how our methodology can be applied to combat drug resistance by identifying sorafenib-like compounds as a case study.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01090-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145289335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improved estimation of intrinsic solubility of drug-like molecules through multi-task graph transformer 利用多任务图转换器改进类药物分子固有溶解度的估计
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-10-13 DOI: 10.1186/s13321-025-01106-0
Jiaxi Zhao, Eline Hermans, Kia Sepassi, Christophe Tistaert, Christel A. S. Bergström, Mazen Ahmad, Per Larsson

Aqueous solubility of a compound plays a crucial role throughout various stages of drug discovery and development. Despite numerous efforts using various machine learning models, accurately estimating aqueous solubility remains a challenge. One primary limitation is the absence of a single source, large dataset of druglike compounds for model training. Additionally, studies have highlighted the need for improvements in prediction algorithms and molecular representations. To address these challenges, the Johnson and Johnson (J&J) in-house solubility data was leveraged. Theoretical pH-solubility equations and in-house pKa prediction tools were utilized to calculate intrinsic solubility from J&J data. A multi-task graph transformer model was developed and trained on the calculated intrinsic solubility data of 13,306 compounds along with seven relevant physicochemical properties including solubility at pH 2/7, logP, and logD at three different pHs. When evaluated making use of high-quality test data, the developed model achieved a root mean square error (RMSE) of 0.61 and coefficient of determination (R2) of 0.60, demonstrating state-of-the-art performance in estimating intrinsic solubility for drug-like compounds.

化合物的水溶性在药物发现和开发的各个阶段起着至关重要的作用。尽管使用各种机器学习模型进行了大量的努力,但准确估计水溶解度仍然是一个挑战。一个主要的限制是缺乏单一来源的大型药物类化合物数据集来进行模型训练。此外,研究强调需要改进预测算法和分子表示。为了应对这些挑战,强生公司(J&J)利用了内部溶解度数据。理论ph -溶解度方程和内部pKa预测工具利用强生公司的数据计算固有溶解度。基于计算得到的13306种化合物的固有溶解度数据,以及pH值为2/7时的溶解度、三种不同pH值下的logP和logD等7个相关理化性质,建立了一个多任务图转换器模型并进行了训练。当使用高质量的测试数据进行评估时,所开发的模型的均方根误差(RMSE)为0.61,决定系数(R2)为0.60,证明了在估计药物类化合物的固有溶解度方面的最先进性能。在这项研究中,开发了一个多任务图转换器模型来同时估计固有溶解度和其他相关的物理化学性质。该模型在估计药物类化合物的固有溶解度方面取得了最先进的性能,与单任务学习方法和随机森林模型相比,表现出了优越的性能。
{"title":"Improved estimation of intrinsic solubility of drug-like molecules through multi-task graph transformer","authors":"Jiaxi Zhao,&nbsp;Eline Hermans,&nbsp;Kia Sepassi,&nbsp;Christophe Tistaert,&nbsp;Christel A. S. Bergström,&nbsp;Mazen Ahmad,&nbsp;Per Larsson","doi":"10.1186/s13321-025-01106-0","DOIUrl":"10.1186/s13321-025-01106-0","url":null,"abstract":"<div><p>Aqueous solubility of a compound plays a crucial role throughout various stages of drug discovery and development. Despite numerous efforts using various machine learning models, accurately estimating aqueous solubility remains a challenge. One primary limitation is the absence of a single source, large dataset of druglike compounds for model training. Additionally, studies have highlighted the need for improvements in prediction algorithms and molecular representations. To address these challenges, the Johnson and Johnson (J&amp;J) in-house solubility data was leveraged. Theoretical pH-solubility equations and in-house pKa prediction tools were utilized to calculate intrinsic solubility from J&amp;J data. A multi-task graph transformer model was developed and trained on the calculated intrinsic solubility data of 13,306 compounds along with seven relevant physicochemical properties including solubility at pH 2/7, logP, and logD at three different pHs. When evaluated making use of high-quality test data, the developed model achieved a root mean square error (RMSE) of 0.61 and coefficient of determination (R<sup>2</sup>) of 0.60, demonstrating state-of-the-art performance in estimating intrinsic solubility for drug-like compounds. </p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01106-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145277484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhanced Thompson sampling by roulette wheel selection for screening ultralarge combinatorial libraries 通过轮盘赌选择增强汤普森抽样筛选超大型组合库
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-10-13 DOI: 10.1186/s13321-025-01105-1
Hongtao Zhao, Eva Nittinger, Melissa A. Yu, Symon Gathiaka, W. Patrick Walters, Christian Tyrchan

Chemical space exploration has gained significant interest with the increasing availability of building blocks, enabling the creation of ultralarge virtual libraries containing billions or trillions of compounds. However, challenges remain in selecting the most suitable compounds for synthesis, especially in hit expansion. Thompson sampling, a probabilistic search method, has recently been proposed to improve efficiency by operating in reagent space rather than product space. Here, we address some of its limitations by introducing a roulette wheel selection method combined with a thermal cycling approach to balance greedy search and diversity-driven exploration. The effectiveness of this method is demonstrated through 109 queries against twenty distinct 1-million-compound libraries using ROCS.

随着构建模块的日益可用性,化学空间探索获得了极大的兴趣,能够创建包含数十亿或数万亿化合物的超大型虚拟库。然而,在选择最合适的化合物进行合成方面仍然存在挑战,特别是在hit扩展方面。汤普森抽样,一种概率搜索方法,最近被提出通过在试剂空间而不是产品空间中操作来提高效率。在这里,我们通过引入轮盘赌轮盘选择方法结合热循环方法来平衡贪婪搜索和多样性驱动的探索,从而解决了它的一些局限性。通过使用ROCS对20个不同的100万复合库进行109次查询,证明了该方法的有效性。据我们所知,这是在一个全面的分子设计工作流程中集成变分自编码器与规范化流的第一份报告。这也是条件归一化流首次应用于基于字符串的分子表示的分子设计。我们的方法产生的新分子的性能指标超过了ChEMBL数据库中的240万种化合物,突出了其识别有前途的候选药物的潜力。
{"title":"Enhanced Thompson sampling by roulette wheel selection for screening ultralarge combinatorial libraries","authors":"Hongtao Zhao,&nbsp;Eva Nittinger,&nbsp;Melissa A. Yu,&nbsp;Symon Gathiaka,&nbsp;W. Patrick Walters,&nbsp;Christian Tyrchan","doi":"10.1186/s13321-025-01105-1","DOIUrl":"10.1186/s13321-025-01105-1","url":null,"abstract":"<div><p>Chemical space exploration has gained significant interest with the increasing availability of building blocks, enabling the creation of ultralarge virtual libraries containing billions or trillions of compounds. However, challenges remain in selecting the most suitable compounds for synthesis, especially in hit expansion. Thompson sampling, a probabilistic search method, has recently been proposed to improve efficiency by operating in reagent space rather than product space. Here, we address some of its limitations by introducing a roulette wheel selection method combined with a thermal cycling approach to balance greedy search and diversity-driven exploration. The effectiveness of this method is demonstrated through 109 queries against twenty distinct 1-million-compound libraries using ROCS.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01105-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145277468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Chemical classification program synthesis using generative artificial intelligence 利用生成式人工智能合成化学分类程序
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-10-01 DOI: 10.1186/s13321-025-01092-3
Christopher J. Mungall, Adnan Malik, Daniel R. Korn, Justin T. Reese, Noel M. O’Boyle, Janna Hastings

Accurately classifying chemical structures is essential for cheminformatics and bioinformatics, including tasks such as identifying bioactive compounds of interest, screening molecules for toxicity to humans, finding non-organic compounds with desirable material properties, or organizing large chemical libraries for drug discovery or environmental monitoring. However, manual classification is labor-intensive and difficult to scale to large chemical databases. Existing automated approaches either rely on manually constructed classification rules, or are deep learning methods that lack explainability. This work presents an approach that uses generative artificial intelligence to automatically write chemical classifier programs for classes in the Chemical Entities of Biological Interest (ChEBI) database. These programs can be used for efficient deterministic run-time classification of SMILES structures, with natural language explanations. The programs themselves constitute an explainable computable ontological model of chemical class nomenclature, which we call the ChEBI Chemical Class Program Ontology (C3PO). We validated our approach against the ChEBI database, and compared our results against deep learning models and a naive SMARTS pattern based classifier. C3PO outperforms the naive classifier, but does not reach the performance of state of the art deep learning methods. However, C3PO has a number of strengths that complement deep learning methods, including explainability and reduced data dependence. C3PO can be used alongside deep learning classifiers to provide an explanation of the classification, where both methods agree. The programs can be used as part of the ontology development process, and iteratively refined by expert human curators.

准确分类化学结构对化学信息学和生物信息学至关重要,包括鉴定感兴趣的生物活性化合物,筛选对人类的毒性分子,寻找具有理想材料特性的非有机化合物,或组织用于药物发现或环境监测的大型化学文库等任务。然而,人工分类是劳动密集型的,很难扩展到大型化学数据库。现有的自动化方法要么依赖于人工构建的分类规则,要么是缺乏可解释性的深度学习方法。这项工作提出了一种使用生成式人工智能自动编写化学分类器程序的方法,用于生物兴趣化学实体(ChEBI)数据库中的类。这些程序可以用于对smile结构进行有效的确定性运行时分类,并提供自然语言解释。这些程序本身构成了一个可解释可计算的化学类命名本体模型,我们称之为化学类程序本体(C3PO)。我们针对ChEBI数据库验证了我们的方法,并将我们的结果与深度学习模型和基于朴素SMARTS模式的分类器进行了比较。C3PO优于朴素分类器,但没有达到最先进的深度学习方法的性能。然而,C3PO有许多优势可以补充深度学习方法,包括可解释性和减少数据依赖性。C3PO可以与深度学习分类器一起使用,以提供两种方法一致的分类解释。这些程序可以用作本体开发过程的一部分,并由专家管理员迭代地改进。我们展示了一种新的知识蒸馏技术,其中分类器是程序,利用化学信息学软件库的力量。我们证明了对化学结构分类的适用性,并有助于化学数据库的管理。
{"title":"Chemical classification program synthesis using generative artificial intelligence","authors":"Christopher J. Mungall,&nbsp;Adnan Malik,&nbsp;Daniel R. Korn,&nbsp;Justin T. Reese,&nbsp;Noel M. O’Boyle,&nbsp;Janna Hastings","doi":"10.1186/s13321-025-01092-3","DOIUrl":"10.1186/s13321-025-01092-3","url":null,"abstract":"<div><p>Accurately classifying chemical structures is essential for cheminformatics and bioinformatics, including tasks such as identifying bioactive compounds of interest, screening molecules for toxicity to humans, finding non-organic compounds with desirable material properties, or organizing large chemical libraries for drug discovery or environmental monitoring. However, manual classification is labor-intensive and difficult to scale to large chemical databases. Existing automated approaches either rely on manually constructed classification rules, or are deep learning methods that lack explainability. This work presents an approach that uses generative artificial intelligence to automatically write <i>chemical classifier programs</i> for classes in the Chemical Entities of Biological Interest (ChEBI) database. These programs can be used for efficient deterministic run-time classification of SMILES structures, with natural language explanations. The programs themselves constitute an explainable computable ontological model of chemical class nomenclature, which we call the ChEBI Chemical Class Program Ontology (C3PO). We validated our approach against the ChEBI database, and compared our results against deep learning models and a naive SMARTS pattern based classifier. C3PO outperforms the naive classifier, but does not reach the performance of state of the art deep learning methods. However, C3PO has a number of strengths that complement deep learning methods, including explainability and reduced data dependence. C3PO can be used alongside deep learning classifiers to provide an explanation of the classification, where both methods agree. The programs can be used as part of the ontology development process, and iteratively refined by expert human curators. </p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01092-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145203187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction: Retrosynthetic crosstalk between single-step reaction and multi-step planning 修正:单步反应和多步计划之间的反合成串扰
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-09-30 DOI: 10.1186/s13321-025-01101-5
Junseok Choe, Hajung Kim, Yan Ting Chok, Mogan Gim, Jaewoo Kang
{"title":"Correction: Retrosynthetic crosstalk between single-step reaction and multi-step planning","authors":"Junseok Choe,&nbsp;Hajung Kim,&nbsp;Yan Ting Chok,&nbsp;Mogan Gim,&nbsp;Jaewoo Kang","doi":"10.1186/s13321-025-01101-5","DOIUrl":"10.1186/s13321-025-01101-5","url":null,"abstract":"","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01101-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145195428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TraceMetrix: a traceable metabolomics interactive analysis platform TraceMetrix:可追溯代谢组学互动分析平台
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-09-29 DOI: 10.1186/s13321-025-01095-0
Wei Chen, Yanpeng An, Ziru Chen, Ruijin Luo, Qinwei Lu, Cong Li, Chenhan Zhang, Qingxia Huang, Qinsheng Chen, Lianglong Zhang, Xiaoxuan Yi, Yixue Li, Huiru Tang, Guoqing Zhang

Metabolomics data analysis is a multifaceted process often constrained by limited data sharing and a lack of transparency, which hinders reproducibility of results. While existing bioinformatics tools address some of these challenges, achieving greater simplicity and operational clarity remains essential for fully leveraging the potential of metabolomics. Here, we introduce TraceMetrix, a web-based platform designed for interactive traceability in metabolomics data analysis. TraceMetrix provides a flexible management system for both raw and derived data, enabling comprehensive tracking of file origins and destinations throughout the whole analysis pipeline. The platform documents the software and parameters used across four key modules, from raw data preprocessing, data cleaning, statistical analysis to functional analysis, enabling users to easily track critical factors influencing result accuracy. By mapping upstream and downstream relationships for nearly 19 analytical functions, TraceMetrix ensures end-to-end traceability, viewable interactively online or exportable as detailed reports. To address the limitations of single-machine environments in processing large-scale datasets, TraceMetrix is deployed on a high-performance computing cluster for efficient batch processing. Using a non-targeted metabolomics dataset, we demonstrated its traceability function to optimize parameter selection, successfully reproducing the analysis process and validating the original study's findings. TraceMetrix integrates traceability across data, software, and processes, significantly enhancing reproducibility in metabolomics research. The platform supports diverse applications and is freely available at https://www.biosino.org/tracemetrix.

代谢组学数据分析是一个多方面的过程,经常受到有限的数据共享和缺乏透明度的限制,这阻碍了结果的可重复性。虽然现有的生物信息学工具解决了其中的一些挑战,但实现更大的简单性和操作清晰度对于充分利用代谢组学的潜力仍然至关重要。在这里,我们介绍TraceMetrix,一个基于网络的平台,用于代谢组学数据分析的交互式可追溯性。TraceMetrix为原始数据和派生数据提供了一个灵活的管理系统,能够在整个分析管道中对文件起源和目的地进行全面跟踪。该平台记录了四个关键模块使用的软件和参数,从原始数据预处理、数据清理、统计分析到功能分析,使用户能够轻松跟踪影响结果准确性的关键因素。通过映射近19个分析功能的上游和下游关系,TraceMetrix确保了端到端的可追溯性,可交互式在线查看或作为详细报告导出。为了解决单机环境在处理大规模数据集时的局限性,TraceMetrix部署在高性能计算集群上,以实现高效的批处理。使用非靶向代谢组学数据集,我们展示了其可追溯功能,以优化参数选择,成功地再现了分析过程并验证了原始研究的发现。TraceMetrix集成了数据、软件和流程的可追溯性,显著提高了代谢组学研究的可重复性。该平台支持多种应用程序,可在https://www.biosino.org/tracemetrix免费获得。TraceMetrix引入了一种新的基于网络的代谢组学数据分析平台,提供交互式可追溯性,确保对整个分析过程进行全面跟踪。与现有工具不同,TraceMetrix通过有效的数据管理实现了数据文件、过程(分析方法)和参数的可追溯性,显著提高了透明度和可再现性。此外,通过部署在高性能计算集群上,它解决了大规模代谢组学数据分析的挑战。
{"title":"TraceMetrix: a traceable metabolomics interactive analysis platform","authors":"Wei Chen,&nbsp;Yanpeng An,&nbsp;Ziru Chen,&nbsp;Ruijin Luo,&nbsp;Qinwei Lu,&nbsp;Cong Li,&nbsp;Chenhan Zhang,&nbsp;Qingxia Huang,&nbsp;Qinsheng Chen,&nbsp;Lianglong Zhang,&nbsp;Xiaoxuan Yi,&nbsp;Yixue Li,&nbsp;Huiru Tang,&nbsp;Guoqing Zhang","doi":"10.1186/s13321-025-01095-0","DOIUrl":"10.1186/s13321-025-01095-0","url":null,"abstract":"<div><p>Metabolomics data analysis is a multifaceted process often constrained by limited data sharing and a lack of transparency, which hinders reproducibility of results. While existing bioinformatics tools address some of these challenges, achieving greater simplicity and operational clarity remains essential for fully leveraging the potential of metabolomics. Here, we introduce TraceMetrix, a web-based platform designed for interactive traceability in metabolomics data analysis. TraceMetrix provides a flexible management system for both raw and derived data, enabling comprehensive tracking of file origins and destinations throughout the whole analysis pipeline. The platform documents the software and parameters used across four key modules, from raw data preprocessing, data cleaning, statistical analysis to functional analysis, enabling users to easily track critical factors influencing result accuracy. By mapping upstream and downstream relationships for nearly 19 analytical functions, TraceMetrix ensures end-to-end traceability, viewable interactively online or exportable as detailed reports. To address the limitations of single-machine environments in processing large-scale datasets, TraceMetrix is deployed on a high-performance computing cluster for efficient batch processing. Using a non-targeted metabolomics dataset, we demonstrated its traceability function to optimize parameter selection, successfully reproducing the analysis process and validating the original study's findings. TraceMetrix integrates traceability across data, software, and processes, significantly enhancing reproducibility in metabolomics research. The platform supports diverse applications and is freely available at https://www.biosino.org/tracemetrix.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01095-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145182843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Subgrapher: visual fingerprinting of chemical structures 子图谱:化学结构的视觉指纹图谱
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-09-29 DOI: 10.1186/s13321-025-01091-4
Lucas Morin, Gerhard Ingmar Meijer, Valéry Weber, Luc Van Gool, Peter W. J. Staar

Automatic extraction of molecules from scientific literature plays a crucial role in accelerating research across fields ranging from drug discovery to materials science. Patent documents, in particular, contain molecular information in visual form, which is often inaccessible through traditional text-based searches. In this work, we introduce SubGrapher, a method for the visual fingerprinting of molecule and Markush structure images. Unlike conventional Optical Chemical Structure Recognition (OCSR) models that attempt to reconstruct full molecular graphs, SubGrapher focuses on extracting fingerprints directly from images. Using learning-based instance segmentation, SubGrapher identifies functional groups and carbon backbones, constructing a substructure-based fingerprint that enables the retrieval of molecules and Markush structures. Our approach is evaluated against state-of-the-art OCSR and fingerprinting methods, demonstrating superior retrieval performance and robustness across diverse molecule and Markush structure depictions. The benchmark datasets, models, and inference code are publicly available..

从科学文献中自动提取分子在加速从药物发现到材料科学等各个领域的研究中起着至关重要的作用。特别是专利文件,以视觉形式包含分子信息,这通常是通过传统的基于文本的搜索无法访问的。在这项工作中,我们介绍了SubGrapher,一种用于分子和马库什结构图像的视觉指纹识别方法。与传统的光学化学结构识别(OCSR)模型试图重建完整的分子图不同,SubGrapher专注于直接从图像中提取指纹。使用基于学习的实例分割,SubGrapher可以识别官能团和碳骨架,构建基于子结构的指纹,从而可以检索分子和马库什结构。我们的方法与最先进的OCSR和指纹识别方法进行了评估,在不同的分子和马库什结构描述中展示了卓越的检索性能和鲁棒性。基准测试数据集、模型和推理代码都是公开的。SubGrapher引入了一种新的方法,将分子和马库什结构图像直接转换为指纹,只需一步,绕过传统的SMILES或图形重建。在不同数据集(包括Markush结构图像)的子结构检测和结构检索方面,它优于现有的OCSR和指纹识别方法。
{"title":"Subgrapher: visual fingerprinting of chemical structures","authors":"Lucas Morin,&nbsp;Gerhard Ingmar Meijer,&nbsp;Valéry Weber,&nbsp;Luc Van Gool,&nbsp;Peter W. J. Staar","doi":"10.1186/s13321-025-01091-4","DOIUrl":"10.1186/s13321-025-01091-4","url":null,"abstract":"<div><p>Automatic extraction of molecules from scientific literature plays a crucial role in accelerating research across fields ranging from drug discovery to materials science. Patent documents, in particular, contain molecular information in visual form, which is often inaccessible through traditional text-based searches. In this work, we introduce SubGrapher, a method for the visual fingerprinting of molecule and Markush structure images. Unlike conventional Optical Chemical Structure Recognition (OCSR) models that attempt to reconstruct full molecular graphs, SubGrapher focuses on extracting fingerprints directly from images. Using learning-based instance segmentation, SubGrapher identifies functional groups and carbon backbones, constructing a substructure-based fingerprint that enables the retrieval of molecules and Markush structures. Our approach is evaluated against state-of-the-art OCSR and fingerprinting methods, demonstrating superior retrieval performance and robustness across diverse molecule and Markush structure depictions. The benchmark datasets, models, and inference code are publicly available..</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01091-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145182842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1