首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
Privileged structure-based molecular fingerprints for organic electronic materials: towards intuitive machine learning interpretation 有机电子材料的特权结构分子指纹:迈向直观的机器学习解释。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2026-02-27 DOI: 10.1186/s13321-026-01169-7
Tae Wook Yang, Seung Hyun Jo, Min Chul Suh

Molecular descriptors are central to the performance and interpretability of QSPR models, yet most existing fingerprints for organic electronics lack chemical relevance or interpretability. Here, we present the Organic Electronic Fingerprint (OEFP), a structure-based representation tailored for OLED and OPV materials. OEFP was constructed from a manually curated OLED dataset and publicly available OPV and chromophore datasets to ensure structural diversity. Synthetically accessible substructures were identified using fragmentation and ring decomposition methods, which capture the conjugated π-bonds crucial for organic electronic materials, and subsequently encoded as individual bits. In case studies of OPV HOMO energy prediction, OEFP achieved up to 13.7% lower MAE and 1.6% higher R2 compared to domain-mismatched structure-based fingerprint under random splits. Under scaffold-based splits designed to assess generalization, OEFP demonstrated substantially improved generalization, reducing MAE by 50–70% and increasing R2 by over 150% relative to the same baseline and achieved performance comparable to ECFP baselines evaluated at a similar fingerprint length when using count-based representations. SHAP analysis further enabled intuitive substructure-level interpretation, while OEFP’s chemically meaningful and synthetically relevant design supports rational molecular generation. Together, these results establish OEFP as an effective and interpretable molecular representation for machine learning applications in organic electronics.

分子描述符是QSPR模型性能和可解释性的核心,然而大多数现有的有机电子指纹缺乏化学相关性或可解释性。在这里,我们提出了有机电子指纹(OEFP),这是一种为OLED和OPV材料量身定制的基于结构的表示。OEFP是由人工策划的OLED数据集和公开的OPV和发色团数据集构建的,以确保结构的多样性。利用碎片化和环分解方法确定了合成可达子结构,该方法捕获了有机电子材料中至关重要的共轭π键,并随后编码为单个比特。在OPV HOMO能量预测的案例研究中,与随机分裂下基于域不匹配结构的指纹相比,OEFP的MAE降低了13.7%,R2提高了1.6%。在设计用于评估泛化的基于支架的分割下,OEFP显示出显著改善的泛化,相对于相同的基线,减少了50-70%的MAE,增加了150%以上的R2,并且在使用基于计数的表示时,在相似的指纹长度下评估的ECFP基线取得了相当的性能。SHAP分析进一步实现了直观的子结构级解释,而OEFP的化学意义和综合相关设计支持合理的分子生成。总之,这些结果建立了OEFP作为有机电子学中机器学习应用的有效和可解释的分子表示。
{"title":"Privileged structure-based molecular fingerprints for organic electronic materials: towards intuitive machine learning interpretation","authors":"Tae Wook Yang,&nbsp;Seung Hyun Jo,&nbsp;Min Chul Suh","doi":"10.1186/s13321-026-01169-7","DOIUrl":"10.1186/s13321-026-01169-7","url":null,"abstract":"<div><p>Molecular descriptors are central to the performance and interpretability of QSPR models, yet most existing fingerprints for organic electronics lack chemical relevance or interpretability. Here, we present the Organic Electronic Fingerprint (OEFP), a structure-based representation tailored for OLED and OPV materials. OEFP was constructed from a manually curated OLED dataset and publicly available OPV and chromophore datasets to ensure structural diversity. Synthetically accessible substructures were identified using fragmentation and ring decomposition methods, which capture the conjugated π-bonds crucial for organic electronic materials, and subsequently encoded as individual bits. In case studies of OPV HOMO energy prediction, OEFP achieved up to 13.7% lower MAE and 1.6% higher R<sup>2</sup> compared to domain-mismatched structure-based fingerprint under random splits. Under scaffold-based splits designed to assess generalization, OEFP demonstrated substantially improved generalization, reducing MAE by 50–70% and increasing R<sup>2</sup> by over 150% relative to the same baseline and achieved performance comparable to ECFP baselines evaluated at a similar fingerprint length when using count-based representations. SHAP analysis further enabled intuitive substructure-level interpretation, while OEFP’s chemically meaningful and synthetically relevant design supports rational molecular generation. Together, these results establish OEFP as an effective and interpretable molecular representation for machine learning applications in organic electronics.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-026-01169-7.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147315912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Graph-based transformer to predict the octanol-water partition coefficient. 基于图的变压器辛醇-水分配系数预测。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2026-02-27 DOI: 10.1186/s13321-026-01160-2
Vyacheslav Grigorev, Nikita Serov, Timur Gimadiev, Assima Poyezzhayeva, Pavel Sidorov

Lipophilicity is a fundamental physicochemical property that significantly influences various aspects of drug behavior, such as solubility, permeability, metabolism, distribution, protein binding, and excretion. Consequently, accurate prediction of this property is critical for the successful discovery and development of new drug candidates. The classical metric for assessing lipophilicity is logP, defined as the partition coefficient between n-octanol and water at physiological pH 7.4. Recently, graph-based deep learning methods have gained considerable attention and demonstrated strong performance across diverse drug discovery tasks, from molecular property prediction to virtual screening. These models learn informative representations directly from molecular graphs in an end-to-end manner, without the need for handcrafted descriptors. In this work, we propose a logP prediction approach based on a fine-tuned pre-trained GraphormerMapper model, named GraphormerLogP. To evaluate its performance, the model was tested on two datasets: one is compiled by us from publicly available sources and contains 42 006 unique SMILES-logP pairs (named GLP); the second consists of 13 688 molecules and is used for benchmarking purposes. Our comparative analysis against state-of-the-art models (Random Forest, Chemprop, CheMeleon, StructGNN, and Attentive FP) demonstrates that GraphormerLogP consistently achieves competitive or superior predictive accuracy across both datasets, attaining mean absolute error values of 0.251 and 0.269, respectively. The GLP dataset is available in the GitHub repository https://github.com/cimm-kzn/GraphormerLogP/tree/main/data.Scientific contribution This paper presents two key scientific contributions. First, we have collected and carefully curated a large and diverse dataset of molecules with measured logP values, comprising over 42 000 compounds. Second, we propose a Graphormer-based model with a task-specific fine-tuning architecture for logP prediction, tailored to leverage representations learned from reaction data. This model demonstrates high performance in benchmark studies on both established literature data and the newly compiled dataset.

亲脂性是一种基本的物理化学性质,它显著影响药物行为的各个方面,如溶解度、渗透性、代谢、分布、蛋白质结合和排泄。因此,准确预测这一性质对于成功发现和开发新的候选药物至关重要。评估亲脂性的经典度量是logP,定义为生理pH为7.4时正辛醇与水之间的分配系数。最近,基于图的深度学习方法获得了相当大的关注,并在从分子性质预测到虚拟筛选的各种药物发现任务中表现出强大的性能。这些模型以端到端的方式直接从分子图中学习信息表示,而不需要手工制作的描述符。在这项工作中,我们提出了一种基于微调的预训练graphhormermapper模型(graphhormerlogp)的logP预测方法。为了评估其性能,我们在两个数据集上测试了该模型:一个是由我们从公开来源编译的,包含42006个唯一的SMILES-logP对(命名为GLP);第二种由13688个分子组成,用于基准测试。我们对最先进的模型(Random Forest、Chemprop、CheMeleon、StructGNN和专心致志FP)进行了比较分析,结果表明,GraphormerLogP在两个数据集上都能达到具有竞争力或更高的预测精度,平均绝对误差分别为0.251和0.269。GLP数据集可在GitHub存储库https://github.com/cimm-kzn/GraphormerLogP/tree/main/data.Scientific贡献本文提出了两个关键的科学贡献。首先,我们收集并精心整理了一个具有测量logP值的大而多样的分子数据集,包括超过42000种化合物。其次,我们提出了一个基于graphhormer的模型,该模型具有用于logP预测的特定任务微调架构,可以利用从反应数据中学习到的表示。该模型在已建立的文献数据和新编译的数据集的基准研究中都表现出较高的性能。
{"title":"Graph-based transformer to predict the octanol-water partition coefficient.","authors":"Vyacheslav Grigorev, Nikita Serov, Timur Gimadiev, Assima Poyezzhayeva, Pavel Sidorov","doi":"10.1186/s13321-026-01160-2","DOIUrl":"https://doi.org/10.1186/s13321-026-01160-2","url":null,"abstract":"<p><p>Lipophilicity is a fundamental physicochemical property that significantly influences various aspects of drug behavior, such as solubility, permeability, metabolism, distribution, protein binding, and excretion. Consequently, accurate prediction of this property is critical for the successful discovery and development of new drug candidates. The classical metric for assessing lipophilicity is logP, defined as the partition coefficient between n-octanol and water at physiological pH 7.4. Recently, graph-based deep learning methods have gained considerable attention and demonstrated strong performance across diverse drug discovery tasks, from molecular property prediction to virtual screening. These models learn informative representations directly from molecular graphs in an end-to-end manner, without the need for handcrafted descriptors. In this work, we propose a logP prediction approach based on a fine-tuned pre-trained GraphormerMapper model, named GraphormerLogP. To evaluate its performance, the model was tested on two datasets: one is compiled by us from publicly available sources and contains 42 006 unique SMILES-logP pairs (named GLP); the second consists of 13 688 molecules and is used for benchmarking purposes. Our comparative analysis against state-of-the-art models (Random Forest, Chemprop, CheMeleon, StructGNN, and Attentive FP) demonstrates that GraphormerLogP consistently achieves competitive or superior predictive accuracy across both datasets, attaining mean absolute error values of 0.251 and 0.269, respectively. The GLP dataset is available in the GitHub repository https://github.com/cimm-kzn/GraphormerLogP/tree/main/data.Scientific contribution This paper presents two key scientific contributions. First, we have collected and carefully curated a large and diverse dataset of molecules with measured logP values, comprising over 42 000 compounds. Second, we propose a Graphormer-based model with a task-specific fine-tuning architecture for logP prediction, tailored to leverage representations learned from reaction data. This model demonstrates high performance in benchmark studies on both established literature data and the newly compiled dataset.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147315881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PROTAC-Splitter: a machine learning framework for automated identification of PROTAC substructures PROTAC- splitter:用于PROTAC子结构自动识别的机器学习框架。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2026-02-20 DOI: 10.1186/s13321-025-01135-9
Stefano Ribes, Ranxuan Zhang, Télio Cropsal, Anders Källberg, Christian Tyrchan, Eva Nittinger, Rocío Mercado

Proteolysis-targeting chimeras (PROTACs) are heterobifunctional molecules composed of an E3 ligase ligand, a linker, and a warhead targeting a protein of interest. Despite their modular structure, accurately identifying and annotating these components in PROTACs is challenging and typically relies on manual curation and predefined substructure matching. To address this, we developed PROTAC-Splitter, a machine learning framework designed for automated annotation of PROTAC substructures. To address data scarcity, we generated and openly released a synthetic dataset containing approximately 1.3 million PROTAC structures with annotated ligand splits. Leveraging this dataset, we developed two complementary approaches for PROTAC substructure annotation: a Transformer-based sequence-to-sequence model and a graph-based XGBoost model. We evaluated both approaches on held-out public data and structurally novel PROTACs from AstraZeneca’s proprietary collection. The Transformer-based model achieved high exact-match accuracy (86%) on public data but dropped significantly (18%) on structurally novel internal PROTACs due to occasional hallucinations. In contrast, the XGBoost model can ensure chemical validity and perfect reassembly accuracy on both sets, with lower exact-match accuracy on open-data (42.2%) but comparable performance on the internal set (23%). To improve reliability, we implemented a wrapper function for the Transformer (Transformer-(Delta)), which corrects partial prediction errors, raising reassembly accuracy to 96% on public and 70% on internal datasets. Combining the strengths of both models, we propose a hybrid approach that reliably annotates PROTACs across diverse chemical spaces. PROTAC-Splitter provides a robust, scalable tool to facilitate automated PROTAC analysis and is available open-source at https://github.com/ribesstefano/PROTAC-Splitter

靶向蛋白水解嵌合体(Proteolysis-targeting chimeras, PROTACs)是由E3连接酶配体、连接体和靶向感兴趣蛋白质的战斗部组成的异双功能分子。尽管它们是模块化结构,但在PROTACs中准确识别和注释这些组件是具有挑战性的,通常依赖于手动管理和预定义的子结构匹配。为了解决这个问题,我们开发了PROTAC- splitter,这是一个用于PROTAC子结构自动注释的机器学习框架。为了解决数据短缺问题,我们生成并公开发布了一个合成数据集,其中包含大约130万个带有注释配体分裂的PROTAC结构。利用该数据集,我们开发了两种互补的PROTAC子结构注释方法:基于transformer的序列到序列模型和基于图的XGBoost模型。我们评估了这两种方法的公开数据和结构新颖的PROTACs来自阿斯利康的专利收集。基于transformer的模型在公共数据上获得了很高的精确匹配精度(86%),但由于偶尔出现幻觉,在结构新颖的内部protac上显著下降(18%)。相比之下,XGBoost模型可以在两个集合上确保化学有效性和完美的重组精度,在开放数据上的精确匹配精度较低(42.2%),但在内部集合上的性能相当(23%)。为了提高可靠性,我们为Transformer实现了一个包装器函数(Transformer- Δ),它可以纠正部分预测错误,将公共数据集的重组准确率提高到96%,内部数据集的重组准确率提高到70%。结合两种模型的优势,我们提出了一种混合方法,可以在不同的化学空间中可靠地注释PROTACs。PROTAC- splitter提供了一个强大的、可扩展的工具来促进自动化PROTAC分析,并且可以在https://github.com/ribesstefano/PROTAC-Splitter上获得开源。
{"title":"PROTAC-Splitter: a machine learning framework for automated identification of PROTAC substructures","authors":"Stefano Ribes,&nbsp;Ranxuan Zhang,&nbsp;Télio Cropsal,&nbsp;Anders Källberg,&nbsp;Christian Tyrchan,&nbsp;Eva Nittinger,&nbsp;Rocío Mercado","doi":"10.1186/s13321-025-01135-9","DOIUrl":"10.1186/s13321-025-01135-9","url":null,"abstract":"<div><p>Proteolysis-targeting chimeras (PROTACs) are heterobifunctional molecules composed of an E3 ligase ligand, a linker, and a warhead targeting a protein of interest. Despite their modular structure, accurately identifying and annotating these components in PROTACs is challenging and typically relies on manual curation and predefined substructure matching. To address this, we developed PROTAC-Splitter, a machine learning framework designed for automated annotation of PROTAC substructures. To address data scarcity, we generated and openly released a synthetic dataset containing approximately 1.3 million PROTAC structures with annotated ligand splits. Leveraging this dataset, we developed two complementary approaches for PROTAC substructure annotation: a Transformer-based sequence-to-sequence model and a graph-based XGBoost model. We evaluated both approaches on held-out public data and structurally novel PROTACs from AstraZeneca’s proprietary collection. The Transformer-based model achieved high exact-match accuracy (86%) on public data but dropped significantly (18%) on structurally novel internal PROTACs due to occasional hallucinations. In contrast, the XGBoost model can ensure chemical validity and perfect reassembly accuracy on both sets, with lower exact-match accuracy on open-data (42.2%) but comparable performance on the internal set (23%). To improve reliability, we implemented a wrapper function for the Transformer (Transformer-<span>(Delta)</span>), which corrects partial prediction errors, raising reassembly accuracy to 96% on public and 70% on internal datasets. Combining the strengths of both models, we propose a hybrid approach that reliably annotates PROTACs across diverse chemical spaces. PROTAC-Splitter provides a robust, scalable tool to facilitate automated PROTAC analysis and is available open-source at https://github.com/ribesstefano/PROTAC-Splitter</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12924545/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146256838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
How quantum-chemical geometry optimization level affects classical 3D descriptors and QSAR performance: a comparative study. 量子化学几何优化水平如何影响经典三维描述符和QSAR性能:比较研究。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2026-02-19 DOI: 10.1186/s13321-026-01166-w
Jianmin Li, Rongling Gu, Shijie Du, Lu Xu

Accurate representation of three-dimensional (3D) molecular structures is essential for quantitative structure-activity relationship (QSAR) modeling; however, it remains unclear whether increasing the level of theory used for quantum-chemical geometry optimization yields a practically meaningful benefit for classical conformation-dependent 3D descriptors (Dragon 3D) and the resulting QSAR performance. Here, we benchmark eight commonly used quantum-chemical (QM) geometry-optimization protocols-from minimal-basis Hartree-Fock (HF/STO-3 G) to def2-based hybrid density functional theory (DFT) and the composite method r2SCAN-3c-across three anticancer activity datasets and ten machine-learning classifiers. Descriptor-level analyses (relative deviation, rank correlation, and chemical-space similarity) reveal systematic method dependence in descriptor magnitudes: high-accuracy def2-based DFT protocols produce highly consistent descriptor spaces, whereas some intermediate/low-level settings introduce larger variability, although molecular rankings remain largely robust (Spearman ρ>0.95). In contrast, downstream QSAR performance is only weakly affected by the QM level. Across paired dataset×model blocks (n=30), mean balanced accuracies cluster tightly (0.852-0.871). B3LYP/3-21 G achieves the highest overall mean balanced accuracies (BA) (0.8709; 95% CI 0.8565-0.8840), while the lowest mean is observed for HF/STO-3 G (0.8518; 95% CI 0.8371-0.8661); def2-based B3LYP methods are numerically slightly lower (0.855-0.856). A repeated-measures omnibus test indicates a statistically detectable method effect (Friedman p=0.006) but with a small effect size (Kendall's W=0.094), and post-hoc Wilcoxon tests with Holm correction identify only one robust pairwise difference (B3LYP/3-21 G vs. HF/STO-3 G, pHolm=0.025). Thus, the observed performance shifts are marginal in magnitude (1-2%) compared with the 10-100× differences in computational cost.To support pragmatic method selection, we propose a two-tier Absolute Efficiency Ratio (AER) framework integrating predictive performance with efficiency and methodological considerations. Overall, these results indicate a non-linear and practically weak relationship between QM geometry-optimization level, classical 3D descriptor fidelity, and QSAR performance, suggesting that QM-level upgrades mainly reshape descriptor values without yielding commensurate or actionable gains in predictive accuracy.

分子三维结构的准确表征是定量构效关系(QSAR)建模的关键;然而,目前尚不清楚提高量子化学几何优化的理论水平是否会对经典构象依赖的3D描述符(Dragon 3D)和由此产生的QSAR性能产生实际意义上的好处。在这里,我们对八种常用的量子化学(QM)几何优化方案进行了基准测试——从最小基Hartree-Fock (HF/ sto - 3g)到基于def2的混合密度泛函数理论(DFT)和复合方法r2scan -3c——跨越三个抗癌活性数据集和十个机器学习分类器。描述符水平分析(相对偏差、等级相关性和化学空间相似性)揭示了描述符大小的系统方法依赖性:高精度基于def2的DFT协议产生高度一致的描述符空间,而一些中间/低水平设置引入了更大的可变性,尽管分子排名在很大程度上保持稳健(Spearman ρ>0.95)。相比之下,下游QSAR性能仅受QM水平的微弱影响。在成对的dataset×model块(n=30)中,平均平衡精度紧密聚集(0.852-0.871)。B3LYP/3-21 G的总体平均平衡精度(BA)最高(0.8709,95% CI 0.8565-0.8840),而HF/STO-3 G的平均平衡精度最低(0.8518,95% CI 0.8371-0.8661);基于def2的B3LYP方法在数值上略低(~ 0.855-0.856)。重复测量综合检验表明统计学上可检测的方法效应(Friedman p=0.006),但效应大小较小(Kendall's W=0.094),经过Holm校正的事后Wilcoxon检验仅发现一个稳健的两两差异(B3LYP/3-21 G vs. HF/斯托-3 G, pHolm=0.025)。因此,与计算成本的10-100倍差异相比,观察到的性能变化在幅度上是微不足道的(≤1-2%)。为了支持实用的方法选择,我们提出了一个两层绝对效率比(AER)框架,将预测性能与效率和方法考虑相结合。总的来说,这些结果表明QM几何优化水平、经典3D描述子保真度和QSAR性能之间存在非线性且实际上很弱的关系,这表明QM水平升级主要是重塑描述子值,而不是在预测精度方面产生相应的或可操作的增益。
{"title":"How quantum-chemical geometry optimization level affects classical 3D descriptors and QSAR performance: a comparative study.","authors":"Jianmin Li, Rongling Gu, Shijie Du, Lu Xu","doi":"10.1186/s13321-026-01166-w","DOIUrl":"https://doi.org/10.1186/s13321-026-01166-w","url":null,"abstract":"<p><p>Accurate representation of three-dimensional (3D) molecular structures is essential for quantitative structure-activity relationship (QSAR) modeling; however, it remains unclear whether increasing the level of theory used for quantum-chemical geometry optimization yields a practically meaningful benefit for classical conformation-dependent 3D descriptors (Dragon 3D) and the resulting QSAR performance. Here, we benchmark eight commonly used quantum-chemical (QM) geometry-optimization protocols-from minimal-basis Hartree-Fock (HF/STO-3 G) to def2-based hybrid density functional theory (DFT) and the composite method r<math><mmultiscripts><mrow></mrow><mrow></mrow><mn>2</mn></mmultiscripts></math>SCAN-3c-across three anticancer activity datasets and ten machine-learning classifiers. Descriptor-level analyses (relative deviation, rank correlation, and chemical-space similarity) reveal systematic method dependence in descriptor magnitudes: high-accuracy def2-based DFT protocols produce highly consistent descriptor spaces, whereas some intermediate/low-level settings introduce larger variability, although molecular rankings remain largely robust (Spearman <math><mrow><mi>ρ</mi><mo>></mo><mn>0.95</mn></mrow></math>). In contrast, downstream QSAR performance is only weakly affected by the QM level. Across paired dataset<math><mo>×</mo></math>model blocks (<math><mrow><mi>n</mi><mo>=</mo><mn>30</mn></mrow></math>), mean balanced accuracies cluster tightly (0.852-0.871). B3LYP/3-21 G achieves the highest overall mean balanced accuracies (BA) (0.8709; 95% CI 0.8565-0.8840), while the lowest mean is observed for HF/STO-3 G (0.8518; 95% CI 0.8371-0.8661); def2-based B3LYP methods are numerically slightly lower (<math><mo>∼</mo></math>0.855-0.856). A repeated-measures omnibus test indicates a statistically detectable method effect (Friedman <math><mrow><mi>p</mi><mo>=</mo><mn>0.006</mn></mrow></math>) but with a small effect size (Kendall's <math><mrow><mi>W</mi><mo>=</mo><mn>0.094</mn></mrow></math>), and post-hoc Wilcoxon tests with Holm correction identify only one robust pairwise difference (B3LYP/3-21 G vs. HF/STO-3 G, <math><mrow><msub><mi>p</mi><mtext>Holm</mtext></msub><mo>=</mo><mn>0.025</mn></mrow></math>). Thus, the observed performance shifts are marginal in magnitude (<math><mo>≤</mo></math>1-2%) compared with the 10-100<math><mo>×</mo></math> differences in computational cost.To support pragmatic method selection, we propose a two-tier Absolute Efficiency Ratio (AER) framework integrating predictive performance with efficiency and methodological considerations. Overall, these results indicate a non-linear and practically weak relationship between QM geometry-optimization level, classical 3D descriptor fidelity, and QSAR performance, suggesting that QM-level upgrades mainly reshape descriptor values without yielding commensurate or actionable gains in predictive accuracy.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146224986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Addressing model overcomplexity in drug-drug interaction prediction with molecular fingerprints. 利用分子指纹图谱解决药物-药物相互作用预测中模型过于复杂的问题。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2026-02-19 DOI: 10.1186/s13321-025-01128-8
Manel Gil-Sorribes, Alexis Molina

Accurately predicting interactions between drugs is critical for pharmaceutical research and clinical safety. The literature keeps moving toward increasingly complex architectures, yet gains on standard benchmarks are often small. We use a deliberately simple setup that keeps the classifier fixed and swaps only the molecular representation. We compare ECFP4 Morgan fingerprints (MFPs), pretrained graph convolutional networks (GCNs), and MoLFormer embeddings on common DrugBank DDI splits and on an FDA drug-drug affinity benchmark. This design lets us isolate the effect of representation and of split design with minimal confounders.Across leak proof DrugBank splits and the FDA task, MFPs with a shallow head match or surpass much more complex graph and knowledge graph systems while using far fewer parameters. On the Unseen DDI split, MFPs reach AUROC 99.4 and AUPR 98.4, slightly ahead of MoLFormer and the prior state of the art. When we impose a strict scaffold out-of-distribution split, a pretrained GCN leads with AUROC 73.99, which pinpoints when extra capacity helps. On standard splits the absolute gains from added complexity are small. We therefore argue that progress will come mainly from better curated datasets and from rigorous out-of-distribution evaluation, rather than from ever more elaborate models.

准确预测药物之间的相互作用对药物研究和临床安全至关重要。文献一直在向越来越复杂的体系结构发展,但是标准基准的收益通常很小。我们故意使用一个简单的设置,使分类器保持固定,并且只交换分子表示。我们比较了ECFP4摩根指纹(MFPs)、预训练图卷积网络(GCNs)和MoLFormer嵌入在常见DrugBank DDI拆分和FDA药物-药物亲和基准上的效果。这种设计使我们能够以最小的混杂因素隔离表示和分割设计的影响。在防泄漏的DrugBank拆分和FDA任务中,具有浅头部的mfp可以在使用更少参数的情况下匹配或超越更复杂的图和知识图系统。在Unseen DDI拆分中,MFPs达到AUROC 99.4和AUPR 98.4,略高于MoLFormer和之前的技术水平。当我们施加严格的支架分布外分割时,预训练的GCN以AUROC 73.99领先,它确定额外容量何时有帮助。在标准分割中,从增加的复杂性中获得的绝对收益很小。因此,我们认为进步将主要来自更好的数据集和严格的分布外评估,而不是来自更复杂的模型。
{"title":"Addressing model overcomplexity in drug-drug interaction prediction with molecular fingerprints.","authors":"Manel Gil-Sorribes, Alexis Molina","doi":"10.1186/s13321-025-01128-8","DOIUrl":"10.1186/s13321-025-01128-8","url":null,"abstract":"<p><p>Accurately predicting interactions between drugs is critical for pharmaceutical research and clinical safety. The literature keeps moving toward increasingly complex architectures, yet gains on standard benchmarks are often small. We use a deliberately simple setup that keeps the classifier fixed and swaps only the molecular representation. We compare ECFP4 Morgan fingerprints (MFPs), pretrained graph convolutional networks (GCNs), and MoLFormer embeddings on common DrugBank DDI splits and on an FDA drug-drug affinity benchmark. This design lets us isolate the effect of representation and of split design with minimal confounders.Across leak proof DrugBank splits and the FDA task, MFPs with a shallow head match or surpass much more complex graph and knowledge graph systems while using far fewer parameters. On the Unseen DDI split, MFPs reach AUROC 99.4 and AUPR 98.4, slightly ahead of MoLFormer and the prior state of the art. When we impose a strict scaffold out-of-distribution split, a pretrained GCN leads with AUROC 73.99, which pinpoints when extra capacity helps. On standard splits the absolute gains from added complexity are small. We therefore argue that progress will come mainly from better curated datasets and from rigorous out-of-distribution evaluation, rather than from ever more elaborate models.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146218224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FAME3R: an efficient, practical and reliable open-source tool for predicting phase 1 and phase 2 sites of metabolism. FAME3R:一个高效、实用、可靠的开源工具,用于预测1期和2期代谢位点。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2026-02-14 DOI: 10.1186/s13321-026-01161-1
Roxane Axel Jacob, Leo Gaskin, Thomas Seidel, Ya Chen, Angelica Mazzolari, Johannes Kirchmair

Predicting likely sites of metabolism (SOMs), i.e., the atoms in a molecule where metabolic reactions are initiated, is an important component of the computational development pipeline for pharmaceuticals, agrochemicals, and cosmetics. Among SOM prediction tools, FAME3, introduced in 2019, is one of only a few non-commercial models capable of predicting both Phase 1 and Phase 2 SOMs for a wide range of xenobiotics. However, its original implementation posed challenges in maintainability, scalability, and interoperability, which hindered broader adoption. To overcome these limitations, we developed FAME3R, an enhanced version of FAME3 designed to improve computational efficiency and facilitate integration with contemporary cheminformatics workflows. FAME3R introduces several new features, including a novel reliability assessment method based on Shannon entropy and the option to select among various featurization strategies. The tool is available as an open-source Python package, offering both a Python API and a CLI for flexible usage. Additionally, trained FAME3R models can be accessed via a GUI and a REST API hosted on the NERDD web platform.

预测可能的代谢位点(SOMs),即代谢反应开始的分子中的原子,是制药、农用化学品和化妆品计算开发管道的重要组成部分。在SOM预测工具中,2019年推出的FAME3是为数不多的能够预测各种外源性药物的第一阶段和第二阶段SOM的非商业模型之一。然而,它最初的实现在可维护性、可伸缩性和互操作性方面提出了挑战,这阻碍了更广泛的采用。为了克服这些限制,我们开发了FAME3R,这是FAME3的增强版本,旨在提高计算效率并促进与当代化学信息学工作流程的集成。FAME3R引入了几个新特性,包括一种基于香农熵的新型可靠性评估方法和多种特征化策略的选择选项。该工具以开源Python包的形式提供,提供Python API和CLI,以便灵活使用。此外,经过训练的FAME3R模型可以通过GUI和托管在NERDD web平台上的REST API进行访问。
{"title":"FAME3R: an efficient, practical and reliable open-source tool for predicting phase 1 and phase 2 sites of metabolism.","authors":"Roxane Axel Jacob, Leo Gaskin, Thomas Seidel, Ya Chen, Angelica Mazzolari, Johannes Kirchmair","doi":"10.1186/s13321-026-01161-1","DOIUrl":"10.1186/s13321-026-01161-1","url":null,"abstract":"<p><p>Predicting likely sites of metabolism (SOMs), i.e., the atoms in a molecule where metabolic reactions are initiated, is an important component of the computational development pipeline for pharmaceuticals, agrochemicals, and cosmetics. Among SOM prediction tools, FAME3, introduced in 2019, is one of only a few non-commercial models capable of predicting both Phase 1 and Phase 2 SOMs for a wide range of xenobiotics. However, its original implementation posed challenges in maintainability, scalability, and interoperability, which hindered broader adoption. To overcome these limitations, we developed FAME3R, an enhanced version of FAME3 designed to improve computational efficiency and facilitate integration with contemporary cheminformatics workflows. FAME3R introduces several new features, including a novel reliability assessment method based on Shannon entropy and the option to select among various featurization strategies. The tool is available as an open-source Python package, offering both a Python API and a CLI for flexible usage. Additionally, trained FAME3R models can be accessed via a GUI and a REST API hosted on the NERDD web platform.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13011438/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146197205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrating artificial intelligence and manual curation to enhance bioassay annotations in ChEMBL 整合人工智能和人工管理,增强ChEMBL中的生物测定注释。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2026-02-12 DOI: 10.1186/s13321-026-01165-x
Ines Smit, Melissa F. Adasme, Emma Manners, Sybilla Corbett, Nicolas Bosc, Hoang-My-Anh Do, Andrew R. Leach, Noel M. O’Boyle, Barbara Zdrazil

As the volume and diversity of bioactivity data in ChEMBL continues to grow, ensuring that assay metadata is standardized, interoperable, and machine-readable is critical for effective use in cheminformatics and ML applications. In this work, we present recent efforts to enhance the quality and granularity of bioassay annotations in ChEMBL through a combination of manual and semi-manual curation and AI-driven approaches. We introduce a “perfect assay description” template to guide consistent annotation and demonstrate how natural language processing techniques and multi-class classification can be used to automatically extract key assay parameters and assign broad assay categories for legacy data. We report on the development, validation, and application of a spaCy-based NER model that identifies experimental methods with high precision and recall, as well as a complementary classification model that refines ASSAY_TYPE categorization beyond the existing schema. In addition, we describe improvements to metadata extraction for ADME endpoints, organism and protein variant annotations, and ontology linking using tools such as text2term. Together, these enhancements significantly advance the FAIRness of ChEMBL’s bioassay data, enabling more robust downstream analyses and more precise compound-target activity modeling.

Graphical Abstract

随着ChEMBL中生物活性数据的数量和多样性不断增长,确保分析元数据标准化、可互操作和机器可读对于化学信息学和ML应用的有效使用至关重要。在这项工作中,我们介绍了最近的努力,通过人工和半人工管理和人工智能驱动的方法相结合,提高ChEMBL中生物测定注释的质量和粒度。我们引入了一个“完美的分析描述”模板来指导一致的注释,并演示如何使用自然语言处理技术和多类分类来自动提取关键分析参数并为遗留数据分配广泛的分析类别。我们报告了基于空间的NER模型的开发、验证和应用,该模型识别具有高精度和召回率的实验方法,以及在现有模式基础上改进ASSAY_TYPE分类的补充分类模型。此外,我们还描述了使用text2term等工具对ADME端点的元数据提取、生物体和蛋白质变体注释以及本体链接的改进。总之,这些增强功能显著提高了ChEMBL生物测定数据的公平性,实现了更稳健的下游分析和更精确的化合物靶点活性建模。
{"title":"Integrating artificial intelligence and manual curation to enhance bioassay annotations in ChEMBL","authors":"Ines Smit,&nbsp;Melissa F. Adasme,&nbsp;Emma Manners,&nbsp;Sybilla Corbett,&nbsp;Nicolas Bosc,&nbsp;Hoang-My-Anh Do,&nbsp;Andrew R. Leach,&nbsp;Noel M. O’Boyle,&nbsp;Barbara Zdrazil","doi":"10.1186/s13321-026-01165-x","DOIUrl":"10.1186/s13321-026-01165-x","url":null,"abstract":"<div><p>As the volume and diversity of bioactivity data in ChEMBL continues to grow, ensuring that assay metadata is standardized, interoperable, and machine-readable is critical for effective use in cheminformatics and ML applications. In this work, we present recent efforts to enhance the quality and granularity of bioassay annotations in ChEMBL through a combination of manual and semi-manual curation and AI-driven approaches. We introduce a “perfect assay description” template to guide consistent annotation and demonstrate how natural language processing techniques and multi-class classification can be used to automatically extract key assay parameters and assign broad assay categories for legacy data. We report on the development, validation, and application of a spaCy-based NER model that identifies experimental methods with high precision and recall, as well as a complementary classification model that refines ASSAY_TYPE categorization beyond the existing schema. In addition, we describe improvements to metadata extraction for ADME endpoints, organism and protein variant annotations, and ontology linking using tools such as text2term. Together, these enhancements significantly advance the FAIRness of ChEMBL’s bioassay data, enabling more robust downstream analyses and more precise compound-target activity modeling.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12903245/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146177177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ChemGraphX: an open-source web tool for computing topological indices and entropy measures. ChemGraphX:一个用于计算拓扑指数和熵度量的开源网络工具。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2026-02-11 DOI: 10.1186/s13321-025-01141-x
Kavin Jacob, Joseph Clement, Micheal Arockiaraj, Aditya Jindal

Topological indices are invariants derived from graphs, and computing these measures for chemical frameworks is found to have a great impact on quantitative structure-activity/property relationships (QSAPR) analysis by providing insights into the structural properties of chemical frameworks. However, calculating and validating these indices for large chemical systems poses computational challenges. To address this, we have developed ChemGraphX, an open-source web tool that efficiently computes various distance and degree-based topological indices, along with entropy measures. ChemGraphX accepts diverse input formats, including .pdb, .mol, adjacency lists, and adjacency matrices, making it versatile for both chemical and graph-theoretical applications. This article presents a brief description of ChemGraphX, a comparative analysis with existing tools, and discussed the capability and validation of ChemGraphX in computational and graph theoretical chemistry.

拓扑指数是从图中派生出来的不变量,通过对化学框架的结构性质的深入了解,计算这些化学框架的度量对定量结构-活性/性质关系(QSAPR)分析有很大的影响。然而,为大型化学系统计算和验证这些指数带来了计算上的挑战。为了解决这个问题,我们开发了ChemGraphX,这是一个开源的网络工具,可以有效地计算各种距离和基于程度的拓扑指数,以及熵度量。ChemGraphX接受多种输入格式,包括。pdb。Mol,邻接表和邻接矩阵,使其在化学和图理论应用中都是通用的。本文简要介绍了ChemGraphX,并与现有工具进行了比较分析,讨论了ChemGraphX在计算化学和图理论化学中的能力和有效性。
{"title":"ChemGraphX: an open-source web tool for computing topological indices and entropy measures.","authors":"Kavin Jacob, Joseph Clement, Micheal Arockiaraj, Aditya Jindal","doi":"10.1186/s13321-025-01141-x","DOIUrl":"10.1186/s13321-025-01141-x","url":null,"abstract":"<p><p>Topological indices are invariants derived from graphs, and computing these measures for chemical frameworks is found to have a great impact on quantitative structure-activity/property relationships (QSAPR) analysis by providing insights into the structural properties of chemical frameworks. However, calculating and validating these indices for large chemical systems poses computational challenges. To address this, we have developed ChemGraphX, an open-source web tool that efficiently computes various distance and degree-based topological indices, along with entropy measures. ChemGraphX accepts diverse input formats, including .pdb, .mol, adjacency lists, and adjacency matrices, making it versatile for both chemical and graph-theoretical applications. This article presents a brief description of ChemGraphX, a comparative analysis with existing tools, and discussed the capability and validation of ChemGraphX in computational and graph theoretical chemistry.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146163291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TheorChem2Blender: an open-source tool to convert computational chemistry outputs into Blender-ready 3D objects. TheorChem2Blender:一个开源工具,将计算化学输出转换为搅拌机准备好的3D对象。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2026-02-11 DOI: 10.1186/s13321-026-01162-0
Emmanuel Echeverri-Jimenez

TheorChem2Blender is designed to connect computational chemistry and 3D modeling without requiring extensive technical expertise from either field, bridging the gap between the two disciplines. Developed in Python, the tool supports common chemical input formats (com, xyz, mol2) to create 3D files for animations, visualization, extended reality media, and 3D printing (glb, fbx, obj, dae, stl). The program allows users to customize atoms and bonds and animate chemical compounds using a variety of representational systems. Pilot testing with expert computational chemists and 3D modelers informed the graphical user interface and cross-platform compatibility. TheorChem2Blender is freely available under the Apache 2.0 license, with comprehensive documentation and tutorials to support adoption by chemists, educators, and 3D artists.Scientific Contribution: TheorChem2Blender is an accessible tool with minimal technical knowledge requirements that bridges the gap between 3D modeling and computational chemistry by providing an interface to convert chemical input formats into 3D files. A key contribution of the tool is its support for customizable structural features, such as atom highlighting, adjustable bond orders, and multiple representational styles. Moreover, the program is the first open‑source tool capable of directly converting xyz and com files into fully manipulable fbx and glb 3D animations. This feature facilitates integration of in-silico simulations into web‑based and extended‑reality pipelines, enabling accurate visualization of chemical structures and transformations.

TheorChem2Blender旨在连接计算化学和3D建模,而不需要任何一个领域的广泛技术专业知识,弥合了两个学科之间的差距。该工具使用Python开发,支持常见的化学输入格式(com, xyz, mol2)来创建用于动画,可视化,扩展现实媒体和3D打印(glb, fbx, obj, dae, stl)的3D文件。该程序允许用户自定义原子和键和动画化合物使用各种代表性的系统。与专家计算化学家和3D建模师进行的初步测试表明,图形用户界面和跨平台兼容性。TheorChem2Blender是在Apache 2.0许可下免费提供的,具有全面的文档和教程,以支持化学家,教育工作者和3D艺术家的采用。科学贡献:TheorChem2Blender是一个易于访问的工具,具有最低的技术知识要求,通过提供将化学输入格式转换为3D文件的接口,弥合了3D建模和计算化学之间的差距。该工具的一个关键贡献是它支持可定制的结构特性,例如原子高亮显示、可调节键的顺序和多种表示样式。此外,该程序是第一个能够直接将xyz和com文件转换为完全可操作的fbx和glb 3D动画的开源工具。该功能有助于将硅模拟集成到基于网络和扩展现实的管道中,从而实现化学结构和转换的精确可视化。
{"title":"TheorChem2Blender: an open-source tool to convert computational chemistry outputs into Blender-ready 3D objects.","authors":"Emmanuel Echeverri-Jimenez","doi":"10.1186/s13321-026-01162-0","DOIUrl":"10.1186/s13321-026-01162-0","url":null,"abstract":"<p><p>TheorChem2Blender is designed to connect computational chemistry and 3D modeling without requiring extensive technical expertise from either field, bridging the gap between the two disciplines. Developed in Python, the tool supports common chemical input formats (com, xyz, mol2) to create 3D files for animations, visualization, extended reality media, and 3D printing (glb, fbx, obj, dae, stl). The program allows users to customize atoms and bonds and animate chemical compounds using a variety of representational systems. Pilot testing with expert computational chemists and 3D modelers informed the graphical user interface and cross-platform compatibility. TheorChem2Blender is freely available under the Apache 2.0 license, with comprehensive documentation and tutorials to support adoption by chemists, educators, and 3D artists.Scientific Contribution: TheorChem2Blender is an accessible tool with minimal technical knowledge requirements that bridges the gap between 3D modeling and computational chemistry by providing an interface to convert chemical input formats into 3D files. A key contribution of the tool is its support for customizable structural features, such as atom highlighting, adjustable bond orders, and multiple representational styles. Moreover, the program is the first open‑source tool capable of directly converting xyz and com files into fully manipulable fbx and glb 3D animations. This feature facilitates integration of in-silico simulations into web‑based and extended‑reality pipelines, enabling accurate visualization of chemical structures and transformations.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12998365/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146163367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enzyformer: a two-stage pretrained model for enzymatic retrosynthesis 酶制剂:酶反合成的两阶段预训练模型。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2026-02-09 DOI: 10.1186/s13321-026-01164-y
Tiantao Liu, Jiangcheng Xu, Xinke Zhan, Shaolong Lin, Shirley W. I. Siu

Although enzymatic retrosynthesis planning is widely used to accelerate drug synthesis and discovery, it still faces key challenges, including limited knowledge transfer from general chemical space and incomplete EC number prediction. Here, we introduce Enzyformer, an innovative model that integrates the pretraining of molecules and reactions to capture both the syntax of SMILES and the transformation rules of organic reactions. With this two-stage pretraining strategy, Enzyformer improves top-1 and top-10 accuracies in retrosynthesis prediction by 7.5% and 11.7%, respectively, compared with the baseline R-SMILES. In addition, our pipeline also includes EC number prediction as a sequential task, using a contrastive learning model to improve performance on ECREACT dataset. With the contrastive learning model, we achieve the highest average F1-score of 0.924 for first-level EC number prediction. In conclusion, Enzyformer offers a promising solution for improving the accuracy and interpretability of enzymatic retrosynthesis prediction, which may help to reduce researchers’ workload and cost while accelerating drug design and mitigating the environmental impact of drug synthesis.

Graphic Abstract

尽管酶逆转录计划被广泛用于加速药物的合成和发现,但它仍然面临着关键的挑战,包括来自一般化学领域的有限知识转移和不完整的EC数预测。在这里,我们介绍了一种创新的模型,它集成了分子和反应的预训练,以捕获smile的语法和有机反应的转化规则。通过这种两阶段的预训练策略,与基线R-SMILES相比,酶former将逆转录预测的前1名和前10名的准确率分别提高了7.5%和11.7%。此外,我们的管道还包括EC数预测作为一个顺序任务,使用对比学习模型来提高ECREACT数据集的性能。对比学习模型对一级EC数预测的平均f1得分最高,为0.924。综上所述,酶制剂为提高酶逆转录预测的准确性和可解释性提供了一个很有前景的解决方案,这可能有助于减少研究人员的工作量和成本,同时加快药物设计,减轻药物合成对环境的影响。
{"title":"Enzyformer: a two-stage pretrained model for enzymatic retrosynthesis","authors":"Tiantao Liu,&nbsp;Jiangcheng Xu,&nbsp;Xinke Zhan,&nbsp;Shaolong Lin,&nbsp;Shirley W. I. Siu","doi":"10.1186/s13321-026-01164-y","DOIUrl":"10.1186/s13321-026-01164-y","url":null,"abstract":"<div><p>Although enzymatic retrosynthesis planning is widely used to accelerate drug synthesis and discovery, it still faces key challenges, including limited knowledge transfer from general chemical space and incomplete EC number prediction. Here, we introduce Enzyformer, an innovative model that integrates the pretraining of molecules and reactions to capture both the syntax of SMILES and the transformation rules of organic reactions. With this two-stage pretraining strategy, Enzyformer improves top-1 and top-10 accuracies in retrosynthesis prediction by 7.5% and 11.7%, respectively, compared with the baseline R-SMILES. In addition, our pipeline also includes EC number prediction as a sequential task, using a contrastive learning model to improve performance on ECREACT dataset. With the contrastive learning model, we achieve the highest average F1-score of 0.924 for first-level EC number prediction. In conclusion, Enzyformer offers a promising solution for improving the accuracy and interpretability of enzymatic retrosynthesis prediction, which may help to reduce researchers’ workload and cost while accelerating drug design and mitigating the environmental impact of drug synthesis.</p><h3>Graphic Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-026-01164-y.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146148639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1