Pub Date : 2026-02-27DOI: 10.1186/s13321-026-01169-7
Tae Wook Yang, Seung Hyun Jo, Min Chul Suh
Molecular descriptors are central to the performance and interpretability of QSPR models, yet most existing fingerprints for organic electronics lack chemical relevance or interpretability. Here, we present the Organic Electronic Fingerprint (OEFP), a structure-based representation tailored for OLED and OPV materials. OEFP was constructed from a manually curated OLED dataset and publicly available OPV and chromophore datasets to ensure structural diversity. Synthetically accessible substructures were identified using fragmentation and ring decomposition methods, which capture the conjugated π-bonds crucial for organic electronic materials, and subsequently encoded as individual bits. In case studies of OPV HOMO energy prediction, OEFP achieved up to 13.7% lower MAE and 1.6% higher R2 compared to domain-mismatched structure-based fingerprint under random splits. Under scaffold-based splits designed to assess generalization, OEFP demonstrated substantially improved generalization, reducing MAE by 50–70% and increasing R2 by over 150% relative to the same baseline and achieved performance comparable to ECFP baselines evaluated at a similar fingerprint length when using count-based representations. SHAP analysis further enabled intuitive substructure-level interpretation, while OEFP’s chemically meaningful and synthetically relevant design supports rational molecular generation. Together, these results establish OEFP as an effective and interpretable molecular representation for machine learning applications in organic electronics.
{"title":"Privileged structure-based molecular fingerprints for organic electronic materials: towards intuitive machine learning interpretation","authors":"Tae Wook Yang, Seung Hyun Jo, Min Chul Suh","doi":"10.1186/s13321-026-01169-7","DOIUrl":"10.1186/s13321-026-01169-7","url":null,"abstract":"<div><p>Molecular descriptors are central to the performance and interpretability of QSPR models, yet most existing fingerprints for organic electronics lack chemical relevance or interpretability. Here, we present the Organic Electronic Fingerprint (OEFP), a structure-based representation tailored for OLED and OPV materials. OEFP was constructed from a manually curated OLED dataset and publicly available OPV and chromophore datasets to ensure structural diversity. Synthetically accessible substructures were identified using fragmentation and ring decomposition methods, which capture the conjugated π-bonds crucial for organic electronic materials, and subsequently encoded as individual bits. In case studies of OPV HOMO energy prediction, OEFP achieved up to 13.7% lower MAE and 1.6% higher R<sup>2</sup> compared to domain-mismatched structure-based fingerprint under random splits. Under scaffold-based splits designed to assess generalization, OEFP demonstrated substantially improved generalization, reducing MAE by 50–70% and increasing R<sup>2</sup> by over 150% relative to the same baseline and achieved performance comparable to ECFP baselines evaluated at a similar fingerprint length when using count-based representations. SHAP analysis further enabled intuitive substructure-level interpretation, while OEFP’s chemically meaningful and synthetically relevant design supports rational molecular generation. Together, these results establish OEFP as an effective and interpretable molecular representation for machine learning applications in organic electronics.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-026-01169-7.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147315912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-27DOI: 10.1186/s13321-026-01160-2
Vyacheslav Grigorev, Nikita Serov, Timur Gimadiev, Assima Poyezzhayeva, Pavel Sidorov
Lipophilicity is a fundamental physicochemical property that significantly influences various aspects of drug behavior, such as solubility, permeability, metabolism, distribution, protein binding, and excretion. Consequently, accurate prediction of this property is critical for the successful discovery and development of new drug candidates. The classical metric for assessing lipophilicity is logP, defined as the partition coefficient between n-octanol and water at physiological pH 7.4. Recently, graph-based deep learning methods have gained considerable attention and demonstrated strong performance across diverse drug discovery tasks, from molecular property prediction to virtual screening. These models learn informative representations directly from molecular graphs in an end-to-end manner, without the need for handcrafted descriptors. In this work, we propose a logP prediction approach based on a fine-tuned pre-trained GraphormerMapper model, named GraphormerLogP. To evaluate its performance, the model was tested on two datasets: one is compiled by us from publicly available sources and contains 42 006 unique SMILES-logP pairs (named GLP); the second consists of 13 688 molecules and is used for benchmarking purposes. Our comparative analysis against state-of-the-art models (Random Forest, Chemprop, CheMeleon, StructGNN, and Attentive FP) demonstrates that GraphormerLogP consistently achieves competitive or superior predictive accuracy across both datasets, attaining mean absolute error values of 0.251 and 0.269, respectively. The GLP dataset is available in the GitHub repository https://github.com/cimm-kzn/GraphormerLogP/tree/main/data.Scientific contribution This paper presents two key scientific contributions. First, we have collected and carefully curated a large and diverse dataset of molecules with measured logP values, comprising over 42 000 compounds. Second, we propose a Graphormer-based model with a task-specific fine-tuning architecture for logP prediction, tailored to leverage representations learned from reaction data. This model demonstrates high performance in benchmark studies on both established literature data and the newly compiled dataset.
{"title":"Graph-based transformer to predict the octanol-water partition coefficient.","authors":"Vyacheslav Grigorev, Nikita Serov, Timur Gimadiev, Assima Poyezzhayeva, Pavel Sidorov","doi":"10.1186/s13321-026-01160-2","DOIUrl":"https://doi.org/10.1186/s13321-026-01160-2","url":null,"abstract":"<p><p>Lipophilicity is a fundamental physicochemical property that significantly influences various aspects of drug behavior, such as solubility, permeability, metabolism, distribution, protein binding, and excretion. Consequently, accurate prediction of this property is critical for the successful discovery and development of new drug candidates. The classical metric for assessing lipophilicity is logP, defined as the partition coefficient between n-octanol and water at physiological pH 7.4. Recently, graph-based deep learning methods have gained considerable attention and demonstrated strong performance across diverse drug discovery tasks, from molecular property prediction to virtual screening. These models learn informative representations directly from molecular graphs in an end-to-end manner, without the need for handcrafted descriptors. In this work, we propose a logP prediction approach based on a fine-tuned pre-trained GraphormerMapper model, named GraphormerLogP. To evaluate its performance, the model was tested on two datasets: one is compiled by us from publicly available sources and contains 42 006 unique SMILES-logP pairs (named GLP); the second consists of 13 688 molecules and is used for benchmarking purposes. Our comparative analysis against state-of-the-art models (Random Forest, Chemprop, CheMeleon, StructGNN, and Attentive FP) demonstrates that GraphormerLogP consistently achieves competitive or superior predictive accuracy across both datasets, attaining mean absolute error values of 0.251 and 0.269, respectively. The GLP dataset is available in the GitHub repository https://github.com/cimm-kzn/GraphormerLogP/tree/main/data.Scientific contribution This paper presents two key scientific contributions. First, we have collected and carefully curated a large and diverse dataset of molecules with measured logP values, comprising over 42 000 compounds. Second, we propose a Graphormer-based model with a task-specific fine-tuning architecture for logP prediction, tailored to leverage representations learned from reaction data. This model demonstrates high performance in benchmark studies on both established literature data and the newly compiled dataset.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147315881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-20DOI: 10.1186/s13321-025-01135-9
Stefano Ribes, Ranxuan Zhang, Télio Cropsal, Anders Källberg, Christian Tyrchan, Eva Nittinger, Rocío Mercado
Proteolysis-targeting chimeras (PROTACs) are heterobifunctional molecules composed of an E3 ligase ligand, a linker, and a warhead targeting a protein of interest. Despite their modular structure, accurately identifying and annotating these components in PROTACs is challenging and typically relies on manual curation and predefined substructure matching. To address this, we developed PROTAC-Splitter, a machine learning framework designed for automated annotation of PROTAC substructures. To address data scarcity, we generated and openly released a synthetic dataset containing approximately 1.3 million PROTAC structures with annotated ligand splits. Leveraging this dataset, we developed two complementary approaches for PROTAC substructure annotation: a Transformer-based sequence-to-sequence model and a graph-based XGBoost model. We evaluated both approaches on held-out public data and structurally novel PROTACs from AstraZeneca’s proprietary collection. The Transformer-based model achieved high exact-match accuracy (86%) on public data but dropped significantly (18%) on structurally novel internal PROTACs due to occasional hallucinations. In contrast, the XGBoost model can ensure chemical validity and perfect reassembly accuracy on both sets, with lower exact-match accuracy on open-data (42.2%) but comparable performance on the internal set (23%). To improve reliability, we implemented a wrapper function for the Transformer (Transformer-(Delta)), which corrects partial prediction errors, raising reassembly accuracy to 96% on public and 70% on internal datasets. Combining the strengths of both models, we propose a hybrid approach that reliably annotates PROTACs across diverse chemical spaces. PROTAC-Splitter provides a robust, scalable tool to facilitate automated PROTAC analysis and is available open-source at https://github.com/ribesstefano/PROTAC-Splitter
{"title":"PROTAC-Splitter: a machine learning framework for automated identification of PROTAC substructures","authors":"Stefano Ribes, Ranxuan Zhang, Télio Cropsal, Anders Källberg, Christian Tyrchan, Eva Nittinger, Rocío Mercado","doi":"10.1186/s13321-025-01135-9","DOIUrl":"10.1186/s13321-025-01135-9","url":null,"abstract":"<div><p>Proteolysis-targeting chimeras (PROTACs) are heterobifunctional molecules composed of an E3 ligase ligand, a linker, and a warhead targeting a protein of interest. Despite their modular structure, accurately identifying and annotating these components in PROTACs is challenging and typically relies on manual curation and predefined substructure matching. To address this, we developed PROTAC-Splitter, a machine learning framework designed for automated annotation of PROTAC substructures. To address data scarcity, we generated and openly released a synthetic dataset containing approximately 1.3 million PROTAC structures with annotated ligand splits. Leveraging this dataset, we developed two complementary approaches for PROTAC substructure annotation: a Transformer-based sequence-to-sequence model and a graph-based XGBoost model. We evaluated both approaches on held-out public data and structurally novel PROTACs from AstraZeneca’s proprietary collection. The Transformer-based model achieved high exact-match accuracy (86%) on public data but dropped significantly (18%) on structurally novel internal PROTACs due to occasional hallucinations. In contrast, the XGBoost model can ensure chemical validity and perfect reassembly accuracy on both sets, with lower exact-match accuracy on open-data (42.2%) but comparable performance on the internal set (23%). To improve reliability, we implemented a wrapper function for the Transformer (Transformer-<span>(Delta)</span>), which corrects partial prediction errors, raising reassembly accuracy to 96% on public and 70% on internal datasets. Combining the strengths of both models, we propose a hybrid approach that reliably annotates PROTACs across diverse chemical spaces. PROTAC-Splitter provides a robust, scalable tool to facilitate automated PROTAC analysis and is available open-source at https://github.com/ribesstefano/PROTAC-Splitter</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12924545/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146256838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-19DOI: 10.1186/s13321-026-01166-w
Jianmin Li, Rongling Gu, Shijie Du, Lu Xu
Accurate representation of three-dimensional (3D) molecular structures is essential for quantitative structure-activity relationship (QSAR) modeling; however, it remains unclear whether increasing the level of theory used for quantum-chemical geometry optimization yields a practically meaningful benefit for classical conformation-dependent 3D descriptors (Dragon 3D) and the resulting QSAR performance. Here, we benchmark eight commonly used quantum-chemical (QM) geometry-optimization protocols-from minimal-basis Hartree-Fock (HF/STO-3 G) to def2-based hybrid density functional theory (DFT) and the composite method rSCAN-3c-across three anticancer activity datasets and ten machine-learning classifiers. Descriptor-level analyses (relative deviation, rank correlation, and chemical-space similarity) reveal systematic method dependence in descriptor magnitudes: high-accuracy def2-based DFT protocols produce highly consistent descriptor spaces, whereas some intermediate/low-level settings introduce larger variability, although molecular rankings remain largely robust (Spearman ). In contrast, downstream QSAR performance is only weakly affected by the QM level. Across paired datasetmodel blocks (), mean balanced accuracies cluster tightly (0.852-0.871). B3LYP/3-21 G achieves the highest overall mean balanced accuracies (BA) (0.8709; 95% CI 0.8565-0.8840), while the lowest mean is observed for HF/STO-3 G (0.8518; 95% CI 0.8371-0.8661); def2-based B3LYP methods are numerically slightly lower (0.855-0.856). A repeated-measures omnibus test indicates a statistically detectable method effect (Friedman ) but with a small effect size (Kendall's ), and post-hoc Wilcoxon tests with Holm correction identify only one robust pairwise difference (B3LYP/3-21 G vs. HF/STO-3 G, ). Thus, the observed performance shifts are marginal in magnitude (1-2%) compared with the 10-100 differences in computational cost.To support pragmatic method selection, we propose a two-tier Absolute Efficiency Ratio (AER) framework integrating predictive performance with efficiency and methodological considerations. Overall, these results indicate a non-linear and practically weak relationship between QM geometry-optimization level, classical 3D descriptor fidelity, and QSAR performance, suggesting that QM-level upgrades mainly reshape descriptor values without yielding commensurate or actionable gains in predictive accuracy.
分子三维结构的准确表征是定量构效关系(QSAR)建模的关键;然而,目前尚不清楚提高量子化学几何优化的理论水平是否会对经典构象依赖的3D描述符(Dragon 3D)和由此产生的QSAR性能产生实际意义上的好处。在这里,我们对八种常用的量子化学(QM)几何优化方案进行了基准测试——从最小基Hartree-Fock (HF/ sto - 3g)到基于def2的混合密度泛函数理论(DFT)和复合方法r2scan -3c——跨越三个抗癌活性数据集和十个机器学习分类器。描述符水平分析(相对偏差、等级相关性和化学空间相似性)揭示了描述符大小的系统方法依赖性:高精度基于def2的DFT协议产生高度一致的描述符空间,而一些中间/低水平设置引入了更大的可变性,尽管分子排名在很大程度上保持稳健(Spearman ρ>0.95)。相比之下,下游QSAR性能仅受QM水平的微弱影响。在成对的dataset×model块(n=30)中,平均平衡精度紧密聚集(0.852-0.871)。B3LYP/3-21 G的总体平均平衡精度(BA)最高(0.8709,95% CI 0.8565-0.8840),而HF/STO-3 G的平均平衡精度最低(0.8518,95% CI 0.8371-0.8661);基于def2的B3LYP方法在数值上略低(~ 0.855-0.856)。重复测量综合检验表明统计学上可检测的方法效应(Friedman p=0.006),但效应大小较小(Kendall's W=0.094),经过Holm校正的事后Wilcoxon检验仅发现一个稳健的两两差异(B3LYP/3-21 G vs. HF/斯托-3 G, pHolm=0.025)。因此,与计算成本的10-100倍差异相比,观察到的性能变化在幅度上是微不足道的(≤1-2%)。为了支持实用的方法选择,我们提出了一个两层绝对效率比(AER)框架,将预测性能与效率和方法考虑相结合。总的来说,这些结果表明QM几何优化水平、经典3D描述子保真度和QSAR性能之间存在非线性且实际上很弱的关系,这表明QM水平升级主要是重塑描述子值,而不是在预测精度方面产生相应的或可操作的增益。
{"title":"How quantum-chemical geometry optimization level affects classical 3D descriptors and QSAR performance: a comparative study.","authors":"Jianmin Li, Rongling Gu, Shijie Du, Lu Xu","doi":"10.1186/s13321-026-01166-w","DOIUrl":"https://doi.org/10.1186/s13321-026-01166-w","url":null,"abstract":"<p><p>Accurate representation of three-dimensional (3D) molecular structures is essential for quantitative structure-activity relationship (QSAR) modeling; however, it remains unclear whether increasing the level of theory used for quantum-chemical geometry optimization yields a practically meaningful benefit for classical conformation-dependent 3D descriptors (Dragon 3D) and the resulting QSAR performance. Here, we benchmark eight commonly used quantum-chemical (QM) geometry-optimization protocols-from minimal-basis Hartree-Fock (HF/STO-3 G) to def2-based hybrid density functional theory (DFT) and the composite method r<math><mmultiscripts><mrow></mrow><mrow></mrow><mn>2</mn></mmultiscripts></math>SCAN-3c-across three anticancer activity datasets and ten machine-learning classifiers. Descriptor-level analyses (relative deviation, rank correlation, and chemical-space similarity) reveal systematic method dependence in descriptor magnitudes: high-accuracy def2-based DFT protocols produce highly consistent descriptor spaces, whereas some intermediate/low-level settings introduce larger variability, although molecular rankings remain largely robust (Spearman <math><mrow><mi>ρ</mi><mo>></mo><mn>0.95</mn></mrow></math>). In contrast, downstream QSAR performance is only weakly affected by the QM level. Across paired dataset<math><mo>×</mo></math>model blocks (<math><mrow><mi>n</mi><mo>=</mo><mn>30</mn></mrow></math>), mean balanced accuracies cluster tightly (0.852-0.871). B3LYP/3-21 G achieves the highest overall mean balanced accuracies (BA) (0.8709; 95% CI 0.8565-0.8840), while the lowest mean is observed for HF/STO-3 G (0.8518; 95% CI 0.8371-0.8661); def2-based B3LYP methods are numerically slightly lower (<math><mo>∼</mo></math>0.855-0.856). A repeated-measures omnibus test indicates a statistically detectable method effect (Friedman <math><mrow><mi>p</mi><mo>=</mo><mn>0.006</mn></mrow></math>) but with a small effect size (Kendall's <math><mrow><mi>W</mi><mo>=</mo><mn>0.094</mn></mrow></math>), and post-hoc Wilcoxon tests with Holm correction identify only one robust pairwise difference (B3LYP/3-21 G vs. HF/STO-3 G, <math><mrow><msub><mi>p</mi><mtext>Holm</mtext></msub><mo>=</mo><mn>0.025</mn></mrow></math>). Thus, the observed performance shifts are marginal in magnitude (<math><mo>≤</mo></math>1-2%) compared with the 10-100<math><mo>×</mo></math> differences in computational cost.To support pragmatic method selection, we propose a two-tier Absolute Efficiency Ratio (AER) framework integrating predictive performance with efficiency and methodological considerations. Overall, these results indicate a non-linear and practically weak relationship between QM geometry-optimization level, classical 3D descriptor fidelity, and QSAR performance, suggesting that QM-level upgrades mainly reshape descriptor values without yielding commensurate or actionable gains in predictive accuracy.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146224986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-19DOI: 10.1186/s13321-025-01128-8
Manel Gil-Sorribes, Alexis Molina
Accurately predicting interactions between drugs is critical for pharmaceutical research and clinical safety. The literature keeps moving toward increasingly complex architectures, yet gains on standard benchmarks are often small. We use a deliberately simple setup that keeps the classifier fixed and swaps only the molecular representation. We compare ECFP4 Morgan fingerprints (MFPs), pretrained graph convolutional networks (GCNs), and MoLFormer embeddings on common DrugBank DDI splits and on an FDA drug-drug affinity benchmark. This design lets us isolate the effect of representation and of split design with minimal confounders.Across leak proof DrugBank splits and the FDA task, MFPs with a shallow head match or surpass much more complex graph and knowledge graph systems while using far fewer parameters. On the Unseen DDI split, MFPs reach AUROC 99.4 and AUPR 98.4, slightly ahead of MoLFormer and the prior state of the art. When we impose a strict scaffold out-of-distribution split, a pretrained GCN leads with AUROC 73.99, which pinpoints when extra capacity helps. On standard splits the absolute gains from added complexity are small. We therefore argue that progress will come mainly from better curated datasets and from rigorous out-of-distribution evaluation, rather than from ever more elaborate models.
{"title":"Addressing model overcomplexity in drug-drug interaction prediction with molecular fingerprints.","authors":"Manel Gil-Sorribes, Alexis Molina","doi":"10.1186/s13321-025-01128-8","DOIUrl":"10.1186/s13321-025-01128-8","url":null,"abstract":"<p><p>Accurately predicting interactions between drugs is critical for pharmaceutical research and clinical safety. The literature keeps moving toward increasingly complex architectures, yet gains on standard benchmarks are often small. We use a deliberately simple setup that keeps the classifier fixed and swaps only the molecular representation. We compare ECFP4 Morgan fingerprints (MFPs), pretrained graph convolutional networks (GCNs), and MoLFormer embeddings on common DrugBank DDI splits and on an FDA drug-drug affinity benchmark. This design lets us isolate the effect of representation and of split design with minimal confounders.Across leak proof DrugBank splits and the FDA task, MFPs with a shallow head match or surpass much more complex graph and knowledge graph systems while using far fewer parameters. On the Unseen DDI split, MFPs reach AUROC 99.4 and AUPR 98.4, slightly ahead of MoLFormer and the prior state of the art. When we impose a strict scaffold out-of-distribution split, a pretrained GCN leads with AUROC 73.99, which pinpoints when extra capacity helps. On standard splits the absolute gains from added complexity are small. We therefore argue that progress will come mainly from better curated datasets and from rigorous out-of-distribution evaluation, rather than from ever more elaborate models.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146218224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-14DOI: 10.1186/s13321-026-01161-1
Roxane Axel Jacob, Leo Gaskin, Thomas Seidel, Ya Chen, Angelica Mazzolari, Johannes Kirchmair
Predicting likely sites of metabolism (SOMs), i.e., the atoms in a molecule where metabolic reactions are initiated, is an important component of the computational development pipeline for pharmaceuticals, agrochemicals, and cosmetics. Among SOM prediction tools, FAME3, introduced in 2019, is one of only a few non-commercial models capable of predicting both Phase 1 and Phase 2 SOMs for a wide range of xenobiotics. However, its original implementation posed challenges in maintainability, scalability, and interoperability, which hindered broader adoption. To overcome these limitations, we developed FAME3R, an enhanced version of FAME3 designed to improve computational efficiency and facilitate integration with contemporary cheminformatics workflows. FAME3R introduces several new features, including a novel reliability assessment method based on Shannon entropy and the option to select among various featurization strategies. The tool is available as an open-source Python package, offering both a Python API and a CLI for flexible usage. Additionally, trained FAME3R models can be accessed via a GUI and a REST API hosted on the NERDD web platform.
{"title":"FAME3R: an efficient, practical and reliable open-source tool for predicting phase 1 and phase 2 sites of metabolism.","authors":"Roxane Axel Jacob, Leo Gaskin, Thomas Seidel, Ya Chen, Angelica Mazzolari, Johannes Kirchmair","doi":"10.1186/s13321-026-01161-1","DOIUrl":"10.1186/s13321-026-01161-1","url":null,"abstract":"<p><p>Predicting likely sites of metabolism (SOMs), i.e., the atoms in a molecule where metabolic reactions are initiated, is an important component of the computational development pipeline for pharmaceuticals, agrochemicals, and cosmetics. Among SOM prediction tools, FAME3, introduced in 2019, is one of only a few non-commercial models capable of predicting both Phase 1 and Phase 2 SOMs for a wide range of xenobiotics. However, its original implementation posed challenges in maintainability, scalability, and interoperability, which hindered broader adoption. To overcome these limitations, we developed FAME3R, an enhanced version of FAME3 designed to improve computational efficiency and facilitate integration with contemporary cheminformatics workflows. FAME3R introduces several new features, including a novel reliability assessment method based on Shannon entropy and the option to select among various featurization strategies. The tool is available as an open-source Python package, offering both a Python API and a CLI for flexible usage. Additionally, trained FAME3R models can be accessed via a GUI and a REST API hosted on the NERDD web platform.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13011438/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146197205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-12DOI: 10.1186/s13321-026-01165-x
Ines Smit, Melissa F. Adasme, Emma Manners, Sybilla Corbett, Nicolas Bosc, Hoang-My-Anh Do, Andrew R. Leach, Noel M. O’Boyle, Barbara Zdrazil
As the volume and diversity of bioactivity data in ChEMBL continues to grow, ensuring that assay metadata is standardized, interoperable, and machine-readable is critical for effective use in cheminformatics and ML applications. In this work, we present recent efforts to enhance the quality and granularity of bioassay annotations in ChEMBL through a combination of manual and semi-manual curation and AI-driven approaches. We introduce a “perfect assay description” template to guide consistent annotation and demonstrate how natural language processing techniques and multi-class classification can be used to automatically extract key assay parameters and assign broad assay categories for legacy data. We report on the development, validation, and application of a spaCy-based NER model that identifies experimental methods with high precision and recall, as well as a complementary classification model that refines ASSAY_TYPE categorization beyond the existing schema. In addition, we describe improvements to metadata extraction for ADME endpoints, organism and protein variant annotations, and ontology linking using tools such as text2term. Together, these enhancements significantly advance the FAIRness of ChEMBL’s bioassay data, enabling more robust downstream analyses and more precise compound-target activity modeling.