Journal of Cheminformatics最新文献_第4页

CardioGenAI: a machine learning-based framework for re-engineering drugs for reduced hERG liability

IF 7.1 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-03-05 DOI: 10.1186/s13321-025-00976-8

Gregory W. Kyro, Matthew T. Martin, Eric D. Watt, Victor S. Batista

<div>The link between in vitro hERG ion channel inhibition and subsequent in vivo QT interval prolongation, a critical risk factor for the development of arrythmias such as Torsade de Pointes, is so well established that in vitro hERG activity alone is often sufficient to end the development of an otherwise promising drug candidate. It is therefore of tremendous interest to develop advanced methods for identifying hERG-active compounds in the early stages of drug development, as well as for proposing redesigned compounds with reduced hERG liability and preserved primary pharmacology. In this work, we present CardioGenAI, a machine learning-based framework for re-engineering both developmental and commercially available drugs for reduced hERG activity while preserving their pharmacological activity. The framework incorporates novel state-of-the-art discriminative models for predicting hERG channel activity, as well as activity against the voltage-gated NaV1.5 and CaV1.2 channels due to their potential implications in modulating the arrhythmogenic potential induced by hERG channel blockade. We applied the complete framework to pimozide, an FDA-approved antipsychotic agent that demonstrates high affinity to the hERG channel, and generated 100 refined candidates. Remarkably, among the candidates is fluspirilene, a compound which is of the same class of drugs as pimozide (diphenylmethanes) and therefore has similar pharmacological activity, yet exhibits over 700-fold weaker binding to hERG. Furthermore, we demonstrated the framework's ability to optimize hERG, NaV1.5 and CaV1.2 profiles of multiple FDA-approved compounds while maintaining the physicochemical nature of the original drugs. We envision that this method can effectively be applied to developmental compounds exhibiting hERG liabilities to provide a means of rescuing drug development programs that have stalled due to hERG-related safety concerns. Additionally, the discriminative models can also serve independently as effective components of virtual screening pipelines. We have made all of our software open-source at https://github.com/gregory-kyro/CardioGenAI to facilitate integration of the CardioGenAI framework for molecular hypothesis generation into drug discovery workflows.Scientific contributionThis work introduces CardioGenAI, an open-source machine learning-based framework designed to re-engineer drugs for reduced hERG liability while preserving their pharmacological activity. The complete CardioGenAI framework can be applied to developmental compounds exhibiting hERG liabilities to provide a means of rescuing drug discovery programs facing hERG-related challenges. In addition, the framework incorporates novel state-of-the-art discriminative models for predicting hERG, NaV1.5 and CaV1.2 channel activity, which can function independently as effective components of virtual screening pipelines.

{"title":"CardioGenAI: a machine learning-based framework for re-engineering drugs for reduced hERG liability","authors":"Gregory W. Kyro, Matthew T. Martin, Eric D. Watt, Victor S. Batista","doi":"10.1186/s13321-025-00976-8","DOIUrl":"10.1186/s13321-025-00976-8","url":null,"abstract":"<div>The link between in vitro hERG ion channel inhibition and subsequent in vivo QT interval prolongation, a critical risk factor for the development of arrythmias such as Torsade de Pointes, is so well established that in vitro hERG activity alone is often sufficient to end the development of an otherwise promising drug candidate. It is therefore of tremendous interest to develop advanced methods for identifying hERG-active compounds in the early stages of drug development, as well as for proposing redesigned compounds with reduced hERG liability and preserved primary pharmacology. In this work, we present CardioGenAI, a machine learning-based framework for re-engineering both developmental and commercially available drugs for reduced hERG activity while preserving their pharmacological activity. The framework incorporates novel state-of-the-art discriminative models for predicting hERG channel activity, as well as activity against the voltage-gated NaV1.5 and CaV1.2 channels due to their potential implications in modulating the arrhythmogenic potential induced by hERG channel blockade. We applied the complete framework to pimozide, an FDA-approved antipsychotic agent that demonstrates high affinity to the hERG channel, and generated 100 refined candidates. Remarkably, among the candidates is fluspirilene, a compound which is of the same class of drugs as pimozide (diphenylmethanes) and therefore has similar pharmacological activity, yet exhibits over 700-fold weaker binding to hERG. Furthermore, we demonstrated the framework's ability to optimize hERG, NaV1.5 and CaV1.2 profiles of multiple FDA-approved compounds while maintaining the physicochemical nature of the original drugs. We envision that this method can effectively be applied to developmental compounds exhibiting hERG liabilities to provide a means of rescuing drug development programs that have stalled due to hERG-related safety concerns. Additionally, the discriminative models can also serve independently as effective components of virtual screening pipelines. We have made all of our software open-source at https://github.com/gregory-kyro/CardioGenAI to facilitate integration of the CardioGenAI framework for molecular hypothesis generation into drug discovery workflows.Scientific contributionThis work introduces CardioGenAI, an open-source machine learning-based framework designed to re-engineer drugs for reduced hERG liability while preserving their pharmacological activity. The complete CardioGenAI framework can be applied to developmental compounds exhibiting hERG liabilities to provide a means of rescuing drug discovery programs facing hERG-related challenges. In addition, the framework incorporates novel state-of-the-art discriminative models for predicting hERG, NaV1.5 and CaV1.2 channel activity, which can function independently as effective components of virtual screening pipelines.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00976-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143546640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Achieving well-informed decision-making in drug discovery: a comprehensive calibration study using neural network-based structure-activity models

IF 7.1 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-03-05 DOI: 10.1186/s13321-025-00964-y

Hannah Rosa Friesacher, Ola Engkvist, Lewis Mervin, Yves Moreau, Adam Arany

In the drug discovery process, where experiments can be costly and time-consuming, computational models that predict drug-target interactions are valuable tools to accelerate the development of new therapeutic agents. Estimating the uncertainty inherent in these neural network predictions provides valuable information that facilitates optimal decision-making when risk assessment is crucial. However, such models can be poorly calibrated, which results in unreliable uncertainty estimates that do not reflect the true predictive uncertainty. In this study, we compare different metrics, including accuracy and calibration scores, used for model hyperparameter tuning to investigate which model selection strategy achieves well-calibrated models. Furthermore, we propose to use a computationally efficient Bayesian uncertainty estimation method named HMC Bayesian Last Layer (HBLL), which generates Hamiltonian Monte Carlo (HMC) trajectories to obtain samples for the parameters of a Bayesian logistic regression fitted to the hidden layer of the baseline neural network. We report that this approach improves model calibration and achieves the performance of common uncertainty quantification methods by combining the benefits of uncertainty estimation and probability calibration methods. Finally, we show that combining post hoc calibration method with well-performing uncertainty quantification approaches can boost model accuracy and calibration.

{"title":"Achieving well-informed decision-making in drug discovery: a comprehensive calibration study using neural network-based structure-activity models","authors":"Hannah Rosa Friesacher, Ola Engkvist, Lewis Mervin, Yves Moreau, Adam Arany","doi":"10.1186/s13321-025-00964-y","DOIUrl":"10.1186/s13321-025-00964-y","url":null,"abstract":"<div>In the drug discovery process, where experiments can be costly and time-consuming, computational models that predict drug-target interactions are valuable tools to accelerate the development of new therapeutic agents. Estimating the uncertainty inherent in these neural network predictions provides valuable information that facilitates optimal decision-making when risk assessment is crucial. However, such models can be poorly calibrated, which results in unreliable uncertainty estimates that do not reflect the true predictive uncertainty. In this study, we compare different metrics, including accuracy and calibration scores, used for model hyperparameter tuning to investigate which model selection strategy achieves well-calibrated models. Furthermore, we propose to use a computationally efficient Bayesian uncertainty estimation method named HMC Bayesian Last Layer (HBLL), which generates Hamiltonian Monte Carlo (HMC) trajectories to obtain samples for the parameters of a Bayesian logistic regression fitted to the hidden layer of the baseline neural network. We report that this approach improves model calibration and achieves the performance of common uncertainty quantification methods by combining the benefits of uncertainty estimation and probability calibration methods. Finally, we show that combining post hoc calibration method with well-performing uncertainty quantification approaches can boost model accuracy and calibration. </div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00964-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143546641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Syn-MolOpt: a synthesis planning-driven molecular optimization method using data-derived functional reaction templates Syn-MolOpt：使用数据衍生功能反应模板的合成规划驱动分子优化方法

IF 7.1 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-03-02 DOI: 10.1186/s13321-025-00975-9

Xiaodan Yin, Xiaorui Wang, Zhenxing Wu, Qin Li, Yu Kang, Yafeng Deng, Pei Luo, Huanxiang Liu, Guqin Shi, Zheng Wang, Xiaojun Yao, Chang-Yu Hsieh, Tingjun Hou

Molecular optimization is a crucial step in drug development, involving structural modifications to improve the desired properties of drug candidates. Although many deep-learning-based molecular optimization algorithms have been proposed and may perform well on benchmarks, they usually do not pay sufficient attention to the synthesizability of molecules, resulting in optimized compounds difficult to be synthesized. To address this issue, we first developed a general pipeline capable of constructing functional reaction template library specific to any property where a predictive model can be built. Based on these functional templates, we introduced Syn-MolOpt, a synthesis planning-oriented molecular optimization method. During optimization, functional reaction templates steer the process towards specific properties by effectively transforming relevant structural fragments. In four diverse tasks, including two toxicity-related (GSK3β-Mutagenicity and GSK3β-hERG) and two metabolism-related (GSK3β-CYP3A4 and GSK3β-CYP2C19) multi-property molecular optimizations, Syn-MolOpt outperformed three benchmark models (Modof, HierG2G, and SynNet), highlighting its efficacy and adaptability. Additionally, visualization of the synthetic routes for molecules optimized by Syn-MolOpt confirms the effectiveness of functional reaction templates in molecular optimization. Notably, Syn-MolOpt’s robust performance in scenarios with limited scoring accuracy demonstrates its potential for real-world molecular optimization applications. By considering both optimization and synthesizability, Syn-MolOpt promises to be a valuable tool in molecular optimization.

Scientific contribution Syn-MolOpt takes into account both molecular optimization and synthesis, allowing for the design of property-specific functional reaction template libraries for the properties to be optimized, and providing reference synthesis routes for the optimized compounds while optimizing the targeted properties. Syn-MolOpt’s universal workflow makes it suitable for various types of molecular optimization tasks.

分子优化是药物开发的关键步骤，它涉及结构修饰，以改善候选药物的预期特性。虽然已经提出了许多基于深度学习的分子优化算法，并且在基准测试中表现良好，但这些算法通常没有充分关注分子的可合成性，导致优化后的化合物难以合成。为了解决这个问题，我们首先开发了一个通用管道，能够针对任何可以建立预测模型的特性构建功能反应模板库。基于这些功能模板，我们推出了面向合成规划的分子优化方法 Syn-MolOpt。在优化过程中，功能性反应模板通过有效转化相关结构片段，引导整个过程向特定性质发展。在四个不同的任务中，包括两个毒性相关任务（GSK3β-突变性和GSK3β-hERG）和两个代谢相关任务（GSK3β-CYP3A4和GSK3β-CYP2C19）的多属性分子优化中，Syn-MolOpt的表现优于三个基准模型（Modof、HierG2G和SynNet），凸显了它的有效性和适应性。此外，Syn-MolOpt 优化的分子合成路线可视化证实了功能反应模板在分子优化中的有效性。值得注意的是，Syn-MolOpt 在得分准确性有限的情况下表现出的强大性能证明了它在实际分子优化应用中的潜力。科学贡献 Syn-MolOpt同时考虑了分子优化和合成，可以针对待优化的性质设计特定性质的功能反应模板库，并在优化目标性质的同时为优化化合物提供参考合成路线。Syn-MolOpt 的通用工作流程使其适用于各种类型的分子优化任务。

{"title":"Syn-MolOpt: a synthesis planning-driven molecular optimization method using data-derived functional reaction templates","authors":"Xiaodan Yin, Xiaorui Wang, Zhenxing Wu, Qin Li, Yu Kang, Yafeng Deng, Pei Luo, Huanxiang Liu, Guqin Shi, Zheng Wang, Xiaojun Yao, Chang-Yu Hsieh, Tingjun Hou","doi":"10.1186/s13321-025-00975-9","DOIUrl":"10.1186/s13321-025-00975-9","url":null,"abstract":"<div>Molecular optimization is a crucial step in drug development, involving structural modifications to improve the desired properties of drug candidates. Although many deep-learning-based molecular optimization algorithms have been proposed and may perform well on benchmarks, they usually do not pay sufficient attention to the synthesizability of molecules, resulting in optimized compounds difficult to be synthesized. To address this issue, we first developed a general pipeline capable of constructing functional reaction template library specific to any property where a predictive model can be built. Based on these functional templates, we introduced Syn-MolOpt, a synthesis planning-oriented molecular optimization method. During optimization, functional reaction templates steer the process towards specific properties by effectively transforming relevant structural fragments. In four diverse tasks, including two toxicity-related (GSK3β-Mutagenicity and GSK3β-hERG) and two metabolism-related (GSK3β-CYP3A4 and GSK3β-CYP2C19) multi-property molecular optimizations, Syn-MolOpt outperformed three benchmark models (Modof, HierG2G, and SynNet), highlighting its efficacy and adaptability. Additionally, visualization of the synthetic routes for molecules optimized by Syn-MolOpt confirms the effectiveness of functional reaction templates in molecular optimization. Notably, Syn-MolOpt’s robust performance in scenarios with limited scoring accuracy demonstrates its potential for real-world molecular optimization applications. By considering both optimization and synthesizability, Syn-MolOpt promises to be a valuable tool in molecular optimization.Scientific contribution Syn-MolOpt takes into account both molecular optimization and synthesis, allowing for the design of property-specific functional reaction template libraries for the properties to be optimized, and providing reference synthesis routes for the optimized compounds while optimizing the targeted properties. Syn-MolOpt’s universal workflow makes it suitable for various types of molecular optimization tasks.</div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00975-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143529973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GNINA 1.3: the next increment in molecular docking with deep learning

IF 7.1 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-03-02 DOI: 10.1186/s13321-025-00973-x

Andrew T. McNutt, Yanjing Li, Rocco Meli, Rishal Aggarwal, David Ryan Koes

Computer-aided drug design has the potential to significantly reduce the astronomical costs of drug development, and molecular docking plays a prominent role in this process. Molecular docking is an in silico technique that predicts the bound 3D conformations of two molecules, a necessary step for other structure-based methods. Here, we describe version 1.3 of the open-source molecular docking software Gnina. This release updates the underlying deep learning framework to PyTorch, resulting in more computationally efficient docking and paving the way for seamless integration of other deep learning methods into the docking pipeline. We retrained our CNN scoring functions on the updated CrossDocked2020 v1.3 dataset and introduce knowledge-distilled CNN scoring functions to facilitate high-throughput virtual screening with Gnina. Furthermore, we add functionality for covalent docking, where an atom of the ligand is covalently bound to an atom of the receptor. This update expands the scope of docking with Gnina and further positions Gnina as a user-friendly, open-source molecular docking framework. Gnina is available at https://github.com/gnina/gnina.

Scientific contributions: GNINA 1.3 is an open source a molecular docking tool with enhanced support for covalent docking and updated deep learning models for more effective docking and screening.

计算机辅助药物设计有可能大大降低药物开发的天文数字般的成本，而分子对接在这一过程中发挥着重要作用。分子对接是一种预测两个分子结合三维构象的硅学技术，是其他基于结构的方法的必要步骤。在此，我们介绍开源分子对接软件 Gnina 的 1.3 版本。该版本将底层深度学习框架更新为 PyTorch，从而提高了对接的计算效率，并为将其他深度学习方法无缝集成到对接管道中铺平了道路。我们在更新后的 CrossDocked2020 v1.3 数据集上重新训练了 CNN 评分函数，并引入了经过知识提炼的 CNN 评分函数，以促进使用 Gnina 进行高通量虚拟筛选。此外，我们还增加了共价对接功能，即配体原子与受体原子共价结合。这次更新扩大了 Gnina 的对接范围，并进一步将 Gnina 定位为用户友好型开源分子对接框架。Gnina 可通过 https://github.com/gnina/gnina.Scientific 捐款获得：GNINA 1.3 是一款开源的分子对接工具，增强了对共价对接的支持，并更新了深度学习模型，以实现更有效的对接和筛选。

{"title":"GNINA 1.3: the next increment in molecular docking with deep learning","authors":"Andrew T. McNutt, Yanjing Li, Rocco Meli, Rishal Aggarwal, David Ryan Koes","doi":"10.1186/s13321-025-00973-x","DOIUrl":"10.1186/s13321-025-00973-x","url":null,"abstract":"<div>Computer-aided drug design has the potential to significantly reduce the astronomical costs of drug development, and molecular docking plays a prominent role in this process. Molecular docking is an in silico technique that predicts the bound 3D conformations of two molecules, a necessary step for other structure-based methods. Here, we describe version 1.3 of the open-source molecular docking software Gnina. This release updates the underlying deep learning framework to PyTorch, resulting in more computationally efficient docking and paving the way for seamless integration of other deep learning methods into the docking pipeline. We retrained our CNN scoring functions on the updated CrossDocked2020 v1.3 dataset and introduce knowledge-distilled CNN scoring functions to facilitate high-throughput virtual screening with Gnina. Furthermore, we add functionality for covalent docking, where an atom of the ligand is covalently bound to an atom of the receptor. This update expands the scope of docking with Gnina and further positions Gnina as a user-friendly, open-source molecular docking framework. Gnina is available at https://github.com/gnina/gnina.Scientific contributions: GNINA 1.3 is an open source a molecular docking tool with enhanced support for covalent docking and updated deep learning models for more effective docking and screening.</div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00973-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143529972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Pretraining graph transformers with atom-in-a-molecule quantum properties for improved ADMET modeling

IF 7.1 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-02-27 DOI: 10.1186/s13321-025-00970-0

Alessio Fallani, Ramil Nugmanov, Jose Arjona-Medina, Jörg Kurt Wegner, Alexandre Tkatchenko, Kostiantyn Chernichenko

We evaluate the impact of pretraining Graph Transformer architectures on atom-level quantum-mechanical features for the modeling of absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties of drug-like compounds. We compare this pretraining strategy with two others: one based on molecular quantum properties (specifically the HOMO-LUMO gap) and one using a self-supervised atom masking technique. After fine-tuning on Therapeutic Data Commons ADMET datasets, we evaluate the performance improvement in the different models observing that models pretrained with atomic quantum mechanical properties produce in general better results. We then analyze the latent representations and observe that the supervised strategies preserve the pretraining information after fine-tuning and that different pretrainings produce different trends in latent expressivity across layers. Furthermore, we find that models pretrained on atomic quantum mechanical properties capture more low-frequency Laplacian eigenmodes of the input graph via the attention weights and produce better representations of atomic environments within the molecule. Application of the analysis to a much larger non-public dataset for microsomal clearance illustrates generalizability of the studied indicators. In this case the performances of the models are in accordance with the representation analysis and highlight, especially for the case of masking pretraining and atom-level quantum property pretraining, how model types with similar performance on public benchmarks can have different performances on large scale pharmaceutical data.

Scientific contribution

We systematically compared three different data type/methodologies for pretraining molecular Graphormer with the purpose of modeling ADMET properties as downstream tasks. The learned representations from differently pretrained models were analyzed in addition to comparison of downstream task performances that have been typically reported in similar works. Such examination methodologies, including a newly introduced analysis of Graphormer’s Attention Rollout Matrix, can guide pretraining strategy selection, as corroborated by a performance evaluation on a larger internal dataset.

{"title":"Pretraining graph transformers with atom-in-a-molecule quantum properties for improved ADMET modeling","authors":"Alessio Fallani, Ramil Nugmanov, Jose Arjona-Medina, Jörg Kurt Wegner, Alexandre Tkatchenko, Kostiantyn Chernichenko","doi":"10.1186/s13321-025-00970-0","DOIUrl":"10.1186/s13321-025-00970-0","url":null,"abstract":"<div>We evaluate the impact of pretraining Graph Transformer architectures on atom-level quantum-mechanical features for the modeling of absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties of drug-like compounds. We compare this pretraining strategy with two others: one based on molecular quantum properties (specifically the HOMO-LUMO gap) and one using a self-supervised atom masking technique. After fine-tuning on Therapeutic Data Commons ADMET datasets, we evaluate the performance improvement in the different models observing that models pretrained with atomic quantum mechanical properties produce in general better results. We then analyze the latent representations and observe that the supervised strategies preserve the pretraining information after fine-tuning and that different pretrainings produce different trends in latent expressivity across layers. Furthermore, we find that models pretrained on atomic quantum mechanical properties capture more low-frequency Laplacian eigenmodes of the input graph via the attention weights and produce better representations of atomic environments within the molecule. Application of the analysis to a much larger non-public dataset for microsomal clearance illustrates generalizability of the studied indicators. In this case the performances of the models are in accordance with the representation analysis and highlight, especially for the case of masking pretraining and atom-level quantum property pretraining, how model types with similar performance on public benchmarks can have different performances on large scale pharmaceutical data.Scientific contributionWe systematically compared three different data type/methodologies for pretraining molecular Graphormer with the purpose of modeling ADMET properties as downstream tasks. The learned representations from differently pretrained models were analyzed in addition to comparison of downstream task performances that have been typically reported in similar works. Such examination methodologies, including a newly introduced analysis of Graphormer’s Attention Rollout Matrix, can guide pretraining strategy selection, as corroborated by a performance evaluation on a larger internal dataset.</div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00970-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143513400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving route development using convergent retrosynthesis planning

IF 7.1 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-02-27 DOI: 10.1186/s13321-025-00953-1

Paula Torren-Peraire, Jonas Verhoeven, Dorota Herman, Hugo Ceulemans, Igor V. Tetko, Jörg K. Wegner

Retrosynthesis consists of recursively breaking down a target molecule to produce a synthesis route composed of readily accessible building blocks. In recent years, computer-aided synthesis planning methods have allowed a greater exploration of potential synthesis routes, combining state-of-the-art machine-learning methods with chemical knowledge. However, these methods are generally developed to produce individual routes from a singular product to a set of proposed building blocks and are not designed to leverage potential shared paths between targets. These methods do not necessarily encompass real-world use cases in medicinal chemistry, where one seeks to synthesize sets of target compounds in a library mode, looking for maximal convergence into a shared retrosynthetic path going via advanced key intermediate compounds. Using a graph-based processing pipeline, we explore Johnson & Johnson Electronic Laboratory Notebooks (J&J ELN) and publicly available datasets to identify complex routes with multiple target molecules sharing common intermediates, producing convergent synthesis routes. We find that over 70% of all reactions are involved in convergent synthesis, covering over 80% of all projects in the case of J&J ELN data.

Scientific contribution

We introduce a novel planning approach to develop convergent synthesis routes, which can search multiple products and intermediates simultaneously guided by state-of-the-art machine learning single-step retrosynthesis models, enhancing the overall efficiency and practical applicability of retrosynthetic planning. We evaluate the multi-step synthesis planning approach using the extracted convergent routes and observe that solvability is generally high across those routes, being able to identify a convergent route for over 80% of the test routes and showing an individual compound solvability of over 90%. We find that by using a convergent search approach, we can synthesize almost 30% more compounds simultaneously for J&J ELN as compared to using an individual search, while providing an increased use of common intermediates.

{"title":"Improving route development using convergent retrosynthesis planning","authors":"Paula Torren-Peraire, Jonas Verhoeven, Dorota Herman, Hugo Ceulemans, Igor V. Tetko, Jörg K. Wegner","doi":"10.1186/s13321-025-00953-1","DOIUrl":"10.1186/s13321-025-00953-1","url":null,"abstract":"<div>Retrosynthesis consists of recursively breaking down a target molecule to produce a synthesis route composed of readily accessible building blocks. In recent years, computer-aided synthesis planning methods have allowed a greater exploration of potential synthesis routes, combining state-of-the-art machine-learning methods with chemical knowledge. However, these methods are generally developed to produce individual routes from a singular product to a set of proposed building blocks and are not designed to leverage potential shared paths between targets. These methods do not necessarily encompass real-world use cases in medicinal chemistry, where one seeks to synthesize sets of target compounds in a library mode, looking for maximal convergence into a shared retrosynthetic path going via advanced key intermediate compounds. Using a graph-based processing pipeline, we explore Johnson & Johnson Electronic Laboratory Notebooks (J&J ELN) and publicly available datasets to identify complex routes with multiple target molecules sharing common intermediates, producing convergent synthesis routes. We find that over 70% of all reactions are involved in convergent synthesis, covering over 80% of all projects in the case of J&J ELN data. Scientific contributionWe introduce a novel planning approach to develop convergent synthesis routes, which can search multiple products and intermediates simultaneously guided by state-of-the-art machine learning single-step retrosynthesis models, enhancing the overall efficiency and practical applicability of retrosynthetic planning. We evaluate the multi-step synthesis planning approach using the extracted convergent routes and observe that solvability is generally high across those routes, being able to identify a convergent route for over 80% of the test routes and showing an individual compound solvability of over 90%. We find that by using a convergent search approach, we can synthesize almost 30% more compounds simultaneously for J&J ELN as compared to using an individual search, while providing an increased use of common intermediates.</div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00953-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143513401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Infrared spectrum analysis of organic molecules with neural networks using standard reference data sets in combination with real-world data 利用神经网络，结合真实世界数据，使用标准参考数据集分析有机分子的红外光谱

IF 7.1 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-02-26 DOI: 10.1186/s13321-025-00960-2

Dev Punjabi, Yu-Chieh Huang, Laura Holzhauer, Pierre Tremouilhac, Pascal Friederich, Nicole Jung, Stefan Bräse

In this study, we propose a neural network- based approach to analyze IR spectra and detect the presence of functional groups. Our neural network architecture is based on the concept of learning split representations. We demonstrate that our method achieves favorable validation performance using the NIST dataset. Furthermore, by incorporating additional data from the open-access research data repository Chemotion, we show that our model improves the classification performance for nitriles and amides.

Scientific contribution: Our method exclusively uses IR data as input for a neural network, making its performance, unlike other well-performing models, independent of additional data types obtained from analytical measurements. Furthermore, our proposed method leverages a deep learning model that outperforms previous approaches, achieving F1 scores above 0.7 to identify 17 functional groups. By incorporating real-world data from various laboratories, we demonstrate how open-access, specialized research data repositories can serve as yet unexplored, valuable benchmark datasets for future machine learning research.

{"title":"Infrared spectrum analysis of organic molecules with neural networks using standard reference data sets in combination with real-world data","authors":"Dev Punjabi, Yu-Chieh Huang, Laura Holzhauer, Pierre Tremouilhac, Pascal Friederich, Nicole Jung, Stefan Bräse","doi":"10.1186/s13321-025-00960-2","DOIUrl":"10.1186/s13321-025-00960-2","url":null,"abstract":"<div>In this study, we propose a neural network- based approach to analyze IR spectra and detect the presence of functional groups. Our neural network architecture is based on the concept of learning split representations. We demonstrate that our method achieves favorable validation performance using the NIST dataset. Furthermore, by incorporating additional data from the open-access research data repository Chemotion, we show that our model improves the classification performance for nitriles and amides. Scientific contribution: Our method exclusively uses IR data as input for a neural network, making its performance, unlike other well-performing models, independent of additional data types obtained from analytical measurements. Furthermore, our proposed method leverages a deep learning model that outperforms previous approaches, achieving F1 scores above 0.7 to identify 17 functional groups. By incorporating real-world data from various laboratories, we demonstrate how open-access, specialized research data repositories can serve as yet unexplored, valuable benchmark datasets for future machine learning research.</div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00960-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143489609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

kMoL: an open-source machine and federated learning library for drug discovery

IF 7.1 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-02-25 DOI: 10.1186/s13321-025-00967-9

Romeo Cozac, Haris Hasic, Jun Jin Choong, Vincent Richard, Loic Beheshti, Cyrille Froehlich, Takuto Koyama, Shigeyuki Matsumoto, Ryosuke Kojima, Hiroaki Iwata, Aki Hasegawa, Takao Otsuka, Yasushi Okuno

Machine learning is quickly becoming integral to drug discovery pipelines, particularly quantitative structure-activity relationship (QSAR) and absorption, distribution, metabolism, and excretion (ADME) tasks. Graph Convolutional Network (GCN) models have proven especially promising due to their inherent ability to model molecular structures using graph-based representations. However, maximizing the potential of such models in practice is challenging, as companies prioritize data privacy and security over collaboration initiatives to improve model performance and robustness. kMoL is an open-source machine learning library with integrated federated learning capabilities developed to address such challenges. Its key features include state-of-the-art model architectures, Bayesian optimization, explainability, and federated learning mechanisms. It demonstrates extensive customization possibilities, advanced security features, straightforward implementation of user-specific models, and high adaptability to custom datasets without additional programming requirements. kMoL is evaluated through locally trained benchmark settings and distributed federated learning experiments using various datasets to assess the features and flexibility of the library, as well as the ability to facilitate fast and practical experimentation. Additionally, results of these experiments provide further insights into the performance trade-offs associated with federated learning strategies, presenting valuable guidance for deploying machine learning models in a privacy-preserving manner within drug discovery pipelines. kMoL is available on GitHub at https://github.com/elix-tech/kmol.

Scientific contribution The primary scientific contribution of this research project is the introduction and evaluation of kMoL, an open-source machine learning library with integrated federated learning capabilities. By demonstrating advanced customization and security capabilities without additional programming requirements, kMoL represents an accessible yet secure open-source platform for collaborative drug discovery projects. Additionally, the experiment results provide further insights into the performance trade-offs associated with federated learning strategies, presenting valuable guidance for deploying machine learning models in a privacy-preserving manner within drug discovery pipelines.

{"title":"kMoL: an open-source machine and federated learning library for drug discovery","authors":"Romeo Cozac, Haris Hasic, Jun Jin Choong, Vincent Richard, Loic Beheshti, Cyrille Froehlich, Takuto Koyama, Shigeyuki Matsumoto, Ryosuke Kojima, Hiroaki Iwata, Aki Hasegawa, Takao Otsuka, Yasushi Okuno","doi":"10.1186/s13321-025-00967-9","DOIUrl":"10.1186/s13321-025-00967-9","url":null,"abstract":"<div>Machine learning is quickly becoming integral to drug discovery pipelines, particularly quantitative structure-activity relationship (QSAR) and absorption, distribution, metabolism, and excretion (ADME) tasks. Graph Convolutional Network (GCN) models have proven especially promising due to their inherent ability to model molecular structures using graph-based representations. However, maximizing the potential of such models in practice is challenging, as companies prioritize data privacy and security over collaboration initiatives to improve model performance and robustness. kMoL is an open-source machine learning library with integrated federated learning capabilities developed to address such challenges. Its key features include state-of-the-art model architectures, Bayesian optimization, explainability, and federated learning mechanisms. It demonstrates extensive customization possibilities, advanced security features, straightforward implementation of user-specific models, and high adaptability to custom datasets without additional programming requirements. kMoL is evaluated through locally trained benchmark settings and distributed federated learning experiments using various datasets to assess the features and flexibility of the library, as well as the ability to facilitate fast and practical experimentation. Additionally, results of these experiments provide further insights into the performance trade-offs associated with federated learning strategies, presenting valuable guidance for deploying machine learning models in a privacy-preserving manner within drug discovery pipelines. kMoL is available on GitHub at https://github.com/elix-tech/kmol.Scientific contribution The primary scientific contribution of this research project is the introduction and evaluation of kMoL, an open-source machine learning library with integrated federated learning capabilities. By demonstrating advanced customization and security capabilities without additional programming requirements, kMoL represents an accessible yet secure open-source platform for collaborative drug discovery projects. Additionally, the experiment results provide further insights into the performance trade-offs associated with federated learning strategies, presenting valuable guidance for deploying machine learning models in a privacy-preserving manner within drug discovery pipelines.</div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00967-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143481279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DrugDiff: small molecule diffusion model with flexible guidance towards molecular properties

IF 7.1 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-02-25 DOI: 10.1186/s13321-025-00965-x

Marie Oestreich, Erinc Merdivan, Michael Lee, Joachim L. Schultze, Marie Piraud, Matthias Becker

With the cost/yield-ratio of drug development becoming increasingly unfavourable, recent work has explored machine learning to accelerate early stages of the development process. Given the current success of deep generative models across domains, we here investigated their application to the property-based proposal of new small molecules for drug development. Specifically, we trained a latent diffusion model—DrugDiff—paired with predictor guidance to generate novel compounds with a variety of desired molecular properties. The architecture was designed to be highly flexible and easily adaptable to future scenarios. Our experiments showed successful generation of unique, diverse and novel small molecules with targeted properties. The code is available at https://github.com/MarieOestreich/DrugDiff.

This work expands the use of generative modelling in the field of drug development from previously introduced models for proteins and RNA to the here presented application to small molecules. With small molecules making up the majority of drugs, but simultaneously being difficult to model due to their elaborate chemical rules, this work tackles a new level of difficulty in comparison to sequence-based molecule generation as is the case for proteins and RNA. Additionally, the demonstrated framework is highly flexible, allowing easy addition or removal of considered molecular properties without the need to retrain the model, making it highly adaptable to diverse research settings and it shows compelling performance for a wide variety of targeted molecular properties.

{"title":"DrugDiff: small molecule diffusion model with flexible guidance towards molecular properties","authors":"Marie Oestreich, Erinc Merdivan, Michael Lee, Joachim L. Schultze, Marie Piraud, Matthias Becker","doi":"10.1186/s13321-025-00965-x","DOIUrl":"10.1186/s13321-025-00965-x","url":null,"abstract":"With the cost/yield-ratio of drug development becoming increasingly unfavourable, recent work has explored machine learning to accelerate early stages of the development process. Given the current success of deep generative models across domains, we here investigated their application to the property-based proposal of new small molecules for drug development. Specifically, we trained a latent diffusion model—DrugDiff—paired with predictor guidance to generate novel compounds with a variety of desired molecular properties. The architecture was designed to be highly flexible and easily adaptable to future scenarios. Our experiments showed successful generation of unique, diverse and novel small molecules with targeted properties. The code is available at https://github.com/MarieOestreich/DrugDiff. This work expands the use of generative modelling in the field of drug development from previously introduced models for proteins and RNA to the here presented application to small molecules. With small molecules making up the majority of drugs, but simultaneously being difficult to model due to their elaborate chemical rules, this work tackles a new level of difficulty in comparison to sequence-based molecule generation as is the case for proteins and RNA. Additionally, the demonstrated framework is highly flexible, allowing easy addition or removal of considered molecular properties without the need to retrain the model, making it highly adaptable to diverse research settings and it shows compelling performance for a wide variety of targeted molecular properties.\u0000","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00965-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143481130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Predictive modeling of biodegradation pathways using transformer architectures

IF 7.1 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2025-02-17 DOI: 10.1186/s13321-025-00969-7

Liam Brydon, Kunyang Zhang, Gillian Dobbie, Katerina Taškova, Jörg Simon Wicker

In recent years, the integration of machine learning techniques into chemical reaction product prediction has opened new avenues for understanding and predicting the behaviour of chemical substances. The necessity for such predictive methods stems from the growing regulatory and social awareness of the environmental consequences associated with the persistence and accumulation of chemical residues. Traditional biodegradation prediction methods rely on expert knowledge to perform predictions. However, creating this expert knowledge is becoming increasingly prohibitive due to the complexity and diversity of newer datasets, leaving existing methods unable to perform predictions on these datasets. We formulate the product prediction problem as a sequence-to-sequence generation task and take inspiration from natural language processing and other reaction prediction tasks. In doing so, we reduce the need for the expensive manual creation of expert-based rules.

引用次数: 0