首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
Crossover operators for molecular graphs with an application to virtual drug screening 分子图的交叉算子及其在虚拟药物筛选中的应用
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-06-17 DOI: 10.1186/s13321-025-00958-w
Nico Domschke, Bruno J. Schmidt, Thomas Gatter, Richard Golnik, Paul Eisenhuth, Fabian Liessmann, Jens Meiler, Peter F. Stadler
Genetic algorithms are a powerful method to solve optimization problems with complex cost functions over vast search spaces that rely in particular on recombining parts of previous solutions. Crossover operators play a crucial role in this context. Here, we describe a large class of these operators designed for searching over spaces of graphs. These operators are based on introducing small cuts into graphs and rejoining the resulting induced subgraphs of two parents. This form of cut-and-join crossover can be restricted in a consistent way to preserve local properties such as vertex-degrees (valency), or bond-orders, as well as global properties such as graph-theoretic planarity. In contrast to crossover on strings, cut-and-join crossover on graphs is powerful enough to ergodically explore chemical space even in the absence of mutation operators. Extensive benchmarking shows that the offspring of molecular graphs are again plausible molecules with high probability, while at the same time crossover drastically increases the diversity compared to initial molecule libraries. Moreover, desirable properties such as favorable indices of synthesizability are preserved with sufficient frequency that candidate offsprings can be filtered efficiently for such properties. As an application we utilized the cut-and-join crossover in REvoLd, a GA-based system for computer-aided drug design. In optimization runs searching for ligands binding to four different target proteins we consistently found candidate molecules with binding constants exceeding the best known binders as well as candidates found in make-on-demand libraries. Scientific contribution We define cut-and-join crossover operators on a variety of graph classes including molecular graphs. This constitutes a mathematically simple and well-characterized approach to recombination of molecules that performed very well in real-life CADD tasks.
遗传算法是一种强大的方法,可以在巨大的搜索空间中解决具有复杂代价函数的优化问题,特别是依赖于重组先前解决方案的部分。在这种情况下,跨界运营商发挥着至关重要的作用。在这里,我们描述了一类用于搜索图空间的算子。这些运算符是基于在图中引入小切割并重新连接两个父图的诱导子图。这种形式的切割连接交叉可以以一致的方式进行限制,以保留局部属性,如顶点度(价)或键序,以及全局属性,如图论平面性。与字符串上的交叉相比,图上的切割连接交叉足够强大,即使在没有突变算子的情况下也可以遍历地探索化学空间。广泛的基准测试表明,分子图的后代再次具有高概率的似是而非的分子,同时交叉大大增加了与初始分子库相比的多样性。此外,理想的性质,如有利的可合成性指标,以足够的频率保留,候选后代可以有效地过滤这些性质。作为一个应用,我们在REvoLd中使用了切割连接交叉,REvoLd是一个基于ga的计算机辅助药物设计系统。在寻找与四种不同靶蛋白结合的配体的优化运行中,我们不断地发现候选分子的结合常数超过了最知名的结合物,以及在按需制造文库中发现的候选分子。我们在包括分子图在内的各种图类上定义了切割连接交叉算子。这构成了一种数学上简单且具有良好特征的分子重组方法,在现实生活中的CADD任务中表现非常好。
{"title":"Crossover operators for molecular graphs with an application to virtual drug screening","authors":"Nico Domschke, Bruno J. Schmidt, Thomas Gatter, Richard Golnik, Paul Eisenhuth, Fabian Liessmann, Jens Meiler, Peter F. Stadler","doi":"10.1186/s13321-025-00958-w","DOIUrl":"https://doi.org/10.1186/s13321-025-00958-w","url":null,"abstract":"Genetic algorithms are a powerful method to solve optimization problems with complex cost functions over vast search spaces that rely in particular on recombining parts of previous solutions. Crossover operators play a crucial role in this context. Here, we describe a large class of these operators designed for searching over spaces of graphs. These operators are based on introducing small cuts into graphs and rejoining the resulting induced subgraphs of two parents. This form of cut-and-join crossover can be restricted in a consistent way to preserve local properties such as vertex-degrees (valency), or bond-orders, as well as global properties such as graph-theoretic planarity. In contrast to crossover on strings, cut-and-join crossover on graphs is powerful enough to ergodically explore chemical space even in the absence of mutation operators. Extensive benchmarking shows that the offspring of molecular graphs are again plausible molecules with high probability, while at the same time crossover drastically increases the diversity compared to initial molecule libraries. Moreover, desirable properties such as favorable indices of synthesizability are preserved with sufficient frequency that candidate offsprings can be filtered efficiently for such properties. As an application we utilized the cut-and-join crossover in REvoLd, a GA-based system for computer-aided drug design. In optimization runs searching for ligands binding to four different target proteins we consistently found candidate molecules with binding constants exceeding the best known binders as well as candidates found in make-on-demand libraries. Scientific contribution We define cut-and-join crossover operators on a variety of graph classes including molecular graphs. This constitutes a mathematically simple and well-characterized approach to recombination of molecules that performed very well in real-life CADD tasks.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"44 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144311943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advancements in thermochemical predictions: a multi-output thermodynamics-informed neural network approach 热化学预测的进展:一种多输出热力学信息的神经网络方法
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-06-16 DOI: 10.1186/s13321-025-01033-0
Raheel Hammad, Sownyak Mondal
The Gibbs free energy of an inorganic material represents its maximum reversible work potential under constant temperature and pressure. Its calculation is crucial for understanding material stability, phase transitions, and chemical reactions, thus guiding optimization for diverse applications like catalysis and energy storage. In this study, we have developed a Physics-Informed Neural Network model that leverages the Gibbs free energy equation. The overall loss function is adjusted to allow the model to simultaneously predict all three thermodynamic quantities, including Gibbs free energy, total energy, and entropy, thus transforming it into a multi-output model. In recent literature, there is a growing emphasis on evaluating machine learning models under challenging conditions, such as small datasets and out-of-distribution predictions. Reflecting this trend, we have rigorously benchmarked our model across these scenarios, demonstrating its robustness and adaptability. It turns out that our model demonstrates a 43% improvement for normal scenario and even more in out-of-distribution regime compared to the next-best model. Scientific Contribution This study introduces the application of a Physics-Informed Neural Network to simultaneously compute multiple thermodynamic properties, including Gibbs free energy, total energy, and entropy. By integrating the Gibbs free energy equation into the loss function, the model achieves superior accuracy in low data regimes and enhances robustness in the out-of-distribution scenarios.
无机材料的吉布斯自由能表示其在恒温常压下的最大可逆功势。它的计算对于理解材料稳定性、相变和化学反应至关重要,从而指导催化和储能等各种应用的优化。在这项研究中,我们开发了一个利用吉布斯自由能方程的物理信息神经网络模型。调整整体损失函数,使模型能够同时预测包括吉布斯自由能、总能量和熵在内的所有三个热力学量,从而将其转化为多输出模型。在最近的文献中,越来越多的人强调在具有挑战性的条件下评估机器学习模型,例如小数据集和分布外预测。为了反映这一趋势,我们在这些场景中严格地对我们的模型进行了基准测试,展示了它的健壮性和适应性。事实证明,与次优模型相比,我们的模型在正常情况下提高了43%,在非分布状态下甚至更高。本研究介绍了物理信息神经网络的应用,以同时计算多种热力学性质,包括吉布斯自由能,总能量和熵。通过将Gibbs自由能方程集成到损失函数中,该模型在低数据条件下获得了较好的精度,并增强了在非分布情况下的鲁棒性。
{"title":"Advancements in thermochemical predictions: a multi-output thermodynamics-informed neural network approach","authors":"Raheel Hammad, Sownyak Mondal","doi":"10.1186/s13321-025-01033-0","DOIUrl":"https://doi.org/10.1186/s13321-025-01033-0","url":null,"abstract":"The Gibbs free energy of an inorganic material represents its maximum reversible work potential under constant temperature and pressure. Its calculation is crucial for understanding material stability, phase transitions, and chemical reactions, thus guiding optimization for diverse applications like catalysis and energy storage. In this study, we have developed a Physics-Informed Neural Network model that leverages the Gibbs free energy equation. The overall loss function is adjusted to allow the model to simultaneously predict all three thermodynamic quantities, including Gibbs free energy, total energy, and entropy, thus transforming it into a multi-output model. In recent literature, there is a growing emphasis on evaluating machine learning models under challenging conditions, such as small datasets and out-of-distribution predictions. Reflecting this trend, we have rigorously benchmarked our model across these scenarios, demonstrating its robustness and adaptability. It turns out that our model demonstrates a 43% improvement for normal scenario and even more in out-of-distribution regime compared to the next-best model. Scientific Contribution This study introduces the application of a Physics-Informed Neural Network to simultaneously compute multiple thermodynamic properties, including Gibbs free energy, total energy, and entropy. By integrating the Gibbs free energy equation into the loss function, the model achieves superior accuracy in low data regimes and enhances robustness in the out-of-distribution scenarios.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"11 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144296245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NanoBinder: a machine learning assisted nanobody binding prediction tool using Rosetta energy scores NanoBinder:一个机器学习辅助纳米体结合预测工具,使用罗塞塔能量评分
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-06-16 DOI: 10.1186/s13321-025-01040-1
Palistha Shrestha, Chandana S. Talwar, Jeevan Kandel, Kwang-Hyun Park, Kil To Chong, Eui-Jeon Woo, Hilal Tayara
Nanobodies offer significant therapeutic potential due to their small size, stability, and versatility. Although advancements in computational protein design have made designing de novo nanobodies increasingly feasible, there are limited tools specifically tailored for this purpose. Rosetta with its specialized protocols, is a prominent tool for nanobody design but is limited by a high false-negative rate, necessitating extensive high-throughput screening. This results in increased costs, time, and labor due to the need for large-scale experimentation and detailed structural analysis. To address current challenges in nanobody design, we introduce NanoBinder, an interpretable machine learning model that predicts nanobody-antigen binding using Rosetta energy scores. NanoBinder utilizes a Random Forest model trained on experimentally validated complexes and can be seamlessly integrated into the Rosetta software. It employs SHAP summary plots for interpretability, which helps identify key features influencing binding interactions. Experimentally validated on forty-nine diverse nanobodies, NanoBinder accurately predicts non-binders and shows reasonable performance in identifying binders. This approach significantly enhances predictive accuracy, reduces the need for extensive experimental assays, and accelerates nanobody development, thereby offering a powerful tool to mitigate the costs, time, and labor associated with high-throughput screening. Scientific contribution This study introduces NanoBinder, a machine learning framework for predicting nanobody-antigen binding using Rosetta-derived energy features. Through rigorous experimental validation across diverse nanobody sets, NanoBinder enhances nanobody screening workflows by reducing false positives and minimizing reliance on extensive wet-lab assays. The approach bridges the gap between physics-based modeling and data-driven prediction in nanobody design.
纳米体由于其小尺寸、稳定性和多功能性而具有显著的治疗潜力。尽管计算蛋白质设计的进步使得设计从头开始的纳米体越来越可行,但专门为此目的量身定制的工具有限。Rosetta具有其专门的协议,是纳米体设计的重要工具,但受限于高假阴性率,需要广泛的高通量筛选。由于需要大规模的实验和详细的结构分析,这增加了成本、时间和劳动力。为了解决当前纳米体设计中的挑战,我们引入了NanoBinder,这是一种可解释的机器学习模型,可以使用Rosetta能量评分来预测纳米体-抗原结合。NanoBinder利用随机森林模型训练实验验证的复合物,可以无缝集成到Rosetta软件。它采用SHAP摘要图进行可解释性,这有助于确定影响绑定相互作用的关键特征。在49种不同的纳米体上进行了实验验证,NanoBinder可以准确地预测非结合物,并在识别结合物方面表现出合理的性能。这种方法显著提高了预测准确性,减少了对大量实验分析的需求,并加速了纳米体的开发,从而提供了一种强大的工具,以减轻与高通量筛选相关的成本、时间和劳动力。本研究介绍了NanoBinder,这是一个机器学习框架,用于使用罗塞塔衍生的能量特征预测纳米体抗原结合。通过对不同纳米体进行严格的实验验证,NanoBinder通过减少假阳性和最大限度地减少对大量湿实验室分析的依赖来增强纳米体筛选工作流程。该方法弥补了纳米体设计中基于物理的建模和数据驱动的预测之间的差距。
{"title":"NanoBinder: a machine learning assisted nanobody binding prediction tool using Rosetta energy scores","authors":"Palistha Shrestha, Chandana S. Talwar, Jeevan Kandel, Kwang-Hyun Park, Kil To Chong, Eui-Jeon Woo, Hilal Tayara","doi":"10.1186/s13321-025-01040-1","DOIUrl":"https://doi.org/10.1186/s13321-025-01040-1","url":null,"abstract":"Nanobodies offer significant therapeutic potential due to their small size, stability, and versatility. Although advancements in computational protein design have made designing de novo nanobodies increasingly feasible, there are limited tools specifically tailored for this purpose. Rosetta with its specialized protocols, is a prominent tool for nanobody design but is limited by a high false-negative rate, necessitating extensive high-throughput screening. This results in increased costs, time, and labor due to the need for large-scale experimentation and detailed structural analysis. To address current challenges in nanobody design, we introduce NanoBinder, an interpretable machine learning model that predicts nanobody-antigen binding using Rosetta energy scores. NanoBinder utilizes a Random Forest model trained on experimentally validated complexes and can be seamlessly integrated into the Rosetta software. It employs SHAP summary plots for interpretability, which helps identify key features influencing binding interactions. Experimentally validated on forty-nine diverse nanobodies, NanoBinder accurately predicts non-binders and shows reasonable performance in identifying binders. This approach significantly enhances predictive accuracy, reduces the need for extensive experimental assays, and accelerates nanobody development, thereby offering a powerful tool to mitigate the costs, time, and labor associated with high-throughput screening. Scientific contribution This study introduces NanoBinder, a machine learning framework for predicting nanobody-antigen binding using Rosetta-derived energy features. Through rigorous experimental validation across diverse nanobody sets, NanoBinder enhances nanobody screening workflows by reducing false positives and minimizing reliance on extensive wet-lab assays. The approach bridges the gap between physics-based modeling and data-driven prediction in nanobody design.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"227 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144296244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UMAP-based clustering split for rigorous evaluation of AI models for virtual screening on cancer cell lines* 基于umap的聚类分裂对肿瘤细胞系虚拟筛选人工智能模型进行严格评估*
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-06-10 DOI: 10.1186/s13321-025-01039-8
Qianrong Guo, Saiveth Hernandez-Hernandez, Pedro J. Ballester
Virtual Screening (VS) of large compound libraries using Artificial Intelligence (AI) models is a highly effective approach for early drug discovery. Data splitting is crucial for benchmarking the performance of such AI models. Traditional random data splits often result in structurally similar molecules in both training and test sets, which conflict with the reality of VS libraries that typically contain structurally diverse compounds. To tackle this challenge, scaffold split, which groups molecules by shared core structure, and Butina clustering, which clusters molecules by chemotypes, have long been used. However, we show that these methods still introduce high similarities between training and test sets, leading to overestimated model performance. Our study examined four representative AI models across 60 NCI-60 datasets, each comprising approximately 33,000–54,000 molecules tested on different cancer cell lines. Each dataset was split in four ways: random, scaffold, Butina clustering and the more realistic Uniform Manifold Approximation and Projection (UMAP) clustering. Using Linear Regression, Random Forest, Transformer-CNN, and GEM, we trained a total of 8400 models and evaluated under four splitting methods. These comprehensive results show that UMAP split provides more challenging and realistic benchmarks for model evaluation, followed by Butina splits, then scaffold splits and closely after random splits. Consequently, we recommend using UMAP splits instead of overly optimistic Butina splits and especially scaffold splits for molecular property prediction, including VS. Lastly, we illustrate how misaligned ROC AUC is with VS goals, despite its common use. The code and datasets for reproducibility are available at https://github.com/Rong830/UMAP_split_for_VS and archived in https://zenodo.org/records/14736486 . Scientific contribution This work advances the field by introducing UMAP clustering as a robust splitting method for molecular datasets, improving over traditional methods like Butina clustering and especially scaffold splits. It offers a new evaluation framework to benchmark AI models under more realistic conditions, fostering progress in molecular property prediction. The findings also show how inappropriate the use of ROC AUC for virtual screening (VS) continues to be, despite its popularity, emphasizing the need for context-specific evaluation metrics.
利用人工智能(AI)模型对大型化合物文库进行虚拟筛选(VS)是一种非常有效的药物早期发现方法。数据分割对于此类人工智能模型的性能基准测试至关重要。传统的随机数据分割通常会导致训练集和测试集中的分子结构相似,这与通常包含结构多样化合物的VS库的现实相冲突。为了应对这一挑战,支架分裂(通过共享核心结构将分子分组)和Butina聚类(通过化学型将分子聚类)已经被长期使用。然而,我们表明这些方法仍然引入了训练集和测试集之间的高度相似性,导致高估模型性能。我们的研究检查了60个NCI-60数据集中的四个代表性人工智能模型,每个数据集包含大约33,000-54,000个在不同癌细胞系上测试的分子。每个数据集被分成四种方式:随机、支架、Butina聚类和更现实的统一流形近似和投影(UMAP)聚类。我们使用线性回归、随机森林、Transformer-CNN和GEM训练了8400个模型,并在四种分裂方法下进行了评估。这些综合结果表明,UMAP拆分为模型评估提供了更具挑战性和更现实的基准,其次是Butina拆分,然后是scaffold拆分,紧随其后的是随机拆分。因此,我们建议使用UMAP分裂,而不是过于乐观的Butina分裂,特别是支架分裂,用于分子性质预测,包括VS。最后,我们说明了尽管常用,但ROC AUC与VS目标是如何不一致的。可再现性的代码和数据集可在https://github.com/Rong830/UMAP_split_for_VS和https://zenodo.org/records/14736486存档。这项工作通过引入UMAP聚类作为分子数据集的鲁棒分裂方法,改进了传统的方法,如Butina聚类,特别是支架分裂,从而推动了该领域的发展。它为在更现实的条件下对人工智能模型进行基准测试提供了一个新的评估框架,促进了分子性质预测的进展。研究结果还表明,尽管使用ROC AUC进行虚拟筛查(VS)很受欢迎,但它仍然是不合适的,强调需要针对具体情况的评估指标。
{"title":"UMAP-based clustering split for rigorous evaluation of AI models for virtual screening on cancer cell lines*","authors":"Qianrong Guo, Saiveth Hernandez-Hernandez, Pedro J. Ballester","doi":"10.1186/s13321-025-01039-8","DOIUrl":"https://doi.org/10.1186/s13321-025-01039-8","url":null,"abstract":"Virtual Screening (VS) of large compound libraries using Artificial Intelligence (AI) models is a highly effective approach for early drug discovery. Data splitting is crucial for benchmarking the performance of such AI models. Traditional random data splits often result in structurally similar molecules in both training and test sets, which conflict with the reality of VS libraries that typically contain structurally diverse compounds. To tackle this challenge, scaffold split, which groups molecules by shared core structure, and Butina clustering, which clusters molecules by chemotypes, have long been used. However, we show that these methods still introduce high similarities between training and test sets, leading to overestimated model performance. Our study examined four representative AI models across 60 NCI-60 datasets, each comprising approximately 33,000–54,000 molecules tested on different cancer cell lines. Each dataset was split in four ways: random, scaffold, Butina clustering and the more realistic Uniform Manifold Approximation and Projection (UMAP) clustering. Using Linear Regression, Random Forest, Transformer-CNN, and GEM, we trained a total of 8400 models and evaluated under four splitting methods. These comprehensive results show that UMAP split provides more challenging and realistic benchmarks for model evaluation, followed by Butina splits, then scaffold splits and closely after random splits. Consequently, we recommend using UMAP splits instead of overly optimistic Butina splits and especially scaffold splits for molecular property prediction, including VS. Lastly, we illustrate how misaligned ROC AUC is with VS goals, despite its common use. The code and datasets for reproducibility are available at https://github.com/Rong830/UMAP_split_for_VS and archived in https://zenodo.org/records/14736486 . Scientific contribution This work advances the field by introducing UMAP clustering as a robust splitting method for molecular datasets, improving over traditional methods like Butina clustering and especially scaffold splits. It offers a new evaluation framework to benchmark AI models under more realistic conditions, fostering progress in molecular property prediction. The findings also show how inappropriate the use of ROC AUC for virtual screening (VS) continues to be, despite its popularity, emphasizing the need for context-specific evaluation metrics.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"218 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144260195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A 3D generation framework using diffusion model and reinforcement learning to generate multi-target compounds with desired properties 使用扩散模型和强化学习生成具有所需属性的多目标化合物的3D生成框架
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-06-04 DOI: 10.1186/s13321-025-01035-y
Yongna Yuan, Xiaohang Pan, Xiaohong Li, Ruisheng Zhang, Wei Su
Deep generative models provide a powerful solution for the de novo design of molecules. However, the majority of existing methods only generate molecules for a single target. Generating molecules with biological activities against multiple specific targets and desired properties remains an extremely difficult challenge. In this study, we propose a novel 3D molecule generation framework based on reinforcement learning and diffusion model to generate molecules with predefined properties for given multiple targets. The proposed framework, MDRL, uses a diffusion model to understand the 3D chemical structure of molecules and employs Kolmogorov-Arnold Networks instead of Multilayer Perceptron to enhance model performance. Through reinforcement learning, the framework is able to generate molecules that simultaneously target two targets and further optimizes multiple molecular properties. Experimental results show that our model exhibits comparable performance to various state-of-the-art molecular generation models, and MDRL can effectively navigate chemical space to design polypharmacological compounds and control multiple molecular properties. In multiple case studies, we verify that the generated molecules can simultaneously target two targets through molecular docking and assess the model’s ability to control multiple molecular properties. The results in this study highlight the advantages and practicalities of our model in generating polypharmacological compounds with desired properties. This study introduces MDRL, a 3D molecular generation framework integrating diffusion models and reinforcement learning for joint optimization of multi-target binding and molecular properties. MDRL shows improvements over existing methods in controlling drug-relevant properties and enhancing multi-target affinity. Experimental results demonstrate that MDRL efficiently generates drug-like compounds with robust polypharmacological profiles, offering a novel strategy for multi-target drug design.
深度生成模型为分子的从头设计提供了强有力的解决方案。然而,现有的大多数方法只能产生针对单一目标的分子。生成具有针对多种特定目标和所需特性的生物活性的分子仍然是一项极其困难的挑战。在这项研究中,我们提出了一种新的基于强化学习和扩散模型的3D分子生成框架,为给定的多个目标生成具有预定义属性的分子。提出的框架MDRL使用扩散模型来理解分子的三维化学结构,并使用Kolmogorov-Arnold网络而不是多层感知器来提高模型性能。通过强化学习,该框架能够生成同时针对两个靶标的分子,并进一步优化多个分子性质。实验结果表明,我们的模型具有与各种最先进的分子生成模型相当的性能,并且MDRL可以有效地导航化学空间来设计多药理化合物并控制多种分子性质。在多个案例研究中,我们通过分子对接验证了生成的分子可以同时靶向两个靶标,并评估了模型控制多个分子性质的能力。本研究的结果突出了我们的模型在生成具有所需性质的多药理学化合物方面的优势和实用性。本研究引入了MDRL,这是一个集成了扩散模型和强化学习的3D分子生成框架,用于联合优化多靶点结合和分子性质。MDRL在控制药物相关性质和增强多靶点亲和力方面比现有方法有了改进。实验结果表明,MDRL可以有效地生成具有强大多药理特征的类药物化合物,为多靶点药物设计提供了一种新的策略。
{"title":"A 3D generation framework using diffusion model and reinforcement learning to generate multi-target compounds with desired properties","authors":"Yongna Yuan, Xiaohang Pan, Xiaohong Li, Ruisheng Zhang, Wei Su","doi":"10.1186/s13321-025-01035-y","DOIUrl":"https://doi.org/10.1186/s13321-025-01035-y","url":null,"abstract":"Deep generative models provide a powerful solution for the de novo design of molecules. However, the majority of existing methods only generate molecules for a single target. Generating molecules with biological activities against multiple specific targets and desired properties remains an extremely difficult challenge. In this study, we propose a novel 3D molecule generation framework based on reinforcement learning and diffusion model to generate molecules with predefined properties for given multiple targets. The proposed framework, MDRL, uses a diffusion model to understand the 3D chemical structure of molecules and employs Kolmogorov-Arnold Networks instead of Multilayer Perceptron to enhance model performance. Through reinforcement learning, the framework is able to generate molecules that simultaneously target two targets and further optimizes multiple molecular properties. Experimental results show that our model exhibits comparable performance to various state-of-the-art molecular generation models, and MDRL can effectively navigate chemical space to design polypharmacological compounds and control multiple molecular properties. In multiple case studies, we verify that the generated molecules can simultaneously target two targets through molecular docking and assess the model’s ability to control multiple molecular properties. The results in this study highlight the advantages and practicalities of our model in generating polypharmacological compounds with desired properties. This study introduces MDRL, a 3D molecular generation framework integrating diffusion models and reinforcement learning for joint optimization of multi-target binding and molecular properties. MDRL shows improvements over existing methods in controlling drug-relevant properties and enhancing multi-target affinity. Experimental results demonstrate that MDRL efficiently generates drug-like compounds with robust polypharmacological profiles, offering a novel strategy for multi-target drug design.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144211377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RLSuccSite: succinylation sites prediction based on reinforcement learning dynamic with balanced reward mechanism and three-peaks enhanced method for physicochemical property scores RLSuccSite:基于平衡奖励机制的强化学习动态琥珀酰化位点预测和理化性质评分三峰增强法
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-06-02 DOI: 10.1186/s13321-025-01034-z
Lun Zhu, Qingchao Zhang, Sen Yang
Recent progress in computational biology has driven the development of machine learning models for predicting protein post-translational modification sites. However, challenges such as data imbalance and limited sequence-context representation continue to hinder prediction accuracy, particularly for less frequent modifications like succinylation. In this study, we propose RLSuccSite, a reinforcement learning-based framework specifically designed to predict succinylation sites by addressing the class imbalance issue via a dynamic with balanced reward mechanism. To enhance sequence feature representation, this study also introduces Three-Peaks Enhanced Method for Physicochemical Property Scores (TPEM-PPS), a physicochemical property-driven feature extraction method that incorporates position-aware scoring to reflect amino acid contributions more effectively. The code and data of RLSuccSite can be obtained from the website: https://github.com/Zhangqingchao-Ch/RLSuccSite.git . Scientific contribution This study applies reinforcement learning to protein succinylation sites prediction, introducing a dynamic with balanced reward mechanism that effectively addresses dataset imbalance. Additionally, this study proposes a novel Three-Peaks Enhanced Method for Physicochemical Scoring, which captures residue contributions with higher precision than traditional feature extraction techniques.
计算生物学的最新进展推动了用于预测蛋白质翻译后修饰位点的机器学习模型的发展。然而,诸如数据不平衡和有限的序列上下文表示等挑战继续阻碍预测的准确性,特别是对于像琥珀酰化这样不太频繁的修饰。在本研究中,我们提出了RLSuccSite,这是一个基于强化学习的框架,专门用于通过动态平衡奖励机制解决类不平衡问题来预测琥珀酰化位点。为了增强序列特征表示,本研究还引入了物化属性分数的三峰增强方法(TPEM-PPS),这是一种物化属性驱动的特征提取方法,结合位置感知评分来更有效地反映氨基酸的贡献。RLSuccSite的代码和数据可从以下网站获取:https://github.com/Zhangqingchao-Ch/RLSuccSite.git。本研究将强化学习应用于蛋白质琥珀酰化位点预测,引入动态平衡奖励机制,有效解决数据集不平衡问题。此外,本研究还提出了一种新的三峰物理化学评分方法,该方法比传统的特征提取技术更精确地捕获残留贡献。
{"title":"RLSuccSite: succinylation sites prediction based on reinforcement learning dynamic with balanced reward mechanism and three-peaks enhanced method for physicochemical property scores","authors":"Lun Zhu, Qingchao Zhang, Sen Yang","doi":"10.1186/s13321-025-01034-z","DOIUrl":"https://doi.org/10.1186/s13321-025-01034-z","url":null,"abstract":"Recent progress in computational biology has driven the development of machine learning models for predicting protein post-translational modification sites. However, challenges such as data imbalance and limited sequence-context representation continue to hinder prediction accuracy, particularly for less frequent modifications like succinylation. In this study, we propose RLSuccSite, a reinforcement learning-based framework specifically designed to predict succinylation sites by addressing the class imbalance issue via a dynamic with balanced reward mechanism. To enhance sequence feature representation, this study also introduces Three-Peaks Enhanced Method for Physicochemical Property Scores (TPEM-PPS), a physicochemical property-driven feature extraction method that incorporates position-aware scoring to reflect amino acid contributions more effectively. The code and data of RLSuccSite can be obtained from the website: https://github.com/Zhangqingchao-Ch/RLSuccSite.git . Scientific contribution This study applies reinforcement learning to protein succinylation sites prediction, introducing a dynamic with balanced reward mechanism that effectively addresses dataset imbalance. Additionally, this study proposes a novel Three-Peaks Enhanced Method for Physicochemical Scoring, which captures residue contributions with higher precision than traditional feature extraction techniques. ","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"9 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144193336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Representation of chemistry transport models simulations using knowledge graphs 用知识图表示化学输运模型的模拟
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-05-31 DOI: 10.1186/s13321-025-01025-0
Eduardo Illueca Fernández, Antonio Jesús Jara Valera, Jesualdo Tomás Fernández Breis
Persistent air quality pollution poses a serious threat to human health, and is one of the action points that policy makers should monitor according to the Directive 2008/50/EC. While deploying a massive network of hyperlocal sensors could provide extensive monitoring, this approach cannot generate geospatial continuous data and present several challenges in terms of logistics. Thus, developing accurate and trustable expert systems based on chemistry transport models is a key strategy for environmental protection. However, chemistry transport models present an important lack of standardization, and the formats are not interoperable between different systems, which limits the use for different stakeholders. In this context, semantic technologies provide methods and standards for scientific data and make information readable for expert systems. Therefore, this paper proposes a novel methodology for an ontology driven transformation for CHIMERE simulations, a chemistry transport model, allowing to generate knowledge graphs representing air quality information. It enables the transformation of netCDF files into RDF triples for short term air quality forecasting. Concretely, we utilize the Semantic Web Integration Tool (SWIT) framework for mapping individuals using an ontology as a template. Then, a new ontology for CHIMERE has been defined in this work, reusing concepts for other standards in the state of the art. Our approach demonstrates that RDF files can be created from netCDF in a linear computational time, allowing the scalability for expert systems. In addition, the ontology complains with the OQuaRE quality metrics and can be extended in future extensions to be applied to other chemistry transport models. Development of the first ontology for a chemistry transport model. FAIRification of physical models thanks to the generation of knowledge graphs from netCDF files. The ontology proposed is published in PURL ( https://purl.org/chimere-ontology ) and the knowledge graph generated for a 72-h simulation can be accessed in the following repository: https://doi.org/10.5281/zenodo.13981544 .
持续的空气质量污染对人类健康构成严重威胁,是决策者应根据指令2008/50/EC进行监测的行动点之一。虽然部署大规模的超局部传感器网络可以提供广泛的监测,但这种方法无法生成地理空间连续数据,并且在物流方面存在一些挑战。因此,开发基于化学输运模型的准确可靠的专家系统是环境保护的关键策略。然而,化学输运模型缺乏标准化,并且格式在不同系统之间不能互操作,这限制了不同利益相关者的使用。在这种情况下,语义技术为科学数据提供了方法和标准,并使专家系统能够读取信息。因此,本文提出了一种新的方法,用于CHIMERE模拟的本体驱动转换,这是一种化学传输模型,允许生成表示空气质量信息的知识图。它可以将netCDF文件转换为RDF三元组,用于短期空气质量预报。具体而言,我们利用语义Web集成工具(SWIT)框架以本体作为模板来映射个体。然后,在这项工作中为CHIMERE定义了一个新的本体,重用了目前其他标准的概念。我们的方法证明了可以在线性计算时间内从netCDF创建RDF文件,从而允许专家系统的可伸缩性。此外,本体论与OQuaRE质量度量相一致,可以在未来的扩展中进行扩展,以应用于其他化学传输模型。化学输运模型的第一个本体的发展。通过netCDF文件生成知识图,实现了物理模型的标准化。提出的本体发布在PURL (https://purl.org/chimere-ontology)上,为72小时模拟生成的知识图可以在以下存储库中访问:https://doi.org/10.5281/zenodo.13981544。
{"title":"Representation of chemistry transport models simulations using knowledge graphs","authors":"Eduardo Illueca Fernández, Antonio Jesús Jara Valera, Jesualdo Tomás Fernández Breis","doi":"10.1186/s13321-025-01025-0","DOIUrl":"https://doi.org/10.1186/s13321-025-01025-0","url":null,"abstract":"Persistent air quality pollution poses a serious threat to human health, and is one of the action points that policy makers should monitor according to the Directive 2008/50/EC. While deploying a massive network of hyperlocal sensors could provide extensive monitoring, this approach cannot generate geospatial continuous data and present several challenges in terms of logistics. Thus, developing accurate and trustable expert systems based on chemistry transport models is a key strategy for environmental protection. However, chemistry transport models present an important lack of standardization, and the formats are not interoperable between different systems, which limits the use for different stakeholders. In this context, semantic technologies provide methods and standards for scientific data and make information readable for expert systems. Therefore, this paper proposes a novel methodology for an ontology driven transformation for CHIMERE simulations, a chemistry transport model, allowing to generate knowledge graphs representing air quality information. It enables the transformation of netCDF files into RDF triples for short term air quality forecasting. Concretely, we utilize the Semantic Web Integration Tool (SWIT) framework for mapping individuals using an ontology as a template. Then, a new ontology for CHIMERE has been defined in this work, reusing concepts for other standards in the state of the art. Our approach demonstrates that RDF files can be created from netCDF in a linear computational time, allowing the scalability for expert systems. In addition, the ontology complains with the OQuaRE quality metrics and can be extended in future extensions to be applied to other chemistry transport models. Development of the first ontology for a chemistry transport model. FAIRification of physical models thanks to the generation of knowledge graphs from netCDF files. The ontology proposed is published in PURL ( https://purl.org/chimere-ontology ) and the knowledge graph generated for a 72-h simulation can be accessed in the following repository: https://doi.org/10.5281/zenodo.13981544 .","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"3 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144188999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Higher education in chemoinformatics: achievements and challenges 化学信息学高等教育:成就与挑战
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-05-31 DOI: 10.1186/s13321-025-01036-x
Alexandre Varnek, Gilles Marcou, Dragos Horvath
While chemoinformatics is a well-established scientific field, its integration into university curricula is rarely discussed. In this work, we share our experience in developing a chemoinformatics curriculum at the University of Strasbourg and highlight the main challenges in higher education for this discipline.
虽然化学信息学是一个成熟的科学领域,但它与大学课程的整合却很少被讨论。在这项工作中,我们分享了我们在斯特拉斯堡大学开发化学信息学课程的经验,并强调了该学科在高等教育中的主要挑战。
{"title":"Higher education in chemoinformatics: achievements and challenges","authors":"Alexandre Varnek, Gilles Marcou, Dragos Horvath","doi":"10.1186/s13321-025-01036-x","DOIUrl":"https://doi.org/10.1186/s13321-025-01036-x","url":null,"abstract":"While chemoinformatics is a well-established scientific field, its integration into university curricula is rarely discussed. In this work, we share our experience in developing a chemoinformatics curriculum at the University of Strasbourg and highlight the main challenges in higher education for this discipline.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"28 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144188912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Equivariant diffusion for structure-based de novo ligand generation with latent-conditioning 基于结构的具有潜在调节的新配体生成的等变扩散
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-05-31 DOI: 10.1186/s13321-025-01028-x
Tuan Le, Julian Cremer, Djork-Arné Clevert, Kristof T. Schütt
We introduce PoLiGenX, a novel generative model for de novo ligand design that employs latent-conditioned, target-aware equivariant diffusion. Our approach leverages the conditioning of the ligand generation process on reference molecules located within a specific protein pocket. By doing so, PoLiGenX generates shape-similar ligands that are adapted to the target pocket, enabling effective applications in target-aware hit expansion and hit optimization. Our experimental results underscore the efficacy of PoLiGenX in advancing ligand design. Notably, docking analyses reveal that the ligands generated by PoLiGenX show enhanced binding affinities relative to their reference molecules, all while retaining a similar molecular shape, but also retaining better poses with lower strain energies and less steric clashes. Furthermore, the model promotes substantial chemical diversity, facilitating the exploration of broader and more varied chemical spaces. Importantly, the generated ligands were assessed for drug-likeness using Lipinski’s rule of five, demonstrating superior adherence to drug-likeness criteria compared to the reference dataset. This work represents a step forward in the controlled and precise generation of therapeutically relevant de novo ligands tailored for specific protein targets, contributing to progress in computational drug discovery and ligand design. We present a latent-conditioning method within diffusion models to enable the controllable generation of ligands in structure-based drug design that are similar to a reference ligand. We show that the generated ligands obtained via latent-conditioning achieve favorable ligand poses with reduced steric clashes and lower strain energies compared to diffusion models that only condition on the protein pocket alone. We demonstrate that the ligand generation can be further constrained using an importance sampling algorithm with external surrogate models that account for molecular properties such as synthetic accessibility.
我们介绍了PoLiGenX,这是一种用于从头配体设计的新型生成模型,采用潜在条件,目标感知等变扩散。我们的方法利用了位于特定蛋白质口袋内的参考分子上的配体生成过程的调节。通过这样做,PoLiGenX可以生成适合目标口袋的形状相似的配体,从而有效地应用于目标感知命中扩展和命中优化。我们的实验结果强调了PoLiGenX在推进配体设计方面的功效。值得注意的是,对接分析显示,PoLiGenX生成的配体相对于参考分子具有更强的结合亲和力,同时保持了相似的分子形状,并且具有更低的应变能和更少的空间冲突。此外,该模型促进了实质性的化学多样性,促进了对更广泛、更多样化的化学空间的探索。重要的是,生成的配体使用Lipinski的五法则进行药物相似性评估,与参考数据集相比,显示出对药物相似性标准的优越依从性。这项工作代表了在控制和精确生成针对特定蛋白质靶点的治疗相关从头配体方面向前迈出的一步,有助于计算药物发现和配体设计的进展。我们在扩散模型中提出了一种潜在调节方法,以实现基于结构的药物设计中与参考配体相似的配体的可控生成。我们表明,与仅在蛋白质口袋单独条件下的扩散模型相比,通过潜伏调节获得的生成配体具有更低的空间冲突和更低的应变能,从而实现了有利的配体姿态。我们证明了配体的生成可以使用具有外部代理模型的重要采样算法进一步约束,该模型考虑了分子性质,如合成可及性。
{"title":"Equivariant diffusion for structure-based de novo ligand generation with latent-conditioning","authors":"Tuan Le, Julian Cremer, Djork-Arné Clevert, Kristof T. Schütt","doi":"10.1186/s13321-025-01028-x","DOIUrl":"https://doi.org/10.1186/s13321-025-01028-x","url":null,"abstract":"We introduce PoLiGenX, a novel generative model for de novo ligand design that employs latent-conditioned, target-aware equivariant diffusion. Our approach leverages the conditioning of the ligand generation process on reference molecules located within a specific protein pocket. By doing so, PoLiGenX generates shape-similar ligands that are adapted to the target pocket, enabling effective applications in target-aware hit expansion and hit optimization. Our experimental results underscore the efficacy of PoLiGenX in advancing ligand design. Notably, docking analyses reveal that the ligands generated by PoLiGenX show enhanced binding affinities relative to their reference molecules, all while retaining a similar molecular shape, but also retaining better poses with lower strain energies and less steric clashes. Furthermore, the model promotes substantial chemical diversity, facilitating the exploration of broader and more varied chemical spaces. Importantly, the generated ligands were assessed for drug-likeness using Lipinski’s rule of five, demonstrating superior adherence to drug-likeness criteria compared to the reference dataset. This work represents a step forward in the controlled and precise generation of therapeutically relevant de novo ligands tailored for specific protein targets, contributing to progress in computational drug discovery and ligand design. We present a latent-conditioning method within diffusion models to enable the controllable generation of ligands in structure-based drug design that are similar to a reference ligand. We show that the generated ligands obtained via latent-conditioning achieve favorable ligand poses with reduced steric clashes and lower strain energies compared to diffusion models that only condition on the protein pocket alone. We demonstrate that the ligand generation can be further constrained using an importance sampling algorithm with external surrogate models that account for molecular properties such as synthetic accessibility.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"7 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144188881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semi-supervised prediction of protein fitness for data-driven protein engineering 数据驱动蛋白质工程中蛋白质适应度的半监督预测
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-05-31 DOI: 10.1186/s13321-025-01029-w
Alicia Olivares-Gil, José A. Barbero-Aparicio, Juan J. Rodríguez, José F. Díez-Pastor, César García-Osorio, Mehdi D. Davari
Protein fitness prediction plays a crucial role in the advancement of protein engineering endeavours. However, the combinatorial complexity of the protein sequence space and the limited availability of assay-labelled data hinder the efficient optimization of protein properties. Data-driven strategies utilizing machine learning methods have emerged as a promising solution, yet their dependence on labelled training datasets poses a significant obstacle. To overcome this challenge, in this work, we explore various ways of introducing the latent information present in evolutionarily related sequences (homologous sequences) into the training process. To do so, we establish several strategies based on semi-supervised learning (unsupervised pre-processing and wrapper methods) and perform a comprehensive comparison using 19 datasets containing protein-fitness pairs. Our findings reveal that using the information present in the homologous sequences can improve the performance of the models, especially when the number of available labelled sequences is considerably low. Specifically, the combination of a sequence encoding method based on Direct Coupling Analysis (DCA), with MERGE (a hybrid regression framework that combines evolutionary information with supervised learning) and an SVM regressor, outperforms other encodings (PAM250, UniRep, eUniRep) and other semi-supervised wrapper methods (Tri-Training Regressor, Co-Training Regressor). In summary, the demonstrated performance gains of this strategy mark a substantial leap towards more robust and reliable predictive models for protein engineering tasks. This advancement holds the potential to streamline the design and optimisation of proteins for diverse applications in biotechnology and therapeutics. We explore several semi-supervised learning strategies capable of including the homologous sequences (unlabelled) to the protein of interest in the training process. Among them, we present two new methods to exploit the information in the homologous sequences: i) a new generalised version of MERGE capable of employing any regressor as a base estimator; ii) the Tri-Training Regressor method, an adaptation of the Tri-Training method for regression problems. We find that the information inherent in the homologous sequences has the ability to improve the predictive capacity of models when the number of available sequences is scarce, especially when using the DCA encoding together with MERGE and an SVM regressor.
蛋白质适应度预测在蛋白质工程研究中起着至关重要的作用。然而,蛋白质序列空间的组合复杂性和测定标记数据的有限可用性阻碍了蛋白质特性的有效优化。利用机器学习方法的数据驱动策略已经成为一种很有前途的解决方案,但它们对标记训练数据集的依赖构成了一个重大障碍。为了克服这一挑战,在这项工作中,我们探索了将进化相关序列(同源序列)中存在的潜在信息引入训练过程的各种方法。为此,我们建立了几种基于半监督学习的策略(无监督预处理和包装方法),并使用包含蛋白质适应度对的19个数据集进行了全面的比较。我们的研究结果表明,利用同源序列中存在的信息可以提高模型的性能,特别是当可用的标记序列数量相当低时。具体来说,基于直接耦合分析(DCA)的序列编码方法、MERGE(一种结合进化信息和监督学习的混合回归框架)和SVM回归器的组合优于其他编码(PAM250、UniRep、eUniRep)和其他半监督包装方法(三训练回归器、共同训练回归器)。综上所述,该策略的性能提升标志着蛋白质工程任务预测模型朝着更稳健、更可靠的方向迈出了实质性的一步。这一进展有可能简化蛋白质的设计和优化,以用于生物技术和治疗学的各种应用。我们探索了几种半监督学习策略,能够在训练过程中包括感兴趣的蛋白质的同源序列(未标记)。其中,我们提出了两种新的方法来利用同源序列中的信息:i)一种新的通用版本的MERGE,能够使用任何回归量作为基估计量;ii) Tri-Training Regressor method,这是对Tri-Training方法的改进,用于解决回归问题。我们发现,当可用序列数量不足时,同源序列固有的信息能够提高模型的预测能力,特别是当将DCA编码与MERGE和SVM回归器结合使用时。
{"title":"Semi-supervised prediction of protein fitness for data-driven protein engineering","authors":"Alicia Olivares-Gil, José A. Barbero-Aparicio, Juan J. Rodríguez, José F. Díez-Pastor, César García-Osorio, Mehdi D. Davari","doi":"10.1186/s13321-025-01029-w","DOIUrl":"https://doi.org/10.1186/s13321-025-01029-w","url":null,"abstract":"Protein fitness prediction plays a crucial role in the advancement of protein engineering endeavours. However, the combinatorial complexity of the protein sequence space and the limited availability of assay-labelled data hinder the efficient optimization of protein properties. Data-driven strategies utilizing machine learning methods have emerged as a promising solution, yet their dependence on labelled training datasets poses a significant obstacle. To overcome this challenge, in this work, we explore various ways of introducing the latent information present in evolutionarily related sequences (homologous sequences) into the training process. To do so, we establish several strategies based on semi-supervised learning (unsupervised pre-processing and wrapper methods) and perform a comprehensive comparison using 19 datasets containing protein-fitness pairs. Our findings reveal that using the information present in the homologous sequences can improve the performance of the models, especially when the number of available labelled sequences is considerably low. Specifically, the combination of a sequence encoding method based on Direct Coupling Analysis (DCA), with MERGE (a hybrid regression framework that combines evolutionary information with supervised learning) and an SVM regressor, outperforms other encodings (PAM250, UniRep, eUniRep) and other semi-supervised wrapper methods (Tri-Training Regressor, Co-Training Regressor). In summary, the demonstrated performance gains of this strategy mark a substantial leap towards more robust and reliable predictive models for protein engineering tasks. This advancement holds the potential to streamline the design and optimisation of proteins for diverse applications in biotechnology and therapeutics. We explore several semi-supervised learning strategies capable of including the homologous sequences (unlabelled) to the protein of interest in the training process. Among them, we present two new methods to exploit the information in the homologous sequences: i) a new generalised version of MERGE capable of employing any regressor as a base estimator; ii) the Tri-Training Regressor method, an adaptation of the Tri-Training method for regression problems. We find that the information inherent in the homologous sequences has the ability to improve the predictive capacity of models when the number of available sequences is scarce, especially when using the DCA encoding together with MERGE and an SVM regressor.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"3 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144188911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1