Molecular Informatics最新文献_第4页

Predicting the Price of Molecules Using Their Predicted Synthetic Pathways. 利用预测的合成途径预测分子的价格。

IF 2.8 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics

Pub Date : 2025-02-01 DOI: 10.1002/minf.202400039

Massina Abderrahmane, Hamza Tajmouati, Vinicius Barros Ribeiro da Silva, Quentin Perron

Currently, numerous metrics allow chemists and computational chemists to refine and filter libraries of virtual molecules in order to prioritize their synthesis. Some of the most commonly used metrics and models are QSAR models, docking scores, diverse druggability metrics, and synthetic feasibility scores to name only a few. To our knowledge, among the known metrics, a function which estimates the price of a novel virtual molecule and which takes into account the availability and price of starting materials has not been considered before in literature. Being able to make such a prediction could improve and accelerate the decision-making process related to the cost-of-goods. Taking advantage of recent advances in the field of Computer Aided Synthetic Planning (CASP), we decided to investigate if the predicted retrosynthetic pathways of a given molecule and the prices of its associated starting materials could be good features to predict the price of that compound. In this work, we present a deep learning model, RetroPriceNet, that predicts the price of molecules using their predicted synthetic pathways. On a holdout test set, the model achieves better performance than the state-of-the-art model. The developed approach takes into account the synthetic feasibility of molecules and the availability and prices of the starting materials.

目前，许多指标允许化学家和计算化学家精炼和过滤虚拟分子库，以便优先考虑它们的合成。一些最常用的指标和模型是QSAR模型、对接评分、多种药物可药性指标和合成可行性评分等。据我们所知，在已知的指标中，估计新型虚拟分子价格并考虑到起始材料的可用性和价格的函数在文献中尚未被考虑过。能够做出这样的预测可以改善和加快与货物成本有关的决策过程。利用计算机辅助合成计划（CASP）领域的最新进展，我们决定研究给定分子的预测反合成途径及其相关起始材料的价格是否可以作为预测该化合物价格的良好特征。在这项工作中，我们提出了一个深度学习模型RetroPriceNet，该模型使用预测的合成途径来预测分子的价格。在holdout测试集上，该模型比最先进的模型实现了更好的性能。所开发的方法考虑了分子合成的可行性以及起始材料的可用性和价格。

{"title":"Predicting the Price of Molecules Using Their Predicted Synthetic Pathways.","authors":"Massina Abderrahmane, Hamza Tajmouati, Vinicius Barros Ribeiro da Silva, Quentin Perron","doi":"10.1002/minf.202400039","DOIUrl":"10.1002/minf.202400039","url":null,"abstract":"Currently, numerous metrics allow chemists and computational chemists to refine and filter libraries of virtual molecules in order to prioritize their synthesis. Some of the most commonly used metrics and models are QSAR models, docking scores, diverse druggability metrics, and synthetic feasibility scores to name only a few. To our knowledge, among the known metrics, a function which estimates the price of a novel virtual molecule and which takes into account the availability and price of starting materials has not been considered before in literature. Being able to make such a prediction could improve and accelerate the decision-making process related to the cost-of-goods. Taking advantage of recent advances in the field of Computer Aided Synthetic Planning (CASP), we decided to investigate if the predicted retrosynthetic pathways of a given molecule and the prices of its associated starting materials could be good features to predict the price of that compound. In this work, we present a deep learning model, RetroPriceNet, that predicts the price of molecules using their predicted synthetic pathways. On a holdout test set, the model achieves better performance than the state-of-the-art model. The developed approach takes into account the synthetic feasibility of molecules and the availability and prices of the starting materials.","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 2","pages":"e202400039"},"PeriodicalIF":2.8,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143066819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Prediction of the Appropriate Temperature and Pressure for Polymer Dissolution Using Machine Learning Models. 使用机器学习模型预测聚合物溶解的适当温度和压力。

IF 2.8 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics

Pub Date : 2025-02-01 DOI: 10.1002/minf.202400193

Dorsa Dadashi, Marjan Kaedi, Parsa Dadashi, Suprakas Sinha Ray

The widespread use of polymer solutions in the chemical industry poses a significant challenge in determining optimal dissolution conditions. Traditionally, researchers have relied on experimental methods to estimate the processing parameters needed to dissolve polymers, often requiring numerous iterations of testing different temperatures and pressures. This approach is both costly and time-consuming. In this study, for the first time, we present a machine learning-based approach to predict the minimum temperature and pressure required for polymer dissolution, correlating molecular weight and chemical structure of both the polymer and solvent and its weight percent. Using a dataset compiled from existing literature, which includes key factors influencing polymer dissolution, we also extracted chemical bond information from the molecular structures of polymer-solvent systems. Six different machine learning algorithms, including linear regression, k-nearest neighbors, regression trees, random forests, multilayer perceptron neural networks, and support vector regression, were employed to develop predictive models. Among these, the Random Forest model achieved the highest accuracy, with R² values of 0.931 and 0.942 for temperature and pressure predictions, respectively. This novel approach eliminates the need for repetitive experimental testing, offering a more efficient pathway to determining dissolution conditions.

聚合物溶液在化学工业中的广泛应用对确定最佳溶解条件提出了重大挑战。传统上，研究人员依靠实验方法来估计溶解聚合物所需的工艺参数，通常需要多次重复测试不同的温度和压力。这种方法既昂贵又耗时。在这项研究中，我们首次提出了一种基于机器学习的方法来预测聚合物溶解所需的最低温度和压力，将聚合物和溶剂的分子量和化学结构及其重量百分比相关联。利用现有文献汇编的数据集，包括影响聚合物溶解的关键因素，我们还从聚合物溶剂体系的分子结构中提取了化学键信息。六种不同的机器学习算法，包括线性回归、k近邻、回归树、随机森林、多层感知器神经网络和支持向量回归，被用于开发预测模型。其中Random Forest模型的预测精度最高，预测温度和压力的R2分别为0.931和0.942。这种新颖的方法消除了重复实验测试的需要，为确定溶解条件提供了更有效的途径。

{"title":"Prediction of the Appropriate Temperature and Pressure for Polymer Dissolution Using Machine Learning Models.","authors":"Dorsa Dadashi, Marjan Kaedi, Parsa Dadashi, Suprakas Sinha Ray","doi":"10.1002/minf.202400193","DOIUrl":"10.1002/minf.202400193","url":null,"abstract":"The widespread use of polymer solutions in the chemical industry poses a significant challenge in determining optimal dissolution conditions. Traditionally, researchers have relied on experimental methods to estimate the processing parameters needed to dissolve polymers, often requiring numerous iterations of testing different temperatures and pressures. This approach is both costly and time-consuming. In this study, for the first time, we present a machine learning-based approach to predict the minimum temperature and pressure required for polymer dissolution, correlating molecular weight and chemical structure of both the polymer and solvent and its weight percent. Using a dataset compiled from existing literature, which includes key factors influencing polymer dissolution, we also extracted chemical bond information from the molecular structures of polymer-solvent systems. Six different machine learning algorithms, including linear regression, k-nearest neighbors, regression trees, random forests, multilayer perceptron neural networks, and support vector regression, were employed to develop predictive models. Among these, the Random Forest model achieved the highest accuracy, with R2 values of 0.931 and 0.942 for temperature and pressure predictions, respectively. This novel approach eliminates the need for repetitive experimental testing, offering a more efficient pathway to determining dissolution conditions.","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 2","pages":"e202400193"},"PeriodicalIF":2.8,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143391324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

KNIME Workflows for Chemoinformatic Characterization of Chemical Databases. 用于化学数据库化学信息学表征的KNIME工作流程。

IF 2.8 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics

Pub Date : 2025-02-01 DOI: 10.1002/minf.202400337

Carlos D Ramírez-Márquez, José L Medina-Franco

In chemoinformatics, chemical databases have great importance since their main objective is to store and organize the chemical structures of molecules and their properties, from basic information such as chemical structure to more complex like molecular fingerprints or other types of calculated or experimental descriptors and biological activity. However, this data can only be utilized in projects to identify novel therapeutic molecules or other fields through their correct characterization and analysis. In this Application Note, we compiled five workflows within the open-source data analytics and visualization platform KNIME that can be implemented for the chemoinformatic characterization of databases. To illustrate the application of the workflows, we used BIOFACQUIM, a compound database of natural products isolated and characterized in Mexico [1].

在化学信息学中，化学数据库非常重要，因为它们的主要目的是存储和组织分子的化学结构及其性质，从基本信息（如化学结构）到更复杂的信息（如分子指纹或其他类型的计算或实验描述符和生物活性）。然而，这些数据只能用于项目中，通过正确的表征和分析来识别新的治疗分子或其他领域。在本应用笔记中，我们在开源数据分析和可视化平台KNIME中编译了五个工作流程，可以用于数据库的化学信息学表征。为了说明工作流程的应用，我们使用了BIOFACQUIM，这是一个从墨西哥[1]分离和鉴定的天然产物的化合物数据库。

引用次数: 0

Exploration of the Global Minimum and Conical Intersection with Bayesian Optimization. 用贝叶斯优化方法探索全局最小和圆锥交问题。

IF 2.8 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics

Pub Date : 2025-02-01 DOI: 10.1002/minf.202400041

Riho Somaki, Taichi Inagaki, Miho Hatanaka

Conventional molecular geometry searches on a potential energy surface (PES) utilize energy gradients from quantum chemical calculations. However, replacing energy calculations with noisy quantum computer measurements generates errors in the energies, which makes geometry optimization using the energy gradient difficult. One gradient-free optimization method that can potentially solve this problem is Bayesian optimization (BO). To use BO in geometry search, an acquisition function (AF), which involves an objective variable, must be defined suitably. In this study, we propose a strategy for geometry searches using BO and examine the appropriate AFs to explore two critical structures: the global minimum (GM) on the singlet ground state (S₀) and the most stable conical intersection (CI) point between S₀ and the singlet excited state. We applied our strategy to two molecules and located the GM and the most stable CI geometries with high accuracy for both molecules. We also succeeded in the geometry searches even when artificial random noises were added to the energies to simulate geometry optimization using noisy quantum computer measurements.

传统的分子几何搜索势能面（PES）利用量子化学计算的能量梯度。然而，用噪声量子计算机测量代替能量计算会产生能量误差，这使得使用能量梯度进行几何优化变得困难。一种可能解决该问题的无梯度优化方法是贝叶斯优化（BO）。为了在几何搜索中使用BO，必须定义包含目标变量的获取函数（AF）。在这项研究中，我们提出了一种使用BO进行几何搜索的策略，并检查了合适的AFs来探索两个关键结构：单重态基态（S0）上的全局最小值（GM）和S0与单重态激发态之间最稳定的圆锥交点（CI）。我们将我们的策略应用于两个分子，并以高精度定位了两个分子的GM和最稳定的CI几何形状。即使在能量中加入了人工随机噪声，我们也成功地进行了几何搜索，以模拟使用噪声量子计算机测量的几何优化。

{"title":"Exploration of the Global Minimum and Conical Intersection with Bayesian Optimization.","authors":"Riho Somaki, Taichi Inagaki, Miho Hatanaka","doi":"10.1002/minf.202400041","DOIUrl":"10.1002/minf.202400041","url":null,"abstract":"Conventional molecular geometry searches on a potential energy surface (PES) utilize energy gradients from quantum chemical calculations. However, replacing energy calculations with noisy quantum computer measurements generates errors in the energies, which makes geometry optimization using the energy gradient difficult. One gradient-free optimization method that can potentially solve this problem is Bayesian optimization (BO). To use BO in geometry search, an acquisition function (AF), which involves an objective variable, must be defined suitably. In this study, we propose a strategy for geometry searches using BO and examine the appropriate AFs to explore two critical structures: the global minimum (GM) on the singlet ground state (S0) and the most stable conical intersection (CI) point between S0 and the singlet excited state. We applied our strategy to two molecules and located the GM and the most stable CI geometries with high accuracy for both molecules. We also succeeded in the geometry searches even when artificial random noises were added to the energies to simulate geometry optimization using noisy quantum computer measurements.","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 2","pages":"e202400041"},"PeriodicalIF":2.8,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11781018/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143066818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Ultra-Large Virtual Screening: Definition, Recent Advances, and Challenges in Drug Design. 超大虚拟筛选：药物设计的定义、最新进展和挑战。

IF 2.8 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics

Pub Date : 2025-01-01 Epub Date: 2024-12-05 DOI: 10.1002/minf.202400305

Gabriel Corrêa Veríssimo, Rafaela Salgado Ferreira, Vinícius Gonçalves Maltarollo

Virtual screening (VS) in drug design employs computational methodologies to systematically rank molecules from a virtual compound library based on predicted features related to their biological activities or chemical properties. The recent expansion in commercially accessible compound libraries and the advancements in artificial intelligence (AI) and computational power - including enhanced central processing units (CPUs), graphics processing units (GPUs), high-performance computing (HPC), and cloud computing - have significantly expanded our capacity to screen libraries containing over 10⁹ molecules. Herein, we review the concept of ultra-large virtual screening (ULVS), focusing on the various algorithms and methodologies employed for virtual screening at this scale. In this context, we present the software utilized, applications, and results of different approaches, such as brute force docking, reaction-based docking approaches, machine learning (ML) strategies applied to docking or other VS methods, and similarity/pharmacophore search-based techniques. These examples represent a paradigm shift in the drug discovery process, demonstrating not only the feasibility of billion-scale compound screening but also their potential to identify hit candidates and increase the structural diversity of novel compounds with biological activities.

药物设计中的虚拟筛选（VS）采用计算方法，根据与生物活性或化学性质相关的预测特征，从虚拟化合物库中系统地对分子进行排序。最近商业上可访问的化合物库的扩展以及人工智能（AI）和计算能力的进步-包括增强的中央处理单元（cpu），图形处理单元（gpu），高性能计算（HPC）和云计算-大大扩展了我们筛选包含超过109个分子的库的能力。在此，我们回顾了超大型虚拟筛选（ULVS）的概念，重点介绍了用于这种规模的虚拟筛选的各种算法和方法。在此背景下，我们介绍了所使用的软件，应用程序和不同方法的结果，例如蛮力对接，基于反应的对接方法，应用于对接或其他VS方法的机器学习（ML）策略，以及基于相似性/药效团搜索的技术。这些例子代表了药物发现过程中的范式转变，不仅证明了数十亿级化合物筛选的可行性，而且还证明了它们在确定候选候选药物和增加具有生物活性的新化合物结构多样性方面的潜力。

{"title":"Ultra-Large Virtual Screening: Definition, Recent Advances, and Challenges in Drug Design.","authors":"Gabriel Corrêa Veríssimo, Rafaela Salgado Ferreira, Vinícius Gonçalves Maltarollo","doi":"10.1002/minf.202400305","DOIUrl":"10.1002/minf.202400305","url":null,"abstract":"Virtual screening (VS) in drug design employs computational methodologies to systematically rank molecules from a virtual compound library based on predicted features related to their biological activities or chemical properties. The recent expansion in commercially accessible compound libraries and the advancements in artificial intelligence (AI) and computational power - including enhanced central processing units (CPUs), graphics processing units (GPUs), high-performance computing (HPC), and cloud computing - have significantly expanded our capacity to screen libraries containing over 109 molecules. Herein, we review the concept of ultra-large virtual screening (ULVS), focusing on the various algorithms and methodologies employed for virtual screening at this scale. In this context, we present the software utilized, applications, and results of different approaches, such as brute force docking, reaction-based docking approaches, machine learning (ML) strategies applied to docking or other VS methods, and similarity/pharmacophore search-based techniques. These examples represent a paradigm shift in the drug discovery process, demonstrating not only the feasibility of billion-scale compound screening but also their potential to identify hit candidates and increase the structural diversity of novel compounds with biological activities.","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":" ","pages":"e202400305"},"PeriodicalIF":2.8,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142780630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Simple User-Friendly Reaction Format. 简单的用户友好的反应格式。

IF 2.8 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics

Pub Date : 2025-01-01 DOI: 10.1002/minf.202400361

David F Nippa, Alex T Müller, Kenneth Atz, David B Konrad, Uwe Grether, Rainer E Martin, Gisbert Schneider

Utilizing the growing wealth of chemical reaction data can boost synthesis planning and increase success rates. Yet, the effectiveness of machine learning tools for retrosynthesis planning and forward reaction prediction relies on accessible, well-curated data presented in a structured format. Although some public and licensed reaction databases exist, they often lack essential information about reaction conditions. To address this issue and promote the principles of findable, accessible, interoperable, and reusable (FAIR) data reporting and sharing, we introduce the Simple User-Friendly Reaction Format (SURF). SURF standardizes the documentation of reaction data through a structured tabular format, requiring only a basic understanding of spreadsheets. This format enables chemists to record the synthesis of molecules in a format that is understandable by both humans and machines, which facilitates seamless sharing and integration directly into machine learning pipelines. SURF files are designed to be interoperable, easily imported into relational databases, and convertible into other formats. This complements existing initiatives like the Open Reaction Database (ORD) and Unified Data Model (UDM). At Roche, SURF plays a crucial role in democratizing FAIR reaction data sharing and expediting the chemical synthesis process.

利用日益丰富的化学反应数据可以促进合成计划和提高成功率。然而，机器学习工具在逆向合成计划和正向反应预测方面的有效性依赖于以结构化格式呈现的可访问的、精心策划的数据。虽然存在一些公开和许可的反应数据库，但它们往往缺乏有关反应条件的基本信息。为了解决这个问题并促进可查找、可访问、可互操作和可重用（FAIR）数据报告和共享的原则，我们引入了简单用户友好反应格式（SURF）。SURF通过结构化表格格式标准化反应数据的文档，只需要对电子表格有基本的了解。这种格式使化学家能够以人类和机器都能理解的格式记录分子的合成，从而促进无缝共享和直接集成到机器学习管道中。SURF文件被设计为可互操作的，容易导入到关系数据库中，并可转换为其他格式。这是对现有计划的补充，如开放反应数据库（ORD）和统一数据模型（UDM）。在罗氏公司，SURF在公平反应数据共享和加速化学合成过程中发挥着至关重要的作用。

{"title":"Simple User-Friendly Reaction Format.","authors":"David F Nippa, Alex T Müller, Kenneth Atz, David B Konrad, Uwe Grether, Rainer E Martin, Gisbert Schneider","doi":"10.1002/minf.202400361","DOIUrl":"10.1002/minf.202400361","url":null,"abstract":"Utilizing the growing wealth of chemical reaction data can boost synthesis planning and increase success rates. Yet, the effectiveness of machine learning tools for retrosynthesis planning and forward reaction prediction relies on accessible, well-curated data presented in a structured format. Although some public and licensed reaction databases exist, they often lack essential information about reaction conditions. To address this issue and promote the principles of findable, accessible, interoperable, and reusable (FAIR) data reporting and sharing, we introduce the Simple User-Friendly Reaction Format (SURF). SURF standardizes the documentation of reaction data through a structured tabular format, requiring only a basic understanding of spreadsheets. This format enables chemists to record the synthesis of molecules in a format that is understandable by both humans and machines, which facilitates seamless sharing and integration directly into machine learning pipelines. SURF files are designed to be interoperable, easily imported into relational databases, and convertible into other formats. This complements existing initiatives like the Open Reaction Database (ORD) and Unified Data Model (UDM). At Roche, SURF plays a crucial role in democratizing FAIR reaction data sharing and expediting the chemical synthesis process.","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 1","pages":"e202400361"},"PeriodicalIF":2.8,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11755691/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143024131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Navigating a 1E+60 Chemical Space of Peptide/Peptoid Oligomers. 浏览肽/肽低聚物的 1E+60 化学空间。

IF 2.8 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics

Pub Date : 2025-01-01 Epub Date: 2024-10-10 DOI: 10.1002/minf.202400186

Markus Orsi, Jean-Louis Reymond

Herein we report a virtual library of 1E+60 members, a common estimate for the total size of the drug-like chemical space. The library is obtained from 100 commercially available peptide and peptoid building blocks assembled into linear or cyclic oligomers of up to 30 units, forming molecules within the size range of peptide drugs and potentially accessible by solid-phase synthesis. We demonstrate ligand-based virtual screening (LBVS) using the peptide design genetic algorithm (PDGA), which evolves a population of 50 members to resemble a given target molecule using molecular fingerprint similarity as fitness function. Target molecules are reached in less than 10,000 generations. Like in many journeys, the value of the chemical space journey using PDGA lies not in reaching the target but in the journey itself, here by encountering non-obvious analogs. We also show that PDGA can be used to generate median molecules and analogs of non-peptide target molecules.

在此，我们报告了一个由 1E+60 个成员组成的虚拟库，这是对类药物化学空间大小的常见估计。该库由线性或环状低聚物组成，分子大小在肽类药物的范围内。我们利用遗传算法演示了基于配体的虚拟筛选。

引用次数: 0

Active learning approaches in molecule pKi prediction. 分子 pKi 预测中的主动学习方法。

IF 2.8 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics

Pub Date : 2025-01-01 Epub Date: 2024-08-06 DOI: 10.1002/minf.202400154

I M Kashafutdinova, A Poyezzhayeva, T Gimadiev, T Madzhidov

During the early stages of drug design, identifying compounds with suitable bioactivities is crucial. Given the vast array of potential drug databases, it's feasible to assay only a limited subset of candidates. The optimal method for selecting the candidates, aiming to minimize the overall number of assays, involves an active learning (AL) approach. In this work, we benchmarked a range of AL strategies with two main objectives: (1) to identify a strategy that ensures high model performance and (2) to select molecules with desired properties using minimal assays. To evaluate the different AL strategies, we employed the simulated AL workflow based on "virtual" experiments. These experiments leveraged ChEMBL datasets, which come with known biological activity values for the molecules. Furthermore, for classification tasks, we proposed the hybrid selection strategy that unified both exploration and exploitation AL strategies into a single acquisition function, defined by parameters n and c. We have also shown that popular minimal margin and maximal variance selection approaches for exploration selection correspond to minimization of the hybrid acquisition function with n=1 and 2 respectively. The balance between the exploration and exploitation strategies can be adjusted using a coefficient (c), making the optimal strategy selection straightforward. The primary strength of the hybrid selection method lies in its adaptability; it offers the flexibility to adjust the criteria for molecule selection based on the specific task by modifying the value of the contribution coefficient. Our analysis revealed that, in regression tasks, AL strategies didn't succeed at ensuring high model performance, however, they were successful in selecting molecules with desired properties using minimal number of tests. In analogous experiments in classification tasks, exploration strategy and the hybrid selection function with a constant c<1 (for n=1) and c≤0.2 (for n=2) were effective in achieving the goal of constructing a high-performance predictive model using minimal data. When searching for molecules with desired properties, exploitation, and the hybrid function with c≥1 (n=1) and c≥0.7 (n=2) demonstrated efficiency identifying molecules in fewer iterations compared to random selection method. Notably, when the hybrid function was set to an intermediate coefficient value (c=0.7), it successfully addressed both tasks simultaneously.

在药物设计的早期阶段，确定具有合适生物活性的化合物至关重要。鉴于潜在药物数据库数量庞大，因此只能对有限的候选化合物进行检测。选择候选化合物的最佳方法是主动学习（AL）方法，目的是最大限度地减少化验的总次数。在这项工作中，我们对一系列主动学习策略进行了基准测试，主要目的有两个：（1）确定一种能确保高模型性能的策略；（2）使用最少的化验选择具有所需特性的分子。为了评估不同的 AL 策略，我们采用了基于 "虚拟 "实验的模拟 AL 工作流程。这些实验利用了 ChEMBL 数据集，其中包含了已知分子的生物活性值。此外，针对分类任务，我们提出了混合选择策略，将探索和利用 AL 策略统一为一个单一的获取函数，该函数由参数 n 和 c 定义。我们还证明，用于探索选择的流行最小边际和最大方差选择方法分别对应于混合获取函数的最小化（n=1 和 2）。探索策略和开发策略之间的平衡可以通过系数（c）进行调整，从而使最优策略选择变得简单明了。混合选择方法的主要优势在于其适应性；它可以根据具体任务，通过修改贡献系数的值来灵活调整分子选择的标准。我们的分析表明，在回归任务中，AL 策略并不能成功地确保高模型性能，但却能用最少的测试次数成功地选择出具有所需特性的分子。在分类任务的类似实验中，探索策略和具有常数 c

{"title":"Active learning approaches in molecule pKi prediction.","authors":"I M Kashafutdinova, A Poyezzhayeva, T Gimadiev, T Madzhidov","doi":"10.1002/minf.202400154","DOIUrl":"10.1002/minf.202400154","url":null,"abstract":"During the early stages of drug design, identifying compounds with suitable bioactivities is crucial. Given the vast array of potential drug databases, it's feasible to assay only a limited subset of candidates. The optimal method for selecting the candidates, aiming to minimize the overall number of assays, involves an active learning (AL) approach. In this work, we benchmarked a range of AL strategies with two main objectives: (1) to identify a strategy that ensures high model performance and (2) to select molecules with desired properties using minimal assays. To evaluate the different AL strategies, we employed the simulated AL workflow based on \"virtual\" experiments. These experiments leveraged ChEMBL datasets, which come with known biological activity values for the molecules. Furthermore, for classification tasks, we proposed the hybrid selection strategy that unified both exploration and exploitation AL strategies into a single acquisition function, defined by parameters n and c. We have also shown that popular minimal margin and maximal variance selection approaches for exploration selection correspond to minimization of the hybrid acquisition function with n=1 and 2 respectively. The balance between the exploration and exploitation strategies can be adjusted using a coefficient (c), making the optimal strategy selection straightforward. The primary strength of the hybrid selection method lies in its adaptability; it offers the flexibility to adjust the criteria for molecule selection based on the specific task by modifying the value of the contribution coefficient. Our analysis revealed that, in regression tasks, AL strategies didn't succeed at ensuring high model performance, however, they were successful in selecting molecules with desired properties using minimal number of tests. In analogous experiments in classification tasks, exploration strategy and the hybrid selection function with a constant c<1 (for n=1) and c≤0.2 (for n=2) were effective in achieving the goal of constructing a high-performance predictive model using minimal data. When searching for molecules with desired properties, exploitation, and the hybrid function with c≥1 (n=1) and c≥0.7 (n=2) demonstrated efficiency identifying molecules in fewer iterations compared to random selection method. Notably, when the hybrid function was set to an intermediate coefficient value (c=0.7), it successfully addressed both tasks simultaneously.","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":" ","pages":"e202400154"},"PeriodicalIF":2.8,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141893849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Topology-Enhanced Multi-Viewed Contrastive Approach for Molecular Graph Representation Learning and Classification. 分子图表示学习与分类的拓扑增强多视图对比方法。

IF 2.8 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics

Pub Date : 2025-01-01 DOI: 10.1002/minf.202400252

Phu Pham

In recent times, graph representation learning has been becoming a hot research topic which has attracted a lot of attention from researchers. Graph embeddings have diverse applications across fields such as information and social network analysis, bioinformatics and cheminformatics, natural language processing (NLP), and recommendation systems. Among the advanced deep learning (DL) based architectures used in graph representation learning, graph neural networks (GNNs) have emerged as the dominant and highly effective framework. The recent GNN-based methods have demonstrated state-of-the-art performance on complex supervised and unsupervised tasks at both the node and graph levels. In recent years, to enhance multi-view and structured graph representations, contrastive learning-based techniques have been developed, introducing models known as graph contrastive learning (GCL) models. These GCL approaches leverage unsupervised contrastive methods to capture multi-view graph representations by comparing node and graph embeddings, yielding significant improvements in both graph-level representations and task-specific applications, such as molecular embedding and classification. However, as most GCL techniques are primarily designed to focus on the explicit graph structure through GNN-based encoders, they often overlook critical topological insights that could be provided through topological data analysis (TDA). Given the promising research indicating that topological features can greatly benefit various graph learning tasks, we propose a novel topology-enhanced, multi-view graph contrastive learning model called TMGCL. Our TMGCL model is designed to capture and utilize both comprehensive multi-scale topological and global structural information from graphs. This enhanced representation capability positions TMGCL to directly support a range of applications, such as molecular classification, with improved accuracy and robustness. Extensive experiments within two real-world datasets proved the effectiveness and outperformance of our proposed TMGCL in comparing with state-of-the-art GNN/GCL-based baselines.

近年来，图表示学习已经成为一个研究热点，引起了研究者的广泛关注。图嵌入在信息和社会网络分析、生物信息学和化学信息学、自然语言处理（NLP）和推荐系统等领域有着广泛的应用。在用于图表示学习的基于高级深度学习（DL）的架构中，图神经网络（gnn）已成为占主导地位的高效框架。最近基于gnn的方法在节点和图级别上都展示了复杂监督和无监督任务的最先进性能。近年来，为了增强多视图和结构化图表示，基于对比学习的技术得到了发展，引入了图对比学习（GCL）模型。这些GCL方法利用无监督的对比方法，通过比较节点和图嵌入来捕获多视图图表示，从而在图级表示和特定于任务的应用程序（如分子嵌入和分类）中产生重大改进。然而，由于大多数GCL技术主要是通过基于gnn的编码器来关注显式图结构，它们往往忽略了可以通过拓扑数据分析（TDA）提供的关键拓扑见解。鉴于有研究表明拓扑特征可以极大地促进各种图学习任务，我们提出了一种新的拓扑增强的多视图图对比学习模型TMGCL。我们的TMGCL模型旨在从图中捕获和利用全面的多尺度拓扑和全局结构信息。这种增强的表示能力使TMGCL能够直接支持一系列应用程序，例如分子分类，并且具有更高的准确性和健壮性。在两个真实数据集中进行的大量实验证明，与最先进的基于GNN/ gcl的基线相比，我们提出的TMGCL的有效性和卓越性能。

{"title":"A Topology-Enhanced Multi-Viewed Contrastive Approach for Molecular Graph Representation Learning and Classification.","authors":"Phu Pham","doi":"10.1002/minf.202400252","DOIUrl":"https://doi.org/10.1002/minf.202400252","url":null,"abstract":"In recent times, graph representation learning has been becoming a hot research topic which has attracted a lot of attention from researchers. Graph embeddings have diverse applications across fields such as information and social network analysis, bioinformatics and cheminformatics, natural language processing (NLP), and recommendation systems. Among the advanced deep learning (DL) based architectures used in graph representation learning, graph neural networks (GNNs) have emerged as the dominant and highly effective framework. The recent GNN-based methods have demonstrated state-of-the-art performance on complex supervised and unsupervised tasks at both the node and graph levels. In recent years, to enhance multi-view and structured graph representations, contrastive learning-based techniques have been developed, introducing models known as graph contrastive learning (GCL) models. These GCL approaches leverage unsupervised contrastive methods to capture multi-view graph representations by comparing node and graph embeddings, yielding significant improvements in both graph-level representations and task-specific applications, such as molecular embedding and classification. However, as most GCL techniques are primarily designed to focus on the explicit graph structure through GNN-based encoders, they often overlook critical topological insights that could be provided through topological data analysis (TDA). Given the promising research indicating that topological features can greatly benefit various graph learning tasks, we propose a novel topology-enhanced, multi-view graph contrastive learning model called TMGCL. Our TMGCL model is designed to capture and utilize both comprehensive multi-scale topological and global structural information from graphs. This enhanced representation capability positions TMGCL to directly support a range of applications, such as molecular classification, with improved accuracy and robustness. Extensive experiments within two real-world datasets proved the effectiveness and outperformance of our proposed TMGCL in comparing with state-of-the-art GNN/GCL-based baselines.","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 1","pages":"e202400252"},"PeriodicalIF":2.8,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142951853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Chemical Space Spanned by Manually Curated Datasets of Natural and Synthetic Compounds with Activities against SARS-CoV-2. 人工编辑的具有抗 SARS-CoV-2 活性的天然和合成化合物数据集所跨越的化学空间。

IF 2.8 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics

Pub Date : 2025-01-01 Epub Date: 2024-11-23 DOI: 10.1002/minf.202400293

Jude Y Betow, Gemma Turon, Clovis S Metuge, Simeon Akame, Vanessa A Shu, Oyere T Ebob, Miquel Duran-Frigola, Fidele Ntie-Kang

Diseases caused by viruses are challenging to contain, as their outbreak and spread could be very sudden, compounded by rapid mutations, making the development of drugs and vaccines a continued endeavour that requires fast discovery and preparedness. Targeting viral infections with small molecules remains one of the treatment options to reduce transmission and the disease burden. A lesson learned from the recent coronavirus disease (COVID-19) is to collect ready-to-screen small molecule libraries in preparation for the next viral outbreak, and potentially find a clinical candidate before it becomes a pandemic. Public availability of diverse compound libraries, well annotated in terms of chemical structures and scaffolds, modes of action, and bioactivities are, therefore, crucial to ensure the participation of academic laboratories in these screening efforts, especially in resource-limited settings where synthesis, testing and computing capacity are scarce. Here, we demonstrate a low-resource approach to populate the chemical space of naturally occurring and synthetic small molecules that have shown in vitro and/or in vivo activities against the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and its target proteins. We have manually curated two datasets of small molecules (naturally occurring and synthetically derived) by reading and collecting (hand-curating) the published literature. Information from the literature reveals that a majority of the reported SARS-CoV-2 compounds act by inhibiting the main protease, while 25% of the compounds currently have no known target. Scaffold analysis and principal component analysis revealed that the most common scaffolds in the datasets are quite distinct. We then expanded the initially manually curated dataset of over 1200 compounds via an ultra-large scale 2D and 3D similarity search, obtaining an expanded collection of over 150 k purchasable compounds. The spanned chemical space significantly extends beyond that of a commercially available coronavirus library of more than 20 k small molecules and constitutes a good starting collection for virtual screening campaigns given its manageable size and proximity to hand-curated compounds.

由病毒引起的疾病难以控制，因为它们的爆发和传播可能非常突然，再加上快速变异，使得药物和疫苗的开发成为一项需要快速发现和准备的持续性工作。用小分子药物治疗病毒感染仍然是减少传播和疾病负担的治疗方法之一。从最近的冠状病毒疾病（COVID-19）中汲取的教训是，收集可随时筛选的小分子化合物库，为下一次病毒爆发做好准备，并有可能在病毒大流行之前找到临床候选药物。因此，向公众提供在化学结构和支架、作用模式和生物活性方面注释清楚的各种化合物库，对于确保学术实验室参与这些筛选工作至关重要，尤其是在合成、测试和计算能力稀缺的资源有限环境中。在这里，我们展示了一种低资源方法，用于填充针对严重急性呼吸系统综合征冠状病毒 2（SARS-CoV-2）及其靶蛋白具有体外和/或体内活性的天然小分子和合成小分子的化学空间。我们通过阅读和收集（手工整理）已发表的文献，手工整理了两个小分子（天然生成的和人工合成的）数据集。文献信息显示，大多数已报道的 SARS-CoV-2 化合物通过抑制主要蛋白酶发挥作用，而 25% 的化合物目前尚无已知靶点。支架分析和主成分分析表明，数据集中最常见的支架非常不同。随后，我们通过超大规模的二维和三维相似性搜索，扩展了最初人工编辑的 1200 多种化合物的数据集，获得了超过 15 万种可购买化合物的扩展集合。所跨越的化学空间大大超出了由 20 多万个小分子组成的商业化冠状病毒库的范围，而且由于其规模易于管理且接近人工整理的化合物，因此是虚拟筛选活动的良好起点。

{"title":"The Chemical Space Spanned by Manually Curated Datasets of Natural and Synthetic Compounds with Activities against SARS-CoV-2.","authors":"Jude Y Betow, Gemma Turon, Clovis S Metuge, Simeon Akame, Vanessa A Shu, Oyere T Ebob, Miquel Duran-Frigola, Fidele Ntie-Kang","doi":"10.1002/minf.202400293","DOIUrl":"10.1002/minf.202400293","url":null,"abstract":"Diseases caused by viruses are challenging to contain, as their outbreak and spread could be very sudden, compounded by rapid mutations, making the development of drugs and vaccines a continued endeavour that requires fast discovery and preparedness. Targeting viral infections with small molecules remains one of the treatment options to reduce transmission and the disease burden. A lesson learned from the recent coronavirus disease (COVID-19) is to collect ready-to-screen small molecule libraries in preparation for the next viral outbreak, and potentially find a clinical candidate before it becomes a pandemic. Public availability of diverse compound libraries, well annotated in terms of chemical structures and scaffolds, modes of action, and bioactivities are, therefore, crucial to ensure the participation of academic laboratories in these screening efforts, especially in resource-limited settings where synthesis, testing and computing capacity are scarce. Here, we demonstrate a low-resource approach to populate the chemical space of naturally occurring and synthetic small molecules that have shown in vitro and/or in vivo activities against the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and its target proteins. We have manually curated two datasets of small molecules (naturally occurring and synthetically derived) by reading and collecting (hand-curating) the published literature. Information from the literature reveals that a majority of the reported SARS-CoV-2 compounds act by inhibiting the main protease, while 25% of the compounds currently have no known target. Scaffold analysis and principal component analysis revealed that the most common scaffolds in the datasets are quite distinct. We then expanded the initially manually curated dataset of over 1200 compounds via an ultra-large scale 2D and 3D similarity search, obtaining an expanded collection of over 150 k purchasable compounds. The spanned chemical space significantly extends beyond that of a commercially available coronavirus library of more than 20 k small molecules and constitutes a good starting collection for virtual screening campaigns given its manageable size and proximity to hand-curated compounds.","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":" ","pages":"e202400293"},"PeriodicalIF":2.8,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142693295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0