首页 > 最新文献

Journal of Chemical Information and Modeling 最新文献

英文 中文
MD-LAIs Software: Computing Whole-Sequence and Amino Acid-Level "Embeddings" for Peptides and Proteins. MD-LAIs 软件:计算肽和蛋白质的全序列和氨基酸级 "嵌入"。
IF 5.6 2区 化学 Q1 CHEMISTRY, MEDICINAL Pub Date : 2024-12-09 Epub Date: 2024-11-18 DOI: 10.1021/acs.jcim.3c01189
Ernesto Contreras-Torres, Yovani Marrero-Ponce

Several computational tools have been developed to calculate sequence-based molecular descriptors (MDs) for peptides and proteins. However, these tools have certain limitations: 1) They generally lack capabilities for curating input data. 2) Their outputs often exhibit significant overlap. 3) There is limited availability of MDs at the amino acid (aa) level. 4) They lack flexibility in computing specific MDs. To address these issues, we developed MD-LAIs (Molecular Descriptors from Local Amino acid Invariants), Java-based software designed to compute both whole-sequence and aa-level MDs for peptides and proteins. These MDs are generated by applying aggregation operators (AOs) to macromolecular vectors containing the chemical-physical and structural properties of aas. The set of AOs includes both nonclassical (e.g., Minkowski norms) and classical AOs (e.g., Radial Distribution Function). Classical AOs capture neighborhood structural information at different k levels, while nonclassical AOs are applied using a sliding window to generalize the aa-level output. A weighting system based on fuzzy membership functions is also included to account for the contributions of individual aas. MD-LAIs features: 1) a module for data curation tasks, 2) a feature selection module, 3) projects of highly relevant MDs, and 4) low-dimensional lists of informative global and aa-level MDs. Overall, we expect that MD-LAIs will be a valuable tool for encoding protein or peptide sequences. The software is freely available as a stand-alone system on GitHub (https://github.com/Grupo-Medicina-Molecular-y-Traslacional/MD_LAIS).

目前已开发出多种计算工具,用于计算基于序列的肽和蛋白质分子描述符(MD)。然而,这些工具有一定的局限性:1) 它们通常缺乏整理输入数据的能力。2) 它们的输出结果经常出现明显的重叠。3) 氨基酸 (aa) 级别的 MDs 数量有限。4) 它们在计算特定 MD 方面缺乏灵活性。为了解决这些问题,我们开发了 MD-LAIs(Molecular Descriptors from Local Amino acid Invariants),这是一种基于 Java 的软件,旨在计算肽和蛋白质的全序列和 aa 级 MD。这些 MDs 是通过对包含 aas 化学物理和结构特性的大分子向量应用聚合算子(AOs)生成的。聚集算子集包括非经典聚集算子(如闵科夫斯基准则)和经典聚集算子(如径向分布函数)。经典 AO 可捕捉不同 k 级的邻域结构信息,而非经典 AO 则使用滑动窗口来概括 aa 级输出。此外,还包括一个基于模糊成员函数的加权系统,以考虑单个 aas 的贡献。MD-LAIs 的特点包括1) 数据整理任务模块;2) 特征选择模块;3) 高度相关的 MD 项目;4) 具有信息量的全局和 aa 级 MD 的低维列表。总之,我们希望 MD-LAIs 将成为编码蛋白质或肽序列的重要工具。该软件作为独立系统可在 GitHub(https://github.com/Grupo-Medicina-Molecular-y-Traslacional/MD_LAIS)上免费获取。
{"title":"MD-LAIs Software: Computing Whole-Sequence and Amino Acid-Level \"Embeddings\" for Peptides and Proteins.","authors":"Ernesto Contreras-Torres, Yovani Marrero-Ponce","doi":"10.1021/acs.jcim.3c01189","DOIUrl":"10.1021/acs.jcim.3c01189","url":null,"abstract":"<p><p>Several computational tools have been developed to calculate sequence-based molecular descriptors (MDs) for peptides and proteins. However, these tools have certain limitations: 1) They generally lack capabilities for curating input data. 2) Their outputs often exhibit significant overlap. 3) There is limited availability of MDs at the amino acid (<i>aa</i>) level. 4) They lack flexibility in computing specific MDs. To address these issues, we developed <b>MD-LAIs</b> (<b>M</b>olecular <b>D</b>escriptors from <b>L</b>ocal <b>A</b>mino acid <b>I</b>nvariant<b>s</b>), Java-based software designed to compute both whole-sequence and <i>aa</i>-level MDs for peptides and proteins. These MDs are generated by applying aggregation operators (<b>AOs</b>) to macromolecular vectors containing the chemical-physical and structural properties of <i>aas</i>. The set of <b>AOs</b> includes both nonclassical (e.g., Minkowski norms) and classical <b>AOs</b> (e.g., Radial Distribution Function). Classical <b>AOs</b> capture neighborhood structural information at different <i>k</i> levels, while nonclassical <b>AOs</b> are applied using a sliding window to generalize the <i>aa</i>-level output. A weighting system based on fuzzy membership functions is also included to account for the contributions of individual <i>aas</i>. <b>MD-LAIs</b> features: 1) a module for data curation tasks, 2) a feature selection module, 3) projects of highly relevant MDs, and 4) low-dimensional lists of informative global and <i>aa</i>-level MDs. Overall, we expect that <b>MD-LAIs</b> will be a valuable tool for encoding protein or peptide sequences. The software is freely available as a stand-alone system on GitHub (https://github.com/Grupo-Medicina-Molecular-y-Traslacional/MD_LAIS).</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":"8665-8672"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142646378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CACHE Challenge #1: Docking with GNINA Is All You Need.
IF 5.6 2区 化学 Q1 CHEMISTRY, MEDICINAL Pub Date : 2024-12-09 DOI: 10.1021/acs.jcim.4c01429
Ian Dunn, Somayeh Pirhadi, Yao Wang, Smmrithi Ravindran, Carter Concepcion, David Ryan Koes

We describe our winning submission to the first Critical Assessment of Computational Hit-Finding Experiments (CACHE) challenge. In this challenge, 23 participants employed a diverse array of structure-based methods to identify hits to a target with no known ligands. We utilized two methods, pharmacophore search and molecular docking, to identify our initial hit list and compounds for the hit expansion phase. Unlike many other participants, we limited ourselves to using docking scores in identifying and ranking hits. Our resulting best hit series tied for first place when evaluated by a panel of expert judges. Here, we report our top-performing open-source workflow and results.

{"title":"CACHE Challenge #1: Docking with GNINA Is All You Need.","authors":"Ian Dunn, Somayeh Pirhadi, Yao Wang, Smmrithi Ravindran, Carter Concepcion, David Ryan Koes","doi":"10.1021/acs.jcim.4c01429","DOIUrl":"https://doi.org/10.1021/acs.jcim.4c01429","url":null,"abstract":"<p><p>We describe our winning submission to the first Critical Assessment of Computational Hit-Finding Experiments (CACHE) challenge. In this challenge, 23 participants employed a diverse array of structure-based methods to identify hits to a target with no known ligands. We utilized two methods, pharmacophore search and molecular docking, to identify our initial hit list and compounds for the hit expansion phase. Unlike many other participants, we limited ourselves to using docking scores in identifying and ranking hits. Our resulting best hit series tied for first place when evaluated by a panel of expert judges. Here, we report our top-performing open-source workflow and results.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":""},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142798641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ordinal Confidence Level Assignments for Regression Model Predictions.
IF 5.6 2区 化学 Q1 CHEMISTRY, MEDICINAL Pub Date : 2024-12-09 DOI: 10.1021/acs.jcim.4c01755
Steven Kearnes, Patrick Riley

We present a simple method for assigning accurate confidence levels to molecular property predictions from regression models. These confidence levels are easy to interpret and useful for making decisions in drug discovery programs. We demonstrate their performance using time-split validation with assay data from the Relay Therapeutics internal database.

{"title":"Ordinal Confidence Level Assignments for Regression Model Predictions.","authors":"Steven Kearnes, Patrick Riley","doi":"10.1021/acs.jcim.4c01755","DOIUrl":"https://doi.org/10.1021/acs.jcim.4c01755","url":null,"abstract":"<p><p>We present a simple method for assigning accurate confidence levels to molecular property predictions from regression models. These confidence levels are easy to interpret and useful for making decisions in drug discovery programs. We demonstrate their performance using time-split validation with assay data from the Relay Therapeutics internal database.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":""},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142798654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Benchmarking Cross-Docking Strategies in Kinase Drug Discovery. 以激酶药物发现中的交叉对接策略为基准。
IF 5.6 2区 化学 Q1 CHEMISTRY, MEDICINAL Pub Date : 2024-12-09 Epub Date: 2024-11-18 DOI: 10.1021/acs.jcim.4c00905
David A Schaller, Clara D Christ, John D Chodera, Andrea Volkamer

In recent years, machine learning has transformed many aspects of the drug discovery process, including small molecule design, for which the prediction of bioactivity is an integral part. Leveraging structural information about the interactions between a small molecule and its protein target has great potential for downstream machine learning scoring approaches but is fundamentally limited by the accuracy with which protein-ligand complex structures can be predicted in a reliable and automated fashion. With the goal of finding practical approaches to generating useful kinase-inhibitor complex geometries for downstream machine learning scoring approaches, we present a kinase-centric docking benchmark assessing the performance of different classes of docking and pose selection strategies to assess how well experimentally observed binding modes are recapitulated in a realistic cross-docking scenario. The assembled benchmark data set focuses on the well-studied protein kinase family and comprises a subset of 589 protein structures cocrystallized with 423 ATP-competitive ligands. We find that the docking methods biased by the cocrystallized ligand, utilizing shape overlap with or without maximum common substructure matching, are more successful in recovering binding poses than standard physics-based docking alone. Also, docking into multiple structures significantly increases the chance of generating a low root-mean-square deviation (RMSD) docking pose. Docking utilizing an approach that combines all three methods (Posit) into structures with the most similar cocrystallized ligands according to the maximum common substructure (MCS) proved to be the most efficient way to reproduce binding poses, achieving a success rate of 70.4% across all included systems. The studied docking and pose selection strategies, which utilize the OpenEye Toolkits, were implemented into pipelines of the KinoML framework, allowing automated and reliable protein-ligand complex generation for future downstream machine learning tasks. Although focused on protein kinases, we believe that the general findings can also be transferred to other protein families.

近年来,机器学习改变了药物发现过程的许多方面,包括小分子设计,其中生物活性预测是不可或缺的一部分。利用小分子与其蛋白质靶标之间相互作用的结构信息对下游机器学习评分方法具有巨大的潜力,但从根本上说,这种方法受限于以可靠和自动化的方式预测蛋白质配体复合物结构的准确性。为了找到切实可行的方法为下游机器学习评分方法生成有用的激酶抑制剂复合物几何图形,我们提出了一个以激酶为中心的对接基准,评估不同类别对接和姿势选择策略的性能,以评估在现实交叉对接场景中实验观察到的结合模式的再现程度。所收集的基准数据集侧重于研究得比较透彻的蛋白激酶家族,包括与 423 种 ATP 竞争性配体共结晶的 589 种蛋白质结构子集。我们发现,利用形状重叠与或不利用最大共同子结构匹配,以共晶配体为偏向的对接方法在恢复结合位置方面比单独基于物理的标准对接更为成功。此外,与多种结构对接也大大增加了生成低均值方根偏差(RMSD)对接姿势的机会。根据最大共同子结构(MCS),将所有三种方法(Posit)结合到具有最相似共晶配体的结构中进行对接被证明是重现结合姿态的最有效方法,在所有包含的系统中成功率达到 70.4%。所研究的对接和姿势选择策略利用了 OpenEye 工具包,并将其实施到 KinoML 框架的管道中,从而为未来的下游机器学习任务自动生成可靠的蛋白质配体复合物。虽然研究的重点是蛋白激酶,但我们相信一般研究结果也可以应用于其他蛋白家族。
{"title":"Benchmarking Cross-Docking Strategies in Kinase Drug Discovery.","authors":"David A Schaller, Clara D Christ, John D Chodera, Andrea Volkamer","doi":"10.1021/acs.jcim.4c00905","DOIUrl":"10.1021/acs.jcim.4c00905","url":null,"abstract":"<p><p>In recent years, machine learning has transformed many aspects of the drug discovery process, including small molecule design, for which the prediction of bioactivity is an integral part. Leveraging structural information about the interactions between a small molecule and its protein target has great potential for downstream machine learning scoring approaches but is fundamentally limited by the accuracy with which protein-ligand complex structures can be predicted in a reliable and automated fashion. With the goal of finding practical approaches to generating useful kinase-inhibitor complex geometries for downstream machine learning scoring approaches, we present a kinase-centric docking benchmark assessing the performance of different classes of docking and pose selection strategies to assess how well experimentally observed binding modes are recapitulated in a realistic cross-docking scenario. The assembled benchmark data set focuses on the well-studied protein kinase family and comprises a subset of 589 protein structures cocrystallized with 423 ATP-competitive ligands. We find that the docking methods biased by the cocrystallized ligand, utilizing shape overlap with or without maximum common substructure matching, are more successful in recovering binding poses than standard physics-based docking alone. Also, docking into multiple structures significantly increases the chance of generating a low root-mean-square deviation (RMSD) docking pose. Docking utilizing an approach that combines all three methods (Posit) into structures with the most similar cocrystallized ligands according to the maximum common substructure (MCS) proved to be the most efficient way to reproduce binding poses, achieving a success rate of 70.4% across all included systems. The studied docking and pose selection strategies, which utilize the OpenEye Toolkits, were implemented into pipelines of the KinoML framework, allowing automated and reliable protein-ligand complex generation for future downstream machine learning tasks. Although focused on protein kinases, we believe that the general findings can also be transferred to other protein families.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":"8848-8858"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142666469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Kinetics-Based State Definitions for Discrete Binding Conformations of T4 L99A in MD via Markov State Modeling. 通过马尔可夫状态建模对 MD 中 T4 L99A 的离散结合构象进行基于动力学的状态定义
IF 5.6 2区 化学 Q1 CHEMISTRY, MEDICINAL Pub Date : 2024-12-09 Epub Date: 2024-11-26 DOI: 10.1021/acs.jcim.4c01364
Chris Zhang, Meghan Osato, David L Mobley

As a model system, the binding pocket of the L99A mutant of T4 lysozyme has been the subject of numerous computational free energy studies. However, previous studies have failed to fully sample and account for the observed changes in the binding pocket of T4 L99A upon binding of a congeneric ligand series, limiting the accuracy of results. In this work, we resolve the closed, intermediate, and open states for T4 L99A previously reported in experiment in MD and establish definitions for these states based on the dynamics of the system. From this analysis, we arrive at two primary conclusions. First, assignment of simulation trajectories into discrete states should not be done simply based on RMSD to crystal structures as this can result in misassignment of states. Second, the different metastable conformations studied here need to be carefully treated, as we estimate the time scales for conformational interconversion to be on the order of 102 to 103 ns─far longer than time scales for typical binding calculations. We conclude with a discussion on the need to develop enhanced sampling methods to generally account for significant changes in protein conformation due to relatively small ligand perturbations.

作为一个模型系统,T4 溶菌酶 L99A 突变体的结合口袋一直是许多计算自由能研究的主题。然而,以前的研究未能充分采样和解释 T4 L99A 结合同源配体系列时结合口袋中观察到的变化,从而限制了结果的准确性。在这项研究中,我们在 MD 中解析了之前实验中报道的 T4 L99A 的封闭、中间和开放状态,并根据系统的动力学建立了这些状态的定义。通过分析,我们得出两个主要结论。首先,不应简单地根据晶体结构的 RMSD 将模拟轨迹分配为离散状态,因为这可能导致状态分配错误。其次,这里研究的不同可转移构象需要仔细对待,因为我们估计构象相互转换的时间尺度在 102 至 103 ns 之间--远远长于典型结合计算的时间尺度。最后,我们讨论了开发增强型采样方法的必要性,以便对相对较小的配体扰动引起的蛋白质构象的显著变化进行总体解释。
{"title":"Kinetics-Based State Definitions for Discrete Binding Conformations of T4 L99A in MD via Markov State Modeling.","authors":"Chris Zhang, Meghan Osato, David L Mobley","doi":"10.1021/acs.jcim.4c01364","DOIUrl":"10.1021/acs.jcim.4c01364","url":null,"abstract":"<p><p>As a model system, the binding pocket of the L99A mutant of T4 lysozyme has been the subject of numerous computational free energy studies. However, previous studies have failed to fully sample and account for the observed changes in the binding pocket of T4 L99A upon binding of a congeneric ligand series, limiting the accuracy of results. In this work, we resolve the closed, intermediate, and open states for T4 L99A previously reported in experiment in MD and establish definitions for these states based on the dynamics of the system. From this analysis, we arrive at two primary conclusions. First, assignment of simulation trajectories into discrete states should not be done simply based on RMSD to crystal structures as this can result in misassignment of states. Second, the different metastable conformations studied here need to be carefully treated, as we estimate the time scales for conformational interconversion to be on the order of 10<sup>2</sup> to 10<sup>3</sup> ns─far longer than time scales for typical binding calculations. We conclude with a discussion on the need to develop enhanced sampling methods to generally account for significant changes in protein conformation due to relatively small ligand perturbations.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":"8870-8879"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142714834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ProAffinity-GNN: A Novel Approach to Structure-Based Protein-Protein Binding Affinity Prediction via a Curated Data Set and Graph Neural Networks. ProAffinity-GNN:基于结构的蛋白质-蛋白质结合亲和力预测新方法--通过编辑数据集和图神经网络》(ProAffinity-GNN: A new Approach to Structure-Based Protein-Protein Binding Affinity Prediction via a Curated Data Set and Graph Neural Networks)。
IF 5.6 2区 化学 Q1 CHEMISTRY, MEDICINAL Pub Date : 2024-12-09 Epub Date: 2024-11-18 DOI: 10.1021/acs.jcim.4c01850
Zhiyuan Zhou, Yueming Yin, Hao Han, Yiping Jia, Jun Hong Koh, Adams Wai-Kin Kong, Yuguang Mu

Protein-protein interactions (PPIs) are crucial for understanding biological processes and disease mechanisms, contributing significantly to advances in protein engineering and drug discovery. The accurate determination of binding affinities, essential for decoding PPIs, faces challenges due to the substantial time and financial costs involved in experimental and theoretical methods. This situation underscores the urgent need for more effective and precise methodologies for predicting binding affinity. Despite the abundance of research on PPI modeling, the field of quantitative binding affinity prediction remains underexplored, mainly due to a lack of comprehensive data. This study seeks to address these needs by manually curating pairwise interaction labels on available 3D structures of protein complexes, with experimentally determined binding affinities, creating the largest data set for structure-based pairwise protein interaction with binding affinity to date. Subsequently, we introduce ProAffinity-GNN, a novel deep learning framework using protein language model and graph neural network (GNN) to improve the accuracy of prediction of structure-based protein-protein binding affinities. The evaluation results across several benchmark test sets and an additional case study demonstrate that ProAffinity-GNN not only outperforms existing models in terms of accuracy but also shows strong generalization capabilities.

蛋白质-蛋白质相互作用(PPIs)对于了解生物过程和疾病机理至关重要,对蛋白质工程和药物发现的进步贡献巨大。准确测定结合亲和力对解码 PPIs 至关重要,但由于实验和理论方法涉及大量的时间和经济成本,因此面临着挑战。这种情况突出表明,迫切需要更有效、更精确的方法来预测结合亲和力。尽管有关 PPI 建模的研究很多,但主要由于缺乏全面的数据,定量结合亲和力预测领域的研究仍然不足。为了满足这些需求,本研究对现有蛋白质复合物三维结构上的成对相互作用标签和实验确定的结合亲和力进行了人工整理,从而创建了迄今为止最大的基于结构的成对蛋白质相互作用结合亲和力数据集。随后,我们介绍了 ProAffinity-GNN,这是一种使用蛋白质语言模型和图神经网络(GNN)的新型深度学习框架,用于提高基于结构的蛋白质-蛋白质结合亲和力预测的准确性。多个基准测试集和一项附加案例研究的评估结果表明,ProAffinity-GNN 不仅在准确性方面优于现有模型,而且还显示出强大的泛化能力。
{"title":"ProAffinity-GNN: A Novel Approach to Structure-Based Protein-Protein Binding Affinity Prediction via a Curated Data Set and Graph Neural Networks.","authors":"Zhiyuan Zhou, Yueming Yin, Hao Han, Yiping Jia, Jun Hong Koh, Adams Wai-Kin Kong, Yuguang Mu","doi":"10.1021/acs.jcim.4c01850","DOIUrl":"10.1021/acs.jcim.4c01850","url":null,"abstract":"<p><p>Protein-protein interactions (PPIs) are crucial for understanding biological processes and disease mechanisms, contributing significantly to advances in protein engineering and drug discovery. The accurate determination of binding affinities, essential for decoding PPIs, faces challenges due to the substantial time and financial costs involved in experimental and theoretical methods. This situation underscores the urgent need for more effective and precise methodologies for predicting binding affinity. Despite the abundance of research on PPI modeling, the field of quantitative binding affinity prediction remains underexplored, mainly due to a lack of comprehensive data. This study seeks to address these needs by manually curating pairwise interaction labels on available 3D structures of protein complexes, with experimentally determined binding affinities, creating the largest data set for structure-based pairwise protein interaction with binding affinity to date. Subsequently, we introduce ProAffinity-GNN, a novel deep learning framework using protein language model and graph neural network (GNN) to improve the accuracy of prediction of structure-based protein-protein binding affinities. The evaluation results across several benchmark test sets and an additional case study demonstrate that ProAffinity-GNN not only outperforms existing models in terms of accuracy but also shows strong generalization capabilities.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":"8796-8808"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142666472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Molecular Design for Cardiac Cell Differentiation Using a Small Data Set and Decorated Shape Features. 利用小数据集和装饰形状特征进行心脏细胞分化的分子设计
IF 5.6 2区 化学 Q1 CHEMISTRY, MEDICINAL Pub Date : 2024-12-09 Epub Date: 2024-11-25 DOI: 10.1021/acs.jcim.4c01353
Fatemeh Etezadi, Shunichi Ito, Kosuke Yasui, Rodi Kado Abdalkader, Itsunari Minami, Motonari Uesugi, Namasivayam Ganesh Pandian, Haruko Nakano, Atsushi Nakano, Daniel M Packwood

The discovery of small organic compounds for inducing stem cell differentiation is a time- and resource-intensive process. While data science could, in principle, streamline the discovery of these compounds, novel approaches are required due to the difficulty of acquiring training data from large numbers of example compounds. In this paper, we present the design of a new compound for inducing cardiomyocyte differentiation using simple regression models trained with a data set containing only 80 examples. We introduce decorated shape descriptors, an information-rich molecular feature representation that integrates both molecular shape and hydrophilicity information. These models demonstrate improved performance compared to ones using standard molecular descriptors based on shape alone. Model overtraining is diagnosed using a new type of sensitivity analysis. Our new compound is designed using a conservative molecular design strategy, and its effectiveness is confirmed through expression profiles of cardiomyocyte-related marker genes using real-time polymerase chain reaction experiments on human iPS cell lines. This work demonstrates a viable data-driven strategy for designing new compounds for stem cell differentiation protocols and will be useful in situations where training data is limited.

发现诱导干细胞分化的小型有机化合物是一个时间和资源密集型过程。虽然数据科学原则上可以简化这些化合物的发现过程,但由于从大量示例化合物中获取训练数据存在困难,因此需要新颖的方法。在本文中,我们介绍了利用仅包含 80 个示例的数据集所训练的简单回归模型来设计诱导心肌细胞分化的新化合物。我们引入了装饰形状描述符,这是一种集成了分子形状和亲水性信息的信息丰富的分子特征表示。与仅使用基于形状的标准分子描述符的模型相比,这些模型的性能有所提高。通过新型敏感性分析,可以诊断出模型过度训练。我们采用保守的分子设计策略设计出了新化合物,并通过在人类 iPS 细胞系上进行实时聚合酶链反应实验,得出了心肌细胞相关标记基因的表达谱,从而证实了其有效性。这项工作展示了一种可行的数据驱动策略,用于设计干细胞分化方案的新化合物,在训练数据有限的情况下非常有用。
{"title":"Molecular Design for Cardiac Cell Differentiation Using a Small Data Set and Decorated Shape Features.","authors":"Fatemeh Etezadi, Shunichi Ito, Kosuke Yasui, Rodi Kado Abdalkader, Itsunari Minami, Motonari Uesugi, Namasivayam Ganesh Pandian, Haruko Nakano, Atsushi Nakano, Daniel M Packwood","doi":"10.1021/acs.jcim.4c01353","DOIUrl":"10.1021/acs.jcim.4c01353","url":null,"abstract":"<p><p>The discovery of small organic compounds for inducing stem cell differentiation is a time- and resource-intensive process. While data science could, in principle, streamline the discovery of these compounds, novel approaches are required due to the difficulty of acquiring training data from large numbers of example compounds. In this paper, we present the design of a new compound for inducing cardiomyocyte differentiation using simple regression models trained with a data set containing only 80 examples. We introduce decorated shape descriptors, an information-rich molecular feature representation that integrates both molecular shape and hydrophilicity information. These models demonstrate improved performance compared to ones using standard molecular descriptors based on shape alone. Model overtraining is diagnosed using a new type of sensitivity analysis. Our new compound is designed using a conservative molecular design strategy, and its effectiveness is confirmed through expression profiles of cardiomyocyte-related marker genes using real-time polymerase chain reaction experiments on human iPS cell lines. This work demonstrates a viable data-driven strategy for designing new compounds for stem cell differentiation protocols and will be useful in situations where training data is limited.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":"8824-8837"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142714838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Transparent Machine Learning Model to Understand Drug Permeability through the Blood-Brain Barrier. 了解药物通过血脑屏障渗透性的透明机器学习模型
IF 5.6 2区 化学 Q1 CHEMISTRY, MEDICINAL Pub Date : 2024-12-09 Epub Date: 2024-11-18 DOI: 10.1021/acs.jcim.4c01217
Hengjian Jia, Gabriele C Sosso

The blood-brain barrier (BBB) selectively regulates the passage of chemical compounds into and out of the central nervous system (CNS). As such, understanding the permeability of drug molecules through the BBB is key to treating neurological diseases and evaluating the response of the CNS to medical treatments. Within the last two decades, a diverse portfolio of machine learning (ML) models have been regularly utilized as a tool to predict, and, to a much lesser extent, understand, several functional properties of medicinal drugs, including their propensity to pass through the BBB. However, the most numerically accurate models to date lack in transparency, as they typically rely on complex blends of different descriptors (or features or fingerprints), many of which are not necessarily interpretable in a straightforward fashion. In fact, the "black-box" nature of these models has prevented us from pinpointing any specific design rule to craft the next generation of pharmaceuticals that need to pass (or not) through the BBB. In this work, we have developed a ML model that leverages an uncomplicated, transparent set of descriptors to predict the permeability of drug molecules through the BBB. In addition to its simplicity, our model achieves comparable results in terms of accuracy compared to state-of-the-art models. Moreover, we use a naive Bayes model as an analytical tool to provide further insights into the structure-function relation that underpins the capacity of a given drug molecule to pass through the BBB. Although our results are computational rather than experimental, we have identified several molecular fragments and functional groups that may significantly impact a drug's likelihood of permeating the BBB. This work provides a unique angle to the BBB problem and lays the foundations for future work aimed at leveraging additional transparent descriptors, potentially obtained via bespoke molecular dynamics simulations.

血脑屏障(BBB)选择性地调节化合物进出中枢神经系统(CNS)的通道。因此,了解药物分子通过 BBB 的渗透性是治疗神经系统疾病和评估中枢神经系统对药物治疗反应的关键。在过去的二十年里,人们经常利用各种机器学习(ML)模型来预测药物的若干功能特性,包括它们通过 BBB 的倾向性,但对药物功能特性的了解程度要低得多。然而,迄今为止最精确的数字模型都缺乏透明度,因为它们通常依赖于不同描述因子(或特征或指纹)的复杂混合,其中许多不一定能以直接的方式进行解释。事实上,这些模型的 "黑箱 "性质使我们无法确定任何具体的设计规则,来设计需要通过(或不需要)BBB的下一代药物。在这项工作中,我们开发了一种 ML 模型,利用一组不复杂、透明的描述符来预测药物分子通过 BBB 的渗透性。除了简单之外,我们的模型在准确性方面也达到了与最先进模型相当的结果。此外,我们还利用天真贝叶斯模型作为分析工具,进一步深入了解了特定药物分子通过 BBB 的能力所依赖的结构-功能关系。虽然我们的研究结果是计算性的而非实验性的,但我们发现了一些分子片段和功能基团,它们可能会对药物渗透 BBB 的可能性产生重大影响。这项工作为解决 BBB 问题提供了一个独特的角度,并为今后的工作奠定了基础,目的是利用更多的透明描述符,这些描述符有可能是通过定制的分子动力学模拟获得的。
{"title":"Transparent Machine Learning Model to Understand Drug Permeability through the Blood-Brain Barrier.","authors":"Hengjian Jia, Gabriele C Sosso","doi":"10.1021/acs.jcim.4c01217","DOIUrl":"10.1021/acs.jcim.4c01217","url":null,"abstract":"<p><p>The blood-brain barrier (BBB) selectively regulates the passage of chemical compounds into and out of the central nervous system (CNS). As such, understanding the permeability of drug molecules through the BBB is key to treating neurological diseases and evaluating the response of the CNS to medical treatments. Within the last two decades, a diverse portfolio of machine learning (ML) models have been regularly utilized as a tool to predict, and, to a much lesser extent, understand, several functional properties of medicinal drugs, including their propensity to pass through the BBB. However, the most numerically accurate models to date lack in transparency, as they typically rely on complex blends of different descriptors (or features or fingerprints), many of which are not necessarily interpretable in a straightforward fashion. In fact, the \"black-box\" nature of these models has prevented us from pinpointing any specific design rule to craft the next generation of pharmaceuticals that need to pass (or not) through the BBB. In this work, we have developed a ML model that leverages an uncomplicated, transparent set of descriptors to predict the permeability of drug molecules through the BBB. In addition to its simplicity, our model achieves comparable results in terms of accuracy compared to state-of-the-art models. Moreover, we use a naive Bayes model as an analytical tool to provide further insights into the structure-function relation that underpins the capacity of a given drug molecule to pass through the BBB. Although our results are computational rather than experimental, we have identified several molecular fragments and functional groups that may significantly impact a drug's likelihood of permeating the BBB. This work provides a unique angle to the BBB problem and lays the foundations for future work aimed at leveraging additional transparent descriptors, potentially obtained via bespoke molecular dynamics simulations.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":"8718-8728"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11632763/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142666184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Chemoenzymatic Synthesis Planning Guided by Reaction Type Score.
IF 5.6 2区 化学 Q1 CHEMISTRY, MEDICINAL Pub Date : 2024-12-08 DOI: 10.1021/acs.jcim.4c01525
Hongxiang Li, Xuan Liu, Guangde Jiang, Huimin Zhao

Thanks to the growing interest in computer-aided synthesis planning (CASP), a wide variety of retrosynthesis and retrobiosynthesis tools have been developed in the past decades. However, synthesis planning tools for multistep chemoenzymatic reactions are still rare despite the widespread use of enzymatic reactions in chemical synthesis. Herein, we report a reaction type score (RTscore)-guided chemoenzymatic synthesis planning (RTS-CESP) strategy. Briefly, the RTscore is trained using a text-based convolutional neural network (TextCNN) to distinguish synthesis reactions from decomposition reactions and evaluate synthesis efficiency. Once multiple chemical synthesis routes are generated by a retrosynthesis tool for a target molecule, RTscore is used to rank them and find the step(s) that can be replaced by enzymatic reactions to improve synthesis efficiency. As proof of concept, RTS-CESP was applied to 10 molecules with known chemoenzymatic synthesis routes in the literature and was able to predict all of them with six being the top-ranked routes. Moreover, RTS-CESP was employed for 1000 molecules in the boutique database and was able to predict the chemoenzymatic synthesis routes for 554 molecules, outperforming ASKCOS, a state-of-the-art chemoenzymatic synthesis planning tool. Finally, RTS-CESP was used to design a new chemoenzymatic synthesis route for the FDA-approved drug Alclofenac, which was shorter than the literature-reported route and has been experimentally validated.

{"title":"Chemoenzymatic Synthesis Planning Guided by Reaction Type Score.","authors":"Hongxiang Li, Xuan Liu, Guangde Jiang, Huimin Zhao","doi":"10.1021/acs.jcim.4c01525","DOIUrl":"https://doi.org/10.1021/acs.jcim.4c01525","url":null,"abstract":"<p><p>Thanks to the growing interest in computer-aided synthesis planning (CASP), a wide variety of retrosynthesis and retrobiosynthesis tools have been developed in the past decades. However, synthesis planning tools for multistep chemoenzymatic reactions are still rare despite the widespread use of enzymatic reactions in chemical synthesis. Herein, we report a reaction type score (RTscore)-guided chemoenzymatic synthesis planning (RTS-CESP) strategy. Briefly, the RTscore is trained using a text-based convolutional neural network (TextCNN) to distinguish synthesis reactions from decomposition reactions and evaluate synthesis efficiency. Once multiple chemical synthesis routes are generated by a retrosynthesis tool for a target molecule, RTscore is used to rank them and find the step(s) that can be replaced by enzymatic reactions to improve synthesis efficiency. As proof of concept, RTS-CESP was applied to 10 molecules with known chemoenzymatic synthesis routes in the literature and was able to predict all of them with six being the top-ranked routes. Moreover, RTS-CESP was employed for 1000 molecules in the boutique database and was able to predict the chemoenzymatic synthesis routes for 554 molecules, outperforming ASKCOS, a state-of-the-art chemoenzymatic synthesis planning tool. Finally, RTS-CESP was used to design a new chemoenzymatic synthesis route for the FDA-approved drug Alclofenac, which was shorter than the literature-reported route and has been experimentally validated.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":""},"PeriodicalIF":5.6,"publicationDate":"2024-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142794034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
From SMILES to Enhanced Molecular Property Prediction: A Unified Multimodal Framework with Predicted 3D Conformers and Contrastive Learning Techniques.
IF 5.6 2区 化学 Q1 CHEMISTRY, MEDICINAL Pub Date : 2024-12-06 DOI: 10.1021/acs.jcim.4c01240
Long D Nguyen, Quang H Nguyen, Quang H Trinh, Binh P Nguyen

We present a novel molecular property prediction framework that requires only the SMILES format as input but is designed to be multimodal by incorporating predicted 3D conformer representations. Our model captures comprehensive molecular features by leveraging both the sequential character structure of SMILES and the three-dimensional spatial structure of conformers. The framework employs contrastive learning techniques, utilizing InfoNCE loss to align SMILES and conformer embeddings, along with task-specific loss functions, such as ConR for regression and SupCon for classification. To address data imbalance, we incorporate feature distribution smoothing (FDS), a common challenge in drug discovery. We evaluated the framework through multiple case studies, including SARS-CoV-2 drug docking score prediction, molecular property prediction using MoleculeNet data sets, and kinase inhibitor prediction for JAK-1, JAK-2, and MAPK-14 using custom data sets curated from PubChem. The results consistently outperformed state-of-the-art methods, with ConR and FDS significantly improving regression tasks and SupCon enhancing classification performance. These findings highlight the flexibility and robustness of our multimodal model, demonstrating its effectiveness across diverse molecular property prediction tasks, with promising applications in drug discovery and molecular analysis.

{"title":"From SMILES to Enhanced Molecular Property Prediction: A Unified Multimodal Framework with Predicted 3D Conformers and Contrastive Learning Techniques.","authors":"Long D Nguyen, Quang H Nguyen, Quang H Trinh, Binh P Nguyen","doi":"10.1021/acs.jcim.4c01240","DOIUrl":"https://doi.org/10.1021/acs.jcim.4c01240","url":null,"abstract":"<p><p>We present a novel molecular property prediction framework that requires only the SMILES format as input but is designed to be multimodal by incorporating predicted 3D conformer representations. Our model captures comprehensive molecular features by leveraging both the sequential character structure of SMILES and the three-dimensional spatial structure of conformers. The framework employs contrastive learning techniques, utilizing InfoNCE loss to align SMILES and conformer embeddings, along with task-specific loss functions, such as ConR for regression and SupCon for classification. To address data imbalance, we incorporate feature distribution smoothing (FDS), a common challenge in drug discovery. We evaluated the framework through multiple case studies, including SARS-CoV-2 drug docking score prediction, molecular property prediction using MoleculeNet data sets, and kinase inhibitor prediction for JAK-1, JAK-2, and MAPK-14 using custom data sets curated from PubChem. The results consistently outperformed state-of-the-art methods, with ConR and FDS significantly improving regression tasks and SupCon enhancing classification performance. These findings highlight the flexibility and robustness of our multimodal model, demonstrating its effectiveness across diverse molecular property prediction tasks, with promising applications in drug discovery and molecular analysis.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":""},"PeriodicalIF":5.6,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142783338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Chemical Information and Modeling
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1