Pub Date : 2024-12-09Epub Date: 2024-11-18DOI: 10.1021/acs.jcim.3c01189
Ernesto Contreras-Torres, Yovani Marrero-Ponce
Several computational tools have been developed to calculate sequence-based molecular descriptors (MDs) for peptides and proteins. However, these tools have certain limitations: 1) They generally lack capabilities for curating input data. 2) Their outputs often exhibit significant overlap. 3) There is limited availability of MDs at the amino acid (aa) level. 4) They lack flexibility in computing specific MDs. To address these issues, we developed MD-LAIs (Molecular Descriptors from Local Amino acid Invariants), Java-based software designed to compute both whole-sequence and aa-level MDs for peptides and proteins. These MDs are generated by applying aggregation operators (AOs) to macromolecular vectors containing the chemical-physical and structural properties of aas. The set of AOs includes both nonclassical (e.g., Minkowski norms) and classical AOs (e.g., Radial Distribution Function). Classical AOs capture neighborhood structural information at different k levels, while nonclassical AOs are applied using a sliding window to generalize the aa-level output. A weighting system based on fuzzy membership functions is also included to account for the contributions of individual aas. MD-LAIs features: 1) a module for data curation tasks, 2) a feature selection module, 3) projects of highly relevant MDs, and 4) low-dimensional lists of informative global and aa-level MDs. Overall, we expect that MD-LAIs will be a valuable tool for encoding protein or peptide sequences. The software is freely available as a stand-alone system on GitHub (https://github.com/Grupo-Medicina-Molecular-y-Traslacional/MD_LAIS).
目前已开发出多种计算工具,用于计算基于序列的肽和蛋白质分子描述符(MD)。然而,这些工具有一定的局限性:1) 它们通常缺乏整理输入数据的能力。2) 它们的输出结果经常出现明显的重叠。3) 氨基酸 (aa) 级别的 MDs 数量有限。4) 它们在计算特定 MD 方面缺乏灵活性。为了解决这些问题,我们开发了 MD-LAIs(Molecular Descriptors from Local Amino acid Invariants),这是一种基于 Java 的软件,旨在计算肽和蛋白质的全序列和 aa 级 MD。这些 MDs 是通过对包含 aas 化学物理和结构特性的大分子向量应用聚合算子(AOs)生成的。聚集算子集包括非经典聚集算子(如闵科夫斯基准则)和经典聚集算子(如径向分布函数)。经典 AO 可捕捉不同 k 级的邻域结构信息,而非经典 AO 则使用滑动窗口来概括 aa 级输出。此外,还包括一个基于模糊成员函数的加权系统,以考虑单个 aas 的贡献。MD-LAIs 的特点包括1) 数据整理任务模块;2) 特征选择模块;3) 高度相关的 MD 项目;4) 具有信息量的全局和 aa 级 MD 的低维列表。总之,我们希望 MD-LAIs 将成为编码蛋白质或肽序列的重要工具。该软件作为独立系统可在 GitHub(https://github.com/Grupo-Medicina-Molecular-y-Traslacional/MD_LAIS)上免费获取。
{"title":"MD-LAIs Software: Computing Whole-Sequence and Amino Acid-Level \"Embeddings\" for Peptides and Proteins.","authors":"Ernesto Contreras-Torres, Yovani Marrero-Ponce","doi":"10.1021/acs.jcim.3c01189","DOIUrl":"10.1021/acs.jcim.3c01189","url":null,"abstract":"<p><p>Several computational tools have been developed to calculate sequence-based molecular descriptors (MDs) for peptides and proteins. However, these tools have certain limitations: 1) They generally lack capabilities for curating input data. 2) Their outputs often exhibit significant overlap. 3) There is limited availability of MDs at the amino acid (<i>aa</i>) level. 4) They lack flexibility in computing specific MDs. To address these issues, we developed <b>MD-LAIs</b> (<b>M</b>olecular <b>D</b>escriptors from <b>L</b>ocal <b>A</b>mino acid <b>I</b>nvariant<b>s</b>), Java-based software designed to compute both whole-sequence and <i>aa</i>-level MDs for peptides and proteins. These MDs are generated by applying aggregation operators (<b>AOs</b>) to macromolecular vectors containing the chemical-physical and structural properties of <i>aas</i>. The set of <b>AOs</b> includes both nonclassical (e.g., Minkowski norms) and classical <b>AOs</b> (e.g., Radial Distribution Function). Classical <b>AOs</b> capture neighborhood structural information at different <i>k</i> levels, while nonclassical <b>AOs</b> are applied using a sliding window to generalize the <i>aa</i>-level output. A weighting system based on fuzzy membership functions is also included to account for the contributions of individual <i>aas</i>. <b>MD-LAIs</b> features: 1) a module for data curation tasks, 2) a feature selection module, 3) projects of highly relevant MDs, and 4) low-dimensional lists of informative global and <i>aa</i>-level MDs. Overall, we expect that <b>MD-LAIs</b> will be a valuable tool for encoding protein or peptide sequences. The software is freely available as a stand-alone system on GitHub (https://github.com/Grupo-Medicina-Molecular-y-Traslacional/MD_LAIS).</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":"8665-8672"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142646378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-09DOI: 10.1021/acs.jcim.4c01429
Ian Dunn, Somayeh Pirhadi, Yao Wang, Smmrithi Ravindran, Carter Concepcion, David Ryan Koes
We describe our winning submission to the first Critical Assessment of Computational Hit-Finding Experiments (CACHE) challenge. In this challenge, 23 participants employed a diverse array of structure-based methods to identify hits to a target with no known ligands. We utilized two methods, pharmacophore search and molecular docking, to identify our initial hit list and compounds for the hit expansion phase. Unlike many other participants, we limited ourselves to using docking scores in identifying and ranking hits. Our resulting best hit series tied for first place when evaluated by a panel of expert judges. Here, we report our top-performing open-source workflow and results.
{"title":"CACHE Challenge #1: Docking with GNINA Is All You Need.","authors":"Ian Dunn, Somayeh Pirhadi, Yao Wang, Smmrithi Ravindran, Carter Concepcion, David Ryan Koes","doi":"10.1021/acs.jcim.4c01429","DOIUrl":"https://doi.org/10.1021/acs.jcim.4c01429","url":null,"abstract":"<p><p>We describe our winning submission to the first Critical Assessment of Computational Hit-Finding Experiments (CACHE) challenge. In this challenge, 23 participants employed a diverse array of structure-based methods to identify hits to a target with no known ligands. We utilized two methods, pharmacophore search and molecular docking, to identify our initial hit list and compounds for the hit expansion phase. Unlike many other participants, we limited ourselves to using docking scores in identifying and ranking hits. Our resulting best hit series tied for first place when evaluated by a panel of expert judges. Here, we report our top-performing open-source workflow and results.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":""},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142798641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-09DOI: 10.1021/acs.jcim.4c01755
Steven Kearnes, Patrick Riley
We present a simple method for assigning accurate confidence levels to molecular property predictions from regression models. These confidence levels are easy to interpret and useful for making decisions in drug discovery programs. We demonstrate their performance using time-split validation with assay data from the Relay Therapeutics internal database.
{"title":"Ordinal Confidence Level Assignments for Regression Model Predictions.","authors":"Steven Kearnes, Patrick Riley","doi":"10.1021/acs.jcim.4c01755","DOIUrl":"https://doi.org/10.1021/acs.jcim.4c01755","url":null,"abstract":"<p><p>We present a simple method for assigning accurate confidence levels to molecular property predictions from regression models. These confidence levels are easy to interpret and useful for making decisions in drug discovery programs. We demonstrate their performance using time-split validation with assay data from the Relay Therapeutics internal database.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":""},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142798654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-09Epub Date: 2024-11-18DOI: 10.1021/acs.jcim.4c00905
David A Schaller, Clara D Christ, John D Chodera, Andrea Volkamer
In recent years, machine learning has transformed many aspects of the drug discovery process, including small molecule design, for which the prediction of bioactivity is an integral part. Leveraging structural information about the interactions between a small molecule and its protein target has great potential for downstream machine learning scoring approaches but is fundamentally limited by the accuracy with which protein-ligand complex structures can be predicted in a reliable and automated fashion. With the goal of finding practical approaches to generating useful kinase-inhibitor complex geometries for downstream machine learning scoring approaches, we present a kinase-centric docking benchmark assessing the performance of different classes of docking and pose selection strategies to assess how well experimentally observed binding modes are recapitulated in a realistic cross-docking scenario. The assembled benchmark data set focuses on the well-studied protein kinase family and comprises a subset of 589 protein structures cocrystallized with 423 ATP-competitive ligands. We find that the docking methods biased by the cocrystallized ligand, utilizing shape overlap with or without maximum common substructure matching, are more successful in recovering binding poses than standard physics-based docking alone. Also, docking into multiple structures significantly increases the chance of generating a low root-mean-square deviation (RMSD) docking pose. Docking utilizing an approach that combines all three methods (Posit) into structures with the most similar cocrystallized ligands according to the maximum common substructure (MCS) proved to be the most efficient way to reproduce binding poses, achieving a success rate of 70.4% across all included systems. The studied docking and pose selection strategies, which utilize the OpenEye Toolkits, were implemented into pipelines of the KinoML framework, allowing automated and reliable protein-ligand complex generation for future downstream machine learning tasks. Although focused on protein kinases, we believe that the general findings can also be transferred to other protein families.
{"title":"Benchmarking Cross-Docking Strategies in Kinase Drug Discovery.","authors":"David A Schaller, Clara D Christ, John D Chodera, Andrea Volkamer","doi":"10.1021/acs.jcim.4c00905","DOIUrl":"10.1021/acs.jcim.4c00905","url":null,"abstract":"<p><p>In recent years, machine learning has transformed many aspects of the drug discovery process, including small molecule design, for which the prediction of bioactivity is an integral part. Leveraging structural information about the interactions between a small molecule and its protein target has great potential for downstream machine learning scoring approaches but is fundamentally limited by the accuracy with which protein-ligand complex structures can be predicted in a reliable and automated fashion. With the goal of finding practical approaches to generating useful kinase-inhibitor complex geometries for downstream machine learning scoring approaches, we present a kinase-centric docking benchmark assessing the performance of different classes of docking and pose selection strategies to assess how well experimentally observed binding modes are recapitulated in a realistic cross-docking scenario. The assembled benchmark data set focuses on the well-studied protein kinase family and comprises a subset of 589 protein structures cocrystallized with 423 ATP-competitive ligands. We find that the docking methods biased by the cocrystallized ligand, utilizing shape overlap with or without maximum common substructure matching, are more successful in recovering binding poses than standard physics-based docking alone. Also, docking into multiple structures significantly increases the chance of generating a low root-mean-square deviation (RMSD) docking pose. Docking utilizing an approach that combines all three methods (Posit) into structures with the most similar cocrystallized ligands according to the maximum common substructure (MCS) proved to be the most efficient way to reproduce binding poses, achieving a success rate of 70.4% across all included systems. The studied docking and pose selection strategies, which utilize the OpenEye Toolkits, were implemented into pipelines of the KinoML framework, allowing automated and reliable protein-ligand complex generation for future downstream machine learning tasks. Although focused on protein kinases, we believe that the general findings can also be transferred to other protein families.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":"8848-8858"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142666469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-09Epub Date: 2024-11-26DOI: 10.1021/acs.jcim.4c01364
Chris Zhang, Meghan Osato, David L Mobley
As a model system, the binding pocket of the L99A mutant of T4 lysozyme has been the subject of numerous computational free energy studies. However, previous studies have failed to fully sample and account for the observed changes in the binding pocket of T4 L99A upon binding of a congeneric ligand series, limiting the accuracy of results. In this work, we resolve the closed, intermediate, and open states for T4 L99A previously reported in experiment in MD and establish definitions for these states based on the dynamics of the system. From this analysis, we arrive at two primary conclusions. First, assignment of simulation trajectories into discrete states should not be done simply based on RMSD to crystal structures as this can result in misassignment of states. Second, the different metastable conformations studied here need to be carefully treated, as we estimate the time scales for conformational interconversion to be on the order of 102 to 103 ns─far longer than time scales for typical binding calculations. We conclude with a discussion on the need to develop enhanced sampling methods to generally account for significant changes in protein conformation due to relatively small ligand perturbations.
{"title":"Kinetics-Based State Definitions for Discrete Binding Conformations of T4 L99A in MD via Markov State Modeling.","authors":"Chris Zhang, Meghan Osato, David L Mobley","doi":"10.1021/acs.jcim.4c01364","DOIUrl":"10.1021/acs.jcim.4c01364","url":null,"abstract":"<p><p>As a model system, the binding pocket of the L99A mutant of T4 lysozyme has been the subject of numerous computational free energy studies. However, previous studies have failed to fully sample and account for the observed changes in the binding pocket of T4 L99A upon binding of a congeneric ligand series, limiting the accuracy of results. In this work, we resolve the closed, intermediate, and open states for T4 L99A previously reported in experiment in MD and establish definitions for these states based on the dynamics of the system. From this analysis, we arrive at two primary conclusions. First, assignment of simulation trajectories into discrete states should not be done simply based on RMSD to crystal structures as this can result in misassignment of states. Second, the different metastable conformations studied here need to be carefully treated, as we estimate the time scales for conformational interconversion to be on the order of 10<sup>2</sup> to 10<sup>3</sup> ns─far longer than time scales for typical binding calculations. We conclude with a discussion on the need to develop enhanced sampling methods to generally account for significant changes in protein conformation due to relatively small ligand perturbations.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":"8870-8879"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142714834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-09Epub Date: 2024-11-18DOI: 10.1021/acs.jcim.4c01850
Zhiyuan Zhou, Yueming Yin, Hao Han, Yiping Jia, Jun Hong Koh, Adams Wai-Kin Kong, Yuguang Mu
Protein-protein interactions (PPIs) are crucial for understanding biological processes and disease mechanisms, contributing significantly to advances in protein engineering and drug discovery. The accurate determination of binding affinities, essential for decoding PPIs, faces challenges due to the substantial time and financial costs involved in experimental and theoretical methods. This situation underscores the urgent need for more effective and precise methodologies for predicting binding affinity. Despite the abundance of research on PPI modeling, the field of quantitative binding affinity prediction remains underexplored, mainly due to a lack of comprehensive data. This study seeks to address these needs by manually curating pairwise interaction labels on available 3D structures of protein complexes, with experimentally determined binding affinities, creating the largest data set for structure-based pairwise protein interaction with binding affinity to date. Subsequently, we introduce ProAffinity-GNN, a novel deep learning framework using protein language model and graph neural network (GNN) to improve the accuracy of prediction of structure-based protein-protein binding affinities. The evaluation results across several benchmark test sets and an additional case study demonstrate that ProAffinity-GNN not only outperforms existing models in terms of accuracy but also shows strong generalization capabilities.
蛋白质-蛋白质相互作用(PPIs)对于了解生物过程和疾病机理至关重要,对蛋白质工程和药物发现的进步贡献巨大。准确测定结合亲和力对解码 PPIs 至关重要,但由于实验和理论方法涉及大量的时间和经济成本,因此面临着挑战。这种情况突出表明,迫切需要更有效、更精确的方法来预测结合亲和力。尽管有关 PPI 建模的研究很多,但主要由于缺乏全面的数据,定量结合亲和力预测领域的研究仍然不足。为了满足这些需求,本研究对现有蛋白质复合物三维结构上的成对相互作用标签和实验确定的结合亲和力进行了人工整理,从而创建了迄今为止最大的基于结构的成对蛋白质相互作用结合亲和力数据集。随后,我们介绍了 ProAffinity-GNN,这是一种使用蛋白质语言模型和图神经网络(GNN)的新型深度学习框架,用于提高基于结构的蛋白质-蛋白质结合亲和力预测的准确性。多个基准测试集和一项附加案例研究的评估结果表明,ProAffinity-GNN 不仅在准确性方面优于现有模型,而且还显示出强大的泛化能力。
{"title":"ProAffinity-GNN: A Novel Approach to Structure-Based Protein-Protein Binding Affinity Prediction via a Curated Data Set and Graph Neural Networks.","authors":"Zhiyuan Zhou, Yueming Yin, Hao Han, Yiping Jia, Jun Hong Koh, Adams Wai-Kin Kong, Yuguang Mu","doi":"10.1021/acs.jcim.4c01850","DOIUrl":"10.1021/acs.jcim.4c01850","url":null,"abstract":"<p><p>Protein-protein interactions (PPIs) are crucial for understanding biological processes and disease mechanisms, contributing significantly to advances in protein engineering and drug discovery. The accurate determination of binding affinities, essential for decoding PPIs, faces challenges due to the substantial time and financial costs involved in experimental and theoretical methods. This situation underscores the urgent need for more effective and precise methodologies for predicting binding affinity. Despite the abundance of research on PPI modeling, the field of quantitative binding affinity prediction remains underexplored, mainly due to a lack of comprehensive data. This study seeks to address these needs by manually curating pairwise interaction labels on available 3D structures of protein complexes, with experimentally determined binding affinities, creating the largest data set for structure-based pairwise protein interaction with binding affinity to date. Subsequently, we introduce ProAffinity-GNN, a novel deep learning framework using protein language model and graph neural network (GNN) to improve the accuracy of prediction of structure-based protein-protein binding affinities. The evaluation results across several benchmark test sets and an additional case study demonstrate that ProAffinity-GNN not only outperforms existing models in terms of accuracy but also shows strong generalization capabilities.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":"8796-8808"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142666472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The discovery of small organic compounds for inducing stem cell differentiation is a time- and resource-intensive process. While data science could, in principle, streamline the discovery of these compounds, novel approaches are required due to the difficulty of acquiring training data from large numbers of example compounds. In this paper, we present the design of a new compound for inducing cardiomyocyte differentiation using simple regression models trained with a data set containing only 80 examples. We introduce decorated shape descriptors, an information-rich molecular feature representation that integrates both molecular shape and hydrophilicity information. These models demonstrate improved performance compared to ones using standard molecular descriptors based on shape alone. Model overtraining is diagnosed using a new type of sensitivity analysis. Our new compound is designed using a conservative molecular design strategy, and its effectiveness is confirmed through expression profiles of cardiomyocyte-related marker genes using real-time polymerase chain reaction experiments on human iPS cell lines. This work demonstrates a viable data-driven strategy for designing new compounds for stem cell differentiation protocols and will be useful in situations where training data is limited.
{"title":"Molecular Design for Cardiac Cell Differentiation Using a Small Data Set and Decorated Shape Features.","authors":"Fatemeh Etezadi, Shunichi Ito, Kosuke Yasui, Rodi Kado Abdalkader, Itsunari Minami, Motonari Uesugi, Namasivayam Ganesh Pandian, Haruko Nakano, Atsushi Nakano, Daniel M Packwood","doi":"10.1021/acs.jcim.4c01353","DOIUrl":"10.1021/acs.jcim.4c01353","url":null,"abstract":"<p><p>The discovery of small organic compounds for inducing stem cell differentiation is a time- and resource-intensive process. While data science could, in principle, streamline the discovery of these compounds, novel approaches are required due to the difficulty of acquiring training data from large numbers of example compounds. In this paper, we present the design of a new compound for inducing cardiomyocyte differentiation using simple regression models trained with a data set containing only 80 examples. We introduce decorated shape descriptors, an information-rich molecular feature representation that integrates both molecular shape and hydrophilicity information. These models demonstrate improved performance compared to ones using standard molecular descriptors based on shape alone. Model overtraining is diagnosed using a new type of sensitivity analysis. Our new compound is designed using a conservative molecular design strategy, and its effectiveness is confirmed through expression profiles of cardiomyocyte-related marker genes using real-time polymerase chain reaction experiments on human iPS cell lines. This work demonstrates a viable data-driven strategy for designing new compounds for stem cell differentiation protocols and will be useful in situations where training data is limited.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":"8824-8837"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142714838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-09Epub Date: 2024-11-18DOI: 10.1021/acs.jcim.4c01217
Hengjian Jia, Gabriele C Sosso
The blood-brain barrier (BBB) selectively regulates the passage of chemical compounds into and out of the central nervous system (CNS). As such, understanding the permeability of drug molecules through the BBB is key to treating neurological diseases and evaluating the response of the CNS to medical treatments. Within the last two decades, a diverse portfolio of machine learning (ML) models have been regularly utilized as a tool to predict, and, to a much lesser extent, understand, several functional properties of medicinal drugs, including their propensity to pass through the BBB. However, the most numerically accurate models to date lack in transparency, as they typically rely on complex blends of different descriptors (or features or fingerprints), many of which are not necessarily interpretable in a straightforward fashion. In fact, the "black-box" nature of these models has prevented us from pinpointing any specific design rule to craft the next generation of pharmaceuticals that need to pass (or not) through the BBB. In this work, we have developed a ML model that leverages an uncomplicated, transparent set of descriptors to predict the permeability of drug molecules through the BBB. In addition to its simplicity, our model achieves comparable results in terms of accuracy compared to state-of-the-art models. Moreover, we use a naive Bayes model as an analytical tool to provide further insights into the structure-function relation that underpins the capacity of a given drug molecule to pass through the BBB. Although our results are computational rather than experimental, we have identified several molecular fragments and functional groups that may significantly impact a drug's likelihood of permeating the BBB. This work provides a unique angle to the BBB problem and lays the foundations for future work aimed at leveraging additional transparent descriptors, potentially obtained via bespoke molecular dynamics simulations.
{"title":"Transparent Machine Learning Model to Understand Drug Permeability through the Blood-Brain Barrier.","authors":"Hengjian Jia, Gabriele C Sosso","doi":"10.1021/acs.jcim.4c01217","DOIUrl":"10.1021/acs.jcim.4c01217","url":null,"abstract":"<p><p>The blood-brain barrier (BBB) selectively regulates the passage of chemical compounds into and out of the central nervous system (CNS). As such, understanding the permeability of drug molecules through the BBB is key to treating neurological diseases and evaluating the response of the CNS to medical treatments. Within the last two decades, a diverse portfolio of machine learning (ML) models have been regularly utilized as a tool to predict, and, to a much lesser extent, understand, several functional properties of medicinal drugs, including their propensity to pass through the BBB. However, the most numerically accurate models to date lack in transparency, as they typically rely on complex blends of different descriptors (or features or fingerprints), many of which are not necessarily interpretable in a straightforward fashion. In fact, the \"black-box\" nature of these models has prevented us from pinpointing any specific design rule to craft the next generation of pharmaceuticals that need to pass (or not) through the BBB. In this work, we have developed a ML model that leverages an uncomplicated, transparent set of descriptors to predict the permeability of drug molecules through the BBB. In addition to its simplicity, our model achieves comparable results in terms of accuracy compared to state-of-the-art models. Moreover, we use a naive Bayes model as an analytical tool to provide further insights into the structure-function relation that underpins the capacity of a given drug molecule to pass through the BBB. Although our results are computational rather than experimental, we have identified several molecular fragments and functional groups that may significantly impact a drug's likelihood of permeating the BBB. This work provides a unique angle to the BBB problem and lays the foundations for future work aimed at leveraging additional transparent descriptors, potentially obtained via bespoke molecular dynamics simulations.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":"8718-8728"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11632763/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142666184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-08DOI: 10.1021/acs.jcim.4c01525
Hongxiang Li, Xuan Liu, Guangde Jiang, Huimin Zhao
Thanks to the growing interest in computer-aided synthesis planning (CASP), a wide variety of retrosynthesis and retrobiosynthesis tools have been developed in the past decades. However, synthesis planning tools for multistep chemoenzymatic reactions are still rare despite the widespread use of enzymatic reactions in chemical synthesis. Herein, we report a reaction type score (RTscore)-guided chemoenzymatic synthesis planning (RTS-CESP) strategy. Briefly, the RTscore is trained using a text-based convolutional neural network (TextCNN) to distinguish synthesis reactions from decomposition reactions and evaluate synthesis efficiency. Once multiple chemical synthesis routes are generated by a retrosynthesis tool for a target molecule, RTscore is used to rank them and find the step(s) that can be replaced by enzymatic reactions to improve synthesis efficiency. As proof of concept, RTS-CESP was applied to 10 molecules with known chemoenzymatic synthesis routes in the literature and was able to predict all of them with six being the top-ranked routes. Moreover, RTS-CESP was employed for 1000 molecules in the boutique database and was able to predict the chemoenzymatic synthesis routes for 554 molecules, outperforming ASKCOS, a state-of-the-art chemoenzymatic synthesis planning tool. Finally, RTS-CESP was used to design a new chemoenzymatic synthesis route for the FDA-approved drug Alclofenac, which was shorter than the literature-reported route and has been experimentally validated.
{"title":"Chemoenzymatic Synthesis Planning Guided by Reaction Type Score.","authors":"Hongxiang Li, Xuan Liu, Guangde Jiang, Huimin Zhao","doi":"10.1021/acs.jcim.4c01525","DOIUrl":"https://doi.org/10.1021/acs.jcim.4c01525","url":null,"abstract":"<p><p>Thanks to the growing interest in computer-aided synthesis planning (CASP), a wide variety of retrosynthesis and retrobiosynthesis tools have been developed in the past decades. However, synthesis planning tools for multistep chemoenzymatic reactions are still rare despite the widespread use of enzymatic reactions in chemical synthesis. Herein, we report a reaction type score (RTscore)-guided chemoenzymatic synthesis planning (RTS-CESP) strategy. Briefly, the RTscore is trained using a text-based convolutional neural network (TextCNN) to distinguish synthesis reactions from decomposition reactions and evaluate synthesis efficiency. Once multiple chemical synthesis routes are generated by a retrosynthesis tool for a target molecule, RTscore is used to rank them and find the step(s) that can be replaced by enzymatic reactions to improve synthesis efficiency. As proof of concept, RTS-CESP was applied to 10 molecules with known chemoenzymatic synthesis routes in the literature and was able to predict all of them with six being the top-ranked routes. Moreover, RTS-CESP was employed for 1000 molecules in the boutique database and was able to predict the chemoenzymatic synthesis routes for 554 molecules, outperforming ASKCOS, a state-of-the-art chemoenzymatic synthesis planning tool. Finally, RTS-CESP was used to design a new chemoenzymatic synthesis route for the FDA-approved drug Alclofenac, which was shorter than the literature-reported route and has been experimentally validated.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":""},"PeriodicalIF":5.6,"publicationDate":"2024-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142794034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-06DOI: 10.1021/acs.jcim.4c01240
Long D Nguyen, Quang H Nguyen, Quang H Trinh, Binh P Nguyen
We present a novel molecular property prediction framework that requires only the SMILES format as input but is designed to be multimodal by incorporating predicted 3D conformer representations. Our model captures comprehensive molecular features by leveraging both the sequential character structure of SMILES and the three-dimensional spatial structure of conformers. The framework employs contrastive learning techniques, utilizing InfoNCE loss to align SMILES and conformer embeddings, along with task-specific loss functions, such as ConR for regression and SupCon for classification. To address data imbalance, we incorporate feature distribution smoothing (FDS), a common challenge in drug discovery. We evaluated the framework through multiple case studies, including SARS-CoV-2 drug docking score prediction, molecular property prediction using MoleculeNet data sets, and kinase inhibitor prediction for JAK-1, JAK-2, and MAPK-14 using custom data sets curated from PubChem. The results consistently outperformed state-of-the-art methods, with ConR and FDS significantly improving regression tasks and SupCon enhancing classification performance. These findings highlight the flexibility and robustness of our multimodal model, demonstrating its effectiveness across diverse molecular property prediction tasks, with promising applications in drug discovery and molecular analysis.
{"title":"From SMILES to Enhanced Molecular Property Prediction: A Unified Multimodal Framework with Predicted 3D Conformers and Contrastive Learning Techniques.","authors":"Long D Nguyen, Quang H Nguyen, Quang H Trinh, Binh P Nguyen","doi":"10.1021/acs.jcim.4c01240","DOIUrl":"https://doi.org/10.1021/acs.jcim.4c01240","url":null,"abstract":"<p><p>We present a novel molecular property prediction framework that requires only the SMILES format as input but is designed to be multimodal by incorporating predicted 3D conformer representations. Our model captures comprehensive molecular features by leveraging both the sequential character structure of SMILES and the three-dimensional spatial structure of conformers. The framework employs contrastive learning techniques, utilizing InfoNCE loss to align SMILES and conformer embeddings, along with task-specific loss functions, such as ConR for regression and SupCon for classification. To address data imbalance, we incorporate feature distribution smoothing (FDS), a common challenge in drug discovery. We evaluated the framework through multiple case studies, including SARS-CoV-2 drug docking score prediction, molecular property prediction using MoleculeNet data sets, and kinase inhibitor prediction for JAK-1, JAK-2, and MAPK-14 using custom data sets curated from PubChem. The results consistently outperformed state-of-the-art methods, with ConR and FDS significantly improving regression tasks and SupCon enhancing classification performance. These findings highlight the flexibility and robustness of our multimodal model, demonstrating its effectiveness across diverse molecular property prediction tasks, with promising applications in drug discovery and molecular analysis.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":""},"PeriodicalIF":5.6,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142783338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}