Pub Date : 2024-12-09Epub Date: 2024-11-24DOI: 10.1021/acs.jcim.4c01583
Lucía Morán-González, Jørn Eirik Betten, Hannes Kneiding, David Balcells
Graphs are one of the most natural and powerful representations available for molecules; natural because they have an intuitive correspondence to skeletal formulas, the language used by chemists worldwide, and powerful, because they are highly expressive both globally (molecular topology) and locally (atom and bond properties). Graph kernels are used to transform molecular graphs into fixed-length vectors, which, based on their capacity of measuring similarity, can be used as fingerprints for machine learning (ML). To date, graph kernels have mostly focused on the atomic nodes of the graph. In this work, we developed a graph kernel based on atom-atom, bond-bond, and bond-atom (AABBA) autocorrelations. The resulting vector representations were tested on regression ML tasks on a data set of transition metal complexes; a benchmark motivated by the higher complexity of these compounds relative to organic molecules. In particular, we tested different flavors of the AABBA kernel in the prediction of the energy barriers and bond distances of the Vaska's complex data set (Friederich et al., Chem. Sci., 2020, 11, 4584). For a variety of ML models, including neural networks, gradient boosting machines, and Gaussian processes, we showed that AABBA outperforms the baseline including only atom-atom autocorrelations. Dimensionality reduction studies also showed that the bond-bond and bond-atom autocorrelations yield many of the most relevant features. We believe that the AABBA graph kernel can accelerate the exploration of large chemical spaces and inspire novel molecular representations in which both atomic and bond properties play an important role.
分子图是最自然、最强大的分子表征之一;自然是因为分子图与全球化学家使用的语言--骨骼公式有着直观的对应关系;强大是因为分子图在全局(分子拓扑)和局部(原子和化学键属性)两方面都具有很强的表现力。图核用于将分子图转化为固定长度的向量,基于其测量相似性的能力,这些向量可用作机器学习(ML)的指纹。迄今为止,图核主要集中在图的原子节点上。在这项工作中,我们开发了一种基于原子-原子、键-键和键-原子(AABBA)自相关性的图核。我们在过渡金属复合物数据集的回归 ML 任务中测试了由此产生的矢量表示法;与有机分子相比,这些化合物具有更高的复杂性,因此我们对这些数据集进行了基准测试。特别是,我们在预测瓦斯卡复合物数据集(Friederich 等人,《化学科学》,2020 年,11 期,4584 页)的能垒和键距时,测试了 AABBA 核的不同类型。对于包括神经网络、梯度提升机和高斯过程在内的各种 ML 模型,我们发现 AABBA 优于仅包含原子-原子自相关性的基线模型。降维研究还表明,键-键和键-原子自相关性产生了许多最相关的特征。我们相信,AABBA 图核可以加速对大型化学空间的探索,并激发新的分子表征,其中原子和化学键特性都发挥了重要作用。
{"title":"AABBA Graph Kernel: Atom-Atom, Bond-Bond, and Bond-Atom Autocorrelations for Machine Learning.","authors":"Lucía Morán-González, Jørn Eirik Betten, Hannes Kneiding, David Balcells","doi":"10.1021/acs.jcim.4c01583","DOIUrl":"10.1021/acs.jcim.4c01583","url":null,"abstract":"<p><p>Graphs are one of the most natural and powerful representations available for molecules; natural because they have an intuitive correspondence to skeletal formulas, the language used by chemists worldwide, and powerful, because they are highly expressive both globally (molecular topology) and locally (atom and bond properties). Graph kernels are used to transform molecular graphs into fixed-length vectors, which, based on their capacity of measuring similarity, can be used as fingerprints for machine learning (ML). To date, graph kernels have mostly focused on the atomic nodes of the graph. In this work, we developed a graph kernel based on atom-atom, bond-bond, and bond-atom (AABBA) autocorrelations. The resulting vector representations were tested on regression ML tasks on a data set of transition metal complexes; a benchmark motivated by the higher complexity of these compounds relative to organic molecules. In particular, we tested different flavors of the AABBA kernel in the prediction of the energy barriers and bond distances of the Vaska's complex data set (Friederich et al., <i>Chem. Sci.</i>, 2020, <b>11,</b> 4584). For a variety of ML models, including neural networks, gradient boosting machines, and Gaussian processes, we showed that AABBA outperforms the baseline including only atom-atom autocorrelations. Dimensionality reduction studies also showed that the bond-bond and bond-atom autocorrelations yield many of the most relevant features. We believe that the AABBA graph kernel can accelerate the exploration of large chemical spaces and inspire novel molecular representations in which both atomic and bond properties play an important role.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":"8756-8769"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11632777/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142708590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Detecting doping agents in sports poses a significant challenge due to the continuous emergence of new prohibited substances and methods. Traditional detection methods primarily rely on targeted analysis, which is often labor-intensive and is susceptible to errors. In response, machine learning offers a transformative approach to enhancing doping screening and detection. With its powerful data analysis capabilities, machine learning enables the rapid identification of patterns and features in complex compound data, increasing both the efficiency and the accuracy of detection. Moreover, when integrated with nontargeted metabolomics, machine learning can predict unknown metabolites, aiding the discovery of long-lasting biomarkers of doping. It also excels in classifying novel compounds, thereby reducing false-negative rates. As instrumental analysis and machine learning technologies continue to advance, the development of rapid, scalable, and highly efficient doping detection methods becomes increasingly feasible, supporting the pursuit of fairness and integrity in sports competitions.
{"title":"The Application of Machine Learning in Doping Detection.","authors":"Qingqing Yang, Wennuo Xu, Xiaodong Sun, Qin Chen, Bing Niu","doi":"10.1021/acs.jcim.4c01234","DOIUrl":"10.1021/acs.jcim.4c01234","url":null,"abstract":"<p><p>Detecting doping agents in sports poses a significant challenge due to the continuous emergence of new prohibited substances and methods. Traditional detection methods primarily rely on targeted analysis, which is often labor-intensive and is susceptible to errors. In response, machine learning offers a transformative approach to enhancing doping screening and detection. With its powerful data analysis capabilities, machine learning enables the rapid identification of patterns and features in complex compound data, increasing both the efficiency and the accuracy of detection. Moreover, when integrated with nontargeted metabolomics, machine learning can predict unknown metabolites, aiding the discovery of long-lasting biomarkers of doping. It also excels in classifying novel compounds, thereby reducing false-negative rates. As instrumental analysis and machine learning technologies continue to advance, the development of rapid, scalable, and highly efficient doping detection methods becomes increasingly feasible, supporting the pursuit of fairness and integrity in sports competitions.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":"8673-8683"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142685417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-09Epub Date: 2024-11-19DOI: 10.1021/acs.jcim.4c01175
Li Liang, Yunxin Duan, Chen Zeng, Boheng Wan, Huifeng Yao, Haichun Liu, Tao Lu, Yanmin Zhang, Yadong Chen, Jun Shen
Protein-ligand binding affinity prediction is a crucial and challenging task in the field of drug discovery. However, traditional simulation-based computational approaches are often prohibitively time-consuming, limiting their practical utility. In this study, we introduce a novel deep learning method, CPIScore, which leverages the capabilities of Transformer and Graph Convolutional Networks (GCN) to enhance the prediction of protein-ligand binding affinity. CPIScore utilizes the Transformer architecture to capture comprehensive global contexts of protein and ligand sequences, while the GCN component effectively extracts local features from small molecular graphs. Our results demonstrate that CPIScore surpasses both traditional machine learning and other deep learning models in accuracy, achieving a Pearson's r of 0.74 on our test set. Furthermore, CPIScore has been validated across multiple targets, proving its ability to discern inhibitors from a diverse compound library with high enrichment rates. Notably, when applied to a generated focused library of compounds, CPIScore successfully identified six potent small-molecule inhibitors of ATR, which were tested experimentally and four small molecules exhibited inhibitory activity below ten nanomoles. These results highlight CPIScore's potential to significantly streamline and enhance the efficiency of drug discovery processes.
{"title":"CPIScore: A Deep Learning Approach for Rapid Scoring and Interpretation of Protein-Ligand Binding Interactions.","authors":"Li Liang, Yunxin Duan, Chen Zeng, Boheng Wan, Huifeng Yao, Haichun Liu, Tao Lu, Yanmin Zhang, Yadong Chen, Jun Shen","doi":"10.1021/acs.jcim.4c01175","DOIUrl":"10.1021/acs.jcim.4c01175","url":null,"abstract":"<p><p>Protein-ligand binding affinity prediction is a crucial and challenging task in the field of drug discovery. However, traditional simulation-based computational approaches are often prohibitively time-consuming, limiting their practical utility. In this study, we introduce a novel deep learning method, CPIScore, which leverages the capabilities of Transformer and Graph Convolutional Networks (GCN) to enhance the prediction of protein-ligand binding affinity. CPIScore utilizes the Transformer architecture to capture comprehensive global contexts of protein and ligand sequences, while the GCN component effectively extracts local features from small molecular graphs. Our results demonstrate that CPIScore surpasses both traditional machine learning and other deep learning models in accuracy, achieving a Pearson's <i>r</i> of 0.74 on our test set. Furthermore, CPIScore has been validated across multiple targets, proving its ability to discern inhibitors from a diverse compound library with high enrichment rates. Notably, when applied to a generated focused library of compounds, CPIScore successfully identified six potent small-molecule inhibitors of ATR, which were tested experimentally and four small molecules exhibited inhibitory activity below ten nanomoles. These results highlight CPIScore's potential to significantly streamline and enhance the efficiency of drug discovery processes.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":"8809-8823"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142674418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-09Epub Date: 2024-11-21DOI: 10.1021/acs.jcim.4c01676
Myeonghun Lee, Taehyun Park, Kyoungmin Min
In this study, we introduced Matini-Net, which is a versatile framework for feature engineering and automated architecture design for materials informatics research using deep neural networks. Matini-Net provides the flexibility to design feature-based, graph-based, and combinations of these models, accommodating both single- and multimodal model architectures. For validation, we performed a performance evaluation on the MatBench benchmarking dataset of five properties, targeting five types of regression architectures that can be designed using Matini-Net. When applied to each of the five material property datasets, the best model performance for the various architectures exhibited R2 > 0.84. This highlights the usefulness and flexibility of Matini-Net for accelerating materials discovery. Specifically, this framework was developed for researchers with limited experience in deep learning to easily apply it to research through automated feature engineering, hyperparameter tuning, and network construction. Moreover, Matini-Net improves the model interpretability by performing an importance analysis of the selected features. We believe that by employing Matini-Net, machine and deep learning can be applied more easily and effectively in various types of materials research.
{"title":"Matini-Net: Versatile Material Informatics Research Framework for Feature Engineering and Deep Neural Network Design.","authors":"Myeonghun Lee, Taehyun Park, Kyoungmin Min","doi":"10.1021/acs.jcim.4c01676","DOIUrl":"10.1021/acs.jcim.4c01676","url":null,"abstract":"<p><p>In this study, we introduced Matini-Net, which is a versatile framework for feature engineering and automated architecture design for materials informatics research using deep neural networks. Matini-Net provides the flexibility to design feature-based, graph-based, and combinations of these models, accommodating both single- and multimodal model architectures. For validation, we performed a performance evaluation on the MatBench benchmarking dataset of five properties, targeting five types of regression architectures that can be designed using Matini-Net. When applied to each of the five material property datasets, the best model performance for the various architectures exhibited <i>R</i><sup>2</sup> > 0.84. This highlights the usefulness and flexibility of Matini-Net for accelerating materials discovery. Specifically, this framework was developed for researchers with limited experience in deep learning to easily apply it to research through automated feature engineering, hyperparameter tuning, and network construction. Moreover, Matini-Net improves the model interpretability by performing an importance analysis of the selected features. We believe that by employing Matini-Net, machine and deep learning can be applied more easily and effectively in various types of materials research.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":"8770-8783"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142680026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-09DOI: 10.1021/acs.jcim.4c01035
Jinyong Park, Minhi Han, Kiwoong Lee, Sungnam Park
With the advancement of deep learning (DL) methods in chemistry and materials science, the interpretability of DL models has become a critical issue in elucidating quantitative (molecular) structure-property relationships. Although attention mechanisms have been generally employed to explain the importance of molecular substructures that contribute to molecular properties, their interpretability remains limited. In this work, we introduce a versatile segmentation method and develop an interpretable subgraph attention (ISA) network with positive and negative streams (ISA-PN) to enhance the understanding of molecular structure-property relationships. The predictive performance of the ISA models was validated using data sets for aqueous solubility, lipophilicity, and melting temperature, with a particular focus on evaluating interpretability for the aqueous solubility data set. The ISA-PN model enables the quantification of the contributions of molecular substructures through positive and negative attention scores. Comparative analyses of the ISA, ISA-PN, and GC-Net (group contribution network) models demonstrate that the ISA-PN model significantly improves interpretability while maintaining similar accuracy levels. This study highlights the efficacy of the ISA-PN model in providing meaningful insights into the contributions of molecular substructures to molecular properties, thereby enhancing the interpretability of DL models in chemical applications.
{"title":"Hierarchical Graph Attention Network with Positive and Negative Attentions for Improved Interpretability: ISA-PN.","authors":"Jinyong Park, Minhi Han, Kiwoong Lee, Sungnam Park","doi":"10.1021/acs.jcim.4c01035","DOIUrl":"https://doi.org/10.1021/acs.jcim.4c01035","url":null,"abstract":"<p><p>With the advancement of deep learning (DL) methods in chemistry and materials science, the interpretability of DL models has become a critical issue in elucidating quantitative (molecular) structure-property relationships. Although attention mechanisms have been generally employed to explain the importance of molecular substructures that contribute to molecular properties, their interpretability remains limited. In this work, we introduce a versatile segmentation method and develop an interpretable subgraph attention (ISA) network with positive and negative streams (ISA-PN) to enhance the understanding of molecular structure-property relationships. The predictive performance of the ISA models was validated using data sets for aqueous solubility, lipophilicity, and melting temperature, with a particular focus on evaluating interpretability for the aqueous solubility data set. The ISA-PN model enables the quantification of the contributions of molecular substructures through positive and negative attention scores. Comparative analyses of the ISA, ISA-PN, and GC-Net (group contribution network) models demonstrate that the ISA-PN model significantly improves interpretability while maintaining similar accuracy levels. This study highlights the efficacy of the ISA-PN model in providing meaningful insights into the contributions of molecular substructures to molecular properties, thereby enhancing the interpretability of DL models in chemical applications.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":""},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142798647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-09Epub Date: 2024-11-12DOI: 10.1021/acs.jcim.4c01380
Ammaar A Saeed, Margaret A Klureza, Doeke R Hekstra
Proteins are dynamic macromolecules. Knowledge of a protein's thermally accessible conformations is critical to determining important transitions and designing therapeutics. Accessible conformations are highly constrained by a protein's structure such that concerted structural changes due to external perturbations likely track intrinsic conformational transitions. These transitions can be thought of as paths through a conformational landscape. Crystallographic drug fragment screens are high-throughput perturbation experiments, in which thousands of crystals of a drug target are soaked with small-molecule drug precursors (fragments) and examined for fragment binding, mapping potential drug binding sites on the target protein. Here, we describe an open-source Python package, COnformational LAndscape Visualization (COLAV), to infer conformational landscapes from such large-scale crystallographic perturbation studies. We apply COLAV to drug fragment screens of two medically important systems: protein tyrosine phosphatase 1B (PTP1B), which regulates insulin signaling, and the SARS CoV-2 Main Protease (MPro). With enough fragment-bound structures, we find that such drug screens enable detailed mapping of proteins' conformational landscapes.
{"title":"Mapping Protein Conformational Landscapes from Crystallographic Drug Fragment Screens.","authors":"Ammaar A Saeed, Margaret A Klureza, Doeke R Hekstra","doi":"10.1021/acs.jcim.4c01380","DOIUrl":"10.1021/acs.jcim.4c01380","url":null,"abstract":"<p><p>Proteins are dynamic macromolecules. Knowledge of a protein's thermally accessible conformations is critical to determining important transitions and designing therapeutics. Accessible conformations are highly constrained by a protein's structure such that concerted structural changes due to external perturbations likely track intrinsic conformational transitions. These transitions can be thought of as paths through a conformational landscape. Crystallographic drug fragment screens are high-throughput perturbation experiments, in which thousands of crystals of a drug target are soaked with small-molecule drug precursors (fragments) and examined for fragment binding, mapping potential drug binding sites on the target protein. Here, we describe an open-source Python package, COnformational LAndscape Visualization (COLAV), to infer conformational landscapes from such large-scale crystallographic perturbation studies. We apply COLAV to drug fragment screens of two medically important systems: protein tyrosine phosphatase 1B (PTP1B), which regulates insulin signaling, and the SARS CoV-2 Main Protease (MPro). With enough fragment-bound structures, we find that such drug screens enable detailed mapping of proteins' conformational landscapes.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":"8937-8951"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11633654/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142612435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-09Epub Date: 2024-11-19DOI: 10.1021/acs.jcim.4c01223
Donya Ohadi, Kiran Kumar, Suchitra Ravula, Renee L DesJarlais, Mark J Seierstad, Amy Y Shih, Michael D Hack, Jamie M Schiffer
Free energy perturbation (FEP) methodologies have become commonplace methods for modeling potency in hit-to-lead and lead optimization stages of drug discovery. The conformational states of the initial poses of compounds for FEP+ calculations are often set up by alignment to a cocrystal structure ligand, but it is not clear if this method provides the best result for all proteins or all ligands. Not only are ligand conformational states potential variables in modeling compound potency in FEP but also the selection of crystallographic water molecules for inclusion in the FEP input structures can impact FEP models. Here, we report the results of FEP calculations using FEP+ from Schrödinger and starting from maximum common substructure alignment and docked poses generated with an array of docking methodologies. As a benchmark data set, we use monoacylglycerol lipase (MAGL), an important clinical drug target in cancer malignancy, neurological diseases, and metabolic disorders, and a set of 17 MAGL inhibitors. We found a large variation among FEP+ correlations to experimental IC50 values depending on the method used to generate the input pose and that the inclusion of ligand-based information in the docking process, with some methods, increases the correlation between FEP+ free energies and IC50 values. Upon analysis of the initial poses, we found that the differences in FEP+ correlations stemmed from rotation around a tertiary amide bond as well as translation of the compound toward the more hydrophobic side of the MAGL pocket. FEP+ estimation improved across all pose modeling methods when hydrogen bond constraint information was added. However, simple maximum common substructure alignment in the presence of all crystallographic water molecules outperformed all other methods in correlation between estimated and experimental IC50 values. Taken together, these findings suggest that pose selection and crystallographic water inclusion greatly impact how well FEP+ estimated IC50 values align with experimental IC50 values and that modelers should benchmark a few different pose generation methodologies and different water inclusion strategies for their hit-to-lead and lead optimization drug discovery projects.
{"title":"Input Pose is Key to Performance of Free Energy Perturbation: Benchmarking with Monoacylglycerol Lipase.","authors":"Donya Ohadi, Kiran Kumar, Suchitra Ravula, Renee L DesJarlais, Mark J Seierstad, Amy Y Shih, Michael D Hack, Jamie M Schiffer","doi":"10.1021/acs.jcim.4c01223","DOIUrl":"10.1021/acs.jcim.4c01223","url":null,"abstract":"<p><p>Free energy perturbation (FEP) methodologies have become commonplace methods for modeling potency in hit-to-lead and lead optimization stages of drug discovery. The conformational states of the initial poses of compounds for FEP+ calculations are often set up by alignment to a cocrystal structure ligand, but it is not clear if this method provides the best result for all proteins or all ligands. Not only are ligand conformational states potential variables in modeling compound potency in FEP but also the selection of crystallographic water molecules for inclusion in the FEP input structures can impact FEP models. Here, we report the results of FEP calculations using FEP+ from Schrödinger and starting from maximum common substructure alignment and docked poses generated with an array of docking methodologies. As a benchmark data set, we use monoacylglycerol lipase (MAGL), an important clinical drug target in cancer malignancy, neurological diseases, and metabolic disorders, and a set of 17 MAGL inhibitors. We found a large variation among FEP+ correlations to experimental IC<sub>50</sub> values depending on the method used to generate the input pose and that the inclusion of ligand-based information in the docking process, with some methods, increases the correlation between FEP+ free energies and IC<sub>50</sub> values. Upon analysis of the initial poses, we found that the differences in FEP+ correlations stemmed from rotation around a tertiary amide bond as well as translation of the compound toward the more hydrophobic side of the MAGL pocket. FEP+ estimation improved across all pose modeling methods when hydrogen bond constraint information was added. However, simple maximum common substructure alignment in the presence of all crystallographic water molecules outperformed all other methods in correlation between estimated and experimental IC<sub>50</sub> values. Taken together, these findings suggest that pose selection and crystallographic water inclusion greatly impact how well FEP+ estimated IC<sub>50</sub> values align with experimental IC<sub>50</sub> values and that modelers should benchmark a few different pose generation methodologies and different water inclusion strategies for their hit-to-lead and lead optimization drug discovery projects.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":"8859-8869"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142666471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-09Epub Date: 2024-11-21DOI: 10.1021/acs.jcim.4c01554
Mingqing Liu, Xuechun Meng, Yiyang Mao, Hongqi Li, Ji Liu
Identifying drug-target interactions (DTIs) is essential for drug discovery and development. Existing deep learning approaches to DTI prediction often employ powerful feature encoders to represent drugs and targets holistically, which usually cause significant redundancy and noise by neglecting the restricted binding regions. Furthermore, many previous DTI networks ignore or simplify the complex intermolecular interaction process involving diverse binding types, which significantly limits both predictive ability and interpretability. We propose ReduMixDTI, an end-to-end model that addresses feature redundancy and explicitly captures complex local interactions for DTI prediction. In this study, drug and target features are encoded by using graph neural networks and convolutional neural networks, respectively. These features are refined from channel and spatial perspectives to enhance the representations. The proposed attention mechanism explicitly models pairwise interactions between drug and target substructures, improving the model's understanding of binding processes. In extensive comparisons with seven state-of-the-art methods, ReduMixDTI demonstrates superior performance across three benchmark data sets and external test sets reflecting real-world scenarios. Additionally, we perform comprehensive ablation studies and visualize protein attention weights to enhance the interpretability. The results confirm that ReduMixDTI serves as a robust and interpretable model for reducing feature redundancy, contributing to advances in DTI prediction.
{"title":"ReduMixDTI: Prediction of Drug-Target Interaction with Feature Redundancy Reduction and Interpretable Attention Mechanism.","authors":"Mingqing Liu, Xuechun Meng, Yiyang Mao, Hongqi Li, Ji Liu","doi":"10.1021/acs.jcim.4c01554","DOIUrl":"10.1021/acs.jcim.4c01554","url":null,"abstract":"<p><p>Identifying drug-target interactions (DTIs) is essential for drug discovery and development. Existing deep learning approaches to DTI prediction often employ powerful feature encoders to represent drugs and targets holistically, which usually cause significant redundancy and noise by neglecting the restricted binding regions. Furthermore, many previous DTI networks ignore or simplify the complex intermolecular interaction process involving diverse binding types, which significantly limits both predictive ability and interpretability. We propose ReduMixDTI, an end-to-end model that addresses feature redundancy and explicitly captures complex local interactions for DTI prediction. In this study, drug and target features are encoded by using graph neural networks and convolutional neural networks, respectively. These features are refined from channel and spatial perspectives to enhance the representations. The proposed attention mechanism explicitly models pairwise interactions between drug and target substructures, improving the model's understanding of binding processes. In extensive comparisons with seven state-of-the-art methods, ReduMixDTI demonstrates superior performance across three benchmark data sets and external test sets reflecting real-world scenarios. Additionally, we perform comprehensive ablation studies and visualize protein attention weights to enhance the interpretability. The results confirm that ReduMixDTI serves as a robust and interpretable model for reducing feature redundancy, contributing to advances in DTI prediction.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":"8952-8962"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142685415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-09Epub Date: 2024-11-15DOI: 10.1021/acs.jcim.4c01781
Hengwei Chen, Jürgen Bajorath
In medicinal chemistry, compound optimization relies on the generation of analogue series (AS) for exploring structure-activity relationships (SARs). Potency progression is a critical criterion for advancing AS. During optimization, a key question is which analogues to synthesize next. We introduce a new computational methodology for the extension of AS with potent compounds containing both core structure and substituent modifications at multiple sites, which has been reported for the first time. The approach combines a transformer chemical language model (CLM) with a SAR matrix (SARM) methodology that identifies and organizes structurally related AS. Therefore, the SARM approach was expanded to cover multisite AS. Consensus series extracted from SARMs representing a potency gradient served as input for CLM training to extend test AS with potent analogues. Different model variants were derived and investigated. Both general and fine-tuned models correctly predicted known potent analogues at high positions in probability-based compound rankings and chemically diversified AS through core structure modifications of the generated candidate compounds and substituent replacements at multiple sites.
在药物化学中,化合物的优化依赖于生成用于探索结构-活性关系(SARs)的类似物系列(AS)。药效进展是推进 AS 的关键标准。在优化过程中,一个关键问题是下一步合成哪些类似物。我们介绍了一种新的计算方法,用于扩展包含核心结构和多个位点取代基修饰的强效化合物的 AS。该方法将转换化学语言模型(CLM)与 SAR 矩阵(SARM)方法相结合,可识别和组织结构相关的 AS。因此,SARM 方法被扩展到涵盖多位点 AS。从代表药效梯度的 SARM 中提取的共识系列作为 CLM 训练的输入,以扩展测试 AS 的强效类似物。对不同的模型变体进行了推导和研究。通用模型和微调模型都能正确预测基于概率的化合物排名中处于高位的已知强效类似物,并通过对生成的候选化合物进行核心结构修改和在多个位点进行取代基替换,实现了化学多样化的 AS。
{"title":"Combining a Chemical Language Model and the Structure-Activity Relationship Matrix Formalism for Generative Design of Potent Compounds with Core Structure and Substituent Modifications.","authors":"Hengwei Chen, Jürgen Bajorath","doi":"10.1021/acs.jcim.4c01781","DOIUrl":"10.1021/acs.jcim.4c01781","url":null,"abstract":"<p><p>In medicinal chemistry, compound optimization relies on the generation of analogue series (AS) for exploring structure-activity relationships (SARs). Potency progression is a critical criterion for advancing AS. During optimization, a key question is which analogues to synthesize next. We introduce a new computational methodology for the extension of AS with potent compounds containing both core structure and substituent modifications at multiple sites, which has been reported for the first time. The approach combines a transformer chemical language model (CLM) with a SAR matrix (SARM) methodology that identifies and organizes structurally related AS. Therefore, the SARM approach was expanded to cover multisite AS. Consensus series extracted from SARMs representing a potency gradient served as input for CLM training to extend test AS with potent analogues. Different model variants were derived and investigated. Both general and fine-tuned models correctly predicted known potent analogues at high positions in probability-based compound rankings and chemically diversified AS through core structure modifications of the generated candidate compounds and substituent replacements at multiple sites.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":"8784-8795"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142638031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-09Epub Date: 2024-11-19DOI: 10.1021/acs.jcim.4c01420
Jonathan W Zheng, Ivo Leito, William H Green
The acid dissociation constant (pKa), which quantifies the propensity for a solute to donate a proton to its solvent, is crucial for drug design and synthesis, environmental fate studies, chemical manufacturing, and many other fields. Unfortunately, the terminology used for describing acid-base phenomena is sometimes inconsistent, causing large potential for misinterpretation. In this work, we examine a systematic confusion underlying the definition of "acidic" and "basic" pKa values for zwitterionic compounds. Due to this confusion, some pKa data are misrepresented in data repositories, including the widely used and highly trusted ChEMBL database. Such datasets are frequently used to supply training data for pKa prediction models, and hence, confusion and errors in the data make the model performance worse. Herein, we discuss the intricacies of this issue. We make suggestions for describing acid-base phenomena, training pKa prediction models, and stewarding pKa datasets, given the high potential for confusion and potentially high impact in downstream applications.
{"title":"Widespread Misinterpretation of p<i>K</i><sub>a</sub> Terminology for Zwitterionic Compounds and Its Consequences.","authors":"Jonathan W Zheng, Ivo Leito, William H Green","doi":"10.1021/acs.jcim.4c01420","DOIUrl":"10.1021/acs.jcim.4c01420","url":null,"abstract":"<p><p>The acid dissociation constant (p<i>K</i><sub>a</sub>), which quantifies the propensity for a solute to donate a proton to its solvent, is crucial for drug design and synthesis, environmental fate studies, chemical manufacturing, and many other fields. Unfortunately, the terminology used for describing acid-base phenomena is sometimes inconsistent, causing large potential for misinterpretation. In this work, we examine a systematic confusion underlying the definition of \"acidic\" and \"basic\" p<i>K</i><sub>a</sub> values for zwitterionic compounds. Due to this confusion, some p<i>K</i><sub>a</sub> data are misrepresented in data repositories, including the widely used and highly trusted ChEMBL database. Such datasets are frequently used to supply training data for p<i>K</i><sub>a</sub> prediction models, and hence, confusion and errors in the data make the model performance worse. Herein, we discuss the intricacies of this issue. We make suggestions for describing acid-base phenomena, training p<i>K</i><sub>a</sub> prediction models, and stewarding p<i>K</i><sub>a</sub> datasets, given the high potential for confusion and potentially high impact in downstream applications.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":"8838-8847"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142666199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}