Pub Date : 2024-06-07DOI: 10.1021/acs.jcim.4c00825
Roshni Bhatt, David Ryan Koes and Jacob D. Durrant*,
We present a novel and interpretable approach for assessing small-molecule binding using context explanation networks. Given the specific structure of a protein/ligand complex, our CENsible scoring function uses a deep convolutional neural network to predict the contributions of precalculated terms to the overall binding affinity. We show that CENsible can effectively distinguish active vs inactive compounds for many systems. Its primary benefit over related machine-learning scoring functions, however, is that it retains interpretability, allowing researchers to identify the contribution of each precalculated term to the final affinity prediction, with implications for subsequent lead optimization.
{"title":"CENsible: Interpretable Insights into Small-Molecule Binding with Context Explanation Networks","authors":"Roshni Bhatt, David Ryan Koes and Jacob D. Durrant*, ","doi":"10.1021/acs.jcim.4c00825","DOIUrl":"10.1021/acs.jcim.4c00825","url":null,"abstract":"<p >We present a novel and interpretable approach for assessing small-molecule binding using context explanation networks. Given the specific structure of a protein/ligand complex, our CENsible scoring function uses a deep convolutional neural network to predict the contributions of precalculated terms to the overall binding affinity. We show that CENsible can effectively distinguish active vs inactive compounds for many systems. Its primary benefit over related machine-learning scoring functions, however, is that it retains interpretability, allowing researchers to identify the contribution of each precalculated term to the final affinity prediction, with implications for subsequent lead optimization.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11200255/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141282327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-06DOI: 10.1021/acs.jcim.4c00619
Muyun Lihan, and , Emad Tajkhorshid*,
Cholesterol (CHL) plays an integral role in modulating the function and activity of various mammalian membrane proteins. Due to the slow dynamics of lipids, conventional computational studies of protein–CHL interactions rely on either long-time scale atomistic simulations or coarse-grained approximations to sample the process. A highly mobile membrane mimetic (HMMM) has been developed to enhance lipid diffusion and thus used to facilitate the investigation of lipid interactions with peripheral membrane proteins and, with customized in silico solvents to replace phospholipid tails, with integral membrane proteins. Here, we report an updated HMMM model that is able to include CHL, a nonphospholipid component of the membrane, henceforth called HMMM-CHL. To this end, we had to optimize the effect of the customized solvents on CHL behavior in the membrane. Furthermore, the new solvent is compatible with simulations using force-based switching protocols. In the HMMM-CHL, both improved CHL dynamics and accelerated lipid diffusion are integrated. To test the updated model, we have applied it to the characterization of protein–CHL interactions in two membrane protein systems, the human β2-adrenergic receptor (β2AR) and the mitochondrial voltage-dependent anion channel 1 (VDAC-1). Our HMMM-CHL simulations successfully identified CHL binding sites and captured detailed CHL interactions in excellent consistency with experimental data as well as other simulation results, indicating the utility of the improved model in applications where an enhanced sampling of protein–CHL interactions is desired.
{"title":"Improved Highly Mobile Membrane Mimetic Model for Investigating Protein–Cholesterol Interactions","authors":"Muyun Lihan, and , Emad Tajkhorshid*, ","doi":"10.1021/acs.jcim.4c00619","DOIUrl":"10.1021/acs.jcim.4c00619","url":null,"abstract":"<p >Cholesterol (CHL) plays an integral role in modulating the function and activity of various mammalian membrane proteins. Due to the slow dynamics of lipids, conventional computational studies of protein–CHL interactions rely on either long-time scale atomistic simulations or coarse-grained approximations to sample the process. A highly mobile membrane mimetic (HMMM) has been developed to enhance lipid diffusion and thus used to facilitate the investigation of lipid interactions with peripheral membrane proteins and, with customized <i>in silico</i> solvents to replace phospholipid tails, with integral membrane proteins. Here, we report an updated HMMM model that is able to include CHL, a nonphospholipid component of the membrane, henceforth called HMMM-CHL. To this end, we had to optimize the effect of the customized solvents on CHL behavior in the membrane. Furthermore, the new solvent is compatible with simulations using force-based switching protocols. In the HMMM-CHL, both improved CHL dynamics and accelerated lipid diffusion are integrated. To test the updated model, we have applied it to the characterization of protein–CHL interactions in two membrane protein systems, the human β<sub>2</sub>-adrenergic receptor (β<sub>2</sub>AR) and the mitochondrial voltage-dependent anion channel 1 (VDAC-1). Our HMMM-CHL simulations successfully identified CHL binding sites and captured detailed CHL interactions in excellent consistency with experimental data as well as other simulation results, indicating the utility of the improved model in applications where an enhanced sampling of protein–CHL interactions is desired.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141282330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-06DOI: 10.1021/acs.jcim.4c00378
Cailum M. K. Stienstra, Liam Hebert, Patrick Thomas, Alexander Haack, Jason Guo and W. Scott Hopkins*,
Infrared (IR) spectroscopy is an important analytical tool in various chemical and forensic domains and a great deal of effort has gone into developing in silico methods for predicting experimental spectra. A key challenge in this regard is generating highly accurate spectra quickly to enable real-time feedback between computation and experiment. Here, we employ Graphormer, a graph neural network (GNN) transformer, to predict IR spectra using only simplified molecular-input line-entry system (SMILES) strings. Our data set includes 53,528 high-quality spectra, measured in five different experimental media (i.e., phases), for molecules containing the elements H, C, N, O, F, Si, S, P, Cl, Br, and I. When using only atomic numbers for node encodings, Graphormer-IR achieved a mean test spectral information similarity (SISμ) value of 0.8449 ± 0.0012 (n = 5), which surpasses that the current state-of-the-art model Chemprop-IR (SISμ = 0.8409 ± 0.0014, n = 5) with only 36% of the encoded information. Augmenting node embeddings with additional node-level descriptors in learned embeddings generated through a multilayer perceptron improves scores to SISμ = 0.8523 ± 0.0006, a total improvement of 19.7σ (t = 19). These improved scores show how Graphormer-IR excels in capturing long-range interactions like hydrogen bonding, anharmonic peak positions in experimental spectra, and stretching frequencies of uncommon functional groups. Scaling our architecture to 210 attention heads demonstrates specialist-like behavior for distinct IR frequencies that improves model performance. Our model utilizes novel architectures, including a global node for phase encoding, learned node feature embeddings, and a one-dimensional (1D) smoothing convolutional neural network (CNN). Graphormer-IR’s innovations underscore its value over traditional message-passing neural networks (MPNNs) due to its expressive embeddings and ability to capture long-range intramolecular relationships.
{"title":"Graphormer-IR: Graph Transformers Predict Experimental IR Spectra Using Highly Specialized Attention","authors":"Cailum M. K. Stienstra, Liam Hebert, Patrick Thomas, Alexander Haack, Jason Guo and W. Scott Hopkins*, ","doi":"10.1021/acs.jcim.4c00378","DOIUrl":"10.1021/acs.jcim.4c00378","url":null,"abstract":"<p >Infrared (IR) spectroscopy is an important analytical tool in various chemical and forensic domains and a great deal of effort has gone into developing <i>in silico</i> methods for predicting experimental spectra. A key challenge in this regard is generating highly accurate spectra quickly to enable real-time feedback between computation and experiment. Here, we employ Graphormer, a graph neural network (GNN) transformer, to predict IR spectra using only simplified molecular-input line-entry system (SMILES) strings. Our data set includes 53,528 high-quality spectra, measured in five different experimental media (i.e., phases), for molecules containing the elements H, C, N, O, F, Si, S, P, Cl, Br, and I. When using only atomic numbers for node encodings, Graphormer-IR achieved a mean test spectral information similarity (<i>SIS</i><sub>μ</sub>) value of 0.8449 ± 0.0012 (<i>n</i> = 5), which surpasses that the current state-of-the-art model Chemprop-IR (<i>SIS</i><sub>μ</sub> = 0.8409 ± 0.0014, <i>n</i> = 5) with only 36% of the encoded information. Augmenting node embeddings with additional node-level descriptors in learned embeddings generated through a multilayer perceptron improves scores to <i>SIS</i><sub>μ</sub> = 0.8523 ± 0.0006, a total improvement of 19.7σ (<i>t</i> = 19). These improved scores show how Graphormer-IR excels in capturing long-range interactions like hydrogen bonding, anharmonic peak positions in experimental spectra, and stretching frequencies of uncommon functional groups. Scaling our architecture to 210 attention heads demonstrates specialist-like behavior for distinct IR frequencies that improves model performance. Our model utilizes novel architectures, including a global node for phase encoding, learned node feature embeddings, and a one-dimensional (1D) smoothing convolutional neural network (CNN). Graphormer-IR’s innovations underscore its value over traditional message-passing neural networks (MPNNs) due to its expressive embeddings and ability to capture long-range intramolecular relationships.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141282329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-06DOI: 10.1021/acs.jcim.4c00634
Jie Li, Oufan Zhang, Kunyang Sun, Yingze Wang, Xingyi Guan, Dorian Bagni, Mojtaba Haghighatlari, Fiona L Kearns, Conor Parks, Rommie E Amaro, Teresa Head-Gordon
Determining the viability of a new drug molecule is a time- and resource-intensive task that makes computer-aided assessments a vital approach to rapid drug discovery. Here we develop a machine learning algorithm, iMiner, that generates novel inhibitor molecules for target proteins by combining deep reinforcement learning with real-time 3D molecular docking using AutoDock Vina, thereby simultaneously creating chemical novelty while constraining molecules for shape and molecular compatibility with target active sites. Moreover, through the use of various types of reward functions, we have introduced novelty in generative tasks for new molecules such as chemical similarity to a target ligand, molecules grown from known protein bound fragments, and creation of molecules that enforce interactions with target residues in the protein active site. The iMiner algorithm is embedded in a composite workflow that filters out Pan-assay interference compounds, Lipinski rule violations, uncommon structures in medicinal chemistry, and poor synthetic accessibility with options for cross-validation against other docking scoring functions and automation of a molecular dynamics simulation to measure pose stability. We also allow users to define a set of rules for the structures they would like to exclude during the training process and postfiltering steps. Because our approach relies only on the structure of the target protein, iMiner can be easily adapted for the future development of other inhibitors or small molecule therapeutics of any target protein.
{"title":"Mining for Potent Inhibitors through Artificial Intelligence and Physics: A Unified Methodology for Ligand Based and Structure Based Drug Design.","authors":"Jie Li, Oufan Zhang, Kunyang Sun, Yingze Wang, Xingyi Guan, Dorian Bagni, Mojtaba Haghighatlari, Fiona L Kearns, Conor Parks, Rommie E Amaro, Teresa Head-Gordon","doi":"10.1021/acs.jcim.4c00634","DOIUrl":"https://doi.org/10.1021/acs.jcim.4c00634","url":null,"abstract":"<p><p>Determining the viability of a new drug molecule is a time- and resource-intensive task that makes computer-aided assessments a vital approach to rapid drug discovery. Here we develop a machine learning algorithm, iMiner, that generates novel inhibitor molecules for target proteins by combining deep reinforcement learning with real-time 3D molecular docking using AutoDock Vina, thereby simultaneously creating chemical novelty while constraining molecules for shape and molecular compatibility with target active sites. Moreover, through the use of various types of reward functions, we have introduced novelty in generative tasks for new molecules such as chemical similarity to a target ligand, molecules grown from known protein bound fragments, and creation of molecules that enforce interactions with target residues in the protein active site. The iMiner algorithm is embedded in a composite workflow that filters out Pan-assay interference compounds, Lipinski rule violations, uncommon structures in medicinal chemistry, and poor synthetic accessibility with options for cross-validation against other docking scoring functions and automation of a molecular dynamics simulation to measure pose stability. We also allow users to define a set of rules for the structures they would like to exclude during the training process and postfiltering steps. Because our approach relies only on the structure of the target protein, iMiner can be easily adapted for the future development of other inhibitors or small molecule therapeutics of any target protein.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141282331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-05DOI: 10.1021/acs.jcim.4c00765
Maximilian G. Schuh, Davide Boldini* and Stephan A. Sieber*,
The precise prediction of molecular properties can greatly accelerate the development of new drugs. However, in silico molecular property prediction approaches have been limited so far to assays for which large amounts of data are available. In this study, we develop a new computational approach leveraging both the textual description of the assay of interest and the chemical structure of target compounds. By combining these two sources of information via self-supervised learning, our tool can provide accurate predictions for assays where no measurements are available. Remarkably, our approach achieves state-of-the-art performance on the FS-Mol benchmark for zero-shot prediction, outperforming a wide variety of deep learning approaches. Additionally, we demonstrate how our tool can be used for tailoring screening libraries for the assay of interest, showing promising performance in a retrospective case study on a high-throughput screening campaign. By accelerating the early identification of active molecules in drug discovery and development, this method has the potential to streamline the identification of novel therapeutics.
{"title":"Synergizing Chemical Structures and Bioassay Descriptions for Enhanced Molecular Property Prediction in Drug Discovery","authors":"Maximilian G. Schuh, Davide Boldini* and Stephan A. Sieber*, ","doi":"10.1021/acs.jcim.4c00765","DOIUrl":"10.1021/acs.jcim.4c00765","url":null,"abstract":"<p >The precise prediction of molecular properties can greatly accelerate the development of new drugs. However, <i>in silico</i> molecular property prediction approaches have been limited so far to assays for which large amounts of data are available. In this study, we develop a new computational approach leveraging both the textual description of the assay of interest and the chemical structure of target compounds. By combining these two sources of information via self-supervised learning, our tool can provide accurate predictions for assays where no measurements are available. Remarkably, our approach achieves state-of-the-art performance on the FS-Mol benchmark for zero-shot prediction, outperforming a wide variety of deep learning approaches. Additionally, we demonstrate how our tool can be used for tailoring screening libraries for the assay of interest, showing promising performance in a retrospective case study on a high-throughput screening campaign. By accelerating the early identification of active molecules in drug discovery and development, this method has the potential to streamline the identification of novel therapeutics.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11200265/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141247041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-05DOI: 10.1021/acs.jcim.4c00737
Xiang Du, Xinliang Sun and Min Li*,
Drug repositioning is a strategy of repurposing approved drugs for treating new indications, which can accelerate the drug discovery process, reduce development costs, and lower the safety risk. The advancement of biotechnology has significantly accelerated the speed and scale of biological data generation, offering significant potential for drug repositioning through biomedical knowledge graphs that integrate diverse entities and relations from various biomedical sources. To fully learn the semantic information and topological structure information from the biological knowledge graph, we propose a knowledge graph convolutional network with a heuristic search, named KGCNH, which can effectively utilize the diversity of entities and relationships in biological knowledge graphs, as well as topological structure information, to predict the associations between drugs and diseases. Specifically, we design a relation-aware attention mechanism to compute the attention scores for each neighboring entity of a given entity under different relations. To address the challenge of randomness of the initial attention scores potentially impacting model performance and to expand the search scope of the model, we designed a heuristic search module based on Gumbel-Softmax, which uses attention scores as heuristic information and introduces randomness to assist the model in exploring more optimal embeddings of drugs and diseases. Following this module, we derive the relation weights, obtain the embeddings of drugs and diseases through neighborhood aggregation, and then predict drug–disease associations. Additionally, we employ feature-based augmented views to enhance model robustness and mitigate overfitting issues. We have implemented our method and conducted experiments on two data sets. The results demonstrate that KGCNH outperforms competing methods. In particular, case studies on lithium and quetiapine confirm that KGCNH can retrieve more actual drug–disease associations in the top prediction results.
{"title":"Knowledge Graph Convolutional Network with Heuristic Search for Drug Repositioning","authors":"Xiang Du, Xinliang Sun and Min Li*, ","doi":"10.1021/acs.jcim.4c00737","DOIUrl":"10.1021/acs.jcim.4c00737","url":null,"abstract":"<p >Drug repositioning is a strategy of repurposing approved drugs for treating new indications, which can accelerate the drug discovery process, reduce development costs, and lower the safety risk. The advancement of biotechnology has significantly accelerated the speed and scale of biological data generation, offering significant potential for drug repositioning through biomedical knowledge graphs that integrate diverse entities and relations from various biomedical sources. To fully learn the semantic information and topological structure information from the biological knowledge graph, we propose a knowledge graph convolutional network with a heuristic search, named KGCNH, which can effectively utilize the diversity of entities and relationships in biological knowledge graphs, as well as topological structure information, to predict the associations between drugs and diseases. Specifically, we design a relation-aware attention mechanism to compute the attention scores for each neighboring entity of a given entity under different relations. To address the challenge of randomness of the initial attention scores potentially impacting model performance and to expand the search scope of the model, we designed a heuristic search module based on Gumbel-Softmax, which uses attention scores as heuristic information and introduces randomness to assist the model in exploring more optimal embeddings of drugs and diseases. Following this module, we derive the relation weights, obtain the embeddings of drugs and diseases through neighborhood aggregation, and then predict drug–disease associations. Additionally, we employ feature-based augmented views to enhance model robustness and mitigate overfitting issues. We have implemented our method and conducted experiments on two data sets. The results demonstrate that KGCNH outperforms competing methods. In particular, case studies on lithium and quetiapine confirm that KGCNH can retrieve more actual drug–disease associations in the top prediction results.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141260650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With recent large-scale applications and validations, the relative binding free energy (RBFE) calculated using alchemical free energy methods has been proven to be an accurate measure to probe the binding of small-molecule drug candidates. On the other hand, given the flexibility of peptides, it is of great interest to find out whether sufficient sampling could be achieved within the typical time scale of such calculation, and a similar level of accuracy could be reached for peptide drugs. However, the systematic evaluation of such calculations on protein–peptide systems has been less reported. Most reported studies of peptides were restricted to a limited number of data points or lacking experimental support. To demonstrate the applicability of the alchemical free energy method for protein–peptide systems in a typical real-world drug discovery project, we report an application of the thermodynamic integration (TI) method to the RBFE calculation of ghrelin receptor and its peptide agonists. Along with the calculation, the synthesis and in vitro EC50 activity of relamorelin and 17 new peptide derivatives were also reported. A cost-effective criterion to determine the data collection time was proposed for peptides in the TI simulation. The average of three TI repeats yielded a mean absolute error of 0.98 kcal/mol and Pearson’s correlation coefficient (R) of 0.77 against the experimental free energy derived from the in vitro EC50 activity, showing good repeatability of the proposed method and a slightly better agreement than the results obtained from the arbitrary time frames up to 20 ns. Although it is limited by having one target and a deduced binding pose, we hope that this study can add some insights into alchemical free energy calculation of protein–peptide systems, providing theoretical assistance to the development of peptide drugs.
通过最近的大规模应用和验证,使用炼金术自由能方法计算的相对结合自由能(RBFE)已被证明是探究小分子候选药物结合的准确测量方法。另一方面,考虑到多肽的灵活性,人们对能否在此类计算的典型时间尺度内实现足够的取样以及多肽药物能否达到类似的准确度水平非常感兴趣。然而,对蛋白质-肽系统进行此类计算的系统评估报道较少。大多数关于多肽的研究都局限于有限的数据点或缺乏实验支持。为了证明炼金术自由能方法在典型的实际药物发现项目中对蛋白肽系统的适用性,我们报告了热力学积分(TI)方法在胃泌素受体及其多肽激动剂的 RBFE 计算中的应用。在计算的同时,我们还报告了relamorelin和17种新多肽衍生物的合成和体外EC50活性。针对 TI 模拟中的多肽,提出了一个确定数据收集时间的成本效益标准。三次 TI 重复的平均绝对误差为 0.98 kcal/mol,与体外 EC50 活性得出的实验自由能的皮尔逊相关系数 (R) 为 0.77,这表明所提议的方法具有良好的可重复性,其一致性略好于 20 ns 以下任意时间框架得出的结果。虽然这项研究受限于一个目标和一个推导出的结合姿势,但我们希望它能为蛋白质-多肽系统的炼金自由能计算增添一些新的见解,为多肽药物的开发提供理论帮助。
{"title":"Peptide Drug Design Using Alchemical Free Energy Calculation: An Application and Validation on Agonists of Ghrelin Receptor","authors":"Qin Zeng, Guangpeng Meng, Bingyu Zhao, Haodian Lin, Yuqing Guan, Xiaobin Qin, Yu Yuan*, Yuanbo Li* and Qiantao Wang*, ","doi":"10.1021/acs.jcim.4c00414","DOIUrl":"10.1021/acs.jcim.4c00414","url":null,"abstract":"<p >With recent large-scale applications and validations, the relative binding free energy (RBFE) calculated using alchemical free energy methods has been proven to be an accurate measure to probe the binding of small-molecule drug candidates. On the other hand, given the flexibility of peptides, it is of great interest to find out whether sufficient sampling could be achieved within the typical time scale of such calculation, and a similar level of accuracy could be reached for peptide drugs. However, the systematic evaluation of such calculations on protein–peptide systems has been less reported. Most reported studies of peptides were restricted to a limited number of data points or lacking experimental support. To demonstrate the applicability of the alchemical free energy method for protein–peptide systems in a typical real-world drug discovery project, we report an application of the thermodynamic integration (TI) method to the RBFE calculation of ghrelin receptor and its peptide agonists. Along with the calculation, the synthesis and in vitro EC<sub>50</sub> activity of relamorelin and 17 new peptide derivatives were also reported. A cost-effective criterion to determine the data collection time was proposed for peptides in the TI simulation. The average of three TI repeats yielded a mean absolute error of 0.98 kcal/mol and Pearson’s correlation coefficient (<i>R</i>) of 0.77 against the experimental free energy derived from the in vitro EC<sub>50</sub> activity, showing good repeatability of the proposed method and a slightly better agreement than the results obtained from the arbitrary time frames up to 20 ns. Although it is limited by having one target and a deduced binding pose, we hope that this study can add some insights into alchemical free energy calculation of protein–peptide systems, providing theoretical assistance to the development of peptide drugs.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141246965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-05DOI: 10.1021/acs.jcim.4c00398
Kecheng Yang*, and , Hongmin Liu,
Lysine-specific demethylase 1 (LSD1), a highly sophisticated epigenetic regulator, orchestrates a range of critical cellular processes, holding promising therapeutic potential for treating diverse diseases. However, the clinical research progress targeting LSD1 is very slow. After 20 years of research, only one small-molecule drug, BEA-17, targeting the degradation of LSD1 and CoREST has been approved by the U.S. Food and Drug Administration. The primary reason for this may be the lack of abundant structural data regarding its intricate functions. To gain a deeper understanding of its conformational dynamics and guide the drug design process, we conducted molecular dynamics simulations to explore the conformational states of LSD1 in the apo state and under the influence of cofactors of flavin adenine dinucleotide (FAD) and CoREST. Our results showed that, across all states, the substrate binding pocket exhibited high flexibility, whereas the FAD binding pocket remained more stable. These distinct dynamical properties are essential for LSD1’s ability to bind various substrates while maintaining efficient demethylation activity. Both pockets can be enlarged by merging with adjacent pockets, although only the substrate binding pocket can shrink into smaller pockets. These new pocket shapes can inform inhibitor design, particularly for selectively FAD-competitive inhibitors of LSD1, given the presence of numerous FAD-dependent enzymes in the human body. More interestingly, in the absence of FAD binding, the united substrate and FAD binding pocket are partitioned by the conserved residue of Tyr761, offering valuable insights for the design of inhibitors that disrupt the crucial steric role of Tyr761 and the redox role of FAD. Additionally, we identified pockets that positively or negatively correlate with the substrate and FAD binding pockets, which can be exploited for the design of allosteric or concurrent inhibitors. Our results reveal the intricate dynamical properties of LSD1 as well as multiple novel conformational states, which deepen our understanding of its sophisticated functions and aid in the rational design of new inhibitors.
{"title":"Mining the Dynamical Properties of Substrate and FAD Binding Pockets of LSD1: Hints for New Inhibitor Design Direction","authors":"Kecheng Yang*, and , Hongmin Liu, ","doi":"10.1021/acs.jcim.4c00398","DOIUrl":"10.1021/acs.jcim.4c00398","url":null,"abstract":"<p >Lysine-specific demethylase 1 (LSD1), a highly sophisticated epigenetic regulator, orchestrates a range of critical cellular processes, holding promising therapeutic potential for treating diverse diseases. However, the clinical research progress targeting LSD1 is very slow. After 20 years of research, only one small-molecule drug, BEA-17, targeting the degradation of LSD1 and CoREST has been approved by the U.S. Food and Drug Administration. The primary reason for this may be the lack of abundant structural data regarding its intricate functions. To gain a deeper understanding of its conformational dynamics and guide the drug design process, we conducted molecular dynamics simulations to explore the conformational states of LSD1 in the apo state and under the influence of cofactors of flavin adenine dinucleotide (FAD) and CoREST. Our results showed that, across all states, the substrate binding pocket exhibited high flexibility, whereas the FAD binding pocket remained more stable. These distinct dynamical properties are essential for LSD1’s ability to bind various substrates while maintaining efficient demethylation activity. Both pockets can be enlarged by merging with adjacent pockets, although only the substrate binding pocket can shrink into smaller pockets. These new pocket shapes can inform inhibitor design, particularly for selectively FAD-competitive inhibitors of LSD1, given the presence of numerous FAD-dependent enzymes in the human body. More interestingly, in the absence of FAD binding, the united substrate and FAD binding pocket are partitioned by the conserved residue of Tyr761, offering valuable insights for the design of inhibitors that disrupt the crucial steric role of Tyr761 and the redox role of FAD. Additionally, we identified pockets that positively or negatively correlate with the substrate and FAD binding pockets, which can be exploited for the design of allosteric or concurrent inhibitors. Our results reveal the intricate dynamical properties of LSD1 as well as multiple novel conformational states, which deepen our understanding of its sophisticated functions and aid in the rational design of new inhibitors.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141260652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-05DOI: 10.1021/acs.jcim.4c00625
Sita Sirisha Madugula, Pranav Pujar, Bharani Nammi, Shouyi Wang, Vindi M. Jayasinghe-Arachchige, Tyler Pham, Dominic Mashburn, Maria Artiles and Jin Liu*,
The recent development of CRISPR-Cas technology holds promise to correct gene-level defects for genetic diseases. The key element of the CRISPR-Cas system is the Cas protein, a nuclease that can edit the gene of interest assisted by guide RNA. However, these Cas proteins suffer from inherent limitations such as large size, low cleavage efficiency, and off-target effects, hindering their widespread application as a gene editing tool. Therefore, there is a need to identify novel Cas proteins with improved editing properties, for which it is necessary to understand the underlying features governing the Cas families. In this study, we aim to elucidate the unique protein features associated with Cas9 and Cas12 families and identify the features distinguishing each family from non-Cas proteins. Here, we built Random Forest (RF) binary classifiers to distinguish Cas12 and Cas9 proteins from non-Cas proteins, respectively, using the complete protein feature spectrum (13,494 features) encoding various physiochemical, topological, constitutional, and coevolutionary information on Cas proteins. Furthermore, we built multiclass RF classifiers differentiating Cas9, Cas12, and non-Cas proteins. All the models were evaluated rigorously on the test and independent data sets. The Cas12 and Cas9 binary models achieved a high overall accuracy of 92% and 95% on their respective independent data sets, while the multiclass classifier achieved an F1 score of close to 0.98. We observed that Quasi-Sequence-Order (QSO) descriptors like Schneider.lag and Composition descriptors like charge, volume, and polarizability are predominant in the Cas12 family. Conversely Amino Acid Composition descriptors, especially Tripeptide Composition (TPC), predominate the Cas9 family. Four of the top 10 descriptors identified in Cas9 classification are tripeptides PWN, PYY, HHA, and DHI, which are seen to be conserved across all Cas9 proteins and located within different catalytically important domains of the Streptococcus pyogenes Cas9 (SpCas9) structure. Among these, DHI and HHA are well-known to be involved in the DNA cleavage activity of the SpCas9 protein. Mutation studies have highlighted the significance of the PWN tripeptide in PAM recognition and DNA cleavage activity of SpCas9, while Y450 from the PYY tripeptide plays a crucial role in reducing off-target effects and improving the specificity in SpCas9. Leveraging our machine learning (ML) pipeline, we identified numerous Cas9 and Cas12 family-specific features. These features offer valuable insights for future experimental and computational studies aiming at designing Cas systems with enhanced gene-editing properties. These features suggest plausible structural modifications that can effectively guide the development of Cas proteins with improved editing capabilities.
最近开发的 CRISPR-Cas 技术有望纠正遗传疾病的基因水平缺陷。CRISPR-Cas系统的关键元件是Cas蛋白,它是一种核酸酶,可在引导RNA的辅助下编辑相关基因。然而,这些 Cas 蛋白存在固有的局限性,如体积大、裂解效率低和脱靶效应,阻碍了它们作为基因编辑工具的广泛应用。因此,有必要鉴定具有更好编辑特性的新型 Cas 蛋白,为此有必要了解 Cas 家族的基本特征。在本研究中,我们旨在阐明与 Cas9 和 Cas12 家族相关的独特蛋白质特征,并确定每个家族区别于非 Cas 蛋白的特征。在此,我们利用完整的蛋白质特征谱(13,494 个特征)编码了 Cas 蛋白的各种理化、拓扑、结构和协同进化信息,建立了随机森林(RF)二元分类器,以区分 Cas12 和 Cas9 蛋白与非 Cas 蛋白。此外,我们还建立了区分 Cas9、Cas12 和非 Cas 蛋白的多类 RF 分类器。我们在测试数据集和独立数据集上对所有模型进行了严格评估。在各自的独立数据集上,Cas12 和 Cas9 二进制模型的总体准确率分别达到 92% 和 95%,而多分类器的 F1 分数接近 0.98。我们观察到,在 Cas12 家族中,Schneider.lag 等准序列序列(QSO)描述符以及电荷、体积和极化性等组成描述符占主导地位。相反,氨基酸组成描述符,尤其是三肽组成(TPC)在 Cas9 家族中占主导地位。在Cas9分类中发现的前10个描述符中有4个是三肽PWN、PYY、HHA和DHI,它们在所有Cas9蛋白中都是保守的,并且位于化脓性链球菌Cas9(SpCas9)结构的不同重要催化结构域中。众所周知,DHI 和 HHA 参与了 SpCas9 蛋白的 DNA 切割活动。突变研究强调了PWN三肽在SpCas9的PAM识别和DNA切割活性中的重要作用,而PYY三肽中的Y450则在减少脱靶效应和提高SpCas9的特异性方面发挥着关键作用。利用我们的机器学习(ML)管道,我们发现了许多 Cas9 和 Cas12 家族的特异性特征。这些特征为未来旨在设计具有更强基因编辑特性的 Cas 系统的实验和计算研究提供了宝贵的见解。这些特征提出了一些似是而非的结构修饰,可以有效地指导具有更强编辑能力的 Cas 蛋白的开发。
{"title":"Identification of Family-Specific Features in Cas9 and Cas12 Proteins: A Machine Learning Approach Using Complete Protein Feature Spectrum","authors":"Sita Sirisha Madugula, Pranav Pujar, Bharani Nammi, Shouyi Wang, Vindi M. Jayasinghe-Arachchige, Tyler Pham, Dominic Mashburn, Maria Artiles and Jin Liu*, ","doi":"10.1021/acs.jcim.4c00625","DOIUrl":"10.1021/acs.jcim.4c00625","url":null,"abstract":"<p >The recent development of CRISPR-Cas technology holds promise to correct gene-level defects for genetic diseases. The key element of the CRISPR-Cas system is the Cas protein, a nuclease that can edit the gene of interest assisted by guide RNA. However, these Cas proteins suffer from inherent limitations such as large size, low cleavage efficiency, and off-target effects, hindering their widespread application as a gene editing tool. Therefore, there is a need to identify novel Cas proteins with improved editing properties, for which it is necessary to understand the underlying features governing the Cas families. In this study, we aim to elucidate the unique protein features associated with Cas9 and Cas12 families and identify the features distinguishing each family from non-Cas proteins. Here, we built Random Forest (RF) binary classifiers to distinguish Cas12 and Cas9 proteins from non-Cas proteins, respectively, using the complete protein feature spectrum (13,494 features) encoding various physiochemical, topological, constitutional, and coevolutionary information on Cas proteins. Furthermore, we built multiclass RF classifiers differentiating Cas9, Cas12, and non-Cas proteins. All the models were evaluated rigorously on the test and independent data sets. The Cas12 and Cas9 binary models achieved a high overall accuracy of 92% and 95% on their respective independent data sets, while the multiclass classifier achieved an F1 score of close to 0.98. We observed that Quasi-Sequence-Order (QSO) descriptors like Schneider.lag and Composition descriptors like charge, volume, and polarizability are predominant in the Cas12 family. Conversely Amino Acid Composition descriptors, especially Tripeptide Composition (TPC), predominate the Cas9 family. Four of the top 10 descriptors identified in Cas9 classification are tripeptides PWN, PYY, HHA, and DHI, which are seen to be conserved across all Cas9 proteins and located within different catalytically important domains of the <i>Streptococcus pyogenes</i> Cas9 (SpCas9) structure. Among these, DHI and HHA are well-known to be involved in the DNA cleavage activity of the SpCas9 protein. Mutation studies have highlighted the significance of the PWN tripeptide in PAM recognition and DNA cleavage activity of SpCas9, while Y450 from the PYY tripeptide plays a crucial role in reducing off-target effects and improving the specificity in SpCas9. Leveraging our machine learning (ML) pipeline, we identified numerous Cas9 and Cas12 family-specific features. These features offer valuable insights for future experimental and computational studies aiming at designing Cas systems with enhanced gene-editing properties. These features suggest plausible structural modifications that can effectively guide the development of Cas proteins with improved editing capabilities.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141260647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-03DOI: 10.1021/acs.jcim.4c00600
Liping Sun, Jili Hu, Yinfeng Yang, Yongkang Wang, Zijian Wang, Yong Gao, Yiqi Nie, Can Liu, Hongxing Kan
The design of nanozymes with superior catalytic activities is a prerequisite for broadening their biomedical applications. Previous studies have exerted significant effort in theoretical calculation and experimental trials for enhancing the catalytic activity of nanozyme. Machine learning (ML) provides a forward-looking aid in predicting nanozyme catalytic activity. However, this requires a significant amount of human effort for data collection. In addition, the prediction accuracy urgently needs to be improved. Herein, we demonstrate that ChatGPT can collaborate with humans to efficiently collect data. We establish four qualitative models (random forest (RF), decision tree (DT), adaboost random forest (adaboost-RF), and adaboost decision tree (adaboost-DT)) for predicting nanozyme catalytic types, such as peroxidase, oxidase, catalase, superoxide dismutase, and glutathione peroxidase. Furthermore, we use five quantitative models (random forest (RF), decision tree (DT), Support Vector Regression (SVR), gradient boosting regression (GBR), and fully connected deep neuron network (DNN)) to predict nanozyme catalytic activities. We find that GBR model demonstrates superior prediction performance for nanozyme catalytic activities (R2 = 0.6476 for Km and R2 = 0.95 for Kcat). Moreover, an open-access web resource, AI-ZYMES, with a ChatGPT-based nanozyme copilot is developed for predicting nanozyme catalytic types and activities and guiding the synthesis of nanozyme. The accuracy of the nanozyme copilot's responses reaches more than 90% through the retrieval augmented generation. This study provides a new potential application for ChatGPT in the field of nanozymes.
{"title":"ChatGPT Combining Machine Learning for the Prediction of Nanozyme Catalytic Types and Activities.","authors":"Liping Sun, Jili Hu, Yinfeng Yang, Yongkang Wang, Zijian Wang, Yong Gao, Yiqi Nie, Can Liu, Hongxing Kan","doi":"10.1021/acs.jcim.4c00600","DOIUrl":"https://doi.org/10.1021/acs.jcim.4c00600","url":null,"abstract":"<p><p>The design of nanozymes with superior catalytic activities is a prerequisite for broadening their biomedical applications. Previous studies have exerted significant effort in theoretical calculation and experimental trials for enhancing the catalytic activity of nanozyme. Machine learning (ML) provides a forward-looking aid in predicting nanozyme catalytic activity. However, this requires a significant amount of human effort for data collection. In addition, the prediction accuracy urgently needs to be improved. Herein, we demonstrate that ChatGPT can collaborate with humans to efficiently collect data. We establish four qualitative models (random forest (RF), decision tree (DT), adaboost random forest (adaboost-RF), and adaboost decision tree (adaboost-DT)) for predicting nanozyme catalytic types, such as peroxidase, oxidase, catalase, superoxide dismutase, and glutathione peroxidase. Furthermore, we use five quantitative models (random forest (RF), decision tree (DT), Support Vector Regression (SVR), gradient boosting regression (GBR), and fully connected deep neuron network (DNN)) to predict nanozyme catalytic activities. We find that GBR model demonstrates superior prediction performance for nanozyme catalytic activities (<i>R</i><sup>2</sup> = 0.6476 for Km and <i>R</i><sup>2</sup> = 0.95 for Kcat). Moreover, an open-access web resource, AI-ZYMES, with a ChatGPT-based nanozyme copilot is developed for predicting nanozyme catalytic types and activities and guiding the synthesis of nanozyme. The accuracy of the nanozyme copilot's responses reaches more than 90% through the retrieval augmented generation. This study provides a new potential application for ChatGPT in the field of nanozymes.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141236630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}