Yuanqing Zhou, Haitao Lin, Lianghua Xie, Yufei Huang, Lirong Wu, Stan Z Li, Wei Chen
{"title":"Effectiveness and Efficiency: Label-Aware Hierarchical Subgraph Learning for Protein-Protein Interaction.","authors":"Yuanqing Zhou, Haitao Lin, Lianghua Xie, Yufei Huang, Lirong Wu, Stan Z Li, Wei Chen","doi":"10.1016/j.jmb.2024.168737","DOIUrl":null,"url":null,"abstract":"<p><p>The study of protein-protein interactions (PPIs) holds immense significance in understanding various biological activities, as well as in drug discovery and disease diagnosis. Existing deep learning methods for PPI prediction, including graph neural networks (GNNs), have been widely employed as the solutions, while they often experience a decline in performance in the real world. We claim that the topological shortcut is one of the key problems contributing negatively to the performance, according to our analysis. By modeling the PPIs as a graph with protein as nodes and interactions as edge types, the prevailing models tend to learn the pattern of nodes' degrees rather than intrinsic sequence-structure profiles, leading to the problem termed topological shortcut. The huge data growth of PPI leads to intensive computational costs and challenges computing devices, causing infeasibility in practice. To address the discussed problems, we propose a label-aware hierarchical subgraph learning method (laruGL-PPI) that can effectively infer PPIs while being interpretable. Specifically, we introduced edge-based subgraph sampling to effectively alleviate the problems of topological shortcuts and high computing costs. Besides, the inner-outer connections of PPIs are modeled as a hierarchical graph, together with the dependencies between interaction types constructed by a label graph. Extensive experiments conducted across various scales of PPI datasets have conclusively demonstrated that the laruGL-PPI method surpasses the most advanced PPI prediction techniques currently available, particularly in the testing of unseen proteins. Also, our model can recognize crucial sites of proteins, such as surface sites for binding and active sites for catalysis.</p>","PeriodicalId":369,"journal":{"name":"Journal of Molecular Biology","volume":null,"pages":null},"PeriodicalIF":4.7000,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Molecular Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1016/j.jmb.2024.168737","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
The study of protein-protein interactions (PPIs) holds immense significance in understanding various biological activities, as well as in drug discovery and disease diagnosis. Existing deep learning methods for PPI prediction, including graph neural networks (GNNs), have been widely employed as the solutions, while they often experience a decline in performance in the real world. We claim that the topological shortcut is one of the key problems contributing negatively to the performance, according to our analysis. By modeling the PPIs as a graph with protein as nodes and interactions as edge types, the prevailing models tend to learn the pattern of nodes' degrees rather than intrinsic sequence-structure profiles, leading to the problem termed topological shortcut. The huge data growth of PPI leads to intensive computational costs and challenges computing devices, causing infeasibility in practice. To address the discussed problems, we propose a label-aware hierarchical subgraph learning method (laruGL-PPI) that can effectively infer PPIs while being interpretable. Specifically, we introduced edge-based subgraph sampling to effectively alleviate the problems of topological shortcuts and high computing costs. Besides, the inner-outer connections of PPIs are modeled as a hierarchical graph, together with the dependencies between interaction types constructed by a label graph. Extensive experiments conducted across various scales of PPI datasets have conclusively demonstrated that the laruGL-PPI method surpasses the most advanced PPI prediction techniques currently available, particularly in the testing of unseen proteins. Also, our model can recognize crucial sites of proteins, such as surface sites for binding and active sites for catalysis.
蛋白质-蛋白质相互作用(PPIs)的研究对于理解各种生物活动以及药物发现和疾病诊断具有巨大的意义。包括图神经网络(GNN)在内的现有 PPI 预测深度学习方法已被广泛用作解决方案,但它们在现实世界中的性能往往会下降。根据我们的分析,拓扑捷径是影响性能的关键问题之一。通过将 PPIs 建模为以蛋白质为节点、以相互作用为边类型的图,现有模型倾向于学习节点度的模式,而不是内在的序列结构轮廓,从而导致了被称为拓扑捷径的问题。PPI 数据的巨大增长导致了高昂的计算成本和对计算设备的挑战,从而在实践中造成了不可行性。为了解决上述问题,我们提出了一种标签感知分层子图学习方法(laruGL-PPI),它能有效地推断 PPI,同时具有可解释性。具体来说,我们引入了基于边缘的子图采样,有效缓解了拓扑捷径和高计算成本的问题。此外,我们还将 PPI 的内外连接建模为层次图,并通过标签图构建了交互类型之间的依赖关系。在各种规模的 PPI 数据集上进行的广泛实验已经确证,laruGL-PPI 方法超越了目前最先进的 PPI 预测技术,尤其是在测试未见过的蛋白质时。此外,我们的模型还能识别蛋白质的关键位点,如用于结合的表面位点和用于催化的活性位点。
期刊介绍:
Journal of Molecular Biology (JMB) provides high quality, comprehensive and broad coverage in all areas of molecular biology. The journal publishes original scientific research papers that provide mechanistic and functional insights and report a significant advance to the field. The journal encourages the submission of multidisciplinary studies that use complementary experimental and computational approaches to address challenging biological questions.
Research areas include but are not limited to: Biomolecular interactions, signaling networks, systems biology; Cell cycle, cell growth, cell differentiation; Cell death, autophagy; Cell signaling and regulation; Chemical biology; Computational biology, in combination with experimental studies; DNA replication, repair, and recombination; Development, regenerative biology, mechanistic and functional studies of stem cells; Epigenetics, chromatin structure and function; Gene expression; Membrane processes, cell surface proteins and cell-cell interactions; Methodological advances, both experimental and theoretical, including databases; Microbiology, virology, and interactions with the host or environment; Microbiota mechanistic and functional studies; Nuclear organization; Post-translational modifications, proteomics; Processing and function of biologically important macromolecules and complexes; Molecular basis of disease; RNA processing, structure and functions of non-coding RNAs, transcription; Sorting, spatiotemporal organization, trafficking; Structural biology; Synthetic biology; Translation, protein folding, chaperones, protein degradation and quality control.