{"title":"Predicting protein complexes via the integration of multiple biological information","authors":"Xiwei Tang, Jianxin Wang, Yi Pan","doi":"10.1109/ISB.2012.6314132","DOIUrl":null,"url":null,"abstract":"Protein complexes are a cornerstone of many biological processes and together they form various types of molecular machinery that perform a vast array of biological functions. An increase in the amount of protein-protein interaction (PPI) data enables a number of computational methods for predicting protein complexes. There are a mass of algorithms detecting complexes only consider the PPI data. However, the PPI data from high-throughout techniques is flooded with false interactions. In fact, the insufficiency of the PPI data significantly lowers the accuracy of these methods. In the current work, we develop a novel method named CMBI to discover protein complexes via the integration of multiple biological resources including gene expression profiles, essential protein information and PPI data. First, CMBI defines the functional similarity of each pair of interacting proteins based on the edge-clustering coefficient (ECC) from the PPI network and the Pearson correlation coefficient (PCC) from the gene expression data. Second, CMBI selects essential proteins as seeds to bnild the protein complex cores. During the growth process, the seeds' essential protein neighbors and the neighbors whose functional similarity (FS) with the seeds are more than the threshold T will be added to the complex cores. After the complex cores are constructed, CMBI begins to generate protein complexes by attaching their direct neighbors with F S >; T to the cores. In addition to the essential proteins, CMBI also uses other proteins as seeds to expand protein complexes. To check the performance of CMBI, we compare the complexes discovered by CMBI with the ones found by other techniques by matching the predicted complexes against the reference complexes. We use subsequently GO::TermFinder to analyze the complexes predicted by various methods. Finally, the effect of parameter T is investigated. The results from GO functional enrichment and matching analyses show that CMBI performs significantly better than the state-of-the-art methods. It means that it's successful for us to integrate multiple biological information to identify protein complexes in the PPI network.","PeriodicalId":224011,"journal":{"name":"2012 IEEE 6th International Conference on Systems Biology (ISB)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 6th International Conference on Systems Biology (ISB)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISB.2012.6314132","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
Protein complexes are a cornerstone of many biological processes and together they form various types of molecular machinery that perform a vast array of biological functions. An increase in the amount of protein-protein interaction (PPI) data enables a number of computational methods for predicting protein complexes. There are a mass of algorithms detecting complexes only consider the PPI data. However, the PPI data from high-throughout techniques is flooded with false interactions. In fact, the insufficiency of the PPI data significantly lowers the accuracy of these methods. In the current work, we develop a novel method named CMBI to discover protein complexes via the integration of multiple biological resources including gene expression profiles, essential protein information and PPI data. First, CMBI defines the functional similarity of each pair of interacting proteins based on the edge-clustering coefficient (ECC) from the PPI network and the Pearson correlation coefficient (PCC) from the gene expression data. Second, CMBI selects essential proteins as seeds to bnild the protein complex cores. During the growth process, the seeds' essential protein neighbors and the neighbors whose functional similarity (FS) with the seeds are more than the threshold T will be added to the complex cores. After the complex cores are constructed, CMBI begins to generate protein complexes by attaching their direct neighbors with F S >; T to the cores. In addition to the essential proteins, CMBI also uses other proteins as seeds to expand protein complexes. To check the performance of CMBI, we compare the complexes discovered by CMBI with the ones found by other techniques by matching the predicted complexes against the reference complexes. We use subsequently GO::TermFinder to analyze the complexes predicted by various methods. Finally, the effect of parameter T is investigated. The results from GO functional enrichment and matching analyses show that CMBI performs significantly better than the state-of-the-art methods. It means that it's successful for us to integrate multiple biological information to identify protein complexes in the PPI network.
蛋白质复合物是许多生物过程的基石,它们共同形成各种类型的分子机制,执行大量的生物功能。蛋白质-蛋白质相互作用(PPI)数据量的增加使许多预测蛋白质复合物的计算方法成为可能。有大量的检测复合体的算法只考虑PPI数据。然而,来自高通量技术的PPI数据充斥着错误的相互作用。事实上,PPI数据的不足显著降低了这些方法的准确性。在目前的工作中,我们开发了一种名为CMBI的新方法,通过整合多种生物资源(包括基因表达谱、必需蛋白质信息和PPI数据)来发现蛋白质复合物。首先,CMBI基于PPI网络中的边缘聚类系数(ECC)和基因表达数据中的Pearson相关系数(PCC)来定义每对相互作用蛋白的功能相似性。其次,CMBI选择必需蛋白作为种子构建蛋白复合物核心。在种子生长过程中,将种子必需蛋白邻居和与种子功能相似度(FS)大于阈值T的邻居添加到复合核中。复合物核心构建完成后,CMBI开始通过将F S >附着在其直接近邻上生成蛋白复合物;T到核心。除了必需的蛋白质外,CMBI还使用其他蛋白质作为种子来扩展蛋白质复合物。为了验证CMBI的性能,我们通过将预测的配合物与参考配合物进行匹配,将CMBI发现的配合物与其他技术发现的配合物进行比较。我们随后使用GO::TermFinder对各种方法预测的复合物进行分析。最后,研究了参数T的影响。氧化石墨烯功能富集和匹配分析的结果表明,CMBI的性能明显优于最先进的方法。这意味着我们成功地整合了多种生物信息来识别PPI网络中的蛋白质复合物。