{"title":"分析生物网络需要加速器","authors":"Jian-Yu Shi","doi":"10.1109/BIBM.2016.7822733","DOIUrl":null,"url":null,"abstract":"As the development of high-throughput techniques in both biology and its related disciplines (chemistry or medicine), the huge number of biological entries are available. The discovered relationship between them (e.g. interactions or associations) reveals important biological facts, which are never found in individual-based biological experiments. A biological network is an appropriate tool to systematically analyze and uncover such facts. The relationship between biological molecules is usually modeled as a monopartite network, such as protein-protein interactions, while that between biological molecules and other objects is modeled as a bipartite network, such as chemical compound-protein interactions, gene-disease associations and ncRNA-disease associations. A biological network may contain a large number of nodes, of which each owns many heterogeneous attributes, including binary, real-valued and semantic forms. Current algorithms for systematical analysis based on large-scale biological networks have always a need of either using much memory or taking much time, because of their high computational complexity. Take the compound-protein interaction network as an example. Over 90 million compounds are available in PubChem and each compound is characterized as a high-dimensional vector (e.g. 881-d PubChem fingerprint or 4860-d Klekota-Roth fingerprint). Meanwhile, a protein can be characterized as a 20K-demensional vector if the K-mer descriptor is adopted. However, involving intensive matrix manipulation (e.g. matrix factorization, inverse and tensor product), current algorithms cannot be directly applied to predict compound-protein interactions on a large scale. For example, having the complexity O(n3), singular value decomposition (SVD) runs for a 6,000□6,000 matrix in MATLAB 2013b (64 bits) under Windows 7(64bits) with Intel Corei7-4700MQ (2.40G) and GeForce GTX 765M. SVD spends 81.9, 77.9, and 51.4 seconds when using CPU only, CPU with four workers and CPU plus GPU respectively. Consequently, there is an urge need to turn them into accelerator-enabled parallel algorithms or develop novel accelerators to speed up the knowledge-mining in biological networks.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The need of accelerators in analyzing biological networks\",\"authors\":\"Jian-Yu Shi\",\"doi\":\"10.1109/BIBM.2016.7822733\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As the development of high-throughput techniques in both biology and its related disciplines (chemistry or medicine), the huge number of biological entries are available. The discovered relationship between them (e.g. interactions or associations) reveals important biological facts, which are never found in individual-based biological experiments. A biological network is an appropriate tool to systematically analyze and uncover such facts. The relationship between biological molecules is usually modeled as a monopartite network, such as protein-protein interactions, while that between biological molecules and other objects is modeled as a bipartite network, such as chemical compound-protein interactions, gene-disease associations and ncRNA-disease associations. A biological network may contain a large number of nodes, of which each owns many heterogeneous attributes, including binary, real-valued and semantic forms. Current algorithms for systematical analysis based on large-scale biological networks have always a need of either using much memory or taking much time, because of their high computational complexity. Take the compound-protein interaction network as an example. Over 90 million compounds are available in PubChem and each compound is characterized as a high-dimensional vector (e.g. 881-d PubChem fingerprint or 4860-d Klekota-Roth fingerprint). Meanwhile, a protein can be characterized as a 20K-demensional vector if the K-mer descriptor is adopted. However, involving intensive matrix manipulation (e.g. matrix factorization, inverse and tensor product), current algorithms cannot be directly applied to predict compound-protein interactions on a large scale. For example, having the complexity O(n3), singular value decomposition (SVD) runs for a 6,000□6,000 matrix in MATLAB 2013b (64 bits) under Windows 7(64bits) with Intel Corei7-4700MQ (2.40G) and GeForce GTX 765M. SVD spends 81.9, 77.9, and 51.4 seconds when using CPU only, CPU with four workers and CPU plus GPU respectively. Consequently, there is an urge need to turn them into accelerator-enabled parallel algorithms or develop novel accelerators to speed up the knowledge-mining in biological networks.\",\"PeriodicalId\":345384,\"journal\":{\"name\":\"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BIBM.2016.7822733\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBM.2016.7822733","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
随着高通量技术在生物学及其相关学科(化学或医学)的发展,大量的生物条目是可用的。发现它们之间的关系(例如相互作用或关联)揭示了重要的生物学事实,这些事实在基于个体的生物学实验中从未发现过。生物网络是系统分析和揭示这些事实的合适工具。生物分子之间的关系通常被建模为单侧网络,如蛋白质-蛋白质的相互作用,而生物分子与其他物体之间的关系被建模为双侧网络,如化合物-蛋白质的相互作用,基因-疾病的关联,ncrna -疾病的关联。生物网络可能包含大量节点,每个节点都具有许多异构属性,包括二进制、实值和语义形式。目前基于大规模生物网络的系统分析算法由于计算量大,要么占用大量内存,要么耗费大量时间。以化合物-蛋白质相互作用网络为例。《PubChem》中有超过9000万种化合物,每种化合物都被描述为高维向量(例如881 d PubChem指纹或4860 d Klekota-Roth指纹)。同时,如果采用K-mer描述符,则可以将蛋白质表征为20k维向量。然而,涉及到密集的矩阵操作(如矩阵分解、逆和张量积),目前的算法不能直接应用于预测大规模的化合物-蛋白质相互作用。例如,具有复杂度O(n3),奇异值分解(SVD)在Windows 7(64位)下使用Intel Corei7-4700MQ (2.40G)和GeForce GTX 765M在MATLAB 2013b(64位)中运行6,000□6,000矩阵。SVD在仅使用CPU、CPU 4 worker和CPU + GPU时分别花费81.9秒、77.9秒和51.4秒。因此,迫切需要将它们转化为支持加速器的并行算法或开发新的加速器来加速生物网络中的知识挖掘。
The need of accelerators in analyzing biological networks
As the development of high-throughput techniques in both biology and its related disciplines (chemistry or medicine), the huge number of biological entries are available. The discovered relationship between them (e.g. interactions or associations) reveals important biological facts, which are never found in individual-based biological experiments. A biological network is an appropriate tool to systematically analyze and uncover such facts. The relationship between biological molecules is usually modeled as a monopartite network, such as protein-protein interactions, while that between biological molecules and other objects is modeled as a bipartite network, such as chemical compound-protein interactions, gene-disease associations and ncRNA-disease associations. A biological network may contain a large number of nodes, of which each owns many heterogeneous attributes, including binary, real-valued and semantic forms. Current algorithms for systematical analysis based on large-scale biological networks have always a need of either using much memory or taking much time, because of their high computational complexity. Take the compound-protein interaction network as an example. Over 90 million compounds are available in PubChem and each compound is characterized as a high-dimensional vector (e.g. 881-d PubChem fingerprint or 4860-d Klekota-Roth fingerprint). Meanwhile, a protein can be characterized as a 20K-demensional vector if the K-mer descriptor is adopted. However, involving intensive matrix manipulation (e.g. matrix factorization, inverse and tensor product), current algorithms cannot be directly applied to predict compound-protein interactions on a large scale. For example, having the complexity O(n3), singular value decomposition (SVD) runs for a 6,000□6,000 matrix in MATLAB 2013b (64 bits) under Windows 7(64bits) with Intel Corei7-4700MQ (2.40G) and GeForce GTX 765M. SVD spends 81.9, 77.9, and 51.4 seconds when using CPU only, CPU with four workers and CPU plus GPU respectively. Consequently, there is an urge need to turn them into accelerator-enabled parallel algorithms or develop novel accelerators to speed up the knowledge-mining in biological networks.