DUALSIM:单机上海量图的并行子图枚举

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI:10.1145/2882903.2915209

Hyeonji Kim, Juneyoung Lee, S. Bhowmick, Wook-Shin Han, Jeong-Hoon Lee, Seongyun Ko, M. Jarrah

{"title":"DUALSIM:单机上海量图的并行子图枚举","authors":"Hyeonji Kim, Juneyoung Lee, S. Bhowmick, Wook-Shin Han, Jeong-Hoon Lee, Seongyun Ko, M. Jarrah","doi":"10.1145/2882903.2915209","DOIUrl":null,"url":null,"abstract":"Subgraph enumeration is important for many applications such as subgraph frequencies, network motif discovery, graphlet kernel computation, and studying the evolution of social networks. Most earlier work on subgraph enumeration assumes that graphs are resident in memory, which results in serious scalability problems. Recently, efforts to enumerate all subgraphs in a large-scale graph have seemed to enjoy some success by partitioning the data graph and exploiting the distributed frameworks such as MapReduce and distributed graph engines. However, we notice that all existing distributed approaches have serious performance problems for subgraph enumeration due to the explosive number of partial results. In this paper, we design and implement a disk-based, single machine parallel subgraph enumeration solution called DualSim that can handle massive graphs without maintaining exponential numbers of partial results. Specifically, we propose a novel concept of the dual approach for subgraph enumeration. The dual approach swaps the roles of the data graph and the query graph. Specifically, instead of fixing the matching order in the query and then matching data vertices, it fixes the data vertices by fixing a set of disk pages and then finds all subgraph matchings in these pages. This enables us to significantly reduce the number of disk reads. We conduct extensive experiments with various real-world graphs to systematically demonstrate the superiority of DualSim over state-of-the-art distributed subgraph enumeration methods. DualSim outperforms the state-of-the-art methods by up to orders of magnitude, while they fail for many queries due to explosive intermediate results.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"87 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"66","resultStr":"{\"title\":\"DUALSIM: Parallel Subgraph Enumeration in a Massive Graph on a Single Machine\",\"authors\":\"Hyeonji Kim, Juneyoung Lee, S. Bhowmick, Wook-Shin Han, Jeong-Hoon Lee, Seongyun Ko, M. Jarrah\",\"doi\":\"10.1145/2882903.2915209\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Subgraph enumeration is important for many applications such as subgraph frequencies, network motif discovery, graphlet kernel computation, and studying the evolution of social networks. Most earlier work on subgraph enumeration assumes that graphs are resident in memory, which results in serious scalability problems. Recently, efforts to enumerate all subgraphs in a large-scale graph have seemed to enjoy some success by partitioning the data graph and exploiting the distributed frameworks such as MapReduce and distributed graph engines. However, we notice that all existing distributed approaches have serious performance problems for subgraph enumeration due to the explosive number of partial results. In this paper, we design and implement a disk-based, single machine parallel subgraph enumeration solution called DualSim that can handle massive graphs without maintaining exponential numbers of partial results. Specifically, we propose a novel concept of the dual approach for subgraph enumeration. The dual approach swaps the roles of the data graph and the query graph. Specifically, instead of fixing the matching order in the query and then matching data vertices, it fixes the data vertices by fixing a set of disk pages and then finds all subgraph matchings in these pages. This enables us to significantly reduce the number of disk reads. We conduct extensive experiments with various real-world graphs to systematically demonstrate the superiority of DualSim over state-of-the-art distributed subgraph enumeration methods. DualSim outperforms the state-of-the-art methods by up to orders of magnitude, while they fail for many queries due to explosive intermediate results.\",\"PeriodicalId\":20483,\"journal\":{\"name\":\"Proceedings of the 2016 International Conference on Management of Data\",\"volume\":\"87 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-06-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"66\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2016 International Conference on Management of Data\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2882903.2915209\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2882903.2915209","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 66

摘要

子图枚举对于子图频率、网络motif发现、graphlet核计算以及研究社会网络演化等许多应用都具有重要意义。大多数关于子图枚举的早期工作都假设图驻留在内存中，这会导致严重的可伸缩性问题。最近，通过划分数据图和利用分布式框架(如MapReduce和分布式图引擎)，在大规模图中枚举所有子图的努力似乎取得了一些成功。然而，我们注意到，由于部分结果的爆炸性数量，所有现有的分布式方法在子图枚举方面都存在严重的性能问题。在本文中，我们设计并实现了一个基于磁盘的，称为DualSim的单机并行子图枚举解决方案，该解决方案可以处理大量图，而无需维护指数数量的部分结果。具体来说，我们提出了子图枚举的对偶方法的新概念。这种双重方法交换了数据图和查询图的角色。具体来说，它不是在查询中固定匹配顺序，然后匹配数据顶点，而是通过固定一组磁盘页面来固定数据顶点，然后找到这些页面中的所有子图匹配。这使我们能够显著减少磁盘读取的次数。我们对各种真实世界的图进行了广泛的实验，以系统地证明DualSim优于最先进的分布式子图枚举方法。DualSim的性能比最先进的方法高出数量级，但由于中间结果爆炸性，它们在许多查询中失败。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

DUALSIM: Parallel Subgraph Enumeration in a Massive Graph on a Single Machine

Subgraph enumeration is important for many applications such as subgraph frequencies, network motif discovery, graphlet kernel computation, and studying the evolution of social networks. Most earlier work on subgraph enumeration assumes that graphs are resident in memory, which results in serious scalability problems. Recently, efforts to enumerate all subgraphs in a large-scale graph have seemed to enjoy some success by partitioning the data graph and exploiting the distributed frameworks such as MapReduce and distributed graph engines. However, we notice that all existing distributed approaches have serious performance problems for subgraph enumeration due to the explosive number of partial results. In this paper, we design and implement a disk-based, single machine parallel subgraph enumeration solution called DualSim that can handle massive graphs without maintaining exponential numbers of partial results. Specifically, we propose a novel concept of the dual approach for subgraph enumeration. The dual approach swaps the roles of the data graph and the query graph. Specifically, instead of fixing the matching order in the query and then matching data vertices, it fixes the data vertices by fixing a set of disk pages and then finds all subgraph matchings in these pages. This enables us to significantly reduce the number of disk reads. We conduct extensive experiments with various real-world graphs to systematically demonstrate the superiority of DualSim over state-of-the-art distributed subgraph enumeration methods. DualSim outperforms the state-of-the-art methods by up to orders of magnitude, while they fail for many queries due to explosive intermediate results.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助