Bermuda: An Efficient MapReduce Triangle Listing Algorithm for Web-Scale Graphs

Proceedings of the 28th International Conference on Scientific and Statistical Database Management Pub Date : 2016-07-18 DOI:10.1145/2949689.2949715

Dongqing Xiao, M. Eltabakh, Xiangnan Kong

{"title":"Bermuda: An Efficient MapReduce Triangle Listing Algorithm for Web-Scale Graphs","authors":"Dongqing Xiao, M. Eltabakh, Xiangnan Kong","doi":"10.1145/2949689.2949715","DOIUrl":null,"url":null,"abstract":"Triangle listing plays an important role in graph analysis and has numerous graph mining applications. With the rapid growth of graph data, distributed methods for listing triangles over massive graphs are urgently needed. Therefore, the triangle listing problem has been studied in several distributed infrastructures including MapReduce. However, existing algorithms suffer from generating and shuffling huge amounts of intermediate data, where interestingly, a large percentage of this data is redundant. Inspired by this observation, we present the \"Bermuda\" method, an efficient MapReducebased triangle listing technique for massive graphs. Different from existing approaches, Bermuda effectively reduces the size of the intermediate data via redundancy elimination and sharing of messages whenever possible. As a result, Bermuda achieves orders-of-magnitudes of speedup and enables processing larger graphs that other techniques fail to process under the same resources. Bermuda exploits the locality of processing, i.e., in which reduce instance each graph vertex will be processed, to avoid the redundancy of generating messages from mappers to reducers. Bermuda also proposes novel message sharing techniques within each reduce instance to increase the usability of the received messages. We present and analyze several reduce-side caching strategies that dynamically learn the expected access patterns of the shared messages, and adaptively deploy the appropriate technique for better sharing. Extensive experiments conducted on real-world large-scale graphs show that Bermuda speeds up the triangle listing computations by factors up to 10x. Moreover, with a relatively small cluster, Bermuda can scale up to large datasets, e.g., ClueWeb graph dataset (688GB), while other techniques fail to finish.","PeriodicalId":254803,"journal":{"name":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","volume":"26 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 28th International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2949689.2949715","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Triangle listing plays an important role in graph analysis and has numerous graph mining applications. With the rapid growth of graph data, distributed methods for listing triangles over massive graphs are urgently needed. Therefore, the triangle listing problem has been studied in several distributed infrastructures including MapReduce. However, existing algorithms suffer from generating and shuffling huge amounts of intermediate data, where interestingly, a large percentage of this data is redundant. Inspired by this observation, we present the "Bermuda" method, an efficient MapReducebased triangle listing technique for massive graphs. Different from existing approaches, Bermuda effectively reduces the size of the intermediate data via redundancy elimination and sharing of messages whenever possible. As a result, Bermuda achieves orders-of-magnitudes of speedup and enables processing larger graphs that other techniques fail to process under the same resources. Bermuda exploits the locality of processing, i.e., in which reduce instance each graph vertex will be processed, to avoid the redundancy of generating messages from mappers to reducers. Bermuda also proposes novel message sharing techniques within each reduce instance to increase the usability of the received messages. We present and analyze several reduce-side caching strategies that dynamically learn the expected access patterns of the shared messages, and adaptively deploy the appropriate technique for better sharing. Extensive experiments conducted on real-world large-scale graphs show that Bermuda speeds up the triangle listing computations by factors up to 10x. Moreover, with a relatively small cluster, Bermuda can scale up to large datasets, e.g., ClueWeb graph dataset (688GB), while other techniques fail to finish.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

百慕大:一个有效的MapReduce三角形列表算法，用于web规模的图

三角形列表在图分析中起着重要的作用，在图挖掘中有着广泛的应用。随着图数据的快速增长，迫切需要在海量图上列出三角形的分布式方法。因此，在包括MapReduce在内的几种分布式基础架构中，对三角形列表问题进行了研究。然而，现有的算法在生成和洗牌大量中间数据方面存在问题，有趣的是，这些数据中有很大一部分是冗余的。受此启发，我们提出了“Bermuda”方法，这是一种高效的基于mapreduce的三角形列表技术，用于处理海量图。与现有的方法不同，Bermuda通过消除冗余和尽可能地共享消息，有效地减少了中间数据的大小。因此，百慕大实现了数量级的加速，并能够处理在相同资源下其他技术无法处理的更大的图形。百慕大利用了处理的局域性，即在reduce实例中每个图顶点将被处理，以避免从映射器到reducer生成消息的冗余。百慕大还在每个reduce实例中提出了新颖的消息共享技术，以提高接收到的消息的可用性。我们提出并分析了几种减少端缓存策略，这些策略动态地学习共享消息的预期访问模式，并自适应地部署适当的技术以实现更好的共享。在现实世界的大规模图表上进行的大量实验表明，百慕大将三角列表的计算速度提高了10倍。此外，由于集群相对较小，百慕大可以扩展到大型数据集，例如ClueWeb图形数据集(688GB)，而其他技术则无法完成。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 28th International Conference on Scientific and Statistical Database Management

自引率

0.00%

发文量

期刊最新文献

SMS: Stable Matching Algorithm using Skylines Graph-based modelling of query sets for differential privacy Efficient Feedback Collection for Pay-as-you-go Source Selection Multi-Assignment Single Joins for Parallel Cross-Match of Astronomic Catalogs on Heterogeneous Clusters Compact and queryable representation of raster datasets