Hybrid Pulling/Pushing for I/O-Efficient Distributed and Iterative Graph Computing

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-14 DOI:10.1145/2882903.2882938

Zhigang Wang, Yu Gu, Y. Bao, Ge Yu, J. Yu

{"title":"Hybrid Pulling/Pushing for I/O-Efficient Distributed and Iterative Graph Computing","authors":"Zhigang Wang, Yu Gu, Y. Bao, Ge Yu, J. Yu","doi":"10.1145/2882903.2882938","DOIUrl":null,"url":null,"abstract":"Billion-node graphs are rapidly growing in size in many applications such as online social networks. Most graph algorithms generate a large number of messages during iterative computations. Vertex-centric distributed systems usually store graph data and message data on disk to improve scalability. Currently, these distributed systems with disk-resident data take a push-based approach to handle messages. This works well if few messages reside on disk. Otherwise, it is I/O-inefficient due to expensive random writes. By contrast, the existing memory-resident pull-based approach individually pulls messages for each vertex on demand. Although it can be used to avoid disk operations regarding messages, expensive I/O costs are incurred by random and frequent access to vertices. This paper proposes a hybrid solution to support switching between push and pull adaptively, to obtain optimal performance for distributed systems with disk-resident data in different scenarios. We first employ a new block-centric technique (b-pull) to improve the I/O-performance of pulling messages, although the iterative computation is vertex-centric. I/O costs of data accesses are shifted from the receiver side where messages are written/read by push to the sender side where graph data are read by b-pull. Graph data are organized by clustering vertices and edges to achieve high I/O-efficiency in b-pull. Second, we design a seamless switching mechanism and a prominent performance prediction method to guarantee efficiency when switching between push and b-pull. We conduct extensive performance studies to confirm the effectiveness of our proposals over existing up-to-date solutions using a broad spectrum of real-world graphs.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"60 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2882903.2882938","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 30

Abstract

Billion-node graphs are rapidly growing in size in many applications such as online social networks. Most graph algorithms generate a large number of messages during iterative computations. Vertex-centric distributed systems usually store graph data and message data on disk to improve scalability. Currently, these distributed systems with disk-resident data take a push-based approach to handle messages. This works well if few messages reside on disk. Otherwise, it is I/O-inefficient due to expensive random writes. By contrast, the existing memory-resident pull-based approach individually pulls messages for each vertex on demand. Although it can be used to avoid disk operations regarding messages, expensive I/O costs are incurred by random and frequent access to vertices. This paper proposes a hybrid solution to support switching between push and pull adaptively, to obtain optimal performance for distributed systems with disk-resident data in different scenarios. We first employ a new block-centric technique (b-pull) to improve the I/O-performance of pulling messages, although the iterative computation is vertex-centric. I/O costs of data accesses are shifted from the receiver side where messages are written/read by push to the sender side where graph data are read by b-pull. Graph data are organized by clustering vertices and edges to achieve high I/O-efficiency in b-pull. Second, we design a seamless switching mechanism and a prominent performance prediction method to guarantee efficiency when switching between push and b-pull. We conduct extensive performance studies to confirm the effectiveness of our proposals over existing up-to-date solutions using a broad spectrum of real-world graphs.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

I/ o高效分布式迭代图计算的混合拉/推

在在线社交网络等许多应用程序中，十亿节点图的规模正在迅速增长。大多数图算法在迭代计算过程中都会产生大量的消息。以顶点为中心的分布式系统通常将图形数据和消息数据存储在磁盘上，以提高可伸缩性。目前，这些具有磁盘驻留数据的分布式系统采用基于推送的方法来处理消息。如果只有很少的消息驻留在磁盘上，那么这种方法可以很好地工作。否则，由于昂贵的随机写入，它是I/ o效率低下的。相比之下，现有的基于内存驻留的pull方法根据需要分别为每个顶点提取消息。尽管可以使用它来避免与消息相关的磁盘操作，但是随机和频繁地访问顶点会产生昂贵的I/O成本。本文提出了一种支持自适应推拉切换的混合方案，以在不同场景下获得具有磁盘驻留数据的分布式系统的最优性能。我们首先采用了一种新的以块为中心的技术(b-pull)来提高提取消息的I/ o性能，尽管迭代计算是以顶点为中心的。数据访问的I/O成本从通过push写入/读取消息的接收端转移到通过b-pull读取图形数据的发送端。在b-pull中，图形数据通过聚类顶点和边来组织，以达到较高的I/ o效率。其次，我们设计了无缝切换机构和突出的性能预测方法，以保证推拉切换的效率。我们进行了广泛的性能研究，以确认我们的建议比现有的最新解决方案的有效性，并使用广泛的现实世界图表。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2016 International Conference on Management of Data

自引率

0.00%

发文量

期刊最新文献

An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory Rheem: Enabling Multi-Platform Task Execution Wander Join: Online Aggregation for Joins Graph Summarization for Geo-correlated Trends Detection in Social Networks Emma in Action: Declarative Dataflows for Scalable Data Analysis