Making pull-based graph processing performant

Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2018-02-10 DOI:10.1145/3178487.3178506

Samuel Grossman, Heiner Litz, C. Kozyrakis

{"title":"Making pull-based graph processing performant","authors":"Samuel Grossman, Heiner Litz, C. Kozyrakis","doi":"10.1145/3178487.3178506","DOIUrl":null,"url":null,"abstract":"Graph processing engines following either the push-based or pull-based pattern conceptually consist of a two-level nested loop structure. Parallelizing and vectorizing these loops is critical for high overall performance and memory bandwidth utilization. Outer loop parallelization is simple for both engine types but suffers from high load imbalance. This work focuses on inner loop parallelization for pull engines, which when performed naively leads to a significant increase in conflicting memory writes that must be synchronized. Our first contribution is a scheduler-aware interface for parallel loops that allows us to optimize for the common case in which each thread executes several consecutive iterations. This eliminates most write traffic and avoids all synchronization, leading to speedups of up to 50X. Our second contribution is the Vector-Sparse format, which addresses the obstacles to vectorization that stem from the commonly-used Compressed-Sparse data structure. Our new format eliminates unaligned memory accesses and bounds checks within vector operations, two common problems when processing low-degree vertices. Vectorization with Vector-Sparse leads to speedups of up to 2.5X. Our contributions are embodied in Grazelle, a hybrid graph processing framework. On a server equipped with four Intel Xeon E7-4850 v3 processors, Grazelle respectively outperforms Ligra, Polymer, GraphMat, and X-Stream by up to 15.2X, 4.6X, 4.7X, and 66.8X.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"61","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3178487.3178506","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 61

Abstract

Graph processing engines following either the push-based or pull-based pattern conceptually consist of a two-level nested loop structure. Parallelizing and vectorizing these loops is critical for high overall performance and memory bandwidth utilization. Outer loop parallelization is simple for both engine types but suffers from high load imbalance. This work focuses on inner loop parallelization for pull engines, which when performed naively leads to a significant increase in conflicting memory writes that must be synchronized. Our first contribution is a scheduler-aware interface for parallel loops that allows us to optimize for the common case in which each thread executes several consecutive iterations. This eliminates most write traffic and avoids all synchronization, leading to speedups of up to 50X. Our second contribution is the Vector-Sparse format, which addresses the obstacles to vectorization that stem from the commonly-used Compressed-Sparse data structure. Our new format eliminates unaligned memory accesses and bounds checks within vector operations, two common problems when processing low-degree vertices. Vectorization with Vector-Sparse leads to speedups of up to 2.5X. Our contributions are embodied in Grazelle, a hybrid graph processing framework. On a server equipped with four Intel Xeon E7-4850 v3 processors, Grazelle respectively outperforms Ligra, Polymer, GraphMat, and X-Stream by up to 15.2X, 4.6X, 4.7X, and 66.8X.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使基于拉的图形处理性能

遵循基于推或基于拉模式的图形处理引擎在概念上由两层嵌套循环结构组成。并行化和向量化这些循环对于提高整体性能和内存带宽利用率至关重要。外循环并行化对于两种引擎类型来说都很简单，但存在高负载不平衡的问题。这项工作的重点是拉引擎的内部循环并行化，如果执行不当，将导致必须同步的冲突内存写的显著增加。我们的第一个贡献是并行循环的调度器感知接口，它允许我们针对每个线程执行几个连续迭代的常见情况进行优化。这消除了大部分写流量并避免了所有同步，从而使速度提高了50倍。我们的第二个贡献是Vector-Sparse格式，它解决了来自常用的Compressed-Sparse数据结构的向量化障碍。我们的新格式消除了未对齐的内存访问和向量操作中的边界检查，这是处理低度顶点时的两个常见问题。矢量稀疏的矢量化导致加速高达2.5倍。我们的贡献体现在混合图形处理框架Grazelle中。在配备4颗英特尔至强E7-4850 v3处理器的服务器上，Grazelle的性能分别比Ligra、Polymer、GraphMat和X-Stream高出15.2倍、4.6倍、4.7倍和66.8倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

自引率

0.00%

发文量

期刊最新文献

Graph partitioning applied to DAG scheduling to reduce NUMA effects Juggler: a dependence-aware task-based execution framework for GPUs Performance modeling for GPUs using abstract kernel emulation Automated code acceleration targeting heterogeneous openCL devices Layrub: layer-centric GPU memory reuse and data migration in extreme-scale deep learning systems