GraphTango：用于高效流式图更新和分析的混合表示格式

IF 0.9 4区计算机科学 Q3 COMPUTER SCIENCE, THEORY & METHODS International Journal of Parallel Programming Pub Date : 2024-05-18 DOI:10.1007/s10766-024-00768-x

Alif Ahmed, Farzana Ahmed Siddique, Kevin Skadron

{"title":"GraphTango：用于高效流式图更新和分析的混合表示格式","authors":"Alif Ahmed, Farzana Ahmed Siddique, Kevin Skadron","doi":"10.1007/s10766-024-00768-x","DOIUrl":null,"url":null,"abstract":"Streaming graph processing performs batched updates and analytics on a time-evolving graph. The underlying representation format of the graph largely determines the throughputs of these updates and analytics phases. Existing representation formats usually employ variations of hash tables or adjacency lists. However, a recent study showed that the adjacency-list-based approaches perform poorly on heavy-tailed graphs, and the hash table-based approaches suffer on short-tailed graphs. We propose GraphTango, a hybrid representation format that provides excellent update and analytics throughput regardless of the graph’s degree distribution. GraphTango dynamically switches among three different formats based on a vertex’s degree: (i) Low-degree vertices store the edges directly with the neighborhood metadata, confining accesses to a single cache line, (2) Medium-degree vertices use adjacency lists, and (3) High-degree vertices use hash tables as well as adjacency lists. In this case, the adjacency list provides fast traversal during the analytics phase, while the hash table provides constant-time lookups during the update phase. We further optimized the performance by designing an open-addressing-based hash table that fully utilizes every fetched cache line. In addition, we developed a thread-local lock-free memory pool that allows fast growing/shrinking of the adjacency lists and hash tables in a multi-threaded environment. We evaluated GraphTango with the help of the SAGA-Bench framework and compared it with four other representation formats: Stinger, Degree-aware Robin Hood Hashing, and two adjacency list-based formats with different workload balancing scheme. On average, GraphTango provides 4.5x higher insertion throughput, 3.2x higher deletion throughput, and 1.1x higher analytics throughput over the next best format. Furthermore, we integrated GraphTango with the state-of-the-art graph processing frameworks DZiG and RisGraph. Compared to the vanilla DZiG and vanilla RisGraph, [GraphTango + DZiG] and [GraphTango + RisGraph] reduces the average batch processing time by 2.3x and 1.5x, respectively.","PeriodicalId":14313,"journal":{"name":"International Journal of Parallel Programming","volume":"20 1","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2024-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GraphTango: A Hybrid Representation Format for Efficient Streaming Graph Updates and Analysis\",\"authors\":\"Alif Ahmed, Farzana Ahmed Siddique, Kevin Skadron\",\"doi\":\"10.1007/s10766-024-00768-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Streaming graph processing performs batched updates and analytics on a time-evolving graph. The underlying representation format of the graph largely determines the throughputs of these updates and analytics phases. Existing representation formats usually employ variations of hash tables or adjacency lists. However, a recent study showed that the adjacency-list-based approaches perform poorly on heavy-tailed graphs, and the hash table-based approaches suffer on short-tailed graphs. We propose GraphTango, a hybrid representation format that provides excellent update and analytics throughput regardless of the graph’s degree distribution. GraphTango dynamically switches among three different formats based on a vertex’s degree: (i) Low-degree vertices store the edges directly with the neighborhood metadata, confining accesses to a single cache line, (2) Medium-degree vertices use adjacency lists, and (3) High-degree vertices use hash tables as well as adjacency lists. In this case, the adjacency list provides fast traversal during the analytics phase, while the hash table provides constant-time lookups during the update phase. We further optimized the performance by designing an open-addressing-based hash table that fully utilizes every fetched cache line. In addition, we developed a thread-local lock-free memory pool that allows fast growing/shrinking of the adjacency lists and hash tables in a multi-threaded environment. We evaluated GraphTango with the help of the SAGA-Bench framework and compared it with four other representation formats: Stinger, Degree-aware Robin Hood Hashing, and two adjacency list-based formats with different workload balancing scheme. On average, GraphTango provides 4.5x higher insertion throughput, 3.2x higher deletion throughput, and 1.1x higher analytics throughput over the next best format. Furthermore, we integrated GraphTango with the state-of-the-art graph processing frameworks DZiG and RisGraph. Compared to the vanilla DZiG and vanilla RisGraph, [GraphTango + DZiG] and [GraphTango + RisGraph] reduces the average batch processing time by 2.3x and 1.5x, respectively.\",\"PeriodicalId\":14313,\"journal\":{\"name\":\"International Journal of Parallel Programming\",\"volume\":\"20 1\",\"pages\":\"\"},\"PeriodicalIF\":0.9000,\"publicationDate\":\"2024-05-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Parallel Programming\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s10766-024-00768-x\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Parallel Programming","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10766-024-00768-x","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

流图处理是对随时间变化的图进行分批更新和分析。图的底层表示格式在很大程度上决定了这些更新和分析阶段的吞吐量。现有的表示格式通常采用哈希表或邻接表的变体。然而，最近的一项研究表明，基于邻接表的方法在重尾图上表现不佳，而基于哈希表的方法在短尾图上表现不佳。我们提出的 GraphTango 是一种混合表示格式，无论图的度数分布如何，它都能提供出色的更新和分析吞吐量。GraphTango 可根据顶点的度数在三种不同格式之间动态切换：(i) 低度顶点直接存储边与邻域元数据，将访问限制在单个缓存行内；(2) 中度顶点使用邻接表；(3) 高度顶点使用哈希表以及邻接表。在这种情况下，邻接表在分析阶段提供快速遍历，而哈希表在更新阶段提供恒定时间查找。通过设计基于开放式寻址的哈希表，我们进一步优化了性能，充分利用了每一条获取的缓存线。此外，我们还开发了一个线程本地无锁内存池，允许在多线程环境中快速增长/收缩邻接表和哈希表。我们借助 SAGA-Bench 框架对 GraphTango 进行了评估，并将其与其他四种表示格式进行了比较：Stinger、度感知罗宾汉散列和两种基于邻接表的格式，并采用了不同的工作量平衡方案。平均而言，GraphTango 的插入吞吐量比次好格式高出 4.5 倍，删除吞吐量高出 3.2 倍，分析吞吐量高出 1.1 倍。此外，我们还将 GraphTango 与最先进的图形处理框架 DZiG 和 RisGraph 集成。与原版 DZiG 和原版 RisGraph 相比，[GraphTango + DZiG] 和 [GraphTango + RisGraph] 的平均批处理时间分别缩短了 2.3 倍和 1.5 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

GraphTango: A Hybrid Representation Format for Efficient Streaming Graph Updates and Analysis

Streaming graph processing performs batched updates and analytics on a time-evolving graph. The underlying representation format of the graph largely determines the throughputs of these updates and analytics phases. Existing representation formats usually employ variations of hash tables or adjacency lists. However, a recent study showed that the adjacency-list-based approaches perform poorly on heavy-tailed graphs, and the hash table-based approaches suffer on short-tailed graphs. We propose GraphTango, a hybrid representation format that provides excellent update and analytics throughput regardless of the graph’s degree distribution. GraphTango dynamically switches among three different formats based on a vertex’s degree: (i) Low-degree vertices store the edges directly with the neighborhood metadata, confining accesses to a single cache line, (2) Medium-degree vertices use adjacency lists, and (3) High-degree vertices use hash tables as well as adjacency lists. In this case, the adjacency list provides fast traversal during the analytics phase, while the hash table provides constant-time lookups during the update phase. We further optimized the performance by designing an open-addressing-based hash table that fully utilizes every fetched cache line. In addition, we developed a thread-local lock-free memory pool that allows fast growing/shrinking of the adjacency lists and hash tables in a multi-threaded environment. We evaluated GraphTango with the help of the SAGA-Bench framework and compared it with four other representation formats: Stinger, Degree-aware Robin Hood Hashing, and two adjacency list-based formats with different workload balancing scheme. On average, GraphTango provides 4.5x higher insertion throughput, 3.2x higher deletion throughput, and 1.1x higher analytics throughput over the next best format. Furthermore, we integrated GraphTango with the state-of-the-art graph processing frameworks DZiG and RisGraph. Compared to the vanilla DZiG and vanilla RisGraph, [GraphTango + DZiG] and [GraphTango + RisGraph] reduces the average batch processing time by 2.3x and 1.5x, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Parallel Programming 工程技术-计算机：理论方法

CiteScore

4.40

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： International Journal of Parallel Programming is a forum for the publication of peer-reviewed, high-quality original papers in the computer and information sciences, focusing specifically on programming aspects of parallel computing systems. Such systems are characterized by the coexistence over time of multiple coordinated activities. The journal publishes both original research and survey papers. Fields of interest include: linguistic foundations, conceptual frameworks, high-level languages, evaluation methods, implementation techniques, programming support systems, pragmatic considerations, architectural characteristics, software engineering aspects, advances in parallel algorithms, performance studies, and application studies.