Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing最新文献

英文中文

Provably Fast and Space-Efficient Parallel Biconnectivity (Abstract) 可证明快速且空间高效的并行双连接(摘要)

Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing

Pub Date : 2023-07-18 DOI: 10.1145/3597635.3598018

Xiaojun Dong, Letong Wang, Yan Gu, Yihan Sun

We propose the first parallel biconnectivity algorithm (FAST-BCC) that has optimal work, polylogarithmic span, and is space-efficient. Our algorithm creates a skeleton graph based on any spanning tree of the input graph. Then we use the connectivity information of the skeleton to compute the biconnectivity of the original input. We carefully analyze the correctness of our algorithm. We implemented FAST-BCC and compared it with existing implementations, including GBBS, Slota and Madduri's algorithm, and the sequential Hopcroft-Tarjan algorithm. We tested them on a 96-core machine on 27 graphs with varying edge distributions. FAST-BCC is faster than all existing baselines on each graph.

我们提出了第一种并行双连接算法(FAST-BCC)，该算法具有最优的工作性能、多对数跨度和空间效率。我们的算法基于输入图的任意生成树创建骨架图。然后利用骨架的连通性信息计算原始输入的双连通性。我们仔细分析了算法的正确性。我们实现了FAST-BCC，并将其与现有的实现进行了比较，包括GBBS、Slota和maduri算法以及顺序Hopcroft-Tarjan算法。我们在一台96核的机器上对27个具有不同边缘分布的图进行了测试。FAST-BCC比每个图上的所有现有基线都快。

引用次数: 1

Empirical Challenge for NC Theory (Abstract) NC理论的经验挑战(摘要)

Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing

Pub Date : 2023-07-18 DOI: 10.1145/3597635.3598020

Ananth Hari, U. Vishkin

Horn-satisfiability or Horn-SAT is the problem of deciding whether a satisfying assignment exists for a Horn formula, a conjunction of clauses each with at most one positive literal (also known as Horn clauses). It is a well-known P-complete problem, which implies that unless P = NC, it is a hard problem to parallelize. In this paper, we empirically show that, under a known simple random model for generating the Horn formula, the ratio of hard-to-parallelize instances (closer to the worst-case behavior) is infinitesimally small. We show that the depth of a parallel algorithm for Horn-SAT is polylogarithmic on average, for almost all instances, while keeping the work linear. This challenges theoreticians and programmers to look beyond worst-case analysis and come up with practical algorithms coupled with respective performance guarantees.

Horn-satisfiability或Horn-sat是确定Horn公式是否存在令人满意的赋值的问题，Horn公式是一个子句的连接，每个子句最多有一个正字面量(也称为Horn子句)。这是一个众所周知的P完全问题，这意味着除非P = NC，否则它是一个很难并行化的问题。在本文中，我们通过经验证明，在一个已知的用于生成Horn公式的简单随机模型下，难以并行化的实例(更接近最坏情况的行为)的比例是无穷小的。我们表明，对于几乎所有实例，Horn-SAT并行算法的深度平均是多对数的，同时保持工作线性。这对理论家和程序员提出了挑战，要求他们超越最坏情况分析，提出具有各自性能保证的实用算法。

引用次数: 0

Smarter Atomic Smart Pointers: Safe and Efficient Concurrent Memory Management (Abstract) 更智能的原子智能指针:安全高效的并发内存管理(摘要)

Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing

Pub Date : 2023-07-18 DOI: 10.1145/3597635.3598027

Daniel Anderson, G. Blelloch, Yuanhao Wei

We present a technique for concurrent memory management that combines the ease-of-use of automatic memory reclamation, and the efficiency of state-of-the-art deferred reclamation algorithms. First, we combine ideas from referencing counting and hazard pointers in a novel way to implement automatic concurrent reference counting with wait-free, constant-time overhead. Second, we generalize our previous algorithm to obtain a method for converting any standard manual SMR technique into an automatic reference counting technique with a similar performance profile. We have implemented the approach as a C++ library and compared it experimentally to existing atomic reference-counting libraries and state-of-the-art manual techniques. Our results indicate that our technique is faster than existing reference-counting implementations, and competitive with manual memory reclamation techniques. More importantly, it is significantly safer than manual techniques since objects are reclaimed automatically.

我们提出了一种并发内存管理技术，它结合了自动内存回收的易用性和最先进的延迟回收算法的效率。首先，我们以一种新颖的方式结合引用计数和危险指针的思想，以实现无等待、恒定时间开销的自动并发引用计数。其次，我们推广了之前的算法，以获得一种将任何标准的手动SMR技术转换为具有类似性能概况的自动引用计数技术的方法。我们已经将该方法实现为c++库，并将其与现有的原子引用计数库和最先进的手工技术进行了实验比较。我们的结果表明，我们的技术比现有的引用计数实现更快，并与手动内存回收技术竞争。更重要的是，它比手动技术安全得多，因为对象是自动回收的。

引用次数: 0

Efficient Construction of Directed Hopsets and Parallel Single-source Shortest Paths (Abstract) 有向hopset和并行单源最短路径的高效构造(摘要)

Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing

Pub Date : 2023-07-18 DOI: 10.1145/3597635.3598019

Nairen Cao, Jeremy T. Fineman, Katina Russell

The single-source shortest-path problem is as follows: given a graph with nonnegative edge weights and a designated source vertex s, return the distances from~s to each other vertex such. This paper presents a randomized parallel single-source shortest paths (SSSP) algorithm for directed graphs with non-negative integer edge weights that solves the exact SSSP problem in O (m) work and n^1/2+o(1) span, with high probability. All previous exact SSSP algorithms with nearly linear work have linear span, even for undirected unweighted graphs. To solve exact SSSP problem, we first show a deterministic reduction from exact SSSP to directed hopsets using the iterative gradual rounding technique. An (β, ε)-hopset is a set of weighted edges, also known as shortcuts, that when added to the graph, admit β-hop paths with weights no more than (1 + ε) times the true shortest path distances. We show that (β, ε)-hopsets can be used to solve the exact SSSP problem in O (m) work and O (β) span. Furthermore, we present the first nearly linear-work algorithm for constructing hopsets on directed graphs. Our sequential algorithm runs in O (m) time and constructs a hopset with O (n) edges and β = n^1/2+o(1) . We also provide a parallel version of the algorithm with O (m) work and n^1/2+o(1) span. The directed hopsets can be used to solve approximate SSSP problems efficiently, where the objective is to return estimates of the distances from the source vertex to every other vertex such that the estimate falls between the true distance and (1+ε) times the distance. Specifically, for constant ε and graphs with polynomially-bounded real edge weights, there is an algorithm solving approximate SSSP problem with O (m) work and n^1/2+o(1) span.

单源最短路径问题如下:给定一个边权为非负的图和一个指定的源顶点s，返回从~s到其他顶点的距离为。本文提出了一种非负整数边权有向图的随机并行单源最短路径(SSSP)算法，该算法在O (m)功和n^1/2+ O(1)个跨度内以高概率精确地解决了SSSP问题。所有以前精确的SSSP算法几乎都是线性的，即使对于无向无权图也是线性的。为了解决精确SSSP问题，我们首先展示了使用迭代渐进舍入技术从精确SSSP到有向hopset的确定性约简。(β， ε)-hopset是一组加权边，也称为捷径，当添加到图中时，允许β-hop路径的权重不大于(1 + ε)倍的真实最短路径距离。我们证明了(β， ε)-hopset可以在O (m)功和O (β)跨度内精确地求解SSSP问题。在此基础上，提出了构造有向图hopset的第一个近似线性功算法。我们的序列算法在O (m)时间内运行，构建了一个O (n)条边的hopset， β = n^1/2+ O(1)。我们还提供了一个并行版本的算法，它具有O (m)功和n^1/2+ O(1)张成的空间。有向hopset可用于有效地解决近似SSSP问题，其目标是返回从源顶点到每个其他顶点的距离估计，使估计落在真实距离和(1+ε)乘以距离之间。具体来说，对于ε常数和实边权多项式有界的图，有一种求解近似SSSP问题的算法，其功为O (m)，空间为n^1/2+ O(1)。

{"title":"Efficient Construction of Directed Hopsets and Parallel Single-source Shortest Paths (Abstract)","authors":"Nairen Cao, Jeremy T. Fineman, Katina Russell","doi":"10.1145/3597635.3598019","DOIUrl":"https://doi.org/10.1145/3597635.3598019","url":null,"abstract":"The single-source shortest-path problem is as follows: given a graph with nonnegative edge weights and a designated source vertex s, return the distances from~s to each other vertex such. This paper presents a randomized parallel single-source shortest paths (SSSP) algorithm for directed graphs with non-negative integer edge weights that solves the exact SSSP problem in O (m) work and n^1/2+o(1) span, with high probability. All previous exact SSSP algorithms with nearly linear work have linear span, even for undirected unweighted graphs. To solve exact SSSP problem, we first show a deterministic reduction from exact SSSP to directed hopsets using the iterative gradual rounding technique. An (β, ε)-hopset is a set of weighted edges, also known as shortcuts, that when added to the graph, admit β-hop paths with weights no more than (1 + ε) times the true shortest path distances. We show that (β, ε)-hopsets can be used to solve the exact SSSP problem in O (m) work and O (β) span. Furthermore, we present the first nearly linear-work algorithm for constructing hopsets on directed graphs. Our sequential algorithm runs in O (m) time and constructs a hopset with O (n) edges and β = n^1/2+o(1) . We also provide a parallel version of the algorithm with O (m) work and n^1/2+o(1) span. The directed hopsets can be used to solve approximate SSSP problems efficiently, where the objective is to return estimates of the distances from the source vertex to every other vertex such that the estimate falls between the true distance and (1+ε) times the distance. Specifically, for constant ε and graphs with polynomially-bounded real edge weights, there is an algorithm solving approximate SSSP problem with O (m) work and n^1/2+o(1) span.","PeriodicalId":185981,"journal":{"name":"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123777515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Taming Misaligned Graph Traversals in Concurrent Graph Processing (Abstract) 在并发图处理中驯服不对齐的图遍历(摘要)

Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing

Pub Date : 2023-07-18 DOI: 10.1145/3597635.3598028

Xizhe Yin, Zhijia Zhao, Rajiv Gupta

This work introduces Glign, a runtime system that automatically aligns the graph traversals for concurrent queries. Glign introduces three levels of graph traversal alignment for iterative evaluation of concurrent queries. First, it synchronizes the accesses of different queries to the active parts of the graph within each iteration of the evaluation---intra-iteration alignment. On top of that, Glign leverages a key insight regarding the "heavy iterations" in query evaluation to achieveinter-iteration alignment andalignment-aware batching. The former aligns the iterations of different queries to increase the graph access sharing, while the latter tries to group queries of better graph access sharing into the same evaluation batch. Together, these alignment techniques can substantially boost the data locality of concurrent query evaluation. Based on our experiments, Glign outperforms the state-of-the-art concurrent graph processing systems Krill and GraphM by 3.6× and 4.7× on average, respectively.

这项工作介绍了Glign，这是一个运行时系统，可以自动对齐并发查询的图遍历。Glign引入了三个级别的图遍历对齐，用于并行查询的迭代计算。首先，它在每次求值迭代中同步不同查询对图的活动部分的访问——迭代内对齐。最重要的是，Glign利用了查询求值中关于“重迭代”的关键洞察力来实现迭代间对齐和对齐感知批处理。前者对不同查询的迭代进行对齐以增加图访问共享，而后者试图将更好的图访问共享的查询分组到同一个评估批中。总之，这些对齐技术可以极大地提高并发查询求值的数据局部性。根据我们的实验，Glign比最先进的并发图形处理系统Krill和GraphM平均分别高出3.6倍和4.7倍。

引用次数: 0

Parallel Strong Connectivity Based on Faster Reachability (Abstract) 基于更快可达性的并行强连接(摘要)

Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing

Pub Date : 2023-07-18 DOI: 10.1145/3597635.3598017

Letong Wang, Xiaojun Dong, Yan Gu, Yihan Sun

In this paper, we propose a parallel strongly connected components (SCC) implementation that is efficient on a wide range of graphs. Our speedup comes from two novel techniques: vertical granularity control (VGC) and parallel hash bag.

在本文中，我们提出了一种并行强连接组件(SCC)实现，该实现在广泛的图上是有效的。我们的加速来自两种新技术:垂直粒度控制(VGC)和并行散列包。

引用次数: 1

Fast Parallel Algorithms for Euclidean Minimum Spanning Tree and Hierarchical Spatial Clustering (Abstract) 欧几里得最小生成树与分层空间聚类的快速并行算法(摘要)

Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing

Pub Date : 2023-07-18 DOI: 10.1145/3597635.3598025

Yiqiu Wang, Shangdi Yu, Yan Gu, Julian Shun

This paper presents new parallel algorithms for generating Euclidean minimum spanning trees and spatial clustering hierarchies (known as HDBSCAN^*). Our approach is based on generating a well-separated pair decomposition followed by using Kruskal's minimum spanning tree algorithm and bichromatic closest pair computations. We introduce a new notion of well-separation to reduce the work and space of our algorithm for HDBSCAN^*. We also give a new parallel divide-and-conquer algorithm for computing the dendrogram and reachability plots, which are used in visualizing clusters of different scale that arise for both EMST and HDBSCAN^*. We show that our algorithms are theoretically efficient: they have work (number of operations) matching their sequential counterparts, and polylogarithmic depth (parallel time). We implement our algorithms and propose a memory optimization that requires only a subset of well-separated pairs to be computed and materialized, leading to savings in both space (up to 10x) and time (up to 8x). Our experiments on large real-world and synthetic data sets using a 48-core machine show that our fastest algorithms outperform the best serial algorithms for the problems by 11.13--55.89x, and existing parallel algorithms by at least an order of magnitude.

本文提出了一种新的并行算法，用于生成欧几里得最小生成树和空间聚类层次结构(称为HDBSCAN^*)。我们的方法是基于生成一个分离良好的对分解，然后使用Kruskal的最小生成树算法和双色最接近对计算。为了减少HDBSCAN^*算法的工作量和空间，我们引入了井分离的新概念。我们还给出了一种新的并行分治算法来计算树形图和可达性图，用于EMST和HDBSCAN^*中出现的不同规模的集群的可视化。我们证明了我们的算法在理论上是有效的:它们具有匹配顺序对应的工作(操作次数)和多对数深度(并行时间)。我们实现了我们的算法，并提出了一种内存优化，它只需要计算和实现分离良好的对的子集，从而节省了空间(最多10倍)和时间(最多8倍)。我们使用48核机器对大型真实世界和合成数据集进行的实验表明，我们最快的算法比最好的串行算法的性能高出11.13- 55.89倍，现有并行算法的性能至少高出一个数量级。

{"title":"Fast Parallel Algorithms for Euclidean Minimum Spanning Tree and Hierarchical Spatial Clustering (Abstract)","authors":"Yiqiu Wang, Shangdi Yu, Yan Gu, Julian Shun","doi":"10.1145/3597635.3598025","DOIUrl":"https://doi.org/10.1145/3597635.3598025","url":null,"abstract":"This paper presents new parallel algorithms for generating Euclidean minimum spanning trees and spatial clustering hierarchies (known as HDBSCAN^*). Our approach is based on generating a well-separated pair decomposition followed by using Kruskal's minimum spanning tree algorithm and bichromatic closest pair computations. We introduce a new notion of well-separation to reduce the work and space of our algorithm for HDBSCAN^*. We also give a new parallel divide-and-conquer algorithm for computing the dendrogram and reachability plots, which are used in visualizing clusters of different scale that arise for both EMST and HDBSCAN^*. We show that our algorithms are theoretically efficient: they have work (number of operations) matching their sequential counterparts, and polylogarithmic depth (parallel time). We implement our algorithms and propose a memory optimization that requires only a subset of well-separated pairs to be computed and materialized, leading to savings in both space (up to 10x) and time (up to 8x). Our experiments on large real-world and synthetic data sets using a 48-core machine show that our fastest algorithms outperform the best serial algorithms for the problems by 11.13--55.89x, and existing parallel algorithms by at least an order of magnitude.","PeriodicalId":185981,"journal":{"name":"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125118289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Static Prediction of Parallel Computation Graphs (Abstract) 并行计算图的静态预测(摘要)

Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing

Pub Date : 2023-07-18 DOI: 10.1145/3597635.3598026

Stefan K. Muller

Many results in the theory of parallel scheduling, dating back to Brent's Theorem, are expressed in terms of the parallel dependency structure of a program as represented by a Directed Acyclic Graph (DAG). In the world of parallel and concurrent program analysis, such DAG models are also used to study deadlock, data races, and priority inversions, to name just a few examples. In all of these cases, it tends to be convenient to think of the DAG as a model of the program itself-we might say, for example, that the time to run a parallel program on P processors depends on the work and span of the program's DAG. This assumes that the DAG is a static, predictable property of the program. In reality, however, a DAG typically models the runtime relationships between threads during a particular execution of a program. To obtain the DAG, one might simulate an execution (or all possible executions) using some form of cost semantics, a dynamic semantics that produces the DAG as it executes the program. In fine-grained parallel programs, such as those that result from constructs such as fork/join, spawn/sync, async/finish, and futures, these DAGs tend to be especially dynamic and dependent on the features of a particular execution. For example, a divide-and-conquer algorithm implemented using fork/join parallelism may divide a certain number of times depending on the input size, and a program written with futures can choose to wait on threads or not wait on threads depending on conditions available only at runtime. Such programs are best represented by a (possibly infinite) family of DAGs, representing all possible executions of the program.

并行调度理论中的许多结果，可以追溯到布伦特定理，都是用有向无环图(DAG)表示的程序的并行依赖结构来表示的。在并行和并发程序分析的世界中，这样的DAG模型还用于研究死锁、数据竞争和优先级反转，仅举几个例子。在所有这些情况下，将DAG视为程序本身的模型往往是方便的——例如，我们可以说，在P个处理器上运行并行程序的时间取决于程序DAG的工作和范围。这假定DAG是程序的静态、可预测的属性。然而，在现实中，DAG通常对程序特定执行期间线程之间的运行时关系进行建模。为了获得DAG，可以使用某种形式的成本语义模拟一次执行(或所有可能的执行)，这是一种在执行程序时生成DAG的动态语义。在细粒度的并行程序中，例如那些由fork/join、spawn/sync、async/finish和futures等构造产生的程序，这些dag往往是特别动态的，并且依赖于特定执行的特性。例如，使用fork/join并行性实现的分治算法可能会根据输入大小划分一定的次数，而使用future编写的程序可以根据仅在运行时可用的条件选择等待线程或不等待线程。这样的程序最好用(可能无限的)dag族来表示，它们表示程序的所有可能的执行。

{"title":"Static Prediction of Parallel Computation Graphs (Abstract)","authors":"Stefan K. Muller","doi":"10.1145/3597635.3598026","DOIUrl":"https://doi.org/10.1145/3597635.3598026","url":null,"abstract":"Many results in the theory of parallel scheduling, dating back to Brent's Theorem, are expressed in terms of the parallel dependency structure of a program as represented by a Directed Acyclic Graph (DAG). In the world of parallel and concurrent program analysis, such DAG models are also used to study deadlock, data races, and priority inversions, to name just a few examples. In all of these cases, it tends to be convenient to think of the DAG as a model of the program itself-we might say, for example, that the time to run a parallel program on P processors depends on the work and span of the program's DAG. This assumes that the DAG is a static, predictable property of the program. In reality, however, a DAG typically models the runtime relationships between threads during a particular execution of a program. To obtain the DAG, one might simulate an execution (or all possible executions) using some form of cost semantics, a dynamic semantics that produces the DAG as it executes the program. In fine-grained parallel programs, such as those that result from constructs such as fork/join, spawn/sync, async/finish, and futures, these DAGs tend to be especially dynamic and dependent on the features of a particular execution. For example, a divide-and-conquer algorithm implemented using fork/join parallelism may divide a certain number of times depending on the input size, and a program written with futures can choose to wait on threads or not wait on threads depending on conditions available only at runtime. Such programs are best represented by a (possibly infinite) family of DAGs, representing all possible executions of the program.","PeriodicalId":185981,"journal":{"name":"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130258529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CommonGraph: Graph Analytics on Evolving Data (Abstract) CommonGraph:演化数据的图分析(摘要)

Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing

Pub Date : 2023-07-18 DOI: 10.1145/3597635.3598022

Mahbod Afarin, Chao Gao, Shafiur Rahman, Nael B. Abu-Ghazaleh, Rajiv Gupta

We consider the problem of graph analytics on evolving graphs. In this scenario, a query typically needs to be applied to different snapshots of the graph over an extended time window. We propose CommonGraph, an approach for efficient processing of queries on evolving graphs. We first observe that edge deletions are significantly more expensive than addition operations. CommonGraph converts all deletions to additions by finding a common graph that exists across all snapshots. After computing the query on this graph, to reach any snapshot, we simply need to add the missing edges and incrementally update the query results. CommonGraph also allows sharing of common additions among snapshots that require them, and breaks the sequential dependency inherent in the traditional streaming approach where snapshots are processed in sequence, enabling additional opportunities for parallelism. We incorporate the CommonGraph approach by extending the KickStarter streaming framework. CommonGraph achieves 1.38x-8.17x improvement in performance over Kickstarter across multiple benchmarks.

我们考虑演化图上的图分析问题。在这种情况下，查询通常需要在一个扩展的时间窗口内应用于图的不同快照。我们提出了CommonGraph，一种有效处理演化图查询的方法。我们首先观察到，边缘删除操作的开销明显高于加法操作。CommonGraph通过查找存在于所有快照中的公共图，将所有删除转换为添加。在计算了这个图上的查询之后，为了得到任何快照，我们只需要添加缺失的边并增量地更新查询结果。CommonGraph还允许在需要它们的快照之间共享公共添加，并打破了传统流方法中固有的顺序依赖，在传统流方法中，快照是按顺序处理的，从而为并行性提供了额外的机会。我们通过扩展KickStarter流媒体框架来整合CommonGraph方法。在多个基准测试中，CommonGraph的性能比Kickstarter提高了1.38 -8.17倍。

引用次数: 1

Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling (Extended Abstract) 利用动态自反平铺加速稀疏数据编排(扩展摘要)

Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing

Pub Date : 2023-07-18 DOI: 10.1145/3597635.3598031

Toluwanimi O. Odemuyiwa, Hadi Asghari-Moghaddam, Michael Pellauer, Kartik Hegde, Po-An Tsai, N. Crago, A. Jaleel, J. Owens, Edgar Solomonik, J. Emer, Christopher W. Fletcher

Tensor algebra involving multiple sparse operands is severely memory bound, making it a challenging target for acceleration. Furthermore, irregular sparsity complicates traditional techniques---such as tiling---for ameliorating memory bottlenecks. Prior sparse tiling schemes are sparsity unaware: they carve tensors into uniform coordinate-space shapes, which leads to low-occupancy tiles and thus lower exploitable reuse. To address these challenges, this paper proposes dynamic reflexive tiling (DRT), a novel tiling method that improves data reuse over prior art for sparse tensor kernels, unlocking significant performance improvement opportunities. DRT's key idea is dynamic sparsity-aware tiling. DRT continuously re-tiles sparse tensors at runtime based on the current sparsity of the active regions of all input tensors, to maximize accelerator buffer utilization while retaining the ability to co-iterate through tiles of distinct tensors. Through an extensive evaluation over a set of SuiteSparse matrices, we show how DRT can be applied to multiple prior accelerators with different dataflows (ExTensor, OuterSPACE, MatRaptor), improving their performance (by 3.3x, 5.1x, and 1.6x, respectively) while adding negligible area overhead. We apply DRT to higher-order tensor kernels to reduce DRAM traffic by 3.9x and 16.9x over a CPU implementation and prior-art tiling scheme, respectively. Finally, we show that the technique is portable to software, with an improvement of 7.29x and 2.94x in memory overhead compared to untiled sparse-sparse matrix multiplication (SpMSpM).

涉及多个稀疏操作数的张量代数具有严重的内存限制，使其成为一个具有挑战性的加速目标。此外，不规则稀疏性使传统技术(如平铺)变得复杂，从而无法改善内存瓶颈。先前的稀疏平铺方案是不知道稀疏性的:它们将张量雕刻成统一的坐标空间形状，这导致了低占用瓷砖，从而降低了可利用的重用性。为了解决这些挑战，本文提出了动态反射平铺(DRT)，这是一种新的平铺方法，可以提高稀疏张量核的数据重用，从而释放出显着的性能改进机会。DRT的关键思想是动态稀疏感知平铺。DRT在运行时基于所有输入张量的活动区域的当前稀疏度连续地重新铺贴稀疏张量，以最大限度地提高加速器缓冲利用率，同时保留通过不同张量的块进行共迭代的能力。通过对一组SuiteSparse矩阵的广泛评估，我们展示了如何将DRT应用于具有不同数据流的多个先前的加速器(ExTensor、OuterSPACE、MatRaptor)，在增加可忽略的面积开销的同时提高它们的性能(分别提高3.3倍、5.1倍和1.6倍)。我们将DRT应用于高阶张量内核，在CPU实现和现有技术平铺方案上分别减少了3.9倍和16.9倍的DRAM流量。最后，我们证明了该技术可移植到软件中，与未执行稀疏稀疏矩阵乘法(SpMSpM)相比，该技术的内存开销分别提高了7.29倍和2.94倍。

{"title":"Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling (Extended Abstract)","authors":"Toluwanimi O. Odemuyiwa, Hadi Asghari-Moghaddam, Michael Pellauer, Kartik Hegde, Po-An Tsai, N. Crago, A. Jaleel, J. Owens, Edgar Solomonik, J. Emer, Christopher W. Fletcher","doi":"10.1145/3597635.3598031","DOIUrl":"https://doi.org/10.1145/3597635.3598031","url":null,"abstract":"Tensor algebra involving multiple sparse operands is severely memory bound, making it a challenging target for acceleration. Furthermore, irregular sparsity complicates traditional techniques---such as tiling---for ameliorating memory bottlenecks. Prior sparse tiling schemes are sparsity unaware: they carve tensors into uniform coordinate-space shapes, which leads to low-occupancy tiles and thus lower exploitable reuse. To address these challenges, this paper proposes dynamic reflexive tiling (DRT), a novel tiling method that improves data reuse over prior art for sparse tensor kernels, unlocking significant performance improvement opportunities. DRT's key idea is dynamic sparsity-aware tiling. DRT continuously re-tiles sparse tensors at runtime based on the current sparsity of the active regions of all input tensors, to maximize accelerator buffer utilization while retaining the ability to co-iterate through tiles of distinct tensors. Through an extensive evaluation over a set of SuiteSparse matrices, we show how DRT can be applied to multiple prior accelerators with different dataflows (ExTensor, OuterSPACE, MatRaptor), improving their performance (by 3.3x, 5.1x, and 1.6x, respectively) while adding negligible area overhead. We apply DRT to higher-order tensor kernels to reduce DRAM traffic by 3.9x and 16.9x over a CPU implementation and prior-art tiling scheme, respectively. Finally, we show that the technique is portable to software, with an improvement of 7.29x and 2.94x in memory overhead compared to untiled sparse-sparse matrix multiplication (SpMSpM).","PeriodicalId":185981,"journal":{"name":"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing","volume":"95 Suppl A 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116920170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀