ACM Transactions on Parallel Computing最新文献_第3页

Fast Parallel Algorithms for Enumeration of Simple, Temporal, and Hop-constrained Cycles 简单、时间和跳跃约束循环枚举的快速并行算法

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2023-01-03 DOI: 10.1145/3611642

J. Blanuša, K. Atasu, P. Ienne

Cycles are one of the fundamental subgraph patterns and being able to enumerate them in graphs enables important applications in a wide variety of fields, including finance, biology, chemistry, and network science. However, to enable cycle enumeration in real-world applications, efficient parallel algorithms are required. In this work, we propose scalable parallelisation of state-of-the-art sequential algorithms for enumerating simple, temporal, and hop-constrained cycles. First, we focus on the simple cycle enumeration problem and parallelise the algorithms by Johnson and by Read and Tarjan in a fine-grained manner. We theoretically show that our resulting fine-grained parallel algorithms are scalable, with the fine-grained parallel Read-Tarjan algorithm being strongly scalable. In contrast, we show that straightforward coarse-grained parallel versions of these simple cycle enumeration algorithms that exploit edge- or vertex-level parallelism are not scalable. Next, we adapt our fine-grained approach to enable the enumeration of cycles under time-window, temporal, and hop constraints. Our evaluation on a cluster with 256 CPU cores that can execute up to 1,024 simultaneous threads demonstrates a near-linear scalability of our fine-grained parallel algorithms when enumerating cycles under the aforementioned constraints. On the same cluster, our fine-grained parallel algorithms achieve, on average, one order of magnitude speedup compared to the respective coarse-grained parallel versions of the state-of-the-art algorithms for cycle enumeration. The performance gap between the fine-grained and the coarse-grained parallel algorithms increases as we use more CPU cores.

循环是基本的子图模式之一，能够在图中枚举它们可以在各种领域中得到重要的应用，包括金融、生物、化学和网络科学。然而，要在实际应用程序中启用循环枚举，需要高效的并行算法。在这项工作中，我们提出了最先进的顺序算法的可扩展并行化，用于枚举简单的、时间的和跳跃约束的循环。首先，我们将重点放在简单循环枚举问题上，并以细粒度的方式并行化Johnson、Read和Tarjan的算法。我们从理论上证明了我们得到的细粒度并行算法是可扩展的，其中细粒度并行Read-Tarjan算法具有强可扩展性。相反，我们表明，这些利用边缘或顶点级并行性的简单循环枚举算法的直接粗粒度并行版本是不可扩展的。接下来，我们调整我们的细粒度方法，以便在时间窗口、时间和跳跃约束下枚举循环。我们对具有256个CPU内核的集群进行了评估，该集群可以同时执行1,024个线程，在上述约束下枚举周期时，我们的细粒度并行算法具有近似线性的可伸缩性。在同一个集群上，我们的细粒度并行算法与循环枚举的最先进算法的粗粒度并行版本相比，平均实现了一个数量级的加速。细粒度和粗粒度并行算法之间的性能差距随着我们使用更多的CPU内核而增加。

{"title":"Fast Parallel Algorithms for Enumeration of Simple, Temporal, and Hop-constrained Cycles","authors":"J. Blanuša, K. Atasu, P. Ienne","doi":"10.1145/3611642","DOIUrl":"https://doi.org/10.1145/3611642","url":null,"abstract":"Cycles are one of the fundamental subgraph patterns and being able to enumerate them in graphs enables important applications in a wide variety of fields, including finance, biology, chemistry, and network science. However, to enable cycle enumeration in real-world applications, efficient parallel algorithms are required. In this work, we propose scalable parallelisation of state-of-the-art sequential algorithms for enumerating simple, temporal, and hop-constrained cycles. First, we focus on the simple cycle enumeration problem and parallelise the algorithms by Johnson and by Read and Tarjan in a fine-grained manner. We theoretically show that our resulting fine-grained parallel algorithms are scalable, with the fine-grained parallel Read-Tarjan algorithm being strongly scalable. In contrast, we show that straightforward coarse-grained parallel versions of these simple cycle enumeration algorithms that exploit edge- or vertex-level parallelism are not scalable. Next, we adapt our fine-grained approach to enable the enumeration of cycles under time-window, temporal, and hop constraints. Our evaluation on a cluster with 256 CPU cores that can execute up to 1,024 simultaneous threads demonstrates a near-linear scalability of our fine-grained parallel algorithms when enumerating cycles under the aforementioned constraints. On the same cluster, our fine-grained parallel algorithms achieve, on average, one order of magnitude speedup compared to the respective coarse-grained parallel versions of the state-of-the-art algorithms for cycle enumeration. The performance gap between the fine-grained and the coarse-grained parallel algorithms increases as we use more CPU cores.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 35"},"PeriodicalIF":1.6,"publicationDate":"2023-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45204474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Parallel Minimum Cuts in O(m log2 n) Work and Low Depth O（m log2n）工作和低深度的平行最小切口

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2022-12-16 DOI: 10.1145/3565557

Daniel Anderson, G. Blelloch

We present a randomized O(m log2 n) work, O(polylog n) depth parallel algorithm for minimum cut. This algorithm matches the work bounds of a recent sequential algorithm by Gawrychowski, Mozes, and Weimann [ICALP’20], and improves on the previously best parallel algorithm by Geissmann and Gianinazzi [SPAA’18], which performs O(m log4 n) work in O(polylog n) depth. Our algorithm makes use of three components that might be of independent interest. Firstly, we design a parallel data structure that efficiently supports batched mixed queries and updates on trees. It generalizes and improves the work bounds of a previous data structure of Geissmann and Gianinazzi and is work efficient with respect to the best sequential algorithm. Secondly, we design a parallel algorithm for approximate minimum cut that improves on previous results by Karger and Motwani. We use this algorithm to give a work-efficient procedure to produce a tree packing, as in Karger’s sequential algorithm for minimum cuts. Lastly, we design an efficient parallel algorithm for solving the minimum 2-respecting cut problem.

我们提出了一种用于最小割的O（m log2n）工作，O（polylogn）深度并行算法。该算法与Gawrychowski、Mozes和Weimann最近的序列算法[ICALP'20]的工作边界相匹配，并改进了Geissmann和Gianinazzi之前的最佳并行算法[SPA'18]，该算法在O（polylogn）深度中执行O（m log4n）功。我们的算法使用了三个可能独立感兴趣的组件。首先，我们设计了一个并行数据结构，它有效地支持对树的批量混合查询和更新。它推广和改进了Geissmann和Gianinazzi先前数据结构的工作边界，并且相对于最佳序列算法是有效的。其次，我们设计了一个近似最小割的并行算法，该算法改进了Karger和Motwani先前的结果。我们使用这个算法来给出一个高效的生成树包装的过程，就像Karger的最小切割序列算法一样。最后，我们设计了一个有效的并行算法来解决最小2相关割问题。

引用次数: 3

Optimal Algorithms for Right-sizing Data Centers 合适规模数据中心的最优算法

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2022-10-11 DOI: 10.1145/3565513

S. Albers, Jens Quedenfeld

Electricity cost is a dominant and rapidly growing expense in data centers. Unfortunately, much of the consumed energy is wasted, because servers are idle for extended periods of time. We study a capacity management problem that dynamically right-sizes a data center, matching the number of active servers with the varying demand for computing capacity. We resort to a data-center optimization problem introduced by Lin, Wierman, Andrew, and Thereska [25, 27] that, over a time horizon, minimizes a combined objective function consisting of operating cost, modeled by a sequence of convex functions, and server switching cost. All prior work addresses a continuous setting in which the number of active servers, at any time, may take a fractional value. In this article, we investigate for the first time the discrete data-center optimization problem where the number of active servers, at any time, must be integer valued. Thereby, we seek truly feasible solutions. First, we show that the offline problem can be solved in polynomial time. Our algorithm relies on a new, yet intuitive graph theoretic model of the optimization problem and performs binary search in a layered graph. Second, we study the online problem and extend the algorithm Lazy Capacity Provisioning (LCP) by Lin et al. [25, 27] to the discrete setting. We prove that LCP is 3-competitive. Moreover, we show that no deterministic online algorithm can achieve a competitive ratio smaller than 3. Hence, while LCP does not attain an optimal competitiveness in the continuous setting, it does so in the discrete problem examined here. We prove that the lower bound of 3 also holds in a problem variant with more restricted operating cost functions, introduced by Lin et al. [25]. In addition, we develop a randomized online algorithm that is 2-competitive against an oblivious adversary. It is based on the algorithm of Bansal et al. [7] (a deterministic, 2-competitive algorithm for the continuous setting) and uses randomized rounding to obtain an integral solution. Moreover, we prove that 2 is a lower bound for the competitive ratio of randomized online algorithms, so our algorithm is optimal. We prove that the lower bound still holds for the more restricted model. Finally, we address the continuous setting and give a lower bound of 2 on the best competitiveness of online algorithms. This matches an upper bound by Bansal et al. [7]. A lower bound of 2 was also shown by Antoniadis and Schewior [4]. We develop an independent proof that extends to the scenario with more restricted operating cost.

电力成本是数据中心中占主导地位且快速增长的费用。不幸的是，大部分消耗的能量被浪费了，因为服务器长时间处于空闲状态。我们研究了一个容量管理问题，该问题动态地调整数据中心的大小，使活动服务器的数量与对计算能力的不同需求相匹配。我们采用Lin、Wierman、Andrew和Thereska[25,27]提出的数据中心优化问题，在一个时间范围内，最小化由一系列凸函数建模的运营成本和服务器切换成本组成的组合目标函数。所有先前的工作都针对连续设置，其中活动服务器的数量在任何时候都可以取小数。在本文中，我们首次研究离散数据中心优化问题，其中活动服务器的数量在任何时候都必须为整数值。因此，我们寻求真正可行的解决方案。首先，我们证明了离线问题可以在多项式时间内解决。我们的算法依赖于一种新的、直观的优化问题图论模型，并在分层图中执行二分搜索。其次，我们研究了在线问题，并将Lin等[25,27]的Lazy Capacity Provisioning (LCP)算法扩展到离散设置。我们证明了LCP是3竞争的。此外，我们还证明了任何确定性在线算法都无法实现小于3的竞争比。因此，虽然LCP在连续环境中不能达到最优竞争，但在这里研究的离散问题中却可以。我们证明了3的下界在Lin et al.[25]引入的具有更有限运行成本函数的问题变体中也成立。此外，我们开发了一种随机在线算法，该算法与一个无意识的对手是2竞争的。它基于Bansal等人的算法[7](连续设置的确定性，2竞争算法)，并使用随机舍入来获得积分解。此外，我们还证明了2是随机在线算法竞争比的下界，因此我们的算法是最优的。我们证明了对于更严格的模型下界仍然成立。最后，我们讨论了连续设置，并给出了在线算法最佳竞争的下界为2。这与Bansal等人的上界相匹配。Antoniadis和Schewior[4]也证明了2的下界。我们开发了一个独立的证明，扩展到具有更有限的运营成本的场景。

{"title":"Optimal Algorithms for Right-sizing Data Centers","authors":"S. Albers, Jens Quedenfeld","doi":"10.1145/3565513","DOIUrl":"https://doi.org/10.1145/3565513","url":null,"abstract":"Electricity cost is a dominant and rapidly growing expense in data centers. Unfortunately, much of the consumed energy is wasted, because servers are idle for extended periods of time. We study a capacity management problem that dynamically right-sizes a data center, matching the number of active servers with the varying demand for computing capacity. We resort to a data-center optimization problem introduced by Lin, Wierman, Andrew, and Thereska [25, 27] that, over a time horizon, minimizes a combined objective function consisting of operating cost, modeled by a sequence of convex functions, and server switching cost. All prior work addresses a continuous setting in which the number of active servers, at any time, may take a fractional value. In this article, we investigate for the first time the discrete data-center optimization problem where the number of active servers, at any time, must be integer valued. Thereby, we seek truly feasible solutions. First, we show that the offline problem can be solved in polynomial time. Our algorithm relies on a new, yet intuitive graph theoretic model of the optimization problem and performs binary search in a layered graph. Second, we study the online problem and extend the algorithm Lazy Capacity Provisioning (LCP) by Lin et al. [25, 27] to the discrete setting. We prove that LCP is 3-competitive. Moreover, we show that no deterministic online algorithm can achieve a competitive ratio smaller than 3. Hence, while LCP does not attain an optimal competitiveness in the continuous setting, it does so in the discrete problem examined here. We prove that the lower bound of 3 also holds in a problem variant with more restricted operating cost functions, introduced by Lin et al. [25]. In addition, we develop a randomized online algorithm that is 2-competitive against an oblivious adversary. It is based on the algorithm of Bansal et al. [7] (a deterministic, 2-competitive algorithm for the continuous setting) and uses randomized rounding to obtain an integral solution. Moreover, we prove that 2 is a lower bound for the competitive ratio of randomized online algorithms, so our algorithm is optimal. We prove that the lower bound still holds for the more restricted model. Finally, we address the continuous setting and give a lower bound of 2 on the best competitiveness of online algorithms. This matches an upper bound by Bansal et al. [7]. A lower bound of 2 was also shown by Antoniadis and Schewior [4]. We develop an independent proof that extends to the scenario with more restricted operating cost.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 40"},"PeriodicalIF":1.6,"publicationDate":"2022-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42464168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Family of Relaxed Concurrent Queues for Low-Latency Operations and Item Transfers 一类用于低延迟操作和项目传输的松弛并发队列

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2022-10-04 DOI: 10.1145/3565514

Giorgos Kappes, S. Anastasiadis

The producer-consumer communication over shared memory is a critical function of current scalable systems. Queues that provide low latency and high throughput on highly utilized systems can improve the overall performance perceived by the end users. In order to address this demand, we set as priority to achieve both high operation performance and item transfer speed. The Relaxed Concurrent Queues (RCQs) are a family of queues that we have designed and implemented for that purpose. Our key idea is a relaxed ordering model that splits the enqueue and dequeue operations into a stage of sequential assignment to a queue slot and a stage of concurrent execution across the slots. At each slot, we apply no order restrictions among the operations of the same type. We define several variants of the RCQ algorithms with respect to offered concurrency, required hardware instructions, supported operations, occupied memory space, and precondition handling. For specific RCQ algorithms, we provide pseudo-code definitions and reason about their correctness and progress properties. Additionally, we theoretically estimate and experimentally validate the worst-case distance between an RCQ algorithm and a strict first-in-first-out (FIFO) queue. We developed prototype implementations of the RCQ algorithms and experimentally compare them with several representative strict FIFO and relaxed data structures over a range of workload and system settings. The RCQS algorithm is a provably linearizable lock-free member of the RCQ family. We experimentally show that RCQS achieves factors to orders of magnitude advantage over the state-of-the-art strict or relaxed queue algorithms across several latency and throughput statistics of the queue operations and item transfers.

共享内存上的生产者-消费者通信是当前可扩展系统的一个关键功能。在高利用率系统上提供低延迟和高吞吐量的队列可以提高最终用户感受到的整体性能。为了满足这一需求，我们将实现高操作性能和物品传递速度作为优先事项。放松并发队列(rcq)是我们为此目的设计和实现的一系列队列。我们的关键思想是一个宽松的排序模型，该模型将排队和脱队操作分为对队列槽的顺序分配阶段和跨槽的并发执行阶段。在每个槽位，我们对同一类型的操作不施加顺序限制。我们根据提供的并发性、所需的硬件指令、支持的操作、占用的内存空间和前提条件处理定义了RCQ算法的几种变体。对于特定的RCQ算法，我们提供了伪代码定义，并解释了它们的正确性和进度属性。此外，我们从理论上估计和实验验证了RCQ算法和严格的先进先出(FIFO)队列之间的最坏情况距离。我们开发了RCQ算法的原型实现，并在一系列工作负载和系统设置下，将它们与几种具有代表性的严格FIFO和宽松数据结构进行了实验比较。RCQS算法是RCQ族中可线性化的无锁算法。我们通过实验表明，RCQS在队列操作和项目传输的几个延迟和吞吐量统计数据上，比最先进的严格或宽松队列算法实现了数量级的优势。

{"title":"A Family of Relaxed Concurrent Queues for Low-Latency Operations and Item Transfers","authors":"Giorgos Kappes, S. Anastasiadis","doi":"10.1145/3565514","DOIUrl":"https://doi.org/10.1145/3565514","url":null,"abstract":"The producer-consumer communication over shared memory is a critical function of current scalable systems. Queues that provide low latency and high throughput on highly utilized systems can improve the overall performance perceived by the end users. In order to address this demand, we set as priority to achieve both high operation performance and item transfer speed. The Relaxed Concurrent Queues (RCQs) are a family of queues that we have designed and implemented for that purpose. Our key idea is a relaxed ordering model that splits the enqueue and dequeue operations into a stage of sequential assignment to a queue slot and a stage of concurrent execution across the slots. At each slot, we apply no order restrictions among the operations of the same type. We define several variants of the RCQ algorithms with respect to offered concurrency, required hardware instructions, supported operations, occupied memory space, and precondition handling. For specific RCQ algorithms, we provide pseudo-code definitions and reason about their correctness and progress properties. Additionally, we theoretically estimate and experimentally validate the worst-case distance between an RCQ algorithm and a strict first-in-first-out (FIFO) queue. We developed prototype implementations of the RCQ algorithms and experimentally compare them with several representative strict FIFO and relaxed data structures over a range of workload and system settings. The RCQS algorithm is a provably linearizable lock-free member of the RCQ family. We experimentally show that RCQS achieves factors to orders of magnitude advantage over the state-of-the-art strict or relaxed queue algorithms across several latency and throughput statistics of the queue operations and item transfers.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 37"},"PeriodicalIF":1.6,"publicationDate":"2022-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45988660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Orthogonal Layers of Parallelism in Large-Scale Eigenvalue Computations 大规模特征值计算中的正交并行层

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2022-09-05 DOI: 10.1145/3614444

A. Alvermann, G. Hager, H. Fehske

We address the communication overhead of distributed sparse matrix-(multiple)-vector multiplication in the context of large-scale eigensolvers, using filter diagonalization as an example. The basis of our study is a performance model, which includes a communication metric that is computed directly from the matrix sparsity pattern without running any code. The performance model quantifies to which extent scalability and parallel efficiency are lost due to communication overhead. To restore scalability, we identify two orthogonal layers of parallelism in the filter diagonalization technique. In the horizontal layer the rows of the sparse matrix are distributed across individual processes. In the vertical layer bundles of multiple vectors are distributed across separate process groups. An analysis in terms of the communication metric predicts that scalability can be restored if, and only if, one implements the two orthogonal layers of parallelism via different distributed vector layouts. Our theoretical analysis is corroborated by benchmarks for application matrices from quantum and solid state physics, road networks, and nonlinear programming. We finally demonstrate the benefits of using orthogonal layers of parallelism with two exemplary application cases—an exciton and a strongly correlated electron system—which incur either small or large communication overhead.

我们以滤波器对角化为例，解决了大规模特征求解中分布式稀疏矩阵-(多)向量乘法的通信开销。我们研究的基础是一个性能模型，其中包括一个通信度量，该度量直接从矩阵稀疏性模式计算而不运行任何代码。性能模型量化了由于通信开销而导致的可伸缩性和并行效率损失的程度。为了恢复可扩展性，我们在滤波器对角化技术中确定了两个正交的并行层。在水平层，稀疏矩阵的行分布在各个过程之间。在垂直层中，多个向量的束分布在不同的过程组中。根据通信度量的分析预测，当且仅当通过不同的分布式矢量布局实现两个正交的并行层时，可以恢复可伸缩性。我们的理论分析得到了来自量子和固态物理、道路网络和非线性规划的应用矩阵基准的证实。最后，我们通过两个示例应用案例(激子和强相关电子系统)演示了使用正交并行层的好处，这两个示例应用会导致或小或大的通信开销。

{"title":"Orthogonal Layers of Parallelism in Large-Scale Eigenvalue Computations","authors":"A. Alvermann, G. Hager, H. Fehske","doi":"10.1145/3614444","DOIUrl":"https://doi.org/10.1145/3614444","url":null,"abstract":"We address the communication overhead of distributed sparse matrix-(multiple)-vector multiplication in the context of large-scale eigensolvers, using filter diagonalization as an example. The basis of our study is a performance model, which includes a communication metric that is computed directly from the matrix sparsity pattern without running any code. The performance model quantifies to which extent scalability and parallel efficiency are lost due to communication overhead. To restore scalability, we identify two orthogonal layers of parallelism in the filter diagonalization technique. In the horizontal layer the rows of the sparse matrix are distributed across individual processes. In the vertical layer bundles of multiple vectors are distributed across separate process groups. An analysis in terms of the communication metric predicts that scalability can be restored if, and only if, one implements the two orthogonal layers of parallelism via different distributed vector layouts. Our theoretical analysis is corroborated by benchmarks for application matrices from quantum and solid state physics, road networks, and nonlinear programming. We finally demonstrate the benefits of using orthogonal layers of parallelism with two exemplary application cases—an exciton and a strongly correlated electron system—which incur either small or large communication overhead.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 31"},"PeriodicalIF":1.6,"publicationDate":"2022-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41741959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Checkpointing Workflows à la Young/Daly Is Not Good Enough 检查点工作流<s:1>年轻/每日不够好

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2022-09-02 DOI: 10.1145/3548607

A. Benoit, Lucas Perotin, Y. Robert, Hongyang Sun

This article revisits checkpointing strategies when workflows composed of multiple tasks execute on a parallel platform. The objective is to minimize the expectation of the total execution time. For a single task, the Young/Daly formula provides the optimal checkpointing period. However, when many tasks execute simultaneously, the risk that one of them is severely delayed increases with the number of tasks. To mitigate this risk, a possibility is to checkpoint each task more often than with the Young/Daly strategy. But is it worth slowing each task down with extra checkpoints? Does the extra checkpointing make a difference globally? This article answers these questions. On the theoretical side, we prove several negative results for keeping the Young/Daly period when many tasks execute concurrently, and we design novel checkpointing strategies that guarantee an efficient execution with high probability. On the practical side, we report comprehensive experiments that demonstrate the need to go beyond the Young/Daly period and to checkpoint more often for a wide range of application/platform settings.

本文将回顾由多个任务组成的工作流在并行平台上执行时的检查点策略。目标是最小化总执行时间的期望。对于单个任务，Young/Daly公式提供最佳检查点周期。然而，当许多任务同时执行时，其中一个任务严重延迟的风险随着任务数量的增加而增加。为了降低这种风险，可以比Young/Daly策略更频繁地检查每个任务。但是是否值得用额外的检查点来减慢每个任务的速度呢?额外的检查点是否对全局有影响?本文将回答这些问题。在理论方面，我们证明了多个任务并发执行时保持Young/Daly周期的几个负面结果，并设计了新的检查点策略，以保证高概率的有效执行。在实践方面，我们报告了全面的实验，证明了需要超越Young/Daly时期，并在广泛的应用程序/平台设置中更频繁地进行检查点。

引用次数: 2

Improving the Speed and Quality of Parallel Graph Coloring 提高并行图着色的速度和质量

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2022-07-11 DOI: 10.1145/3543545

Ghadeer Alabandi, Martin Burtscher

Graph coloring assigns a color to each vertex of a graph such that no two adjacent vertices get the same color. It is a key building block in many applications. In practice, solutions that require fewer distinct colors and that can be computed faster are typically preferred. Various coloring heuristics exist that provide different quality versus speed tradeoffs. The highest-quality heuristics tend to be slow. To improve performance, several parallel implementations have been proposed. This paper describes two improvements of the widely used LDF heuristic. First, we present a “shortcutting” approach to increase the parallelism by non-speculatively breaking data dependencies. Second, we present “color reduction” techniques to boost the solution of LDF. On 18 graphs from various domains, the shortcutting approach yields 2.5 times more parallelism in the mean, and the color-reduction techniques improve the result quality by up to 20%. Our deterministic CUDA implementation running on a Titan V is 2.9 times faster in the mean and uses as few or fewer colors as the best GPU codes from the literature.

图形着色为图形的每个顶点指定一种颜色，这样就不会有两个相邻的顶点获得相同的颜色。它是许多应用程序中的关键构建块。在实践中，通常优选需要较少不同颜色并且可以更快地计算的解决方案。存在提供不同质量与速度权衡的各种着色启发法。最高质量的启发式往往是缓慢的。为了提高性能，已经提出了几种并行实现。本文描述了广泛使用的LDF启发式算法的两个改进。首先，我们提出了一种“快捷”方法，通过非推测性地打破数据依赖关系来提高并行性。其次，我们提出了“颜色减少”技术来促进LDF的解决方案。在来自不同领域的18张图上，短切方法的平均并行度提高了2.5倍，颜色减少技术将结果质量提高了20%。我们在Titan V上运行的确定性CUDA实现平均速度快2.9倍，使用的颜色与文献中最好的GPU代码一样少。

引用次数: 0

Design and Implementation of a Coarse-grained Dynamically Reconfigurable Multimedia Accelerator 一个粗粒度动态可重构多媒体加速器的设计与实现

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2022-07-09 DOI: 10.1145/3543544

Hung K. Nguyen, Xuan-Tu Tran

This article proposes and implements a Coarse-grained dynamically Reconfigurable Architecture, named Reconfigurable Multimedia Accelerator (REMAC). REMAC architecture is driven by the pipelined multi-instruction-multi-data execution model for exploiting multi-level parallelism of the computation-intensive loops in multimedia applications. The novel architecture of REMAC's reconfigurable processing unit (RPU) allows multiple iterations of a kernel loop can execute concurrently in the pipelining fashion by the temporal overlapping of the configuration fetch, execution, and store processes as much as possible. To address the huge bandwidth required by parallel processing units, REMAC architecture is proposed to efficiently exploit the abundant data locality in the kernel loops to decrease data access bandwidth while increase the efficiency of pipelined execution. In addition, a novel architecture of dedicated hierarchy data memory system is proposed to increase data reuse between iterations and make data always available for parallel operation of RPU. The proposed architecture was modeled at RTL using VHDL language. Several benchmark applications were mapped onto REMAC to validate the high-flexibility and high-performance of the architecture and prove that it is appropriate for a wide set of multimedia applications. The experimental results show that REMAC's performance is better than Xilinx Virtex-II, ADRES, REMUS-II, and TI C64+ DSP.

本文提出并实现了一种粗粒度的动态可重构体系结构，称为可重构多媒体加速器(Reconfigurable Multimedia Accelerator, REMAC)。REMAC架构由流水线式多指令多数据执行模型驱动，利用多媒体应用中计算密集型循环的多级并行性。REMAC的可重构处理单元(reconfigurable processing unit, RPU)的新颖架构允许内核循环的多个迭代以流水线方式并发执行，方法是尽可能多地在时间上重叠配置获取、执行和存储进程。针对并行处理单元对带宽的需求，提出了REMAC架构，有效利用内核循环中丰富的数据局部性，在降低数据访问带宽的同时提高流水线执行效率。此外，提出了一种新的专用层次数据存储系统架构，以增加迭代之间的数据重用，并使数据始终可用于RPU的并行操作。采用VHDL语言在RTL上对所提出的体系结构进行了建模。将几个基准测试应用程序映射到REMAC上，以验证该体系结构的高灵活性和高性能，并证明它适用于广泛的多媒体应用程序。实验结果表明，REMAC的性能优于Xilinx Virtex-II、ADRES、REMUS-II和TI C64+ DSP。

{"title":"Design and Implementation of a Coarse-grained Dynamically Reconfigurable Multimedia Accelerator","authors":"Hung K. Nguyen, Xuan-Tu Tran","doi":"10.1145/3543544","DOIUrl":"https://doi.org/10.1145/3543544","url":null,"abstract":"This article proposes and implements a Coarse-grained dynamically Reconfigurable Architecture, named Reconfigurable Multimedia Accelerator (REMAC). REMAC architecture is driven by the pipelined multi-instruction-multi-data execution model for exploiting multi-level parallelism of the computation-intensive loops in multimedia applications. The novel architecture of REMAC's reconfigurable processing unit (RPU) allows multiple iterations of a kernel loop can execute concurrently in the pipelining fashion by the temporal overlapping of the configuration fetch, execution, and store processes as much as possible. To address the huge bandwidth required by parallel processing units, REMAC architecture is proposed to efficiently exploit the abundant data locality in the kernel loops to decrease data access bandwidth while increase the efficiency of pipelined execution. In addition, a novel architecture of dedicated hierarchy data memory system is proposed to increase data reuse between iterations and make data always available for parallel operation of RPU. The proposed architecture was modeled at RTL using VHDL language. Several benchmark applications were mapped onto REMAC to validate the high-flexibility and high-performance of the architecture and prove that it is appropriate for a wide set of multimedia applications. The experimental results show that REMAC's performance is better than Xilinx Virtex-II, ADRES, REMUS-II, and TI C64+ DSP.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":" ","pages":"1 - 23"},"PeriodicalIF":1.6,"publicationDate":"2022-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46137129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Multi-Interval DomLock: Toward Improving Concurrency in Hierarchies 多区间DomLock：提高层次结构中的并发性

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2022-07-08 DOI: 10.1145/3543543

M. A. Anju, R. Nasre

Locking has been a predominant technique depended upon for achieving thread synchronization and ensuring correctness in multi-threaded applications. It has been established that the concurrent applications working with hierarchical data witness significant benefits due to multi-granularity locking (MGL) techniques compared to either fine- or coarse-grained locking. The de facto MGL technique used in hierarchical databases is intention locks, which uses a traversal-based protocol for hierarchical locking. A recent MGL implementation, dominator-based locking (DomLock), exploits interval numbering to balance the locking cost and concurrency and outperforms intention locks for non-tree-structured hierarchies. We observe, however, that depending upon the hierarchy structure and the interval numbering, DomLock pessimistically declares subhierarchies to be locked when in reality they are not. This increases the waiting time of locks and, in turn, reduces concurrency. To address this issue, we present Multi-Interval DomLock (MID), a new technique to improve the degree of concurrency of interval-based hierarchical locking. By adding additional intervals for each node, MID helps in reducing the unnecessary lock rejections due to false-positive lock status of sub-hierarchies. Unleashing the hidden opportunities to exploit more concurrency allows the parallel threads to finish their operations quickly, leading to notable performance improvement. We also show that with sufficient number of intervals, MID can avoid all the lock rejections due to false-positive lock status of nodes. MID is general and can be applied to any arbitrary hierarchy of trees, Directed Acyclic Graphs (DAGs), and cycles. It also works with dynamic hierarchies wherein the hierarchical structure undergoes updates. We illustrate the effectiveness of MID using STMBench7 and, with extensive experimental evaluation, show that it leads to significant throughput improvement (up to 141%, average 106%) over DomLock.

在多线程应用程序中，锁定一直是实现线程同步和确保正确性所依赖的主要技术。已经确定的是，与细粒度或粗粒度锁相比，使用分层数据的并发应用程序由于使用多粒度锁(MGL)技术而获得了显著的好处。层次数据库中使用的事实上的MGL技术是意图锁，它使用基于遍历的协议进行层次锁。最近的MGL实现，基于支配者的锁定(DomLock)，利用间隔编号来平衡锁定成本和并发性，并且优于非树结构层次结构的意图锁。然而，我们观察到，根据层次结构和区间编号，DomLock悲观地声明子层次被锁定，而实际上它们没有被锁定。这增加了锁的等待时间，进而降低了并发性。为了解决这个问题，我们提出了多间隔DomLock (MID)，这是一种提高基于间隔的分层锁并发度的新技术。通过为每个节点添加额外的间隔，MID有助于减少由于子层次结构的锁状态误报而导致的不必要的锁拒绝。释放隐藏的机会来利用更多的并发性，允许并行线程快速完成它们的操作，从而显著提高性能。我们还证明了在足够的间隔数下，MID可以避免由于节点的假阳性锁定状态而导致的所有锁拒绝。MID是通用的，可以应用于任何任意层次的树、有向无环图(dag)和循环。它也适用于动态层次结构，其中层次结构经历更新。我们使用STMBench7说明了MID的有效性，并且通过广泛的实验评估，表明它比DomLock带来了显着的吞吐量改进(高达141%，平均106%)。

{"title":"Multi-Interval DomLock: Toward Improving Concurrency in Hierarchies","authors":"M. A. Anju, R. Nasre","doi":"10.1145/3543543","DOIUrl":"https://doi.org/10.1145/3543543","url":null,"abstract":"Locking has been a predominant technique depended upon for achieving thread synchronization and ensuring correctness in multi-threaded applications. It has been established that the concurrent applications working with hierarchical data witness significant benefits due to multi-granularity locking (MGL) techniques compared to either fine- or coarse-grained locking. The de facto MGL technique used in hierarchical databases is intention locks, which uses a traversal-based protocol for hierarchical locking. A recent MGL implementation, dominator-based locking (DomLock), exploits interval numbering to balance the locking cost and concurrency and outperforms intention locks for non-tree-structured hierarchies. We observe, however, that depending upon the hierarchy structure and the interval numbering, DomLock pessimistically declares subhierarchies to be locked when in reality they are not. This increases the waiting time of locks and, in turn, reduces concurrency. To address this issue, we present Multi-Interval DomLock (MID), a new technique to improve the degree of concurrency of interval-based hierarchical locking. By adding additional intervals for each node, MID helps in reducing the unnecessary lock rejections due to false-positive lock status of sub-hierarchies. Unleashing the hidden opportunities to exploit more concurrency allows the parallel threads to finish their operations quickly, leading to notable performance improvement. We also show that with sufficient number of intervals, MID can avoid all the lock rejections due to false-positive lock status of nodes. MID is general and can be applied to any arbitrary hierarchy of trees, Directed Acyclic Graphs (DAGs), and cycles. It also works with dynamic hierarchies wherein the hierarchical structure undergoes updates. We illustrate the effectiveness of MID using STMBench7 and, with extensive experimental evaluation, show that it leads to significant throughput improvement (up to 141%, average 106%) over DomLock.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 27"},"PeriodicalIF":1.6,"publicationDate":"2022-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43227379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Simple Concurrent Connected Components Algorithms 简单并发连通组件算法

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2022-07-08 DOI: 10.1145/3543546

S. Liu, R. Tarjan

We study a class of simple algorithms for concurrently computing the connected components of an n-vertex, m-edge graph. Our algorithms are easy to implement in either the COMBINING CRCW PRAM or the MPC computing model. For two related algorithms in this class, we obtain Θ (lg n) step and Θ (m lg n) work bounds.1 For two others, we obtain O(lg2 n) step and O(m lg2 n) work bounds, which are tight for one of them. All our algorithms are simpler than related algorithms in the literature. We also point out some gaps and errors in the analysis of previous algorithms. Our results show that even a basic problem like connected components still has secrets to reveal.

我们研究了一类同时计算n顶点、m边图的连通分量的简单算法。我们的算法很容易在组合CRCW PRAM或MPC计算模型中实现。对于这类中的两个相关算法，我们获得了θ（lgn）步长和θ（mlgn）功界。1对于另外两个算法，我们得到了O（lg2n）步长和O（mlg2n）功界，它们中的一个是紧的。我们所有的算法都比文献中的相关算法简单。我们还指出了在分析以前的算法时存在的一些差距和错误。我们的研究结果表明，即使是像连接组件这样的基本问题，仍然有秘密需要揭示。

引用次数: 9