Cycles are one of the fundamental subgraph patterns and being able to enumerate them in graphs enables important applications in a wide variety of fields, including finance, biology, chemistry, and network science. However, to enable cycle enumeration in real-world applications, efficient parallel algorithms are required. In this work, we propose scalable parallelisation of state-of-the-art sequential algorithms for enumerating simple, temporal, and hop-constrained cycles. First, we focus on the simple cycle enumeration problem and parallelise the algorithms by Johnson and by Read and Tarjan in a fine-grained manner. We theoretically show that our resulting fine-grained parallel algorithms are scalable, with the fine-grained parallel Read-Tarjan algorithm being strongly scalable. In contrast, we show that straightforward coarse-grained parallel versions of these simple cycle enumeration algorithms that exploit edge- or vertex-level parallelism are not scalable. Next, we adapt our fine-grained approach to enable the enumeration of cycles under time-window, temporal, and hop constraints. Our evaluation on a cluster with 256 CPU cores that can execute up to 1,024 simultaneous threads demonstrates a near-linear scalability of our fine-grained parallel algorithms when enumerating cycles under the aforementioned constraints. On the same cluster, our fine-grained parallel algorithms achieve, on average, one order of magnitude speedup compared to the respective coarse-grained parallel versions of the state-of-the-art algorithms for cycle enumeration. The performance gap between the fine-grained and the coarse-grained parallel algorithms increases as we use more CPU cores.
{"title":"Fast Parallel Algorithms for Enumeration of Simple, Temporal, and Hop-constrained Cycles","authors":"J. Blanuša, K. Atasu, P. Ienne","doi":"10.1145/3611642","DOIUrl":"https://doi.org/10.1145/3611642","url":null,"abstract":"Cycles are one of the fundamental subgraph patterns and being able to enumerate them in graphs enables important applications in a wide variety of fields, including finance, biology, chemistry, and network science. However, to enable cycle enumeration in real-world applications, efficient parallel algorithms are required. In this work, we propose scalable parallelisation of state-of-the-art sequential algorithms for enumerating simple, temporal, and hop-constrained cycles. First, we focus on the simple cycle enumeration problem and parallelise the algorithms by Johnson and by Read and Tarjan in a fine-grained manner. We theoretically show that our resulting fine-grained parallel algorithms are scalable, with the fine-grained parallel Read-Tarjan algorithm being strongly scalable. In contrast, we show that straightforward coarse-grained parallel versions of these simple cycle enumeration algorithms that exploit edge- or vertex-level parallelism are not scalable. Next, we adapt our fine-grained approach to enable the enumeration of cycles under time-window, temporal, and hop constraints. Our evaluation on a cluster with 256 CPU cores that can execute up to 1,024 simultaneous threads demonstrates a near-linear scalability of our fine-grained parallel algorithms when enumerating cycles under the aforementioned constraints. On the same cluster, our fine-grained parallel algorithms achieve, on average, one order of magnitude speedup compared to the respective coarse-grained parallel versions of the state-of-the-art algorithms for cycle enumeration. The performance gap between the fine-grained and the coarse-grained parallel algorithms increases as we use more CPU cores.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 35"},"PeriodicalIF":1.6,"publicationDate":"2023-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45204474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a randomized O(m log2 n) work, O(polylog n) depth parallel algorithm for minimum cut. This algorithm matches the work bounds of a recent sequential algorithm by Gawrychowski, Mozes, and Weimann [ICALP’20], and improves on the previously best parallel algorithm by Geissmann and Gianinazzi [SPAA’18], which performs O(m log4 n) work in O(polylog n) depth. Our algorithm makes use of three components that might be of independent interest. Firstly, we design a parallel data structure that efficiently supports batched mixed queries and updates on trees. It generalizes and improves the work bounds of a previous data structure of Geissmann and Gianinazzi and is work efficient with respect to the best sequential algorithm. Secondly, we design a parallel algorithm for approximate minimum cut that improves on previous results by Karger and Motwani. We use this algorithm to give a work-efficient procedure to produce a tree packing, as in Karger’s sequential algorithm for minimum cuts. Lastly, we design an efficient parallel algorithm for solving the minimum 2-respecting cut problem.
{"title":"Parallel Minimum Cuts in O(m log2 n) Work and Low Depth","authors":"Daniel Anderson, G. Blelloch","doi":"10.1145/3565557","DOIUrl":"https://doi.org/10.1145/3565557","url":null,"abstract":"We present a randomized O(m log2 n) work, O(polylog n) depth parallel algorithm for minimum cut. This algorithm matches the work bounds of a recent sequential algorithm by Gawrychowski, Mozes, and Weimann [ICALP’20], and improves on the previously best parallel algorithm by Geissmann and Gianinazzi [SPAA’18], which performs O(m log4 n) work in O(polylog n) depth. Our algorithm makes use of three components that might be of independent interest. Firstly, we design a parallel data structure that efficiently supports batched mixed queries and updates on trees. It generalizes and improves the work bounds of a previous data structure of Geissmann and Gianinazzi and is work efficient with respect to the best sequential algorithm. Secondly, we design a parallel algorithm for approximate minimum cut that improves on previous results by Karger and Motwani. We use this algorithm to give a work-efficient procedure to produce a tree packing, as in Karger’s sequential algorithm for minimum cuts. Lastly, we design an efficient parallel algorithm for solving the minimum 2-respecting cut problem.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"1 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2022-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46270508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Electricity cost is a dominant and rapidly growing expense in data centers. Unfortunately, much of the consumed energy is wasted, because servers are idle for extended periods of time. We study a capacity management problem that dynamically right-sizes a data center, matching the number of active servers with the varying demand for computing capacity. We resort to a data-center optimization problem introduced by Lin, Wierman, Andrew, and Thereska [25, 27] that, over a time horizon, minimizes a combined objective function consisting of operating cost, modeled by a sequence of convex functions, and server switching cost. All prior work addresses a continuous setting in which the number of active servers, at any time, may take a fractional value. In this article, we investigate for the first time the discrete data-center optimization problem where the number of active servers, at any time, must be integer valued. Thereby, we seek truly feasible solutions. First, we show that the offline problem can be solved in polynomial time. Our algorithm relies on a new, yet intuitive graph theoretic model of the optimization problem and performs binary search in a layered graph. Second, we study the online problem and extend the algorithm Lazy Capacity Provisioning (LCP) by Lin et al. [25, 27] to the discrete setting. We prove that LCP is 3-competitive. Moreover, we show that no deterministic online algorithm can achieve a competitive ratio smaller than 3. Hence, while LCP does not attain an optimal competitiveness in the continuous setting, it does so in the discrete problem examined here. We prove that the lower bound of 3 also holds in a problem variant with more restricted operating cost functions, introduced by Lin et al. [25]. In addition, we develop a randomized online algorithm that is 2-competitive against an oblivious adversary. It is based on the algorithm of Bansal et al. [7] (a deterministic, 2-competitive algorithm for the continuous setting) and uses randomized rounding to obtain an integral solution. Moreover, we prove that 2 is a lower bound for the competitive ratio of randomized online algorithms, so our algorithm is optimal. We prove that the lower bound still holds for the more restricted model. Finally, we address the continuous setting and give a lower bound of 2 on the best competitiveness of online algorithms. This matches an upper bound by Bansal et al. [7]. A lower bound of 2 was also shown by Antoniadis and Schewior [4]. We develop an independent proof that extends to the scenario with more restricted operating cost.
电力成本是数据中心中占主导地位且快速增长的费用。不幸的是,大部分消耗的能量被浪费了,因为服务器长时间处于空闲状态。我们研究了一个容量管理问题,该问题动态地调整数据中心的大小,使活动服务器的数量与对计算能力的不同需求相匹配。我们采用Lin、Wierman、Andrew和Thereska[25,27]提出的数据中心优化问题,在一个时间范围内,最小化由一系列凸函数建模的运营成本和服务器切换成本组成的组合目标函数。所有先前的工作都针对连续设置,其中活动服务器的数量在任何时候都可以取小数。在本文中,我们首次研究离散数据中心优化问题,其中活动服务器的数量在任何时候都必须为整数值。因此,我们寻求真正可行的解决方案。首先,我们证明了离线问题可以在多项式时间内解决。我们的算法依赖于一种新的、直观的优化问题图论模型,并在分层图中执行二分搜索。其次,我们研究了在线问题,并将Lin等[25,27]的Lazy Capacity Provisioning (LCP)算法扩展到离散设置。我们证明了LCP是3竞争的。此外,我们还证明了任何确定性在线算法都无法实现小于3的竞争比。因此,虽然LCP在连续环境中不能达到最优竞争,但在这里研究的离散问题中却可以。我们证明了3的下界在Lin et al.[25]引入的具有更有限运行成本函数的问题变体中也成立。此外,我们开发了一种随机在线算法,该算法与一个无意识的对手是2竞争的。它基于Bansal等人的算法[7](连续设置的确定性,2竞争算法),并使用随机舍入来获得积分解。此外,我们还证明了2是随机在线算法竞争比的下界,因此我们的算法是最优的。我们证明了对于更严格的模型下界仍然成立。最后,我们讨论了连续设置,并给出了在线算法最佳竞争的下界为2。这与Bansal等人的上界相匹配。Antoniadis和Schewior[4]也证明了2的下界。我们开发了一个独立的证明,扩展到具有更有限的运营成本的场景。
{"title":"Optimal Algorithms for Right-sizing Data Centers","authors":"S. Albers, Jens Quedenfeld","doi":"10.1145/3565513","DOIUrl":"https://doi.org/10.1145/3565513","url":null,"abstract":"Electricity cost is a dominant and rapidly growing expense in data centers. Unfortunately, much of the consumed energy is wasted, because servers are idle for extended periods of time. We study a capacity management problem that dynamically right-sizes a data center, matching the number of active servers with the varying demand for computing capacity. We resort to a data-center optimization problem introduced by Lin, Wierman, Andrew, and Thereska [25, 27] that, over a time horizon, minimizes a combined objective function consisting of operating cost, modeled by a sequence of convex functions, and server switching cost. All prior work addresses a continuous setting in which the number of active servers, at any time, may take a fractional value. In this article, we investigate for the first time the discrete data-center optimization problem where the number of active servers, at any time, must be integer valued. Thereby, we seek truly feasible solutions. First, we show that the offline problem can be solved in polynomial time. Our algorithm relies on a new, yet intuitive graph theoretic model of the optimization problem and performs binary search in a layered graph. Second, we study the online problem and extend the algorithm Lazy Capacity Provisioning (LCP) by Lin et al. [25, 27] to the discrete setting. We prove that LCP is 3-competitive. Moreover, we show that no deterministic online algorithm can achieve a competitive ratio smaller than 3. Hence, while LCP does not attain an optimal competitiveness in the continuous setting, it does so in the discrete problem examined here. We prove that the lower bound of 3 also holds in a problem variant with more restricted operating cost functions, introduced by Lin et al. [25]. In addition, we develop a randomized online algorithm that is 2-competitive against an oblivious adversary. It is based on the algorithm of Bansal et al. [7] (a deterministic, 2-competitive algorithm for the continuous setting) and uses randomized rounding to obtain an integral solution. Moreover, we prove that 2 is a lower bound for the competitive ratio of randomized online algorithms, so our algorithm is optimal. We prove that the lower bound still holds for the more restricted model. Finally, we address the continuous setting and give a lower bound of 2 on the best competitiveness of online algorithms. This matches an upper bound by Bansal et al. [7]. A lower bound of 2 was also shown by Antoniadis and Schewior [4]. We develop an independent proof that extends to the scenario with more restricted operating cost.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 40"},"PeriodicalIF":1.6,"publicationDate":"2022-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42464168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The producer-consumer communication over shared memory is a critical function of current scalable systems. Queues that provide low latency and high throughput on highly utilized systems can improve the overall performance perceived by the end users. In order to address this demand, we set as priority to achieve both high operation performance and item transfer speed. The Relaxed Concurrent Queues (RCQs) are a family of queues that we have designed and implemented for that purpose. Our key idea is a relaxed ordering model that splits the enqueue and dequeue operations into a stage of sequential assignment to a queue slot and a stage of concurrent execution across the slots. At each slot, we apply no order restrictions among the operations of the same type. We define several variants of the RCQ algorithms with respect to offered concurrency, required hardware instructions, supported operations, occupied memory space, and precondition handling. For specific RCQ algorithms, we provide pseudo-code definitions and reason about their correctness and progress properties. Additionally, we theoretically estimate and experimentally validate the worst-case distance between an RCQ algorithm and a strict first-in-first-out (FIFO) queue. We developed prototype implementations of the RCQ algorithms and experimentally compare them with several representative strict FIFO and relaxed data structures over a range of workload and system settings. The RCQS algorithm is a provably linearizable lock-free member of the RCQ family. We experimentally show that RCQS achieves factors to orders of magnitude advantage over the state-of-the-art strict or relaxed queue algorithms across several latency and throughput statistics of the queue operations and item transfers.
{"title":"A Family of Relaxed Concurrent Queues for Low-Latency Operations and Item Transfers","authors":"Giorgos Kappes, S. Anastasiadis","doi":"10.1145/3565514","DOIUrl":"https://doi.org/10.1145/3565514","url":null,"abstract":"The producer-consumer communication over shared memory is a critical function of current scalable systems. Queues that provide low latency and high throughput on highly utilized systems can improve the overall performance perceived by the end users. In order to address this demand, we set as priority to achieve both high operation performance and item transfer speed. The Relaxed Concurrent Queues (RCQs) are a family of queues that we have designed and implemented for that purpose. Our key idea is a relaxed ordering model that splits the enqueue and dequeue operations into a stage of sequential assignment to a queue slot and a stage of concurrent execution across the slots. At each slot, we apply no order restrictions among the operations of the same type. We define several variants of the RCQ algorithms with respect to offered concurrency, required hardware instructions, supported operations, occupied memory space, and precondition handling. For specific RCQ algorithms, we provide pseudo-code definitions and reason about their correctness and progress properties. Additionally, we theoretically estimate and experimentally validate the worst-case distance between an RCQ algorithm and a strict first-in-first-out (FIFO) queue. We developed prototype implementations of the RCQ algorithms and experimentally compare them with several representative strict FIFO and relaxed data structures over a range of workload and system settings. The RCQS algorithm is a provably linearizable lock-free member of the RCQ family. We experimentally show that RCQS achieves factors to orders of magnitude advantage over the state-of-the-art strict or relaxed queue algorithms across several latency and throughput statistics of the queue operations and item transfers.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 37"},"PeriodicalIF":1.6,"publicationDate":"2022-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45988660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We address the communication overhead of distributed sparse matrix-(multiple)-vector multiplication in the context of large-scale eigensolvers, using filter diagonalization as an example. The basis of our study is a performance model, which includes a communication metric that is computed directly from the matrix sparsity pattern without running any code. The performance model quantifies to which extent scalability and parallel efficiency are lost due to communication overhead. To restore scalability, we identify two orthogonal layers of parallelism in the filter diagonalization technique. In the horizontal layer the rows of the sparse matrix are distributed across individual processes. In the vertical layer bundles of multiple vectors are distributed across separate process groups. An analysis in terms of the communication metric predicts that scalability can be restored if, and only if, one implements the two orthogonal layers of parallelism via different distributed vector layouts. Our theoretical analysis is corroborated by benchmarks for application matrices from quantum and solid state physics, road networks, and nonlinear programming. We finally demonstrate the benefits of using orthogonal layers of parallelism with two exemplary application cases—an exciton and a strongly correlated electron system—which incur either small or large communication overhead.
{"title":"Orthogonal Layers of Parallelism in Large-Scale Eigenvalue Computations","authors":"A. Alvermann, G. Hager, H. Fehske","doi":"10.1145/3614444","DOIUrl":"https://doi.org/10.1145/3614444","url":null,"abstract":"We address the communication overhead of distributed sparse matrix-(multiple)-vector multiplication in the context of large-scale eigensolvers, using filter diagonalization as an example. The basis of our study is a performance model, which includes a communication metric that is computed directly from the matrix sparsity pattern without running any code. The performance model quantifies to which extent scalability and parallel efficiency are lost due to communication overhead. To restore scalability, we identify two orthogonal layers of parallelism in the filter diagonalization technique. In the horizontal layer the rows of the sparse matrix are distributed across individual processes. In the vertical layer bundles of multiple vectors are distributed across separate process groups. An analysis in terms of the communication metric predicts that scalability can be restored if, and only if, one implements the two orthogonal layers of parallelism via different distributed vector layouts. Our theoretical analysis is corroborated by benchmarks for application matrices from quantum and solid state physics, road networks, and nonlinear programming. We finally demonstrate the benefits of using orthogonal layers of parallelism with two exemplary application cases—an exciton and a strongly correlated electron system—which incur either small or large communication overhead.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 31"},"PeriodicalIF":1.6,"publicationDate":"2022-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41741959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article revisits checkpointing strategies when workflows composed of multiple tasks execute on a parallel platform. The objective is to minimize the expectation of the total execution time. For a single task, the Young/Daly formula provides the optimal checkpointing period. However, when many tasks execute simultaneously, the risk that one of them is severely delayed increases with the number of tasks. To mitigate this risk, a possibility is to checkpoint each task more often than with the Young/Daly strategy. But is it worth slowing each task down with extra checkpoints? Does the extra checkpointing make a difference globally? This article answers these questions. On the theoretical side, we prove several negative results for keeping the Young/Daly period when many tasks execute concurrently, and we design novel checkpointing strategies that guarantee an efficient execution with high probability. On the practical side, we report comprehensive experiments that demonstrate the need to go beyond the Young/Daly period and to checkpoint more often for a wide range of application/platform settings.
{"title":"Checkpointing Workflows à la Young/Daly Is Not Good Enough","authors":"A. Benoit, Lucas Perotin, Y. Robert, Hongyang Sun","doi":"10.1145/3548607","DOIUrl":"https://doi.org/10.1145/3548607","url":null,"abstract":"This article revisits checkpointing strategies when workflows composed of multiple tasks execute on a parallel platform. The objective is to minimize the expectation of the total execution time. For a single task, the Young/Daly formula provides the optimal checkpointing period. However, when many tasks execute simultaneously, the risk that one of them is severely delayed increases with the number of tasks. To mitigate this risk, a possibility is to checkpoint each task more often than with the Young/Daly strategy. But is it worth slowing each task down with extra checkpoints? Does the extra checkpointing make a difference globally? This article answers these questions. On the theoretical side, we prove several negative results for keeping the Young/Daly period when many tasks execute concurrently, and we design novel checkpointing strategies that guarantee an efficient execution with high probability. On the practical side, we report comprehensive experiments that demonstrate the need to go beyond the Young/Daly period and to checkpoint more often for a wide range of application/platform settings.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 25"},"PeriodicalIF":1.6,"publicationDate":"2022-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47044084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph coloring assigns a color to each vertex of a graph such that no two adjacent vertices get the same color. It is a key building block in many applications. In practice, solutions that require fewer distinct colors and that can be computed faster are typically preferred. Various coloring heuristics exist that provide different quality versus speed tradeoffs. The highest-quality heuristics tend to be slow. To improve performance, several parallel implementations have been proposed. This paper describes two improvements of the widely used LDF heuristic. First, we present a “shortcutting” approach to increase the parallelism by non-speculatively breaking data dependencies. Second, we present “color reduction” techniques to boost the solution of LDF. On 18 graphs from various domains, the shortcutting approach yields 2.5 times more parallelism in the mean, and the color-reduction techniques improve the result quality by up to 20%. Our deterministic CUDA implementation running on a Titan V is 2.9 times faster in the mean and uses as few or fewer colors as the best GPU codes from the literature.
{"title":"Improving the Speed and Quality of Parallel Graph Coloring","authors":"Ghadeer Alabandi, Martin Burtscher","doi":"10.1145/3543545","DOIUrl":"https://doi.org/10.1145/3543545","url":null,"abstract":"Graph coloring assigns a color to each vertex of a graph such that no two adjacent vertices get the same color. It is a key building block in many applications. In practice, solutions that require fewer distinct colors and that can be computed faster are typically preferred. Various coloring heuristics exist that provide different quality versus speed tradeoffs. The highest-quality heuristics tend to be slow. To improve performance, several parallel implementations have been proposed. This paper describes two improvements of the widely used LDF heuristic. First, we present a “shortcutting” approach to increase the parallelism by non-speculatively breaking data dependencies. Second, we present “color reduction” techniques to boost the solution of LDF. On 18 graphs from various domains, the shortcutting approach yields 2.5 times more parallelism in the mean, and the color-reduction techniques improve the result quality by up to 20%. Our deterministic CUDA implementation running on a Titan V is 2.9 times faster in the mean and uses as few or fewer colors as the best GPU codes from the literature.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":" ","pages":"1 - 35"},"PeriodicalIF":1.6,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47285975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article proposes and implements a Coarse-grained dynamically Reconfigurable Architecture, named Reconfigurable Multimedia Accelerator (REMAC). REMAC architecture is driven by the pipelined multi-instruction-multi-data execution model for exploiting multi-level parallelism of the computation-intensive loops in multimedia applications. The novel architecture of REMAC's reconfigurable processing unit (RPU) allows multiple iterations of a kernel loop can execute concurrently in the pipelining fashion by the temporal overlapping of the configuration fetch, execution, and store processes as much as possible. To address the huge bandwidth required by parallel processing units, REMAC architecture is proposed to efficiently exploit the abundant data locality in the kernel loops to decrease data access bandwidth while increase the efficiency of pipelined execution. In addition, a novel architecture of dedicated hierarchy data memory system is proposed to increase data reuse between iterations and make data always available for parallel operation of RPU. The proposed architecture was modeled at RTL using VHDL language. Several benchmark applications were mapped onto REMAC to validate the high-flexibility and high-performance of the architecture and prove that it is appropriate for a wide set of multimedia applications. The experimental results show that REMAC's performance is better than Xilinx Virtex-II, ADRES, REMUS-II, and TI C64+ DSP.
本文提出并实现了一种粗粒度的动态可重构体系结构,称为可重构多媒体加速器(Reconfigurable Multimedia Accelerator, REMAC)。REMAC架构由流水线式多指令多数据执行模型驱动,利用多媒体应用中计算密集型循环的多级并行性。REMAC的可重构处理单元(reconfigurable processing unit, RPU)的新颖架构允许内核循环的多个迭代以流水线方式并发执行,方法是尽可能多地在时间上重叠配置获取、执行和存储进程。针对并行处理单元对带宽的需求,提出了REMAC架构,有效利用内核循环中丰富的数据局部性,在降低数据访问带宽的同时提高流水线执行效率。此外,提出了一种新的专用层次数据存储系统架构,以增加迭代之间的数据重用,并使数据始终可用于RPU的并行操作。采用VHDL语言在RTL上对所提出的体系结构进行了建模。将几个基准测试应用程序映射到REMAC上,以验证该体系结构的高灵活性和高性能,并证明它适用于广泛的多媒体应用程序。实验结果表明,REMAC的性能优于Xilinx Virtex-II、ADRES、REMUS-II和TI C64+ DSP。
{"title":"Design and Implementation of a Coarse-grained Dynamically Reconfigurable Multimedia Accelerator","authors":"Hung K. Nguyen, Xuan-Tu Tran","doi":"10.1145/3543544","DOIUrl":"https://doi.org/10.1145/3543544","url":null,"abstract":"This article proposes and implements a Coarse-grained dynamically Reconfigurable Architecture, named Reconfigurable Multimedia Accelerator (REMAC). REMAC architecture is driven by the pipelined multi-instruction-multi-data execution model for exploiting multi-level parallelism of the computation-intensive loops in multimedia applications. The novel architecture of REMAC's reconfigurable processing unit (RPU) allows multiple iterations of a kernel loop can execute concurrently in the pipelining fashion by the temporal overlapping of the configuration fetch, execution, and store processes as much as possible. To address the huge bandwidth required by parallel processing units, REMAC architecture is proposed to efficiently exploit the abundant data locality in the kernel loops to decrease data access bandwidth while increase the efficiency of pipelined execution. In addition, a novel architecture of dedicated hierarchy data memory system is proposed to increase data reuse between iterations and make data always available for parallel operation of RPU. The proposed architecture was modeled at RTL using VHDL language. Several benchmark applications were mapped onto REMAC to validate the high-flexibility and high-performance of the architecture and prove that it is appropriate for a wide set of multimedia applications. The experimental results show that REMAC's performance is better than Xilinx Virtex-II, ADRES, REMUS-II, and TI C64+ DSP.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":" ","pages":"1 - 23"},"PeriodicalIF":1.6,"publicationDate":"2022-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46137129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Locking has been a predominant technique depended upon for achieving thread synchronization and ensuring correctness in multi-threaded applications. It has been established that the concurrent applications working with hierarchical data witness significant benefits due to multi-granularity locking (MGL) techniques compared to either fine- or coarse-grained locking. The de facto MGL technique used in hierarchical databases is intention locks, which uses a traversal-based protocol for hierarchical locking. A recent MGL implementation, dominator-based locking (DomLock), exploits interval numbering to balance the locking cost and concurrency and outperforms intention locks for non-tree-structured hierarchies. We observe, however, that depending upon the hierarchy structure and the interval numbering, DomLock pessimistically declares subhierarchies to be locked when in reality they are not. This increases the waiting time of locks and, in turn, reduces concurrency. To address this issue, we present Multi-Interval DomLock (MID), a new technique to improve the degree of concurrency of interval-based hierarchical locking. By adding additional intervals for each node, MID helps in reducing the unnecessary lock rejections due to false-positive lock status of sub-hierarchies. Unleashing the hidden opportunities to exploit more concurrency allows the parallel threads to finish their operations quickly, leading to notable performance improvement. We also show that with sufficient number of intervals, MID can avoid all the lock rejections due to false-positive lock status of nodes. MID is general and can be applied to any arbitrary hierarchy of trees, Directed Acyclic Graphs (DAGs), and cycles. It also works with dynamic hierarchies wherein the hierarchical structure undergoes updates. We illustrate the effectiveness of MID using STMBench7 and, with extensive experimental evaluation, show that it leads to significant throughput improvement (up to 141%, average 106%) over DomLock.
{"title":"Multi-Interval DomLock: Toward Improving Concurrency in Hierarchies","authors":"M. A. Anju, R. Nasre","doi":"10.1145/3543543","DOIUrl":"https://doi.org/10.1145/3543543","url":null,"abstract":"Locking has been a predominant technique depended upon for achieving thread synchronization and ensuring correctness in multi-threaded applications. It has been established that the concurrent applications working with hierarchical data witness significant benefits due to multi-granularity locking (MGL) techniques compared to either fine- or coarse-grained locking. The de facto MGL technique used in hierarchical databases is intention locks, which uses a traversal-based protocol for hierarchical locking. A recent MGL implementation, dominator-based locking (DomLock), exploits interval numbering to balance the locking cost and concurrency and outperforms intention locks for non-tree-structured hierarchies. We observe, however, that depending upon the hierarchy structure and the interval numbering, DomLock pessimistically declares subhierarchies to be locked when in reality they are not. This increases the waiting time of locks and, in turn, reduces concurrency. To address this issue, we present Multi-Interval DomLock (MID), a new technique to improve the degree of concurrency of interval-based hierarchical locking. By adding additional intervals for each node, MID helps in reducing the unnecessary lock rejections due to false-positive lock status of sub-hierarchies. Unleashing the hidden opportunities to exploit more concurrency allows the parallel threads to finish their operations quickly, leading to notable performance improvement. We also show that with sufficient number of intervals, MID can avoid all the lock rejections due to false-positive lock status of nodes. MID is general and can be applied to any arbitrary hierarchy of trees, Directed Acyclic Graphs (DAGs), and cycles. It also works with dynamic hierarchies wherein the hierarchical structure undergoes updates. We illustrate the effectiveness of MID using STMBench7 and, with extensive experimental evaluation, show that it leads to significant throughput improvement (up to 141%, average 106%) over DomLock.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 27"},"PeriodicalIF":1.6,"publicationDate":"2022-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43227379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study a class of simple algorithms for concurrently computing the connected components of an n-vertex, m-edge graph. Our algorithms are easy to implement in either the COMBINING CRCW PRAM or the MPC computing model. For two related algorithms in this class, we obtain Θ (lg n) step and Θ (m lg n) work bounds.1 For two others, we obtain O(lg2 n) step and O(m lg2 n) work bounds, which are tight for one of them. All our algorithms are simpler than related algorithms in the literature. We also point out some gaps and errors in the analysis of previous algorithms. Our results show that even a basic problem like connected components still has secrets to reveal.
{"title":"Simple Concurrent Connected Components Algorithms","authors":"S. Liu, R. Tarjan","doi":"10.1145/3543546","DOIUrl":"https://doi.org/10.1145/3543546","url":null,"abstract":"We study a class of simple algorithms for concurrently computing the connected components of an n-vertex, m-edge graph. Our algorithms are easy to implement in either the COMBINING CRCW PRAM or the MPC computing model. For two related algorithms in this class, we obtain Θ (lg n) step and Θ (m lg n) work bounds.1 For two others, we obtain O(lg2 n) step and O(m lg2 n) work bounds, which are tight for one of them. All our algorithms are simpler than related algorithms in the literature. We also point out some gaps and errors in the analysis of previous algorithms. Our results show that even a basic problem like connected components still has secrets to reveal.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 26"},"PeriodicalIF":1.6,"publicationDate":"2022-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41743528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}