Enumerating simple cycles has important applications in computational biology, network science, and financial crime analysis. In this work, we focus on parallelising the state-of-the-art simple cycle enumeration algorithms by Johnson and Read-Tarjan along with their applications to temporal graphs. To our knowledge, we are the first ones to parallelise these two algorithms in a fine-grained manner. We are also the first to demonstrate experimentally a linear performance scaling. Such a scaling is made possible by our decomposition of long sequential searches into fine-grained tasks, which are then dynamically scheduled across CPU cores, enabling an optimal load balancing. Furthermore, we show that coarse-grained parallel versions of the Johnson and the Read-Tarjan algorithms that exploit edge- or vertex-level parallelism are not scalable. On a cluster of four multi-core CPUs with 256 physical cores, our fine-grained parallel algorithms are, on average, an order of magnitude faster than their coarse-grained parallel counterparts. The performance gap between the fine-grained and the coarse-grained parallel algorithms widens as we use more CPU cores. When using all 256 CPU cores, our parallel algorithms enumerate temporal cycles, on average, 260x faster than the serial algorithm of Kumar and Calders.
{"title":"Scalable Fine-Grained Parallel Cycle Enumeration Algorithms","authors":"J. Blanuša, P. Ienne, K. Atasu","doi":"10.1145/3490148.3538585","DOIUrl":"https://doi.org/10.1145/3490148.3538585","url":null,"abstract":"Enumerating simple cycles has important applications in computational biology, network science, and financial crime analysis. In this work, we focus on parallelising the state-of-the-art simple cycle enumeration algorithms by Johnson and Read-Tarjan along with their applications to temporal graphs. To our knowledge, we are the first ones to parallelise these two algorithms in a fine-grained manner. We are also the first to demonstrate experimentally a linear performance scaling. Such a scaling is made possible by our decomposition of long sequential searches into fine-grained tasks, which are then dynamically scheduled across CPU cores, enabling an optimal load balancing. Furthermore, we show that coarse-grained parallel versions of the Johnson and the Read-Tarjan algorithms that exploit edge- or vertex-level parallelism are not scalable. On a cluster of four multi-core CPUs with 256 physical cores, our fine-grained parallel algorithms are, on average, an order of magnitude faster than their coarse-grained parallel counterparts. The performance gap between the fine-grained and the coarse-grained parallel algorithms widens as we use more CPU cores. When using all 256 CPU cores, our parallel algorithms enumerate temporal cycles, on average, 260x faster than the serial algorithm of Kumar and Calders.","PeriodicalId":112865,"journal":{"name":"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115044268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The bamboo trimming problem considers n bamboo with growth rates h1, 2, . . . , satisfying Σihi = 1. During a given unit of time, each bamboo grows by hi , and then the bamboo-trimming algorithm gets to trim one of the bamboo back down to height zero. The goal is to minimize the height of the tallest bamboo, also known as the backlog. The bamboo trimming problem is closely related to many scheduling problems, and can be viewed as a variation of the widely-studied fixed-rate cup game, but with constant-factor resource augmentation. Past work has given sophisticated pinwheel algorithms that achieve the optimal backlog of 2 in the bamboo trimming problem. It remained an open question, however, whether there exists a simple algorithm with the same guarantee-recent work has devoted considerable theoretical and experimental effort to answering this question. Two algorithms, in particular, have appeared as natural candidates: the Reduce-Max algorithm (which always cuts the tallest bamboo) and the Reduce-Fastest(x) algorithm (which cuts the fastest-growing bamboo out of those that have at least some height x). It is conjectured that Reduce-Max and Reduce- Fastest(1) both achieve backlog 2. This paper improves the bounds for both Reduce-Fastest and Reduce-Max. Among other results, we show that the exact optimal backlog for Reduce-Fastest(x) is x + 1 for all x ≥ 2 (this proves a conjecture of D'Emidio, Di Stefano, and Navarra in the case of x = 2), and we show that Reduce-Fastest(1) does not achieve backlog 2 (this disproves a conjecture of D'Emidio, Di Stefano, and Navarra). Finally, we show that there is a different algorithm, which we call the Deadline-Driven Strategy, that is both very simple and achieves the optimal backlog of 2. This resolves the question as to whether there exists a simple worst-case optimal algorithm for the bamboo trimming problem.
竹材修剪问题考虑n根生长速率为h1、2、…的竹材。,满足Σihi = 1。在给定的时间单位内,每根竹子长1英尺,然后竹子修剪算法会将其中一根竹子修剪回0英尺。目标是最小化最高的竹子的高度,也就是所谓的积压。竹子修剪问题与许多调度问题密切相关,可以被视为广泛研究的固定速率杯游戏的变体,但具有恒定因素的资源增加。过去的工作已经给出了复杂的风车算法,可以在竹子修剪问题中实现2的最优积压。然而,是否存在一个具有同样保证的简单算法仍然是一个悬而未决的问题——最近的工作投入了大量的理论和实验努力来回答这个问题。特别是两种算法,已经成为自然的候选者:Reduce- max算法(总是切割最高的竹子)和Reduce-Fastest(x)算法(从至少有一些高度x的竹子中切割最快的竹子)。据推测,Reduce- max和Reduce-Fastest(1)都实现了积压2。本文改进了Reduce-Fastest和Reduce-Max的界。在其他结果中,我们表明,对于所有x≥2,Reduce-Fastest(x)的确切最优积压是x + 1(这证明了x = 2情况下D'Emidio, Di Stefano和Navarra的一个猜想),并且我们表明Reduce-Fastest(1)不能实现积压2(这反驳了D'Emidio, Di Stefano和Navarra的一个猜想)。最后,我们展示了一种不同的算法,我们称之为截止日期驱动策略,它既简单又能实现2个最优积压。这就解决了是否存在简单的最坏情况最优算法的问题。
{"title":"Bamboo Trimming Revisited: Simple Algorithms Can Do Well Too","authors":"John Kuszmaul","doi":"10.1145/3490148.3538580","DOIUrl":"https://doi.org/10.1145/3490148.3538580","url":null,"abstract":"The bamboo trimming problem considers n bamboo with growth rates h1, 2, . . . , satisfying Σihi = 1. During a given unit of time, each bamboo grows by hi , and then the bamboo-trimming algorithm gets to trim one of the bamboo back down to height zero. The goal is to minimize the height of the tallest bamboo, also known as the backlog. The bamboo trimming problem is closely related to many scheduling problems, and can be viewed as a variation of the widely-studied fixed-rate cup game, but with constant-factor resource augmentation. Past work has given sophisticated pinwheel algorithms that achieve the optimal backlog of 2 in the bamboo trimming problem. It remained an open question, however, whether there exists a simple algorithm with the same guarantee-recent work has devoted considerable theoretical and experimental effort to answering this question. Two algorithms, in particular, have appeared as natural candidates: the Reduce-Max algorithm (which always cuts the tallest bamboo) and the Reduce-Fastest(x) algorithm (which cuts the fastest-growing bamboo out of those that have at least some height x). It is conjectured that Reduce-Max and Reduce- Fastest(1) both achieve backlog 2. This paper improves the bounds for both Reduce-Fastest and Reduce-Max. Among other results, we show that the exact optimal backlog for Reduce-Fastest(x) is x + 1 for all x ≥ 2 (this proves a conjecture of D'Emidio, Di Stefano, and Navarra in the case of x = 2), and we show that Reduce-Fastest(1) does not achieve backlog 2 (this disproves a conjecture of D'Emidio, Di Stefano, and Navarra). Finally, we show that there is a different algorithm, which we call the Deadline-Driven Strategy, that is both very simple and achieves the optimal backlog of 2. This resolves the question as to whether there exists a simple worst-case optimal algorithm for the bamboo trimming problem.","PeriodicalId":112865,"journal":{"name":"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122819966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The concurrency literature presents a number of approaches for building non-blocking, FIFO, multiple-producer and multiple-consumer (MPMC) queues. However, only a fraction of them have high performance. In addition, many queue designs, such as LCRQ, trade memory usage for better performance. The recently proposed SCQ design achieves both memory efficiency as well as excellent performance. Unfortunately, both LCRQ and SCQ are only lock-free. On the other hand, existing wait-free queues are either not very performant or suffer from potentially unbounded memory usage. Strictly described, the latter queues, such as Yang & Mellor-Crummey's (YMC) queue, forfeit wait-freedom as they are blocking when memory is exhausted. We present a wait-free queue, called wCQ. wCQ is based on SCQ and uses its own variation of fast-path-slow-path methodology to attain wait-freedom and bound memory usage. Our experimental studies on x86 and PowerPC architectures validate wCQ's great performance and memory efficiency. They also show that wCQ's performance is often on par with the best known concurrent queue designs.
{"title":"wCQ: A Fast Wait-Free Queue with Bounded Memory Usage","authors":"R. Nikolaev, B. Ravindran","doi":"10.1145/3490148.3538572","DOIUrl":"https://doi.org/10.1145/3490148.3538572","url":null,"abstract":"The concurrency literature presents a number of approaches for building non-blocking, FIFO, multiple-producer and multiple-consumer (MPMC) queues. However, only a fraction of them have high performance. In addition, many queue designs, such as LCRQ, trade memory usage for better performance. The recently proposed SCQ design achieves both memory efficiency as well as excellent performance. Unfortunately, both LCRQ and SCQ are only lock-free. On the other hand, existing wait-free queues are either not very performant or suffer from potentially unbounded memory usage. Strictly described, the latter queues, such as Yang & Mellor-Crummey's (YMC) queue, forfeit wait-freedom as they are blocking when memory is exhausted. We present a wait-free queue, called wCQ. wCQ is based on SCQ and uses its own variation of fast-path-slow-path methodology to attain wait-freedom and bound memory usage. Our experimental studies on x86 and PowerPC architectures validate wCQ's great performance and memory efficiency. They also show that wCQ's performance is often on par with the best known concurrent queue designs.","PeriodicalId":112865,"journal":{"name":"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131816511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Contention resolution on a multiple-access communication channel is a classical problem in distributed and parallel computing. In this problem, a set of nodes arrive over time, each with a message it intends to send. Time proceeds in synchronous slots, and in each slot each node can broadcast its message or remain idle. If in a slot one node broadcasts alone, it succeeds; otherwise, if multiple nodes broadcast simultaneously, messages collide and none succeeds. Nodes can differentiate collision and silence (that is, no node broadcasts) only if a collision detection mechanism is available. Ideally, a contention resolution algorithm should satisfy at least three criteria: (a) low time complexity (i.e., high throughput), meaning it does not take too long for all nodes to succeed; (b) low energy complexity, meaning each node does not make too many broadcast attempts before it succeeds; and (c) strong robustness, meaning the algorithm can maintain good performance even if interference is present. Such interference is often modeled by jamming---a jammed slot always generates collision. Previous work has shown, with collision detection, there are "perfect" contention resolution algorithms satisfying all three criteria. On the other hand, without collision detection, it was not until 2020 that an algorithm was discovered which can achieve optimal time complexity and low energy cost, assuming there is no jamming. More recently, the trade-off between throughput and robustness was studied. However, an intriguing and important question remains unknown: without collision detection, are there "perfect" contention resolution algorithms? In other words, when collision detection is absent and jamming is present, can we achieve both low total time complexity and low per-node energy cost? In this paper, we answer the above question affirmatively. Specifically, a new randomized algorithm for robust contention resolution is developed, assuming collision detection is not available. Lower bound results demonstrate it achieves both optimal time complexity and optimal energy complexity. If all nodes start execution simultaneously---which is often referred to as the "static case" in literature---another algorithm is developed that runs even faster. The separation on time complexity suggests, for robust contention resolution without collision detection, "batch" instances (that is, nodes start simultaneously) are inherently easier than "scattered" ones (that is, nodes arrive over time).
{"title":"Robust and Optimal Contention Resolution without Collision Detection","authors":"Yonggang Jiang, Chaodong Zheng","doi":"10.1145/3490148.3538592","DOIUrl":"https://doi.org/10.1145/3490148.3538592","url":null,"abstract":"Contention resolution on a multiple-access communication channel is a classical problem in distributed and parallel computing. In this problem, a set of nodes arrive over time, each with a message it intends to send. Time proceeds in synchronous slots, and in each slot each node can broadcast its message or remain idle. If in a slot one node broadcasts alone, it succeeds; otherwise, if multiple nodes broadcast simultaneously, messages collide and none succeeds. Nodes can differentiate collision and silence (that is, no node broadcasts) only if a collision detection mechanism is available. Ideally, a contention resolution algorithm should satisfy at least three criteria: (a) low time complexity (i.e., high throughput), meaning it does not take too long for all nodes to succeed; (b) low energy complexity, meaning each node does not make too many broadcast attempts before it succeeds; and (c) strong robustness, meaning the algorithm can maintain good performance even if interference is present. Such interference is often modeled by jamming---a jammed slot always generates collision. Previous work has shown, with collision detection, there are \"perfect\" contention resolution algorithms satisfying all three criteria. On the other hand, without collision detection, it was not until 2020 that an algorithm was discovered which can achieve optimal time complexity and low energy cost, assuming there is no jamming. More recently, the trade-off between throughput and robustness was studied. However, an intriguing and important question remains unknown: without collision detection, are there \"perfect\" contention resolution algorithms? In other words, when collision detection is absent and jamming is present, can we achieve both low total time complexity and low per-node energy cost? In this paper, we answer the above question affirmatively. Specifically, a new randomized algorithm for robust contention resolution is developed, assuming collision detection is not available. Lower bound results demonstrate it achieves both optimal time complexity and optimal energy complexity. If all nodes start execution simultaneously---which is often referred to as the \"static case\" in literature---another algorithm is developed that runs even faster. The separation on time complexity suggests, for robust contention resolution without collision detection, \"batch\" instances (that is, nodes start simultaneously) are inherently easier than \"scattered\" ones (that is, nodes arrive over time).","PeriodicalId":112865,"journal":{"name":"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116973217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}