Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures最新文献

英文中文

Scalable Fine-Grained Parallel Cycle Enumeration Algorithms 可扩展的细粒度并行循环枚举算法

Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures

Pub Date : 2022-02-19 DOI: 10.1145/3490148.3538585

J. Blanuša, P. Ienne, K. Atasu

Enumerating simple cycles has important applications in computational biology, network science, and financial crime analysis. In this work, we focus on parallelising the state-of-the-art simple cycle enumeration algorithms by Johnson and Read-Tarjan along with their applications to temporal graphs. To our knowledge, we are the first ones to parallelise these two algorithms in a fine-grained manner. We are also the first to demonstrate experimentally a linear performance scaling. Such a scaling is made possible by our decomposition of long sequential searches into fine-grained tasks, which are then dynamically scheduled across CPU cores, enabling an optimal load balancing. Furthermore, we show that coarse-grained parallel versions of the Johnson and the Read-Tarjan algorithms that exploit edge- or vertex-level parallelism are not scalable. On a cluster of four multi-core CPUs with 256 physical cores, our fine-grained parallel algorithms are, on average, an order of magnitude faster than their coarse-grained parallel counterparts. The performance gap between the fine-grained and the coarse-grained parallel algorithms widens as we use more CPU cores. When using all 256 CPU cores, our parallel algorithms enumerate temporal cycles, on average, 260x faster than the serial algorithm of Kumar and Calders.

枚举简单循环在计算生物学、网络科学和金融犯罪分析中有着重要的应用。在这项工作中，我们专注于并行Johnson和Read-Tarjan的最先进的简单循环枚举算法及其在时间图中的应用。据我们所知，我们是第一个以细粒度方式并行化这两种算法的人。我们也是第一个通过实验证明线性性能缩放的团队。通过将长顺序搜索分解为细粒度任务，可以实现这种扩展，然后跨CPU内核动态调度这些任务，从而实现最佳负载平衡。此外，我们表明Johnson和Read-Tarjan算法的粗粒度并行版本利用边缘或顶点级并行性是不可扩展的。在具有256个物理核的4个多核cpu集群上，我们的细粒度并行算法平均比粗粒度并行算法快一个数量级。细粒度和粗粒度并行算法之间的性能差距随着我们使用更多的CPU内核而扩大。当使用所有256个CPU内核时，我们的并行算法枚举时间周期的平均速度比Kumar和Calders的串行算法快260倍。

{"title":"Scalable Fine-Grained Parallel Cycle Enumeration Algorithms","authors":"J. Blanuša, P. Ienne, K. Atasu","doi":"10.1145/3490148.3538585","DOIUrl":"https://doi.org/10.1145/3490148.3538585","url":null,"abstract":"Enumerating simple cycles has important applications in computational biology, network science, and financial crime analysis. In this work, we focus on parallelising the state-of-the-art simple cycle enumeration algorithms by Johnson and Read-Tarjan along with their applications to temporal graphs. To our knowledge, we are the first ones to parallelise these two algorithms in a fine-grained manner. We are also the first to demonstrate experimentally a linear performance scaling. Such a scaling is made possible by our decomposition of long sequential searches into fine-grained tasks, which are then dynamically scheduled across CPU cores, enabling an optimal load balancing. Furthermore, we show that coarse-grained parallel versions of the Johnson and the Read-Tarjan algorithms that exploit edge- or vertex-level parallelism are not scalable. On a cluster of four multi-core CPUs with 256 physical cores, our fine-grained parallel algorithms are, on average, an order of magnitude faster than their coarse-grained parallel counterparts. The performance gap between the fine-grained and the coarse-grained parallel algorithms widens as we use more CPU cores. When using all 256 CPU cores, our parallel algorithms enumerate temporal cycles, on average, 260x faster than the serial algorithm of Kumar and Calders.","PeriodicalId":112865,"journal":{"name":"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115044268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Bamboo Trimming Revisited: Simple Algorithms Can Do Well Too 重新审视竹子修剪:简单的算法也可以做得很好

Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures

Pub Date : 2022-01-18 DOI: 10.1145/3490148.3538580

John Kuszmaul

The bamboo trimming problem considers n bamboo with growth rates h1, 2, . . . , satisfying Σihi = 1. During a given unit of time, each bamboo grows by hi , and then the bamboo-trimming algorithm gets to trim one of the bamboo back down to height zero. The goal is to minimize the height of the tallest bamboo, also known as the backlog. The bamboo trimming problem is closely related to many scheduling problems, and can be viewed as a variation of the widely-studied fixed-rate cup game, but with constant-factor resource augmentation. Past work has given sophisticated pinwheel algorithms that achieve the optimal backlog of 2 in the bamboo trimming problem. It remained an open question, however, whether there exists a simple algorithm with the same guarantee-recent work has devoted considerable theoretical and experimental effort to answering this question. Two algorithms, in particular, have appeared as natural candidates: the Reduce-Max algorithm (which always cuts the tallest bamboo) and the Reduce-Fastest(x) algorithm (which cuts the fastest-growing bamboo out of those that have at least some height x). It is conjectured that Reduce-Max and Reduce- Fastest(1) both achieve backlog 2. This paper improves the bounds for both Reduce-Fastest and Reduce-Max. Among other results, we show that the exact optimal backlog for Reduce-Fastest(x) is x + 1 for all x ≥ 2 (this proves a conjecture of D'Emidio, Di Stefano, and Navarra in the case of x = 2), and we show that Reduce-Fastest(1) does not achieve backlog 2 (this disproves a conjecture of D'Emidio, Di Stefano, and Navarra). Finally, we show that there is a different algorithm, which we call the Deadline-Driven Strategy, that is both very simple and achieves the optimal backlog of 2. This resolves the question as to whether there exists a simple worst-case optimal algorithm for the bamboo trimming problem.

竹材修剪问题考虑n根生长速率为h1、2、…的竹材。，满足Σihi = 1。在给定的时间单位内，每根竹子长1英尺，然后竹子修剪算法会将其中一根竹子修剪回0英尺。目标是最小化最高的竹子的高度，也就是所谓的积压。竹子修剪问题与许多调度问题密切相关，可以被视为广泛研究的固定速率杯游戏的变体，但具有恒定因素的资源增加。过去的工作已经给出了复杂的风车算法，可以在竹子修剪问题中实现2的最优积压。然而，是否存在一个具有同样保证的简单算法仍然是一个悬而未决的问题——最近的工作投入了大量的理论和实验努力来回答这个问题。特别是两种算法，已经成为自然的候选者:Reduce- max算法(总是切割最高的竹子)和Reduce-Fastest(x)算法(从至少有一些高度x的竹子中切割最快的竹子)。据推测，Reduce- max和Reduce-Fastest(1)都实现了积压2。本文改进了Reduce-Fastest和Reduce-Max的界。在其他结果中，我们表明，对于所有x≥2,Reduce-Fastest(x)的确切最优积压是x + 1(这证明了x = 2情况下D'Emidio, Di Stefano和Navarra的一个猜想)，并且我们表明Reduce-Fastest(1)不能实现积压2(这反驳了D'Emidio, Di Stefano和Navarra的一个猜想)。最后，我们展示了一种不同的算法，我们称之为截止日期驱动策略，它既简单又能实现2个最优积压。这就解决了是否存在简单的最坏情况最优算法的问题。

{"title":"Bamboo Trimming Revisited: Simple Algorithms Can Do Well Too","authors":"John Kuszmaul","doi":"10.1145/3490148.3538580","DOIUrl":"https://doi.org/10.1145/3490148.3538580","url":null,"abstract":"The bamboo trimming problem considers n bamboo with growth rates h1, 2, . . . , satisfying Σihi = 1. During a given unit of time, each bamboo grows by hi , and then the bamboo-trimming algorithm gets to trim one of the bamboo back down to height zero. The goal is to minimize the height of the tallest bamboo, also known as the backlog. The bamboo trimming problem is closely related to many scheduling problems, and can be viewed as a variation of the widely-studied fixed-rate cup game, but with constant-factor resource augmentation. Past work has given sophisticated pinwheel algorithms that achieve the optimal backlog of 2 in the bamboo trimming problem. It remained an open question, however, whether there exists a simple algorithm with the same guarantee-recent work has devoted considerable theoretical and experimental effort to answering this question. Two algorithms, in particular, have appeared as natural candidates: the Reduce-Max algorithm (which always cuts the tallest bamboo) and the Reduce-Fastest(x) algorithm (which cuts the fastest-growing bamboo out of those that have at least some height x). It is conjectured that Reduce-Max and Reduce- Fastest(1) both achieve backlog 2. This paper improves the bounds for both Reduce-Fastest and Reduce-Max. Among other results, we show that the exact optimal backlog for Reduce-Fastest(x) is x + 1 for all x ≥ 2 (this proves a conjecture of D'Emidio, Di Stefano, and Navarra in the case of x = 2), and we show that Reduce-Fastest(1) does not achieve backlog 2 (this disproves a conjecture of D'Emidio, Di Stefano, and Navarra). Finally, we show that there is a different algorithm, which we call the Deadline-Driven Strategy, that is both very simple and achieves the optimal backlog of 2. This resolves the question as to whether there exists a simple worst-case optimal algorithm for the bamboo trimming problem.","PeriodicalId":112865,"journal":{"name":"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122819966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

wCQ: A Fast Wait-Free Queue with Bounded Memory Usage 具有有限内存使用的快速无等待队列

Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures

Pub Date : 2022-01-06 DOI: 10.1145/3490148.3538572

R. Nikolaev, B. Ravindran

The concurrency literature presents a number of approaches for building non-blocking, FIFO, multiple-producer and multiple-consumer (MPMC) queues. However, only a fraction of them have high performance. In addition, many queue designs, such as LCRQ, trade memory usage for better performance. The recently proposed SCQ design achieves both memory efficiency as well as excellent performance. Unfortunately, both LCRQ and SCQ are only lock-free. On the other hand, existing wait-free queues are either not very performant or suffer from potentially unbounded memory usage. Strictly described, the latter queues, such as Yang & Mellor-Crummey's (YMC) queue, forfeit wait-freedom as they are blocking when memory is exhausted. We present a wait-free queue, called wCQ. wCQ is based on SCQ and uses its own variation of fast-path-slow-path methodology to attain wait-freedom and bound memory usage. Our experimental studies on x86 and PowerPC architectures validate wCQ's great performance and memory efficiency. They also show that wCQ's performance is often on par with the best known concurrent queue designs.

并发文献提出了许多构建非阻塞、FIFO、多生产者和多消费者(MPMC)队列的方法。然而，其中只有一小部分具有高性能。此外，许多队列设计(如LCRQ)会牺牲内存使用来获得更好的性能。最近提出的SCQ设计既实现了存储效率，又实现了优异的性能。不幸的是，LCRQ和SCQ都是无锁的。另一方面，现有的无等待队列要么性能不高，要么可能存在无限的内存使用问题。严格地说，后一种队列，比如Yang & Mellor-Crummey (YMC)队列，在内存耗尽时会阻塞，从而丧失了等待自由。我们提出了一个无等待队列，称为wCQ。wCQ基于SCQ，并使用自己的快速路径-慢路径方法变体来实现等待自由和受限的内存使用。我们在x86和PowerPC架构上的实验研究验证了wCQ出色的性能和内存效率。它们还表明，wCQ的性能通常与最著名的并发队列设计相当。

引用次数: 2

Robust and Optimal Contention Resolution without Collision Detection 无冲突检测的鲁棒和最优争用解决方案

Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures

Pub Date : 2021-11-12 DOI: 10.1145/3490148.3538592

Yonggang Jiang, Chaodong Zheng

Contention resolution on a multiple-access communication channel is a classical problem in distributed and parallel computing. In this problem, a set of nodes arrive over time, each with a message it intends to send. Time proceeds in synchronous slots, and in each slot each node can broadcast its message or remain idle. If in a slot one node broadcasts alone, it succeeds; otherwise, if multiple nodes broadcast simultaneously, messages collide and none succeeds. Nodes can differentiate collision and silence (that is, no node broadcasts) only if a collision detection mechanism is available. Ideally, a contention resolution algorithm should satisfy at least three criteria: (a) low time complexity (i.e., high throughput), meaning it does not take too long for all nodes to succeed; (b) low energy complexity, meaning each node does not make too many broadcast attempts before it succeeds; and (c) strong robustness, meaning the algorithm can maintain good performance even if interference is present. Such interference is often modeled by jamming---a jammed slot always generates collision. Previous work has shown, with collision detection, there are "perfect" contention resolution algorithms satisfying all three criteria. On the other hand, without collision detection, it was not until 2020 that an algorithm was discovered which can achieve optimal time complexity and low energy cost, assuming there is no jamming. More recently, the trade-off between throughput and robustness was studied. However, an intriguing and important question remains unknown: without collision detection, are there "perfect" contention resolution algorithms? In other words, when collision detection is absent and jamming is present, can we achieve both low total time complexity and low per-node energy cost? In this paper, we answer the above question affirmatively. Specifically, a new randomized algorithm for robust contention resolution is developed, assuming collision detection is not available. Lower bound results demonstrate it achieves both optimal time complexity and optimal energy complexity. If all nodes start execution simultaneously---which is often referred to as the "static case" in literature---another algorithm is developed that runs even faster. The separation on time complexity suggests, for robust contention resolution without collision detection, "batch" instances (that is, nodes start simultaneously) are inherently easier than "scattered" ones (that is, nodes arrive over time).

多址通信信道的争用解决是分布式并行计算中的一个经典问题。在这个问题中，一组节点随着时间的推移到达，每个节点都有一条它打算发送的消息。时间在同步槽中进行，在每个槽中，每个节点可以广播其消息或保持空闲。如果在一个槽中有一个节点单独广播，则它成功;否则，如果多个节点同时广播，则消息冲突，没有消息成功。只有当冲突检测机制可用时，节点才能区分冲突和沉默(即无节点广播)。理想情况下，争用解决算法应该至少满足三个标准:(a)低时间复杂度(即高吞吐量)，这意味着它不会花费太长时间使所有节点成功;(b)能量复杂度低，即每个节点在广播成功之前不会进行太多尝试;(c)鲁棒性强，即使存在干扰，算法也能保持良好的性能。这种干扰通常用干扰来描述——一个被干扰的狭缝总是会产生碰撞。先前的工作表明，对于碰撞检测，存在满足所有三个标准的“完美”争用解决算法。另一方面，在没有碰撞检测的情况下，直到2020年才发现了一种算法，该算法可以在没有干扰的情况下实现最优的时间复杂度和低能量成本。最近，研究了吞吐量和鲁棒性之间的权衡。然而，一个有趣而重要的问题仍然未知:没有碰撞检测，是否存在“完美”的争用解决算法?换句话说，当没有碰撞检测而存在干扰时，我们能否同时实现低总时间复杂度和低每节点能量消耗?本文对上述问题作了肯定的回答。具体地说，在没有碰撞检测的情况下，开发了一种新的随机化算法，用于鲁棒的争用解决。下界结果表明，该算法同时实现了最优时间复杂度和最优能量复杂度。如果所有节点同时开始执行——这在文献中通常被称为“静态情况”——则开发出另一种运行速度更快的算法。时间复杂度的分离表明，对于没有冲突检测的健壮的争用解决方案，“批处理”实例(即节点同时启动)本质上比“分散”实例(即节点随时间到达)更容易。

{"title":"Robust and Optimal Contention Resolution without Collision Detection","authors":"Yonggang Jiang, Chaodong Zheng","doi":"10.1145/3490148.3538592","DOIUrl":"https://doi.org/10.1145/3490148.3538592","url":null,"abstract":"Contention resolution on a multiple-access communication channel is a classical problem in distributed and parallel computing. In this problem, a set of nodes arrive over time, each with a message it intends to send. Time proceeds in synchronous slots, and in each slot each node can broadcast its message or remain idle. If in a slot one node broadcasts alone, it succeeds; otherwise, if multiple nodes broadcast simultaneously, messages collide and none succeeds. Nodes can differentiate collision and silence (that is, no node broadcasts) only if a collision detection mechanism is available. Ideally, a contention resolution algorithm should satisfy at least three criteria: (a) low time complexity (i.e., high throughput), meaning it does not take too long for all nodes to succeed; (b) low energy complexity, meaning each node does not make too many broadcast attempts before it succeeds; and (c) strong robustness, meaning the algorithm can maintain good performance even if interference is present. Such interference is often modeled by jamming---a jammed slot always generates collision. Previous work has shown, with collision detection, there are \"perfect\" contention resolution algorithms satisfying all three criteria. On the other hand, without collision detection, it was not until 2020 that an algorithm was discovered which can achieve optimal time complexity and low energy cost, assuming there is no jamming. More recently, the trade-off between throughput and robustness was studied. However, an intriguing and important question remains unknown: without collision detection, are there \"perfect\" contention resolution algorithms? In other words, when collision detection is absent and jamming is present, can we achieve both low total time complexity and low per-node energy cost? In this paper, we answer the above question affirmatively. Specifically, a new randomized algorithm for robust contention resolution is developed, assuming collision detection is not available. Lower bound results demonstrate it achieves both optimal time complexity and optimal energy complexity. If all nodes start execution simultaneously---which is often referred to as the \"static case\" in literature---another algorithm is developed that runs even faster. The separation on time complexity suggests, for robust contention resolution without collision detection, \"batch\" instances (that is, nodes start simultaneously) are inherently easier than \"scattered\" ones (that is, nodes arrive over time).","PeriodicalId":112865,"journal":{"name":"Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116973217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀