ACM Transactions on Parallel Computing最新文献_第5页

Parallel Peeling of Bipartite Networks for Hierarchical Dense Subgraph Discovery 面向层次密集子图发现的二部网络并行剥离

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2021-10-24 DOI: 10.1145/3583084

Kartik Lakhotia, R. Kannan, V. Prasanna

Wing and Tip decomposition are motif-based analytics for bipartite graphs that construct a hierarchy of butterfly (2,2-biclique) dense edge and vertex induced subgraphs, respectively. They have applications in several domains, including e-commerce, recommendation systems, document analysis, and others. Existing decomposition algorithms use a bottom-up approach that constructs the hierarchy in an increasing order of the subgraph density. They iteratively select the edges or vertices with minimum butterfly count peel, i.e., remove them along with their butterflies. The amount of butterflies in real-world bipartite graphs makes bottom-up peeling computationally demanding. Furthermore, the strict order of peeling entities results in a large number of sequentially dependent iterations. Consequently, parallel algorithms based on bottom up peeling incur heavy synchronization and poor scalability. In this article, we propose a novel Parallel Bipartite Network peelinG (PBNG) framework that adopts a two-phased peeling approach to relax the order of peeling, and in turn, dramatically reduce synchronization. The first phase divides the decomposition hierarchy into few partitions and requires little synchronization. The second phase concurrently processes all partitions to generate individual levels of the hierarchy and requires no global synchronization. The two-phased peeling further enables batching optimizations that dramatically improve the computational efficiency of PBNG. We empirically evaluate PBNG using several real-world bipartite graphs and demonstrate radical improvements over the existing approaches. On a shared-memory 36 core server, PBNG achieves up to 19.7× self-relative parallel speedup. Compared to the state-of-the-art parallel framework ParButterfly, PBNG reduces synchronization by up to 15,260× and execution time by up to 295×. Furthermore, it achieves up to 38.5× speedup over state-of-the-art algorithms specifically tuned for wing decomposition. Our source code is made available at https://github.com/kartiklakhotia/RECEIPT.

Wing和Tip分解是基于基序的二分图分析，分别构建了蝴蝶（2，2-二分）密集边和顶点诱导子图的层次。它们在多个领域都有应用，包括电子商务、推荐系统、文档分析等。现有的分解算法使用自下而上的方法，按照子图密度的递增顺序构建层次结构。他们迭代地选择蝴蝶数量剥离最小的边或顶点，即将它们与蝴蝶一起移除。现实世界中的二分图中蝴蝶的数量使得自下而上的剥离在计算上要求很高。此外，剥离实体的严格顺序会导致大量顺序相关的迭代。因此，基于自下而上剥离的并行算法同步性差，可扩展性差。在本文中，我们提出了一种新的并行二部分网络剥离G（PBNG）框架，该框架采用两阶段剥离方法来放松剥离顺序，从而显著减少同步。第一阶段将分解层次划分为几个分区，并且几乎不需要同步。第二阶段同时处理所有分区以生成层次结构的各个级别，并且不需要全局同步。两阶段剥离进一步实现了批量优化，极大地提高了PBNG的计算效率。我们使用几个真实世界的二分图对PBNG进行了实证评估，并证明了对现有方法的根本改进。在共享内存的36核服务器上，PBNG实现了19.7倍的自相对并行加速。与最先进的并行框架ParButterfly相比，PBNG将同步时间减少了15260倍，执行时间减少了295倍。此外，与专门针对机翼分解调整的最先进算法相比，它实现了高达38.5倍的加速。我们的源代码可在https://github.com/kartiklakhotia/RECEIPT.

{"title":"Parallel Peeling of Bipartite Networks for Hierarchical Dense Subgraph Discovery","authors":"Kartik Lakhotia, R. Kannan, V. Prasanna","doi":"10.1145/3583084","DOIUrl":"https://doi.org/10.1145/3583084","url":null,"abstract":"Wing and Tip decomposition are motif-based analytics for bipartite graphs that construct a hierarchy of butterfly (2,2-biclique) dense edge and vertex induced subgraphs, respectively. They have applications in several domains, including e-commerce, recommendation systems, document analysis, and others. Existing decomposition algorithms use a bottom-up approach that constructs the hierarchy in an increasing order of the subgraph density. They iteratively select the edges or vertices with minimum butterfly count peel, i.e., remove them along with their butterflies. The amount of butterflies in real-world bipartite graphs makes bottom-up peeling computationally demanding. Furthermore, the strict order of peeling entities results in a large number of sequentially dependent iterations. Consequently, parallel algorithms based on bottom up peeling incur heavy synchronization and poor scalability. In this article, we propose a novel Parallel Bipartite Network peelinG (PBNG) framework that adopts a two-phased peeling approach to relax the order of peeling, and in turn, dramatically reduce synchronization. The first phase divides the decomposition hierarchy into few partitions and requires little synchronization. The second phase concurrently processes all partitions to generate individual levels of the hierarchy and requires no global synchronization. The two-phased peeling further enables batching optimizations that dramatically improve the computational efficiency of PBNG. We empirically evaluate PBNG using several real-world bipartite graphs and demonstrate radical improvements over the existing approaches. On a shared-memory 36 core server, PBNG achieves up to 19.7× self-relative parallel speedup. Compared to the state-of-the-art parallel framework ParButterfly, PBNG reduces synchronization by up to 15,260× and execution time by up to 295×. Furthermore, it achieves up to 38.5× speedup over state-of-the-art algorithms specifically tuned for wing decomposition. Our source code is made available at https://github.com/kartiklakhotia/RECEIPT.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 35"},"PeriodicalIF":1.6,"publicationDate":"2021-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44283113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Metrics and Design of an Instruction Roofline Model for AMD GPUs AMD GPU指令屋顶线模型的度量与设计

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2021-10-15 DOI: 10.1145/3505285

M. Leinhauser, R. Widera, S. Bastrakov, A. Debus, M. Bussmann, S. Chandrasekaran

Due to the recent announcement of the Frontier supercomputer, many scientific application developers are working to make their applications compatible with AMD (CPU-GPU) architectures, which means moving away from the traditional CPU and NVIDIA-GPU systems. Due to the current limitations of profiling tools for AMD GPUs, this shift leaves a void in how to measure application performance on AMD GPUs. In this article, we design an instruction roofline model for AMD GPUs using AMD’s ROCProfiler and a benchmarking tool, BabelStream (the HIP implementation), as a way to measure an application’s performance in instructions and memory transactions on new AMD hardware. Specifically, we create instruction roofline models for a case study scientific application, PIConGPU, an open source particle-in-cell simulations application used for plasma and laser-plasma physics on the NVIDIA V100, AMD Radeon Instinct MI60, and AMD Instinct MI100 GPUs. When looking at the performance of multiple kernels of interest in PIConGPU we find that although the AMD MI100 GPU achieves a similar, or better, execution time compared to the NVIDIA V100 GPU, profiling tool differences make comparing performance of these two architectures hard. When looking at execution time, GIPS, and instruction intensity, the AMD MI60 achieves the worst performance out of the three GPUs used in this work.

由于最近Frontier超级计算机的发布，许多科学应用程序开发人员正在努力使他们的应用程序与AMD (CPU- gpu)架构兼容，这意味着远离传统的CPU和NVIDIA-GPU系统。由于目前AMD gpu的分析工具的局限性，这种转变在如何测量AMD gpu上的应用程序性能方面留下了空白。在本文中，我们使用AMD的ROCProfiler和基准测试工具BabelStream (HIP实现)为AMD gpu设计了一个指令线模型，作为在新的AMD硬件上测量应用程序在指令和内存事务方面的性能的一种方法。具体来说，我们为一个案例研究科学应用程序PIConGPU创建了指令线模型，PIConGPU是一个开源的粒子单元模拟应用程序，用于NVIDIA V100、AMD Radeon Instinct MI60和AMD Instinct MI100 gpu上的等离子体和激光等离子体物理。当在PIConGPU中查看多个感兴趣的内核的性能时，我们发现尽管AMD MI100 GPU实现了与NVIDIA V100 GPU相似或更好的执行时间，但分析工具的差异使得比较这两种架构的性能变得困难。在查看执行时间、GIPS和指令强度时，AMD MI60在本工作中使用的三种gpu中性能最差。

{"title":"Metrics and Design of an Instruction Roofline Model for AMD GPUs","authors":"M. Leinhauser, R. Widera, S. Bastrakov, A. Debus, M. Bussmann, S. Chandrasekaran","doi":"10.1145/3505285","DOIUrl":"https://doi.org/10.1145/3505285","url":null,"abstract":"Due to the recent announcement of the Frontier supercomputer, many scientific application developers are working to make their applications compatible with AMD (CPU-GPU) architectures, which means moving away from the traditional CPU and NVIDIA-GPU systems. Due to the current limitations of profiling tools for AMD GPUs, this shift leaves a void in how to measure application performance on AMD GPUs. In this article, we design an instruction roofline model for AMD GPUs using AMD’s ROCProfiler and a benchmarking tool, BabelStream (the HIP implementation), as a way to measure an application’s performance in instructions and memory transactions on new AMD hardware. Specifically, we create instruction roofline models for a case study scientific application, PIConGPU, an open source particle-in-cell simulations application used for plasma and laser-plasma physics on the NVIDIA V100, AMD Radeon Instinct MI60, and AMD Instinct MI100 GPUs. When looking at the performance of multiple kernels of interest in PIConGPU we find that although the AMD MI100 GPU achieves a similar, or better, execution time compared to the NVIDIA V100 GPU, profiling tool differences make comparing performance of these two architectures hard. When looking at execution time, GIPS, and instruction intensity, the AMD MI60 achieves the worst performance out of the three GPUs used in this work.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 14"},"PeriodicalIF":1.6,"publicationDate":"2021-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41415302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Introduction to the Special Issue for SPAA 2019 SPAA 2019特刊简介

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2021-09-20 DOI: 10.1145/3477610

P. Berenbrink

1. Soheil Behnezhad, Laxman Dhulipala, Hossein Esfandiari, Jakub Łącki, Vahab Mirrokni, Warren Schudy: Massively Parallel Computation via Remote Memory Access 2. Faith Ellen, Barun Gorain, Avery Miller, Andrzej Pelc: Constant-Length Labeling Schemes for Deterministic Radio Broadcast 3. Michael A. Bender, Alex Conway, Martín Farach-Colton, William Jannen, Yizheng Jiao, Rob Johnson, Eric Knorr, Sara McAllister, Nirjhar Mukherjee, Prashant Pandey, Donald E. Porter, Jun Yuan, and Yang Zhan: External-Memory Dictionaries in the Affine and PDAM Models.

1. Soheil Behnezhad, Laxman Dhulipala, Hossein Esfandiari, Jakub Łącki, Vahab Mirrokni, Warren study:基于远程内存访问的大规模并行计算2。Faith Ellen, Barun Gorain, Avery Miller, Andrzej Pelc:确定性无线电广播的等长度标记方案Michael A. Bender, Alex Conway, Martín Farach-Colton, William Jannen，焦义正，Rob Johnson, Eric Knorr, Sara McAllister, Nirjhar Mukherjee, Prashant Pandey, Donald E. Porter, Yuan, and Yang Zhan:仿射和PDAM模型中的外部记忆字典。

引用次数: 0

Study of Fine-grained Nested Parallelism in CDCL SAT Solvers CDCL SAT解算器中细粒度嵌套并行性的研究

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2021-09-20 DOI: 10.1145/3470639

J. Edwards, U. Vishkin

Boolean satisfiability (SAT) is an important performance-hungry problem with applications in many problem domains. However, most work on parallelizing SAT solvers has focused on coarse-grained, mostly embarrassing, parallelism. Here, we study fine-grained parallelism that can speed up existing sequential SAT solvers, which all happen to be of the so-called Conflict-Directed Clause Learning variety. We show the potential for speedups of up to 382× across a variety of problem instances. We hope that these results will stimulate future research, particularly with respect to a computer architecture open problem we present.

在许多问题领域中，布尔可满足性(SAT)是一个重要的对性能要求很高的问题。然而，大多数关于并行化SAT求解器的工作都集中在粗粒度的、令人尴尬的并行性上。在这里，我们研究了细粒度的并行性，它可以加速现有的顺序SAT解决方案，这些解决方案都是所谓的冲突导向子句学习。我们展示了在各种问题实例中加速高达382倍的潜力。我们希望这些结果将刺激未来的研究，特别是关于我们提出的计算机体系结构开放问题。

引用次数: 3

Massively Parallel Computation via Remote Memory Access 基于远程内存访问的大规模并行计算

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2021-09-20 DOI: 10.1145/3470631

Soheil Behnezhad, Laxman Dhulipala, Hossein Esfandiari, Jakub Lacki, V. Mirrokni, W. Schudy

We introduce the Adaptive Massively Parallel Computation (AMPC) model, which is an extension of the Massively Parallel Computation (MPC) model. At a high level, the AMPC model strengthens the MPC model by storing all messages sent within a round in a distributed data store. In the following round, all machines are provided with random read access to the data store, subject to the same constraints on the total amount of communication as in the MPC model. Our model is inspired by the previous empirical studies of distributed graph algorithms [8, 30] using MapReduce and a distributed hash table service [17]. This extension allows us to give new graph algorithms with much lower round complexities compared to the best-known solutions in the MPC model. In particular, in the AMPC model we show how to solve maximal independent set in O(1) rounds and connectivity/minimum spanning tree in O(log logm/n n rounds both using O(nδ) space per machine for constant δ < 1. In the same memory regime for MPC, the best-known algorithms for these problems require poly log n rounds. Our results imply that the 2-CYCLE conjecture, which is widely believed to hold in the MPC model, does not hold in the AMPC model.

我们介绍了自适应大规模并行计算（AMPC）模型，它是大规模并行计算模型的扩展。在高层，AMPC模型通过将一轮中发送的所有消息存储在分布式数据存储中来增强MPC模型。在下一轮中，向所有机器提供对数据存储的随机读取访问，受与MPC模型中相同的通信总量约束。我们的模型受到了之前使用MapReduce和分布式哈希表服务[17]对分布式图算法[8，30]进行的实证研究的启发。与MPC模型中最著名的解决方案相比，这种扩展使我们能够给出新的图算法，其回合复杂性要低得多。特别地，在AMPC模型中，我们展示了如何在常数δ<1的情况下使用每机器的O（nδ）空间来求解O（1）轮中的最大独立集和O（log logm/n n轮中的连通性/最小生成树。在MPC的相同内存机制中，解决这些问题的最著名算法需要多对数n轮。我们的结果表明，被广泛认为在MPC模型中成立的2-CYCLE猜想在AMPC模型中不成立。

{"title":"Massively Parallel Computation via Remote Memory Access","authors":"Soheil Behnezhad, Laxman Dhulipala, Hossein Esfandiari, Jakub Lacki, V. Mirrokni, W. Schudy","doi":"10.1145/3470631","DOIUrl":"https://doi.org/10.1145/3470631","url":null,"abstract":"We introduce the Adaptive Massively Parallel Computation (AMPC) model, which is an extension of the Massively Parallel Computation (MPC) model. At a high level, the AMPC model strengthens the MPC model by storing all messages sent within a round in a distributed data store. In the following round, all machines are provided with random read access to the data store, subject to the same constraints on the total amount of communication as in the MPC model. Our model is inspired by the previous empirical studies of distributed graph algorithms [8, 30] using MapReduce and a distributed hash table service [17]. This extension allows us to give new graph algorithms with much lower round complexities compared to the best-known solutions in the MPC model. In particular, in the AMPC model we show how to solve maximal independent set in O(1) rounds and connectivity/minimum spanning tree in O(log logm/n n rounds both using O(nδ) space per machine for constant δ < 1. In the same memory regime for MPC, the best-known algorithms for these problems require poly log n rounds. Our results imply that the 2-CYCLE conjecture, which is widely believed to hold in the MPC model, does not hold in the AMPC model.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"8 1","pages":"1 - 25"},"PeriodicalIF":1.6,"publicationDate":"2021-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45449480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Efficient Parallel 3D Computation of the Compressible Euler Equations with an Invariant-domain Preserving Second-order Finite-element Scheme 具有保持不变域的二阶有限元格式的可压缩欧拉方程的高效并行三维计算

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2021-09-20 DOI: 10.1145/3470637

M. Maier, M. Kronbichler

We discuss the efficient implementation of a high-performance second-order collocation-type finite-element scheme for solving the compressible Euler equations of gas dynamics on unstructured meshes. The solver is based on the convex-limiting technique introduced by Guermond et al. (SIAM J. Sci. Comput. 40, A3211–A3239, 2018). As such, it is invariant-domain preserving; i.e., the solver maintains important physical invariants and is guaranteed to be stable without the use of ad hoc tuning parameters. This stability comes at the expense of a significantly more involved algorithmic structure that renders conventional high-performance discretizations challenging. We develop an algorithmic design that allows SIMD vectorization of the compute kernel, identify the main ingredients for a good node-level performance, and report excellent weak and strong scaling of a hybrid thread/MPI parallelization.

讨论了求解非结构网格上气体动力学可压缩欧拉方程的高性能二阶配位型有限元格式的有效实现。该求解器基于Guermond et al. (SIAM J. Sci.)引入的凸极限技术。计算机学报。40,A3211-A3239, 2018)。因此，它是保持不变域的;即，求解器保持重要的物理不变量，并保证在不使用特别调优参数的情况下保持稳定。这种稳定性是以更复杂的算法结构为代价的，这使得传统的高性能离散化具有挑战性。我们开发了一种算法设计，允许计算内核的SIMD矢量化，确定了良好节点级性能的主要成分，并报告了混合线程/MPI并行化的优秀弱缩放和强缩放。

引用次数: 15

External-memory Dictionaries in the Affine and PDAM Models 仿射和PDAM模型中的外部内存字典

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2021-09-20 DOI: 10.1145/3470635

M. A. Bender, Alex Conway, Martín Farach-Colton, William Jannen, Yizheng Jiao, Rob Johnson, Eric Knorr, Sara McAllister, Nirjhar Mukherjee, P. Pandey, Donald E. Porter, Jun Yuan, Yang Zhan

Storage devices have complex performance profiles, including costs to initiate IOs (e.g., seek times in hard drives), parallelism and bank conflicts (in SSDs), costs to transfer data, and firmware-internal operations. The Disk-access Machine (DAM) model simplifies reality by assuming that storage devices transfer data in blocks of size B and that all transfers have unit cost. Despite its simplifications, the DAM model is reasonably accurate. In fact, if B is set to the half-bandwidth point, where the latency and bandwidth of the hardware are equal, then the DAM approximates the IO cost on any hardware to within a factor of 2. Furthermore, the DAM model explains the popularity of B-trees in the 1970s and the current popularity of Bɛ-trees and log-structured merge trees. But it fails to explain why some B-trees use small nodes, whereas all Bɛ-trees use large nodes. In a DAM, all IOs, and hence all nodes, are the same size. In this article, we show that the affine and PDAM models, which are small refinements of the DAM model, yield a surprisingly large improvement in predictability without sacrificing ease of use. We present benchmarks on a large collection of storage devices showing that the affine and PDAM models give good approximations of the performance characteristics of hard drives and SSDs, respectively. We show that the affine model explains node-size choices in B-trees and Bɛ-trees. Furthermore, the models predict that B-trees are highly sensitive to variations in the node size, whereas Bɛ-trees are much less sensitive. These predictions are born out empirically. Finally, we show that in both the affine and PDAM models, it pays to organize data structures to exploit varying IO size. In the affine model, Bɛ-trees can be optimized so that all operations are simultaneously optimal, even up to lower-order terms. In the PDAM model, Bɛ-trees (or B-trees) can be organized so that both sequential and concurrent workloads are handled efficiently. We conclude that the DAM model is useful as a first cut when designing or analyzing an algorithm or data structure but the affine and PDAM models enable the algorithm designer to optimize parameter choices and fill in design details.

存储设备具有复杂的性能配置文件，包括启动IOs的成本(例如，硬盘驱动器中的寻道时间)、并行性和银行冲突(在ssd中)、传输数据的成本以及固件内部操作。磁盘访问机(DAM)模型通过假设存储设备以大小为B的块传输数据，并且所有传输都具有单位成本，从而简化了现实。尽管进行了简化，但DAM模型还是相当准确的。实际上，如果将B设置为半带宽点，即硬件的延迟和带宽相等，则DAM将任何硬件上的IO成本近似为2倍以内。此外，DAM模型解释了20世纪70年代B树的流行以及当前B树和日志结构合并树的流行。但它无法解释为什么有些B树使用小节点，而所有B树都使用大节点。在DAM中，所有IOs以及所有节点的大小都是相同的。在本文中，我们展示了仿射模型和PDAM模型，它们是DAM模型的小改进，在不牺牲易用性的情况下，在可预测性方面产生了惊人的巨大改进。我们提供了大量存储设备的基准测试，表明仿射和PDAM模型分别很好地近似了硬盘驱动器和ssd的性能特征。我们证明了仿射模型解释了B树和B树的节点大小选择。此外，模型预测B树对节点大小的变化高度敏感，而B树则不那么敏感。这些预测都是凭经验得出的。最后，我们证明了在仿射和PDAM模型中，组织数据结构以利用不同的IO大小是值得的。在仿射模型中，B树可以被优化，使所有的操作同时是最优的，即使是低阶项。在PDAM模型中，可以组织B树(或B树)，以便有效地处理顺序和并发工作负载。我们得出结论，DAM模型在设计或分析算法或数据结构时是有用的，但仿射和PDAM模型使算法设计者能够优化参数选择和填充设计细节。

{"title":"External-memory Dictionaries in the Affine and PDAM Models","authors":"M. A. Bender, Alex Conway, Martín Farach-Colton, William Jannen, Yizheng Jiao, Rob Johnson, Eric Knorr, Sara McAllister, Nirjhar Mukherjee, P. Pandey, Donald E. Porter, Jun Yuan, Yang Zhan","doi":"10.1145/3470635","DOIUrl":"https://doi.org/10.1145/3470635","url":null,"abstract":"Storage devices have complex performance profiles, including costs to initiate IOs (e.g., seek times in hard drives), parallelism and bank conflicts (in SSDs), costs to transfer data, and firmware-internal operations. The Disk-access Machine (DAM) model simplifies reality by assuming that storage devices transfer data in blocks of size B and that all transfers have unit cost. Despite its simplifications, the DAM model is reasonably accurate. In fact, if B is set to the half-bandwidth point, where the latency and bandwidth of the hardware are equal, then the DAM approximates the IO cost on any hardware to within a factor of 2. Furthermore, the DAM model explains the popularity of B-trees in the 1970s and the current popularity of Bɛ-trees and log-structured merge trees. But it fails to explain why some B-trees use small nodes, whereas all Bɛ-trees use large nodes. In a DAM, all IOs, and hence all nodes, are the same size. In this article, we show that the affine and PDAM models, which are small refinements of the DAM model, yield a surprisingly large improvement in predictability without sacrificing ease of use. We present benchmarks on a large collection of storage devices showing that the affine and PDAM models give good approximations of the performance characteristics of hard drives and SSDs, respectively. We show that the affine model explains node-size choices in B-trees and Bɛ-trees. Furthermore, the models predict that B-trees are highly sensitive to variations in the node size, whereas Bɛ-trees are much less sensitive. These predictions are born out empirically. Finally, we show that in both the affine and PDAM models, it pays to organize data structures to exploit varying IO size. In the affine model, Bɛ-trees can be optimized so that all operations are simultaneously optimal, even up to lower-order terms. In the PDAM model, Bɛ-trees (or B-trees) can be organized so that both sequential and concurrent workloads are handled efficiently. We conclude that the DAM model is useful as a first cut when designing or analyzing an algorithm or data structure but the affine and PDAM models enable the algorithm designer to optimize parameter choices and fill in design details.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"8 1","pages":"1 - 20"},"PeriodicalIF":1.6,"publicationDate":"2021-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48435982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Constant-Length Labeling Schemes for Deterministic Radio Broadcast 确定性无线电广播的定长标记方案

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2021-09-20 DOI: 10.1145/3470633

Faith Ellen, B. Gorain, Avery Miller, A. Pelc

Broadcast is one of the fundamental network communication primitives. One node of a network, called the source, has a message that has to be learned by all other nodes. We consider broadcast in radio networks, modeled as simple undirected connected graphs with a distinguished source. Nodes communicate in synchronous rounds. In each round, a node can either transmit a message to all its neighbours, or stay silent and listen. At the receiving end, a node v hears a message from a neighbour w in a given round if v listens in this round and if w is its only neighbour that transmits in this round. If more than one neighbour of a node v transmits in a given round, we say that a collision occurs at v. We do not assume collision detection: in case of a collision, node v does not hear anything (except the background noise that it also hears when no neighbour transmits). We are interested in the feasibility of deterministic broadcast in radio networks. If nodes of the network do not have any labels, deterministic broadcast is impossible even in the four-cycle. On the other hand, if all nodes have distinct labels, then broadcast can be carried out, e.g., in a round-robin fashion, and hence O(log n)-bit labels are sufficient for this task in n-node networks. In fact, O(log Δ)-bit labels, where Δ is the maximum degree, are enough to broadcast successfully. Hence, it is natural to ask if very short labels are sufficient for broadcast. Our main result is a positive answer to this question. We show that every radio network can be labeled using 2 bits in such a way that broadcast can be accomplished by some universal deterministic algorithm that does not know the network topology nor any bound on its size. Moreover, at the expense of an extra bit in the labels, we can get the following additional strong property of our algorithm: there exists a common round in which all nodes know that broadcast has been completed. Finally, we show that 3-bit labels are also sufficient to solve both versions of broadcast in the case where it is not known a priori which node is the source.

广播是最基本的网络通信原语之一。网络中的一个节点，称为源，有一条消息必须被所有其他节点学习。我们考虑无线网络中的广播，将其建模为具有区分源的简单无向连接图。节点以同步轮进行通信。在每一轮中，一个节点可以向其所有邻居发送消息，也可以保持沉默并听取消息。在接收端，如果v在此轮中侦听并且w是其唯一在此轮中发送的邻居，则节点v在给定的轮中收到邻居w的消息。如果节点v的多个邻居在给定的一轮中传输，我们说在v处发生了碰撞。我们不假设碰撞检测:在发生碰撞的情况下，节点v听不到任何东西(除了没有邻居传输时它也听到的背景噪声)。我们对无线网络中确定性广播的可行性很感兴趣。如果网络节点没有任何标签，即使在四个周期内也不可能进行确定性广播。另一方面，如果所有节点都有不同的标签，那么广播就可以进行，例如，以轮询的方式，因此在n节点网络中，O(log n)位标签就足以完成这项任务。事实上，O(log Δ)位标签(其中Δ是最大度)足以成功广播。因此，很自然地要问，非常短的标签是否足以播放。我们的主要结果是对这个问题的肯定回答。我们证明了每个无线网络都可以用2位标记，这样广播就可以通过一些不知道网络拓扑也不知道其大小的任何界限的通用确定性算法来完成。此外，以标签中额外的一位为代价，我们可以得到我们算法的以下附加强性质:存在一个所有节点都知道广播已经完成的公共轮。最后，我们表明，在先验不知道哪个节点是源的情况下，3位标签也足以解决两个版本的广播。

{"title":"Constant-Length Labeling Schemes for Deterministic Radio Broadcast","authors":"Faith Ellen, B. Gorain, Avery Miller, A. Pelc","doi":"10.1145/3470633","DOIUrl":"https://doi.org/10.1145/3470633","url":null,"abstract":"Broadcast is one of the fundamental network communication primitives. One node of a network, called the source, has a message that has to be learned by all other nodes. We consider broadcast in radio networks, modeled as simple undirected connected graphs with a distinguished source. Nodes communicate in synchronous rounds. In each round, a node can either transmit a message to all its neighbours, or stay silent and listen. At the receiving end, a node v hears a message from a neighbour w in a given round if v listens in this round and if w is its only neighbour that transmits in this round. If more than one neighbour of a node v transmits in a given round, we say that a collision occurs at v. We do not assume collision detection: in case of a collision, node v does not hear anything (except the background noise that it also hears when no neighbour transmits). We are interested in the feasibility of deterministic broadcast in radio networks. If nodes of the network do not have any labels, deterministic broadcast is impossible even in the four-cycle. On the other hand, if all nodes have distinct labels, then broadcast can be carried out, e.g., in a round-robin fashion, and hence O(log n)-bit labels are sufficient for this task in n-node networks. In fact, O(log Δ)-bit labels, where Δ is the maximum degree, are enough to broadcast successfully. Hence, it is natural to ask if very short labels are sufficient for broadcast. Our main result is a positive answer to this question. We show that every radio network can be labeled using 2 bits in such a way that broadcast can be accomplished by some universal deterministic algorithm that does not know the network topology nor any bound on its size. Moreover, at the expense of an extra bit in the labels, we can get the following additional strong property of our algorithm: there exists a common round in which all nodes know that broadcast has been completed. Finally, we show that 3-bit labels are also sufficient to solve both versions of broadcast in the case where it is not known a priori which node is the source.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"8 1","pages":"1 - 17"},"PeriodicalIF":1.6,"publicationDate":"2021-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44815158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Bandwidth-Optimal Random Shuffling for GPUs gpu的带宽优化随机洗牌

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2021-06-11 DOI: 10.1145/3505287

Rory Mitchell, Daniel Stokes, E. Frank, G. Holmes

Linear-time algorithms that are traditionally used to shuffle data on CPUs, such as the method of Fisher-Yates, are not well suited to implementation on GPUs due to inherent sequential dependencies, and existing parallel shuffling algorithms are unsuitable for GPU architectures because they incur a large number of read/write operations to high latency global memory. To address this, we provide a method of generating pseudo-random permutations in parallel by fusing suitable pseudo-random bijective functions with stream compaction operations. Our algorithm, termed “bijective shuffle” trades increased per-thread arithmetic operations for reduced global memory transactions. It is work-efficient, deterministic, and only requires a single global memory read and write per shuffle input, thus maximising use of global memory bandwidth. To empirically demonstrate the correctness of the algorithm, we develop a statistical test for the quality of pseudo-random permutations based on kernel space embeddings. Experimental results show that the bijective shuffle algorithm outperforms competing algorithms on GPUs, showing improvements of between one and two orders of magnitude and approaching peak device bandwidth.

传统上用于在cpu上对数据进行洗刷的线性时间算法，如Fisher-Yates方法，由于固有的顺序依赖性，不太适合在GPU上实现，现有的并行洗刷算法不适合GPU架构，因为它们会对高延迟的全局内存产生大量的读/写操作。为了解决这个问题，我们提供了一种通过将合适的伪随机双射函数与流压缩操作融合来并行生成伪随机排列的方法。我们的算法，称为“双目标洗牌”，用增加的每线程算术运算来换取减少的全局内存事务。它工作效率高，具有确定性，并且每次洗牌输入只需要一次全局内存读写，从而最大限度地利用全局内存带宽。为了从经验上证明该算法的正确性，我们开发了一个基于核空间嵌入的伪随机排列质量的统计测试。实验结果表明，双目标洗牌算法在gpu上优于竞争算法，表现出一到两个数量级的改进，并且接近峰值设备带宽。

{"title":"Bandwidth-Optimal Random Shuffling for GPUs","authors":"Rory Mitchell, Daniel Stokes, E. Frank, G. Holmes","doi":"10.1145/3505287","DOIUrl":"https://doi.org/10.1145/3505287","url":null,"abstract":"Linear-time algorithms that are traditionally used to shuffle data on CPUs, such as the method of Fisher-Yates, are not well suited to implementation on GPUs due to inherent sequential dependencies, and existing parallel shuffling algorithms are unsuitable for GPU architectures because they incur a large number of read/write operations to high latency global memory. To address this, we provide a method of generating pseudo-random permutations in parallel by fusing suitable pseudo-random bijective functions with stream compaction operations. Our algorithm, termed “bijective shuffle” trades increased per-thread arithmetic operations for reduced global memory transactions. It is work-efficient, deterministic, and only requires a single global memory read and write per shuffle input, thus maximising use of global memory bandwidth. To empirically demonstrate the correctness of the algorithm, we develop a statistical test for the quality of pseudo-random permutations based on kernel space embeddings. Experimental results show that the bijective shuffle algorithm outperforms competing algorithms on GPUs, showing improvements of between one and two orders of magnitude and approaching peak device bandwidth.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 20"},"PeriodicalIF":1.6,"publicationDate":"2021-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46192279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Engineering In-place (Shared-memory) Sorting Algorithms 工程就地(共享内存)排序算法

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2020-09-28 DOI: 10.1145/3505286

Michael Axtmann, Sascha Witt, Daniel Ferizovic, P. Sanders

We present new sequential and parallel sorting algorithms that now represent the fastest known techniques for a wide range of input sizes, input distributions, data types, and machines. Somewhat surprisingly, part of the speed advantage is due to the additional feature of the algorithms to work in-place, i.e., they do not need a significant amount of space beyond the input array. Previously, the in-place feature often implied performance penalties. Our main algorithmic contribution is a blockwise approach to in-place data distribution that is provably cache-efficient. We also parallelize this approach taking dynamic load balancing and memory locality into account. Our new comparison-based algorithm In-place Parallel Super Scalar Samplesort (IPS4o), combines this technique with branchless decision trees. By taking cases with many equal elements into account and by adapting the distribution degree dynamically, we obtain a highly robust algorithm that outperforms the best previous in-place parallel comparison-based sorting algorithms by almost a factor of three. That algorithm also outperforms the best comparison-based competitors regardless of whether we consider in-place or not in-place, parallel or sequential settings. Another surprising result is that IPS4o even outperforms the best (in-place or not in-place) integer sorting algorithms in a wide range of situations. In many of the remaining cases (often involving near-uniform input distributions, small keys, or a sequential setting), our new In-place Parallel Super Scalar Radix Sort (IPS2Ra) turns out to be the best algorithm. Claims to have the – in some sense – “best” sorting algorithm can be found in many papers which cannot all be true. Therefore, we base our conclusions on an extensive experimental study involving a large part of the cross product of 21 state-of-the-art sorting codes, 6 data types, 10 input distributions, 4 machines, 4 memory allocation strategies, and input sizes varying over 7 orders of magnitude. This confirms the claims made about the robust performance of our algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications. This is particularly true for integer sorting algorithms giving one reason to prefer comparison-based algorithms for robust general-purpose sorting.

我们提出了新的顺序和并行排序算法，这些算法现在代表了适用于各种输入大小、输入分布、数据类型和机器的已知最快技术。令人惊讶的是，速度优势的一部分是由于算法的额外功能，即它们不需要输入阵列之外的大量空间。以前，就地功能通常意味着性能惩罚。我们的主要算法贡献是一种可证明具有缓存效率的块式就地数据分发方法。我们还将这种方法并行化，同时考虑到动态负载平衡和内存局部性。我们新的基于比较的原位并行超标量样本排序算法（IPS4o）将该技术与无分支决策树相结合。通过考虑具有许多相等元素的情况并动态调整分布度，我们获得了一种高度鲁棒的算法，该算法比以前最好的基于并行比较的排序算法几乎高出三倍。无论我们是否考虑到位、并行或顺序设置，该算法也优于基于比较的最佳竞争对手。另一个令人惊讶的结果是，IPS4o甚至在各种情况下都优于最好的（原位或非原位）整数排序算法。在剩下的许多情况下（通常涉及接近均匀的输入分布、小键或顺序设置），我们新的原位并行超标量基数排序（IPS2Ra）被证明是最好的算法。在许多论文中都可以找到声称拥有某种意义上“最佳”排序算法的说法，但这些说法不可能都是真的。因此，我们的结论基于一项广泛的实验研究，该研究涉及21种最先进的排序代码、6种数据类型、10种输入分布、4台机器、4种内存分配策略和7个数量级以上的输入大小的大部分叉积。这证实了关于我们算法稳健性能的说法，同时揭示了许多竞争对手在相关出版物中报道的具体测量之外的主要性能问题。整数排序算法尤其如此，这给了我们一个理由，即更喜欢基于比较的算法进行稳健的通用排序。

{"title":"Engineering In-place (Shared-memory) Sorting Algorithms","authors":"Michael Axtmann, Sascha Witt, Daniel Ferizovic, P. Sanders","doi":"10.1145/3505286","DOIUrl":"https://doi.org/10.1145/3505286","url":null,"abstract":"We present new sequential and parallel sorting algorithms that now represent the fastest known techniques for a wide range of input sizes, input distributions, data types, and machines. Somewhat surprisingly, part of the speed advantage is due to the additional feature of the algorithms to work in-place, i.e., they do not need a significant amount of space beyond the input array. Previously, the in-place feature often implied performance penalties. Our main algorithmic contribution is a blockwise approach to in-place data distribution that is provably cache-efficient. We also parallelize this approach taking dynamic load balancing and memory locality into account. Our new comparison-based algorithm In-place Parallel Super Scalar Samplesort (IPS4o), combines this technique with branchless decision trees. By taking cases with many equal elements into account and by adapting the distribution degree dynamically, we obtain a highly robust algorithm that outperforms the best previous in-place parallel comparison-based sorting algorithms by almost a factor of three. That algorithm also outperforms the best comparison-based competitors regardless of whether we consider in-place or not in-place, parallel or sequential settings. Another surprising result is that IPS4o even outperforms the best (in-place or not in-place) integer sorting algorithms in a wide range of situations. In many of the remaining cases (often involving near-uniform input distributions, small keys, or a sequential setting), our new In-place Parallel Super Scalar Radix Sort (IPS2Ra) turns out to be the best algorithm. Claims to have the – in some sense – “best” sorting algorithm can be found in many papers which cannot all be true. Therefore, we base our conclusions on an extensive experimental study involving a large part of the cross product of 21 state-of-the-art sorting codes, 6 data types, 10 input distributions, 4 machines, 4 memory allocation strategies, and input sizes varying over 7 orders of magnitude. This confirms the claims made about the robust performance of our algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications. This is particularly true for integer sorting algorithms giving one reason to prefer comparison-based algorithms for robust general-purpose sorting.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"9 1","pages":"1 - 62"},"PeriodicalIF":1.6,"publicationDate":"2020-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43709384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16