ACM Transactions on Parallel Computing最新文献_第2页

Faster Supervised Average Consensus in Adversarial and Stochastic Anonymous Dynamic Networks 对抗和随机匿名动态网络中更快的监督平均一致性

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2023-04-24 DOI: 10.1145/3593426

Aleksandar Kamenev, D. Kowalski, Miguel A. Mosteiro

How do we reach consensus on an average value in a dynamic crowd without revealing identity? In this work, we study the problem of average network consensus in Anonymous Dynamic Networks (ADN). Network dynamicity is specified by the sequence of topology-graph isoperimetric numbers occurring over time, which we call the isoperimetric dynamicity of the network. The consensus variable is the average of values initially held by nodes, which is customary in the network-consensus literature. Given that having an algorithm to compute the average one can compute the network size (i.e., the counting problem) and vice versa, we focus on the latter. We present a deterministic distributed average network consensus algorithm for ADNs that we call isoperimetric Scalable Coordinated Anonymous Local Aggregation, and we analyze its performance for different scenarios, including worst-case (adversarial) and stochastic dynamic topologies. Our solution utilizes supervisor nodes, which have been shown to be necessary for computations in ADNs. The algorithm uses the isoperimetric dynamicity of the network as an input, meaning that only the isoperimetric number parameters (or their lower bound) must be given, but topologies may occur arbitrarily or stochastically as long as they comply with those parameters. Previous work for adversarial ADNs overestimates the running time to deal with worst-case scenarios. For ADNs with given isoperimetric dynamicity, our analysis shows improved performance for some practical dynamic topologies, with cubic time or better for stochastic ADNs, and our experimental evaluation indicates that our theoretical bounds could not be substantially improved for some models of dynamic networks.

我们如何在不暴露身份的情况下，在动态人群中就平均值达成共识？在这项工作中，我们研究了匿名动态网络（ADN）中的平均网络一致性问题。网络动态性是由随时间发生的拓扑图等周数序列指定的，我们称之为网络的等周动态性。共识变量是节点最初持有的值的平均值，这在网络共识文献中是惯例。假设有一个计算平均值的算法可以计算网络大小（即计数问题），反之亦然，我们将重点放在后者上。我们提出了一种用于ADN的确定性分布式平均网络一致性算法，我们称之为等周可扩展协调匿名本地聚合，并分析了它在不同场景下的性能，包括最坏情况（对抗性）和随机动态拓扑。我们的解决方案利用了监督节点，这已被证明是ADN中计算所必需的。该算法使用网络的等周动态性作为输入，这意味着必须只给出等周数参数（或其下界），但拓扑结构可以任意或随机发生，只要它们符合这些参数。以前针对对抗性ADN的工作高估了处理最坏情况的运行时间。对于具有给定等周动态性的ADN，我们的分析表明，对于一些实际的动态拓扑，性能有所提高，对于随机ADN，性能为三次时间或更好，并且我们的实验评估表明，对于某些动态网络模型，我们的理论界不能得到实质性的改进。

{"title":"Faster Supervised Average Consensus in Adversarial and Stochastic Anonymous Dynamic Networks","authors":"Aleksandar Kamenev, D. Kowalski, Miguel A. Mosteiro","doi":"10.1145/3593426","DOIUrl":"https://doi.org/10.1145/3593426","url":null,"abstract":"How do we reach consensus on an average value in a dynamic crowd without revealing identity? In this work, we study the problem of average network consensus in Anonymous Dynamic Networks (ADN). Network dynamicity is specified by the sequence of topology-graph isoperimetric numbers occurring over time, which we call the isoperimetric dynamicity of the network. The consensus variable is the average of values initially held by nodes, which is customary in the network-consensus literature. Given that having an algorithm to compute the average one can compute the network size (i.e., the counting problem) and vice versa, we focus on the latter. We present a deterministic distributed average network consensus algorithm for ADNs that we call isoperimetric Scalable Coordinated Anonymous Local Aggregation, and we analyze its performance for different scenarios, including worst-case (adversarial) and stochastic dynamic topologies. Our solution utilizes supervisor nodes, which have been shown to be necessary for computations in ADNs. The algorithm uses the isoperimetric dynamicity of the network as an input, meaning that only the isoperimetric number parameters (or their lower bound) must be given, but topologies may occur arbitrarily or stochastically as long as they comply with those parameters. Previous work for adversarial ADNs overestimates the running time to deal with worst-case scenarios. For ADNs with given isoperimetric dynamicity, our analysis shows improved performance for some practical dynamic topologies, with cubic time or better for stochastic ADNs, and our experimental evaluation indicates that our theoretical bounds could not be substantially improved for some models of dynamic networks.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":" ","pages":"1 - 35"},"PeriodicalIF":1.6,"publicationDate":"2023-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48744359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Distributed-GPU Deep Reinforcement Learning System for Solving Large Graph Optimization Problems 求解大型图优化问题的分布式gpu深度强化学习系统

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2023-03-23 DOI: 10.1145/3589188

Weijian Zheng, Dali Wang, Fengguang Song

Graph optimization problems (such as minimum vertex cover, maximum cut, traveling salesman problems) appear in many fields including social sciences, power systems, chemistry, and bioinformatics. Recently, deep reinforcement learning (DRL) has shown success in automatically learning good heuristics to solve graph optimization problems. However, the existing RL systems either do not support graph RL environments or do not support multiple or many GPUs in a distributed setting. This has compromised the ability of reinforcement learning in solving large-scale graph optimization problems due to lack of parallelization and high scalability. To address the challenges of parallelization and scalability, we develop RL4GO, a high-performance distributed-GPU DRL framework for solving graph optimization problems. RL4GO focuses on a class of computationally demanding RL problems, where both the RL environment and policy model are highly computation intensive. Traditional reinforcement learning systems often assume either the RL environment is of low time complexity or the policy model is small. In this work, we distribute large-scale graphs across distributed GPUs and use the spatial parallelism and data parallelism to achieve scalable performance. We compare and analyze the performance of the spatial parallelism and data parallelism and show their differences. To support graph neural network (GNN) layers that take as input data samples partitioned across distributed GPUs, we design parallel mathematical kernels to perform operations on distributed 3D sparse and 3D dense tensors. To handle costly RL environments, we design a parallel graph environment to scale up all RL-environment-related operations. By combining the scalable GNN layers with the scalable RL environment, we are able to develop high-performance RL4GO training and inference algorithms in parallel. Furthermore, we propose two optimization techniques—replay buffer on-the-fly graph generation and adaptive multiple-node selection—to minimize the spatial cost and accelerate reinforcement learning. This work also conducts in-depth analyses of parallel efficiency and memory cost and shows that the designed RL4GO algorithms are scalable on numerous distributed GPUs. Evaluations on large-scale graphs show that (1) RL4GO training and inference can achieve good parallel efficiency on 192 GPUs, (2) its training time can be 18 times faster than the state-of-the-art Gorila distributed RL framework [34], and (3) its inference performance achieves a 26 times improvement over Gorila.

图优化问题（如最小顶点覆盖、最大割、旅行商问题）出现在许多领域，包括社会科学、电力系统、化学和生物信息学。最近，深度强化学习（DRL）在自动学习良好的启发式方法以解决图优化问题方面取得了成功。然而，现有的RL系统要么不支持图形RL环境，要么不支持分布式设置中的多个或多个GPU。由于缺乏并行性和高可扩展性，这削弱了强化学习在解决大规模图优化问题中的能力。为了应对并行化和可扩展性的挑战，我们开发了RL4GO，这是一个用于解决图形优化问题的高性能分布式GPU DRL框架。RL4GO专注于一类计算要求很高的RL问题，其中RL环境和策略模型都是高度计算密集型的。传统的强化学习系统通常认为RL环境的时间复杂度较低，或者策略模型较小。在这项工作中，我们将大规模的图分布在分布式GPU中，并使用空间并行性和数据并行性来实现可扩展的性能。我们对空间并行和数据并行的性能进行了比较和分析，并展示了它们的差异。为了支持图神经网络（GNN）层，该层将在分布式GPU上分割的数据样本作为输入数据，我们设计了并行数学内核来对分布式3D稀疏张量和3D密集张量执行操作。为了处理成本高昂的RL环境，我们设计了一个并行图环境来扩展所有与RL环境相关的操作。通过将可扩展的GNN层与可扩展的RL环境相结合，我们能够并行开发高性能的RL4GO训练和推理算法。此外，我们提出了两种优化技术——回放缓冲动态图生成和自适应多节点选择——以最小化空间成本并加速强化学习。本文还对并行效率和内存成本进行了深入分析，并表明所设计的RL4GO算法在众多分布式GPU上是可扩展的。对大型图的评估表明，（1）RL4GO训练和推理可以在192个GPU上实现良好的并行效率，（2）其训练时间可以比最先进的Gorila分布式RL框架快18倍[34]，（3）其推理性能比Gorila提高了26倍。

{"title":"A Distributed-GPU Deep Reinforcement Learning System for Solving Large Graph Optimization Problems","authors":"Weijian Zheng, Dali Wang, Fengguang Song","doi":"10.1145/3589188","DOIUrl":"https://doi.org/10.1145/3589188","url":null,"abstract":"Graph optimization problems (such as minimum vertex cover, maximum cut, traveling salesman problems) appear in many fields including social sciences, power systems, chemistry, and bioinformatics. Recently, deep reinforcement learning (DRL) has shown success in automatically learning good heuristics to solve graph optimization problems. However, the existing RL systems either do not support graph RL environments or do not support multiple or many GPUs in a distributed setting. This has compromised the ability of reinforcement learning in solving large-scale graph optimization problems due to lack of parallelization and high scalability. To address the challenges of parallelization and scalability, we develop RL4GO, a high-performance distributed-GPU DRL framework for solving graph optimization problems. RL4GO focuses on a class of computationally demanding RL problems, where both the RL environment and policy model are highly computation intensive. Traditional reinforcement learning systems often assume either the RL environment is of low time complexity or the policy model is small. In this work, we distribute large-scale graphs across distributed GPUs and use the spatial parallelism and data parallelism to achieve scalable performance. We compare and analyze the performance of the spatial parallelism and data parallelism and show their differences. To support graph neural network (GNN) layers that take as input data samples partitioned across distributed GPUs, we design parallel mathematical kernels to perform operations on distributed 3D sparse and 3D dense tensors. To handle costly RL environments, we design a parallel graph environment to scale up all RL-environment-related operations. By combining the scalable GNN layers with the scalable RL environment, we are able to develop high-performance RL4GO training and inference algorithms in parallel. Furthermore, we propose two optimization techniques—replay buffer on-the-fly graph generation and adaptive multiple-node selection—to minimize the spatial cost and accelerate reinforcement learning. This work also conducts in-depth analyses of parallel efficiency and memory cost and shows that the designed RL4GO algorithms are scalable on numerous distributed GPUs. Evaluations on large-scale graphs show that (1) RL4GO training and inference can achieve good parallel efficiency on 192 GPUs, (2) its training time can be 18 times faster than the state-of-the-art Gorila distributed RL framework [34], and (3) its inference performance achieves a 26 times improvement over Gorila.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":" ","pages":"1 - 23"},"PeriodicalIF":1.6,"publicationDate":"2023-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47081538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

POETS: An Event-driven Approach to Dissipative Particle Dynamics POETS:耗散粒子动力学的事件驱动方法

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2023-02-20 DOI: 10.1145/3580372

Andrew D. Brown, J. Beaumont, David B. Thomas, J. Shillcock, Matthew Naylor, Graeme M. Bragg, Mark L. Vousden, S. Moore, Shane T. Fleming

HPC clusters have become ever more expensive, both in terms of capital cost and energy consumption; some estimates suggest that competitive installations at the end of the next decade will require their own power station. One way around this looming problem is to design bespoke computing engines, but while the performance benefits are good, the design costs are huge and cannot easily be amortized. Partially Ordered Event Triggered System (POETS)—the focus of this article—seeks to exploit a middle way: The architecture is tuned to a specific algorithmic pattern but, within that constraint, is fully programmable. POETS software is quasi-imperative: The user defines a set of sequential event handlers, defines the topology of a (typically large) concurrent ensemble of these, and lets them interact. The “solution” may be exfiltrated from the emergent behaviour of the ensemble. In this article, we describe (briefly) the architecture, and an example computational chemistry application, dissipative particle dynamics (DPD). The DPD algorithm is traditionally implemented using parallel computational techniques, but we re-cast it as a concurrent compute problem that is then ideally suited to POETS. Our prototype system is realised on a cluster of 48 FPGAs providing 50K concurrent hardware threads, and we report performance speedups of over two orders of magnitude better than a single thread baseline comparator and scaling behaviour that is almost constant. The results are validated against a “conventional” implementation.

无论是在资本成本还是能源消耗方面，高性能计算集群都变得越来越昂贵;一些估计表明，在未来十年末，竞争性的安装将需要他们自己的发电站。解决这个迫在眉睫的问题的一种方法是设计定制的计算引擎，但是，虽然性能优势很好，但设计成本巨大，而且不易分摊。部分有序事件触发系统(partial Ordered Event Triggered System, poet)——本文的重点——寻求一种中间方法:架构被调整到特定的算法模式，但在该约束下，是完全可编程的。poet软件是准命令式的:用户定义一组顺序的事件处理程序，定义这些事件的并发集合(通常很大)的拓扑结构，并让它们进行交互。“解”可能从整体的涌现行为中渗漏出来。在这篇文章中，我们(简要地)描述了体系结构，以及一个计算化学应用的例子，耗散粒子动力学(DPD)。DPD算法传统上是使用并行计算技术实现的，但我们将其重新定义为一个并发计算问题，然后非常适合于poet。我们的原型系统是在48个fpga的集群上实现的，提供50K并发硬件线程，我们报告的性能速度比单线程基准比较器好两个数量级以上，并且伸缩行为几乎是恒定的。根据“常规”实现对结果进行验证。

{"title":"POETS: An Event-driven Approach to Dissipative Particle Dynamics","authors":"Andrew D. Brown, J. Beaumont, David B. Thomas, J. Shillcock, Matthew Naylor, Graeme M. Bragg, Mark L. Vousden, S. Moore, Shane T. Fleming","doi":"10.1145/3580372","DOIUrl":"https://doi.org/10.1145/3580372","url":null,"abstract":"HPC clusters have become ever more expensive, both in terms of capital cost and energy consumption; some estimates suggest that competitive installations at the end of the next decade will require their own power station. One way around this looming problem is to design bespoke computing engines, but while the performance benefits are good, the design costs are huge and cannot easily be amortized. Partially Ordered Event Triggered System (POETS)—the focus of this article—seeks to exploit a middle way: The architecture is tuned to a specific algorithmic pattern but, within that constraint, is fully programmable. POETS software is quasi-imperative: The user defines a set of sequential event handlers, defines the topology of a (typically large) concurrent ensemble of these, and lets them interact. The “solution” may be exfiltrated from the emergent behaviour of the ensemble. In this article, we describe (briefly) the architecture, and an example computational chemistry application, dissipative particle dynamics (DPD). The DPD algorithm is traditionally implemented using parallel computational techniques, but we re-cast it as a concurrent compute problem that is then ideally suited to POETS. Our prototype system is realised on a cluster of 48 FPGAs providing 50K concurrent hardware threads, and we report performance speedups of over two orders of magnitude better than a single thread baseline comparator and scaling behaviour that is almost constant. The results are validated against a “conventional” implementation.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 32"},"PeriodicalIF":1.6,"publicationDate":"2023-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44594257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

MCSH, a Lock with the Standard Interface MCSH，具有标准接口的锁

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2023-02-20 DOI: 10.1145/3584696

W. Hesselink, P. Buhr

The MCS lock of Mellor-Crummey and Scott (1991), 23 pages. is a very efficient first-come first-served mutual-exclusion algorithm that uses the atomic hardware primitives fetch-and-store and compare-and-swap. However, it has the disadvantage that the calling thread must provide a pointer to an allocated record. This additional parameter violates the standard locking interface, which has only the lock as a parameter. Hence, it is impossible to switch to MCS without editing and recompiling an application that uses locks. This article provides a variation of MCS with the standard interface, which remains FCFS, called MCSH. One key ingredient is to stack allocate the necessary record in the acquire procedure of the lock, so its life-time only spans the delay to enter a critical section. A second key ingredient is communicating the allocated record between the acquire and release procedures through the lock to maintain the standard locking interface. Both of these practices are known to practitioners, but our solution combines them in a unique way. Furthermore, when these practices are used in prior papers, their correctness is often argued informally. The correctness of MCSH is verified rigorously with the proof assistant PVS, and experiments are run to compare its performance with MCS and similar locks.

MCS锁的Mellor-Crummey和Scott(1991)， 23页。是一种非常高效的先到先服务互斥算法，它使用原子硬件原语获取-存储和比较-交换。但是，它的缺点是调用线程必须提供指向已分配记录的指针。这个附加参数违反了标准锁定接口，标准锁定接口只有锁作为参数。因此，如果不编辑和重新编译使用锁的应用程序，就不可能切换到MCS。本文提供了具有标准接口的MCS的一种变体，它仍然是FCFS，称为MCSH。一个关键因素是在锁的获取过程中堆栈分配必要的记录，因此它的生命周期只跨越进入临界区的延迟。第二个关键因素是通过锁在获取和释放过程之间通信已分配的记录，以维护标准的锁定接口。从业者都知道这两种实践，但是我们的解决方案以一种独特的方式结合了它们。此外，当这些实践在以前的论文中使用时，它们的正确性经常被非正式地争论。通过验证助手PVS严格验证了MCSH的正确性，并进行了实验，将其性能与MCS和类似锁进行了比较。

{"title":"MCSH, a Lock with the Standard Interface","authors":"W. Hesselink, P. Buhr","doi":"10.1145/3584696","DOIUrl":"https://doi.org/10.1145/3584696","url":null,"abstract":"The MCS lock of Mellor-Crummey and Scott (1991), 23 pages. is a very efficient first-come first-served mutual-exclusion algorithm that uses the atomic hardware primitives fetch-and-store and compare-and-swap. However, it has the disadvantage that the calling thread must provide a pointer to an allocated record. This additional parameter violates the standard locking interface, which has only the lock as a parameter. Hence, it is impossible to switch to MCS without editing and recompiling an application that uses locks. This article provides a variation of MCS with the standard interface, which remains FCFS, called MCSH. One key ingredient is to stack allocate the necessary record in the acquire procedure of the lock, so its life-time only spans the delay to enter a critical section. A second key ingredient is communicating the allocated record between the acquire and release procedures through the lock to maintain the standard locking interface. Both of these practices are known to practitioners, but our solution combines them in a unique way. Furthermore, when these practices are used in prior papers, their correctness is often argued informally. The correctness of MCSH is verified rigorously with the proof assistant PVS, and experiments are run to compare its performance with MCS and similar locks.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":" ","pages":"1 - 23"},"PeriodicalIF":1.6,"publicationDate":"2023-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48547326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Heterogeneous Parallel Computing Approach Optimizing SpTTM on CPU-GPU via GCN 基于GCN的CPU-GPU异构并行计算SpTTM优化方法

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2023-02-17 DOI: 10.1145/3584373

Hao Wang, Wangdong Yang, Renqiu Ouyang, Rong Hu, Kenli Li, Keqin Li

Sparse Tensor-Times-Matrix (SpTTM) is the core calculation in tensor analysis. The sparse distributions of different tensors vary greatly, which poses a big challenge to designing efficient and general SpTTM. In this paper, we describe SpTTM on CPU-GPU heterogeneous hybrid systems and give a parallel execution strategy for SpTTM in different sparse formats. We analyze the theoretical computer powers and estimate the number of tasks to achieve the load balancing between the CPU and the GPU of the heterogeneous systems. We discuss a method to describe tensor sparse structure by graph structure and design a new graph neural network SPT-GCN to select a suitable tensor sparse format. Furthermore, we perform extensive experiments using real datasets to demonstrate the advantages and efficiency of our proposed input-aware slice-wise SpTTM. The experimental results show that our input-aware slice-wise SpTTM can achieve an average speedup of 1.310 × compared to ParTI! library on a CPU-GPU heterogeneous system.

稀疏张量-时间矩阵(SpTTM)是张量分析中的核心计算。不同张量的稀疏分布差异很大，这对设计高效、通用的SpTTM提出了很大的挑战。本文描述了CPU-GPU异构混合系统上的SpTTM，并给出了不同稀疏格式下SpTTM的并行执行策略。通过对理论计算能力的分析和任务数的估计，实现了异构系统中CPU和GPU的负载均衡。讨论了一种用图结构描述张量稀疏结构的方法，并设计了一种新的图神经网络SPT-GCN来选择合适的张量稀疏格式。此外，我们使用真实数据集进行了广泛的实验，以证明我们提出的输入感知切片SpTTM的优势和效率。实验结果表明，与ParTI相比，我们的输入感知切片SpTTM可以实现1.310倍的平均加速!库在CPU-GPU异构系统上。

引用次数: 1

GreenMD: Energy-efficient Matrix Decomposition on Heterogeneous Multi-GPU Systems 异构多gpu系统的节能矩阵分解

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2023-02-17 DOI: 10.1145/3583590

Hadi Zamani, L. Bhuyan, Jieyang Chen, Zizhong Chen

The current trend of performance growth in HPC systems is accompanied by a massive increase in energy consumption. In this article, we introduce GreenMD, an energy-efficient framework for heterogeneous systems for LU factorization utilizing multi-GPUs. LU factorization is a crucial kernel from the MAGMA library, which is highly optimized. Our aim is to apply DVFS to this application by leveraging slacks intelligently on both CPUs and multiple GPUs. To predict the slack times, accurate performance models are developed separately for both CPUs and GPUs based on the algorithmic knowledge and manufacturer’s specifications. Since DVFS does not reduce static energy consumption, we also develop undervolting techniques for both CPUs and GPUs. Reducing voltage below threshold values may give rise to errors; hence, we extract the minimum safe voltages (VsafeMin) for the CPUs and GPUs utilizing a low overhead profiling phase and apply them before execution. It is shown that GreenMD improves the CPU, GPU, and total energy about 59%, 21%, and 31%, respectively, while delivering similar performance to the state-of-the-art linear algebra MAGMA library.

HPC系统当前的性能增长趋势伴随着能源消耗的大幅增加。在本文中，我们介绍了GreenMD，这是一个用于利用多GPU进行LU分解的异构系统的节能框架。LU因子分解是经过高度优化的MAGMA库中的一个关键内核。我们的目标是通过在CPU和多个GPU上智能地利用slack，将DVFS应用于此应用程序。为了预测空闲时间，基于算法知识和制造商的规范，分别为CPU和GPU开发了准确的性能模型。由于DVFS不会减少静态能耗，我们还为CPU和GPU开发了欠电压技术。将电压降低到阈值以下可能会产生误差；因此，我们利用低开销评测阶段提取CPU和GPU的最小安全电压（VsafeMin），并在执行之前应用它们。研究表明，GreenMD将CPU、GPU和总能量分别提高了约59%、21%和31%，同时提供了与最先进的线性代数MAGMA库类似的性能。

{"title":"GreenMD: Energy-efficient Matrix Decomposition on Heterogeneous Multi-GPU Systems","authors":"Hadi Zamani, L. Bhuyan, Jieyang Chen, Zizhong Chen","doi":"10.1145/3583590","DOIUrl":"https://doi.org/10.1145/3583590","url":null,"abstract":"The current trend of performance growth in HPC systems is accompanied by a massive increase in energy consumption. In this article, we introduce GreenMD, an energy-efficient framework for heterogeneous systems for LU factorization utilizing multi-GPUs. LU factorization is a crucial kernel from the MAGMA library, which is highly optimized. Our aim is to apply DVFS to this application by leveraging slacks intelligently on both CPUs and multiple GPUs. To predict the slack times, accurate performance models are developed separately for both CPUs and GPUs based on the algorithmic knowledge and manufacturer’s specifications. Since DVFS does not reduce static energy consumption, we also develop undervolting techniques for both CPUs and GPUs. Reducing voltage below threshold values may give rise to errors; hence, we extract the minimum safe voltages (VsafeMin) for the CPUs and GPUs utilizing a low overhead profiling phase and apply them before execution. It is shown that GreenMD improves the CPU, GPU, and total energy about 59%, 21%, and 31%, respectively, while delivering similar performance to the state-of-the-art linear algebra MAGMA library.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":" ","pages":"1 - 23"},"PeriodicalIF":1.6,"publicationDate":"2023-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49072865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Investigation and Implementation of Parallelism Resources of Numerical Algorithms 数值算法并行资源的研究与实现

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2023-02-15 DOI: 10.1145/3583755

Valentina N. Aleeva, R. Aleev

This article is devoted to an approach to solving a problem of the efficiency of parallel computing. The theoretical basis of this approach is the concept of a Q-determinant. Any numerical algorithm has a Q-determinant. The Q-determinant of the algorithm has clear structure and is convenient for implementation. The Q-determinant consists of Q-terms. Their number is equal to the number of output data items. Each Q-term describes all possible ways to compute one of the output data items based on the input data. We also describe a software Q-system for studying the parallelism resources of numerical algorithms. This system enables to compute and compare the parallelism resources of numerical algorithms. The application of the Q-system is shown on the example of numerical algorithms with different structures of Q-determinants. Furthermore, we suggest a method for designing of parallel programs for numerical algorithms. This method is based on a representation of a numerical algorithm in the form of a Q-determinant. As a result, we can obtain the program using the parallelism resource of the algorithm completely. Such programs are called Q-effective. The results of this research can be applied to increase the implementation efficiency of numerical algorithms, methods, as well as algorithmic problems on parallel computing systems.

本文致力于解决并行计算效率问题的一种方法。这种方法的理论基础是q行列式的概念。任何数值算法都有一个q行列式。该算法的q行列式结构清晰，便于实现。q行列式由q项组成。它们的数量等于输出数据项的数量。每个q项描述了基于输入数据计算一个输出数据项的所有可能方法。我们还描述了一个用于研究数值算法并行资源的软件q系统。该系统能够对数值算法的并行资源进行计算和比较。以不同q行列式结构的数值算法为例，说明了q系统的应用。此外，我们还提出了一种数值算法并行程序的设计方法。这种方法是基于以q行列式的形式表示的数值算法。因此，我们可以完全利用算法的并行资源来获得程序。这样的项目被称为Q-effective。本研究成果可用于提高数值算法、方法以及并行计算系统上算法问题的实现效率。

{"title":"Investigation and Implementation of Parallelism Resources of Numerical Algorithms","authors":"Valentina N. Aleeva, R. Aleev","doi":"10.1145/3583755","DOIUrl":"https://doi.org/10.1145/3583755","url":null,"abstract":"This article is devoted to an approach to solving a problem of the efficiency of parallel computing. The theoretical basis of this approach is the concept of a Q-determinant. Any numerical algorithm has a Q-determinant. The Q-determinant of the algorithm has clear structure and is convenient for implementation. The Q-determinant consists of Q-terms. Their number is equal to the number of output data items. Each Q-term describes all possible ways to compute one of the output data items based on the input data. We also describe a software Q-system for studying the parallelism resources of numerical algorithms. This system enables to compute and compare the parallelism resources of numerical algorithms. The application of the Q-system is shown on the example of numerical algorithms with different structures of Q-determinants. Furthermore, we suggest a method for designing of parallel programs for numerical algorithms. This method is based on a representation of a numerical algorithm in the form of a Q-determinant. As a result, we can obtain the program using the parallelism resource of the algorithm completely. Such programs are called Q-effective. The results of this research can be applied to increase the implementation efficiency of numerical algorithms, methods, as well as algorithmic problems on parallel computing systems.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 64"},"PeriodicalIF":1.6,"publicationDate":"2023-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42505176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Performance Implication of Tensor Irregularity and Optimization for Distributed Tensor Decomposition 张量不规则性的性能蕴涵与分布式张量分解的优化

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2023-02-07 DOI: 10.1145/3580315

Zheng Miao, Jon C. Calhoun, Rong Ge, Jiajia Li

Tensors are used by a wide variety of applications to represent multi-dimensional data; tensor decompositions are a class of methods for latent data analytics, data compression, and so on. Many of these applications generate large tensors with irregular dimension sizes and nonzero distribution. CANDECOMP/PARAFAC decomposition (Cpd) is a popular low-rank tensor decomposition for discovering latent features. The increasing overhead on memory and execution time of Cpd for large tensors requires distributed memory implementations as the only feasible solution. The sparsity and irregularity of tensors hinder the improvement of performance and scalability of distributed memory implementations. While previous works have been proved successful in Cpd for tensors with relatively regular dimension sizes and nonzero distribution, they either deliver unsatisfactory performance and scalability for irregular tensors or require significant time overhead in preprocessing. In this work, we focus on medium-grained tensor distribution to address their limitation for irregular tensors. We first thoroughly investigate through theoretical and experimental analysis. We disclose that the main cause of poor Cpd performance and scalability is the imbalance of multiple types of computations and communications and their tradeoffs; and sparsity and irregularity make it challenging to achieve their balances and tradeoffs. Irregularity of a sparse tensor is categorized based on two aspects: very different dimension sizes and a non-uniform nonzero distribution. Typically, focusing on optimizing one type of load imbalance causes other ones more severe for irregular tensors. To address such challenges, we propose irregularity-aware distributed Cpd that leverages the sparsity and irregularity information to identify the best tradeoff between different imbalances with low time overhead. We materialize the idea with two optimization methods: the prediction-based grid configuration and matrix-oriented distribution policy, where the former forms the global balance among computations and communications, and the latter further adjusts the balances among computations. The experimental results show that our proposed irregularity-aware distributed Cpd is more scalable and outperforms the medium- and fine-grained distributed implementations by up to 4.4 × and 11.4 × on 1,536 processors, respectively. Our optimizations support different sparse tensor formats, such as compressed sparse fiber (CSF), coordinate (COO), and Hierarchical Coordinate (HiCOO), and gain good scalability for all of them.

张量被各种各样的应用程序用来表示多维数据；张量分解是一类用于潜在数据分析、数据压缩等的方法。其中许多应用程序生成具有不规则维数和非零分布的大张量。CANDECOMP/PARAFAC分解（Cpd）是一种流行的用于发现潜在特征的低阶张量分解。对于大张量，不断增加的内存开销和Cpd的执行时间需要分布式内存实现作为唯一可行的解决方案。张量的稀疏性和不规则性阻碍了分布式存储器实现的性能和可扩展性的提高。虽然先前的工作已被证明在具有相对规则维度大小和非零分布的张量的Cpd中是成功的，但它们要么对不规则张量提供了不令人满意的性能和可扩展性，要么在预处理中需要大量的时间开销。在这项工作中，我们专注于中等粒度张量分布，以解决它们对不规则张量的限制。我们首先通过理论和实验分析进行深入研究。我们披露了Cpd性能和可扩展性较差的主要原因是多种类型的计算和通信的不平衡及其权衡；稀疏性和不规则性使得实现它们的平衡和权衡具有挑战性。稀疏张量的不规则性基于两个方面进行分类：非常不同的维度大小和非均匀的非零分布。通常，专注于优化一种类型的负载不平衡会导致其他类型的不规则张量更严重。为了应对这些挑战，我们提出了不规则感知分布式Cpd，该分布式Cpd利用稀疏性和不规则性信息来确定不同不平衡之间的最佳折衷，同时降低时间开销。我们用两种优化方法来实现这一想法：基于预测的网格配置和面向矩阵的分配策略，前者形成计算和通信之间的全局平衡，后者进一步调整计算之间的平衡。实验结果表明，我们提出的不规则感知分布式Cpd更具可扩展性，在1536个处理器上分别比中粒度和细粒度分布式实现高出4.4倍和11.4倍。我们的优化支持不同的稀疏张量格式，如压缩稀疏光纤（CSF）、坐标（COO）和层次坐标（HiCOO），并为所有这些格式获得了良好的可扩展性。

{"title":"Performance Implication of Tensor Irregularity and Optimization for Distributed Tensor Decomposition","authors":"Zheng Miao, Jon C. Calhoun, Rong Ge, Jiajia Li","doi":"10.1145/3580315","DOIUrl":"https://doi.org/10.1145/3580315","url":null,"abstract":"Tensors are used by a wide variety of applications to represent multi-dimensional data; tensor decompositions are a class of methods for latent data analytics, data compression, and so on. Many of these applications generate large tensors with irregular dimension sizes and nonzero distribution. CANDECOMP/PARAFAC decomposition (Cpd) is a popular low-rank tensor decomposition for discovering latent features. The increasing overhead on memory and execution time of Cpd for large tensors requires distributed memory implementations as the only feasible solution. The sparsity and irregularity of tensors hinder the improvement of performance and scalability of distributed memory implementations. While previous works have been proved successful in Cpd for tensors with relatively regular dimension sizes and nonzero distribution, they either deliver unsatisfactory performance and scalability for irregular tensors or require significant time overhead in preprocessing. In this work, we focus on medium-grained tensor distribution to address their limitation for irregular tensors. We first thoroughly investigate through theoretical and experimental analysis. We disclose that the main cause of poor Cpd performance and scalability is the imbalance of multiple types of computations and communications and their tradeoffs; and sparsity and irregularity make it challenging to achieve their balances and tradeoffs. Irregularity of a sparse tensor is categorized based on two aspects: very different dimension sizes and a non-uniform nonzero distribution. Typically, focusing on optimizing one type of load imbalance causes other ones more severe for irregular tensors. To address such challenges, we propose irregularity-aware distributed Cpd that leverages the sparsity and irregularity information to identify the best tradeoff between different imbalances with low time overhead. We materialize the idea with two optimization methods: the prediction-based grid configuration and matrix-oriented distribution policy, where the former forms the global balance among computations and communications, and the latter further adjusts the balances among computations. The experimental results show that our proposed irregularity-aware distributed Cpd is more scalable and outperforms the medium- and fine-grained distributed implementations by up to 4.4 × and 11.4 × on 1,536 processors, respectively. Our optimizations support different sparse tensor formats, such as compressed sparse fiber (CSF), coordinate (COO), and Hierarchical Coordinate (HiCOO), and gain good scalability for all of them.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 27"},"PeriodicalIF":1.6,"publicationDate":"2023-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43336829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Tridigpu: A GPU Library for Block Tridiagonal and Banded Linear Equation Systems 块三对角线和带状线性方程组的GPU库

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2023-01-31 DOI: 10.1145/3580373

Christopher J. Klein, R. Strzodka

In this article, we present a CUDA library with a C API for solving block cyclic tridiagonal and banded systems on one GPU. The library can process block tridiagonal systems with block sizes from 1 × 1 (scalar) to 4 × 4 and banded systems with up to four sub- and superdiagonals. For the compute-intensive block size cases and cases with many right-hand sides, we write out an explicit factorization to memory; however, for the scalar case, the fastest approach is to only output the coarse system and recompute the factorization. Prominent features of the library are (scaled) partial pivoting for improved numeric stability; highest-performance kernels, which completely utilize GPU memory bandwidth; and support for multiple sparse or dense right-hand side and solution vectors. The additional memory consumption is only 5% of the original tridiagonal system, which enables the solution of systems up to GPU memory size. The performance of the state-of-the-art scalar tridiagonal solver of cuSPARSE is outperformed by factor 5 for large problem sizes of 225 unknowns, on a GeForce RTX 2080 Ti.

在本文中，我们提出了一个带有C API的CUDA库，用于在一个GPU上求解块循环三对角线和带状系统。该库可以处理块大小从1 × 1(标量)到4 × 4的块三对角线系统，以及具有多达四条次对角线和超对角线的带状系统。对于计算密集型的块大小情况和有许多右手边的情况，我们写了一个显式的内存分解;然而，对于标量情况，最快的方法是只输出粗系统并重新计算分解。该库的突出特点是(缩放)部分枢轴，以提高数值稳定性;性能最高的内核，完全利用GPU内存带宽;并且支持多个稀疏或密集的右侧和解向量。额外的内存消耗仅为原始三对角线系统的5%，这使得系统的解决方案可以达到GPU内存大小。在GeForce RTX 2080 Ti上，最先进的cuSPARSE标量三对角线求解器在225个未知数的大型问题上的性能优于5倍。

引用次数: 0

Non-overlapping High-accuracy Parallel Closure for Compact Schemes: Application in Multiphysics and Complex Geometry 紧格式的非重叠高精度并行闭包:在多物理场和复杂几何中的应用

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2023-01-17 DOI: 10.1145/3580005

P. Sundaram, A. Sengupta, V. K. Suman, T. Sengupta

Compact schemes are often preferred in performing scientific computing for their superior spectral resolution. Error-free parallelization of a compact scheme is a challenging task due to the requirement of additional closures at the inter-processor boundaries. Here, sources of the error due to sub-domain boundary closures for the compact schemes are analyzed with global spectral analysis. A high-accuracy parallel computing strategy devised in “ A high-accuracy preserving parallel algorithm for compact schemes for DNS. ACM Trans. Parallel Comput. 7, 4, 1-32 (2020)” systematically eliminates error due to parallelization and does not require overlapping points at the sub-domain boundaries. This closure is applicable for any compact scheme and is termed here as non-overlapping high-accuracy parallel (NOHAP) sub-domain boundary closure. In the present work, the advantages of the NOHAP closure are shown with the model convection equation and by solving the compressible Navier–Stokes equation for three-dimensional Rayleigh–Taylor instability simulations involving multiphysics dynamics and high Reynolds number flow past a natural laminar flow airfoil using a body-conforming curvilinear coordinate system. Linear scalability of the NOHAP closure is shown for the large-scale simulations using up to 19,200 processors.

紧凑型方案由于其优越的光谱分辨率，在进行科学计算时通常是首选方案。紧凑方案的无错误并行化是一项具有挑战性的任务，因为在处理器间边界需要额外的闭包。在这里，通过全局谱分析来分析由于紧致格式的子域边界闭合引起的误差源。“用于DNS的紧凑方案的高精度保持并行算法。ACM Trans.parallel Comput.7，4，1-32（2020）”中设计的高精度并行计算策略系统地消除了由于并行化而产生的错误，并且不需要子域边界处的重叠点。这种闭包适用于任何紧凑格式，在这里被称为非重叠高精度并行（NOHAP）子域边界闭包。在目前的工作中，NOHAP闭合的优势通过模型对流方程和求解三维瑞利-泰勒不稳定性模拟的可压缩Navier–Stokes方程来显示，该不稳定性模拟涉及多物理动力学和高雷诺数流动通过自然层流翼型，使用符合体的曲线坐标系。对于使用多达19200个处理器的大规模模拟，显示了NOHAP闭包的线性可扩展性。

{"title":"Non-overlapping High-accuracy Parallel Closure for Compact Schemes: Application in Multiphysics and Complex Geometry","authors":"P. Sundaram, A. Sengupta, V. K. Suman, T. Sengupta","doi":"10.1145/3580005","DOIUrl":"https://doi.org/10.1145/3580005","url":null,"abstract":"Compact schemes are often preferred in performing scientific computing for their superior spectral resolution. Error-free parallelization of a compact scheme is a challenging task due to the requirement of additional closures at the inter-processor boundaries. Here, sources of the error due to sub-domain boundary closures for the compact schemes are analyzed with global spectral analysis. A high-accuracy parallel computing strategy devised in “ A high-accuracy preserving parallel algorithm for compact schemes for DNS. ACM Trans. Parallel Comput. 7, 4, 1-32 (2020)” systematically eliminates error due to parallelization and does not require overlapping points at the sub-domain boundaries. This closure is applicable for any compact scheme and is termed here as non-overlapping high-accuracy parallel (NOHAP) sub-domain boundary closure. In the present work, the advantages of the NOHAP closure are shown with the model convection equation and by solving the compressible Navier–Stokes equation for three-dimensional Rayleigh–Taylor instability simulations involving multiphysics dynamics and high Reynolds number flow past a natural laminar flow airfoil using a body-conforming curvilinear coordinate system. Linear scalability of the NOHAP closure is shown for the large-scale simulations using up to 19,200 processors.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"10 1","pages":"1 - 28"},"PeriodicalIF":1.6,"publicationDate":"2023-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45088678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4