首页 > 最新文献

ACM Transactions on Parallel Computing最新文献

英文 中文
Indigo3: A Parallel Graph Analytics Benchmark Suite for Exploring Implementation Styles and Common Bugs Indigo3:用于探索实现方式和常见错误的并行图分析基准套件
IF 1.6 Q2 Computer Science Pub Date : 2024-05-15 DOI: 10.1145/3665251
Yiqian Liu, Noushin Azami, Avery VanAusdal, Martin Burtscher
Graph analytics codes are widely used and tend to exhibit input-dependent behavior, making them particularly interesting for software verification and validation. This paper presents Indigo3, a labeled benchmark suite based on 7 graph algorithms that are implemented in different styles, including versions with deliberately planted bugs. We systematically combine 13 sets of implementation styles and 15 common bug types to create the 41,790 CUDA, OpenMP, and parallel C programs in the suite. Each code is labeled with the styles and bugs it incorporates. We used 4 subsets of Indigo3 to test 5 program-verification tools. Our results show that the tools perform quite differently across the bug types and implementation styles, have distinct strengths and weaknesses, and generally struggle with graph codes. We discuss the styles and bugs that tend to be the most challenging as well as the programming patterns that yield false positives.
图形分析代码应用广泛,往往表现出依赖输入的行为,因此对软件验证和确认特别有意义。本文介绍了 Indigo3,这是一个基于 7 种图形算法的标记基准套件,这些算法以不同风格实现,包括故意植入错误的版本。我们系统地结合了 13 组实现风格和 15 种常见错误类型,创建了该套件中的 41,790 个 CUDA、OpenMP 和并行 C 程序。每段代码都标注了所采用的风格和错误。我们使用 Indigo3 的 4 个子集测试了 5 种程序校验工具。我们的结果表明,这些工具在不同错误类型和实现风格下的表现大相径庭,优缺点各不相同,而且在处理图形代码时普遍存在困难。我们将讨论最具挑战性的错误类型和错误,以及产生误报的编程模式。
{"title":"Indigo3: A Parallel Graph Analytics Benchmark Suite for Exploring Implementation Styles and Common Bugs","authors":"Yiqian Liu, Noushin Azami, Avery VanAusdal, Martin Burtscher","doi":"10.1145/3665251","DOIUrl":"https://doi.org/10.1145/3665251","url":null,"abstract":"Graph analytics codes are widely used and tend to exhibit input-dependent behavior, making them particularly interesting for software verification and validation. This paper presents Indigo3, a labeled benchmark suite based on 7 graph algorithms that are implemented in different styles, including versions with deliberately planted bugs. We systematically combine 13 sets of implementation styles and 15 common bug types to create the 41,790 CUDA, OpenMP, and parallel C programs in the suite. Each code is labeled with the styles and bugs it incorporates. We used 4 subsets of Indigo3 to test 5 program-verification tools. Our results show that the tools perform quite differently across the bug types and implementation styles, have distinct strengths and weaknesses, and generally struggle with graph codes. We discuss the styles and bugs that tend to be the most challenging as well as the programming patterns that yield false positives.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140973846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TLPGNN : A Lightweight Two-Level Parallelism Paradigm for Graph Neural Network Computation on Single and Multiple GPUs TLPGNN:在单个和多个 GPU 上进行图神经网络计算的轻量级两级并行范式
IF 1.6 Q2 Computer Science Pub Date : 2024-02-09 DOI: 10.1145/3644712
Qiang Fu, Yuede Ji, Thomas B. Rolinger, H. H. Huang
Graph Neural Networks (GNNs) are an emerging class of deep learning models specifically designed for graph-structured data. They have been effectively employed in a variety of real-world applications, including recommendation systems, drug development, and analysis of social networks. The GNN computation includes regular neural network operations and general graph convolution operations, which take most of the total computation time. Though several recent works have been proposed to accelerate the computation for GNNs, they face the limitations of heavy pre-processing, low efficiency atomic operations, and unnecessary kernel launches. In this paper, we design TLPGNN , a lightweight two-level parallelism paradigm for GNN computation. First, we conduct a systematic analysis of the hardware resource usage of GNN workloads to understand the characteristics of GNN workloads deeply. With the insightful observations, we then divide the GNN computation into two levels, i.e., vertex parallelism for the first level and feature parallelism for the second. Next, we employ a novel hybrid dynamic workload assignment to address the imbalanced workload distribution. Furthermore, we fuse the kernels to reduce the number of kernel launches and cache the frequently accessed data into registers to avoid unnecessary memory traffic. To scale TLPGNN to multi-GPU environments, we propose an edge-aware row-wise 1-D partition method to ensure a balanced workload distribution across different GPU devices. Experimental results on various benchmark datasets demonstrate the superiority of our approach, achieving substantial performance improvement over state-of-the-art GNN computation systems, including Deep Graph Library (DGL), GNNAdvisor, and FeatGraph, with speedups of 6.1 ×, 7.7 ×, and 3.0 ×, respectively, on average. Evaluations of multiple-GPU TLPGNN also demonstrate that our solution achieves both linear scalability and a well-balanced workload distribution.
图神经网络(GNN)是一类新兴的深度学习模型,专为图结构数据而设计。它们已被有效地应用于各种现实世界的应用中,包括推荐系统、药物开发和社交网络分析。GNN 计算包括常规的神经网络操作和一般的图卷积操作,这占用了总计算时间的大部分。虽然最近有几项研究提出了加速 GNN 计算的方法,但它们面临着繁重的预处理、低效的原子操作和不必要的内核启动等限制。在本文中,我们设计了用于 GNN 计算的轻量级两级并行范式 TLPGNN。首先,我们对 GNN 工作负载的硬件资源使用情况进行了系统分析,以深入了解 GNN 工作负载的特点。有了深刻的观察,我们将 GNN 计算分为两个层次,即第一层次的顶点并行和第二层次的特征并行。接下来,我们采用一种新颖的混合动态工作负载分配来解决工作负载分布不平衡的问题。此外,我们还对内核进行了融合,以减少内核启动次数,并将频繁访问的数据缓存到寄存器中,以避免不必要的内存流量。为了将 TLPGNN 扩展到多 GPU 环境,我们提出了一种边缘感知行向一维分区方法,以确保在不同 GPU 设备上实现均衡的工作量分配。在各种基准数据集上的实验结果证明了我们的方法的优越性,与最先进的 GNN 计算系统(包括 Deep Graph Library (DGL)、GNNAdvisor 和 FeatGraph)相比,我们的方法实现了性能的大幅提升,平均速度分别提高了 6.1 倍、7.7 倍和 3.0 倍。对多 GPU TLPGNN 的评估还表明,我们的解决方案实现了线性可扩展性和均衡的工作负载分布。
{"title":"TLPGNN\u0000 : A Lightweight Two-Level Parallelism Paradigm for Graph Neural Network Computation on Single and Multiple GPUs","authors":"Qiang Fu, Yuede Ji, Thomas B. Rolinger, H. H. Huang","doi":"10.1145/3644712","DOIUrl":"https://doi.org/10.1145/3644712","url":null,"abstract":"\u0000 Graph Neural Networks (GNNs) are an emerging class of deep learning models specifically designed for graph-structured data. They have been effectively employed in a variety of real-world applications, including recommendation systems, drug development, and analysis of social networks. The GNN computation includes regular neural network operations and general graph convolution operations, which take most of the total computation time. Though several recent works have been proposed to accelerate the computation for GNNs, they face the limitations of heavy pre-processing, low efficiency atomic operations, and unnecessary kernel launches. In this paper, we design\u0000 TLPGNN\u0000 , a lightweight two-level parallelism paradigm for GNN computation. First, we conduct a systematic analysis of the hardware resource usage of GNN workloads to understand the characteristics of GNN workloads deeply. With the insightful observations, we then divide the GNN computation into two levels, i.e.,\u0000 vertex parallelism\u0000 for the first level and\u0000 feature parallelism\u0000 for the second. Next, we employ a novel hybrid dynamic workload assignment to address the imbalanced workload distribution. Furthermore, we fuse the kernels to reduce the number of kernel launches and cache the frequently accessed data into registers to avoid unnecessary memory traffic. To scale\u0000 TLPGNN\u0000 to multi-GPU environments, we propose an edge-aware row-wise 1-D partition method to ensure a balanced workload distribution across different GPU devices. Experimental results on various benchmark datasets demonstrate the superiority of our approach, achieving substantial performance improvement over state-of-the-art GNN computation systems, including Deep Graph Library (DGL), GNNAdvisor, and FeatGraph, with speedups of 6.1 ×, 7.7 ×, and 3.0 ×, respectively, on average. Evaluations of multiple-GPU\u0000 TLPGNN\u0000 also demonstrate that our solution achieves both linear scalability and a well-balanced workload distribution.\u0000","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139848851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TLPGNN : A Lightweight Two-Level Parallelism Paradigm for Graph Neural Network Computation on Single and Multiple GPUs TLPGNN:在单个和多个 GPU 上进行图神经网络计算的轻量级两级并行范式
IF 1.6 Q2 Computer Science Pub Date : 2024-02-09 DOI: 10.1145/3644712
Qiang Fu, Yuede Ji, Thomas B. Rolinger, H. H. Huang
Graph Neural Networks (GNNs) are an emerging class of deep learning models specifically designed for graph-structured data. They have been effectively employed in a variety of real-world applications, including recommendation systems, drug development, and analysis of social networks. The GNN computation includes regular neural network operations and general graph convolution operations, which take most of the total computation time. Though several recent works have been proposed to accelerate the computation for GNNs, they face the limitations of heavy pre-processing, low efficiency atomic operations, and unnecessary kernel launches. In this paper, we design TLPGNN , a lightweight two-level parallelism paradigm for GNN computation. First, we conduct a systematic analysis of the hardware resource usage of GNN workloads to understand the characteristics of GNN workloads deeply. With the insightful observations, we then divide the GNN computation into two levels, i.e., vertex parallelism for the first level and feature parallelism for the second. Next, we employ a novel hybrid dynamic workload assignment to address the imbalanced workload distribution. Furthermore, we fuse the kernels to reduce the number of kernel launches and cache the frequently accessed data into registers to avoid unnecessary memory traffic. To scale TLPGNN to multi-GPU environments, we propose an edge-aware row-wise 1-D partition method to ensure a balanced workload distribution across different GPU devices. Experimental results on various benchmark datasets demonstrate the superiority of our approach, achieving substantial performance improvement over state-of-the-art GNN computation systems, including Deep Graph Library (DGL), GNNAdvisor, and FeatGraph, with speedups of 6.1 ×, 7.7 ×, and 3.0 ×, respectively, on average. Evaluations of multiple-GPU TLPGNN also demonstrate that our solution achieves both linear scalability and a well-balanced workload distribution.
图神经网络(GNN)是一类新兴的深度学习模型,专为图结构数据而设计。它们已被有效地应用于各种现实世界的应用中,包括推荐系统、药物开发和社交网络分析。GNN 计算包括常规的神经网络操作和一般的图卷积操作,这占用了总计算时间的大部分。虽然最近有几项研究提出了加速 GNN 计算的方法,但它们面临着繁重的预处理、低效的原子操作和不必要的内核启动等限制。在本文中,我们设计了用于 GNN 计算的轻量级两级并行范式 TLPGNN。首先,我们对 GNN 工作负载的硬件资源使用情况进行了系统分析,以深入了解 GNN 工作负载的特点。有了深刻的观察,我们将 GNN 计算分为两个层次,即第一层次的顶点并行和第二层次的特征并行。接下来,我们采用一种新颖的混合动态工作负载分配来解决工作负载分布不平衡的问题。此外,我们还对内核进行了融合,以减少内核启动次数,并将频繁访问的数据缓存到寄存器中,以避免不必要的内存流量。为了将 TLPGNN 扩展到多 GPU 环境,我们提出了一种边缘感知行向一维分区方法,以确保在不同 GPU 设备上实现均衡的工作量分配。在各种基准数据集上的实验结果证明了我们的方法的优越性,与最先进的 GNN 计算系统(包括 Deep Graph Library (DGL)、GNNAdvisor 和 FeatGraph)相比,我们的方法实现了性能的大幅提升,平均速度分别提高了 6.1 倍、7.7 倍和 3.0 倍。对多 GPU TLPGNN 的评估还表明,我们的解决方案实现了线性可扩展性和均衡的工作负载分布。
{"title":"TLPGNN\u0000 : A Lightweight Two-Level Parallelism Paradigm for Graph Neural Network Computation on Single and Multiple GPUs","authors":"Qiang Fu, Yuede Ji, Thomas B. Rolinger, H. H. Huang","doi":"10.1145/3644712","DOIUrl":"https://doi.org/10.1145/3644712","url":null,"abstract":"\u0000 Graph Neural Networks (GNNs) are an emerging class of deep learning models specifically designed for graph-structured data. They have been effectively employed in a variety of real-world applications, including recommendation systems, drug development, and analysis of social networks. The GNN computation includes regular neural network operations and general graph convolution operations, which take most of the total computation time. Though several recent works have been proposed to accelerate the computation for GNNs, they face the limitations of heavy pre-processing, low efficiency atomic operations, and unnecessary kernel launches. In this paper, we design\u0000 TLPGNN\u0000 , a lightweight two-level parallelism paradigm for GNN computation. First, we conduct a systematic analysis of the hardware resource usage of GNN workloads to understand the characteristics of GNN workloads deeply. With the insightful observations, we then divide the GNN computation into two levels, i.e.,\u0000 vertex parallelism\u0000 for the first level and\u0000 feature parallelism\u0000 for the second. Next, we employ a novel hybrid dynamic workload assignment to address the imbalanced workload distribution. Furthermore, we fuse the kernels to reduce the number of kernel launches and cache the frequently accessed data into registers to avoid unnecessary memory traffic. To scale\u0000 TLPGNN\u0000 to multi-GPU environments, we propose an edge-aware row-wise 1-D partition method to ensure a balanced workload distribution across different GPU devices. Experimental results on various benchmark datasets demonstrate the superiority of our approach, achieving substantial performance improvement over state-of-the-art GNN computation systems, including Deep Graph Library (DGL), GNNAdvisor, and FeatGraph, with speedups of 6.1 ×, 7.7 ×, and 3.0 ×, respectively, on average. Evaluations of multiple-GPU\u0000 TLPGNN\u0000 also demonstrate that our solution achieves both linear scalability and a well-balanced workload distribution.\u0000","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139788790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Introduction to the Special Issue for SPAA’21 SPAA'21 特刊简介
IF 1.6 Q2 Computer Science Pub Date : 2023-12-14 DOI: 10.1145/3630608
Y. Azar, Julian Shun
{"title":"Introduction to the Special Issue for SPAA’21","authors":"Y. Azar, Julian Shun","doi":"10.1145/3630608","DOIUrl":"https://doi.org/10.1145/3630608","url":null,"abstract":"","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139001830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Conflict-Resilient Lock-Free Linearizable Calendar Queue 具有冲突恢复能力的无锁线性日历队列
IF 1.6 Q2 Computer Science Pub Date : 2023-12-06 DOI: 10.1145/3635163
Romolo Marotta, Mauro Ianni, Alessandro Pellegrini, F. Quaglia
In the last two decades, great attention has been devoted to the design of non-blocking and linearizable data structures, which enable exploiting the scaled-up degree of parallelism in off-the-shelf shared-memory multi-core machines. In this context, priority queues are highly challenging. Indeed, concurrent attempts to extract the highest-priority item are prone to create detrimental thread conflicts that lead to abort/retry of the operations. In this article, we present the first priority queue that jointly provides: i) lock-freedom and linearizability; ii) conflict resiliency against concurrent extractions; iii) adaptiveness to different contention profiles; and iv) amortized constant-time access for both insertions and extractions. Beyond presenting our solution, we also provide proof of its correctness based on an assertional approach. Also, we present an experimental study on a 64-CPU machine, showing that our proposal provides performance improvements over state-of-the-art non-blocking priority queues.
在过去的二十年中,人们非常关注非阻塞和线性数据结构的设计,这使得在现成的共享内存多核机器中利用扩展的并行度成为可能。在这种情况下,优先级队列非常具有挑战性。实际上,并发尝试提取最高优先级的项很容易产生有害的线程冲突,从而导致操作的中止/重试。在本文中,我们提出了第一优先队列,它共同提供:i)锁自由和线性化;Ii)针对并发提取的冲突弹性;Iii)适应不同的争用情况;iv)插入和提取的平摊常数时间访问。除了展示我们的解决方案之外,我们还提供了基于断言方法的其正确性的证明。此外,我们在64 cpu机器上进行了一项实验研究,表明我们的建议比最先进的非阻塞优先级队列提供了性能改进。
{"title":"A Conflict-Resilient Lock-Free Linearizable Calendar Queue","authors":"Romolo Marotta, Mauro Ianni, Alessandro Pellegrini, F. Quaglia","doi":"10.1145/3635163","DOIUrl":"https://doi.org/10.1145/3635163","url":null,"abstract":"In the last two decades, great attention has been devoted to the design of non-blocking and linearizable data structures, which enable exploiting the scaled-up degree of parallelism in off-the-shelf shared-memory multi-core machines. In this context, priority queues are highly challenging. Indeed, concurrent attempts to extract the highest-priority item are prone to create detrimental thread conflicts that lead to abort/retry of the operations. In this article, we present the first priority queue that jointly provides: i) lock-freedom and linearizability; ii) conflict resiliency against concurrent extractions; iii) adaptiveness to different contention profiles; and iv) amortized constant-time access for both insertions and extractions. Beyond presenting our solution, we also provide proof of its correctness based on an assertional approach. Also, we present an experimental study on a 64-CPU machine, showing that our proposal provides performance improvements over state-of-the-art non-blocking priority queues.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138595960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HPS Cholesky: Hierarchical Parallelized Supernodal Cholesky with Adaptive Parameters HPS choolesky:自适应参数的分层并行超节点choolesky
Q2 Computer Science Pub Date : 2023-10-26 DOI: 10.1145/3630051
Shengle Lin, Wangdong Yang, Yikun Hu, Qinyun Cai, Minlu Dai, Haotian Wang, Kenli Li
Sparse supernodal Cholesky on multi-NUMAs is challenging due to the supernode relaxation and load balancing. In this work, we propose a novel approach to improve the performance of sparse Cholesky by combining deep learning with a relaxation parameter and a hierarchical parallelization strategy with NUMA affinity. Specifically, our relaxed supernodal algorithm utilizes a well-trained GCN model to adaptively adjust relaxation parameters based on the sparse matrix’s structure, achieving a proper balance between task-level parallelism and dense computational granularity. Additionally, the hierarchical parallelization maps supernodal tasks to the local NUMA parallel queue and updates contribution blocks in pipeline mode. Furthermore, the stream scheduling with NUMA affinity can further enhance the efficiency of memory access during the numerical factorization. The experimental results show that HPS Cholesky can outperform state-of-the-art libraries, such as Eigen LL T , CHOLMOD, PaStiX and SuiteSparse on (79.78% ) , (79.60% ) , (82.09% ) and (74.47% ) of 1128 datasets. It achieves an average speedup of 1.41x over the current optimal relaxation algorithm. Moreover, (70.83% ) of matrices have surpassed MKL sparse Cholesky on Xeon Gold 6248.
由于超节点松弛和负载平衡问题,多numa上的稀疏超节点Cholesky具有挑战性。在这项工作中,我们提出了一种新的方法,通过将深度学习与松弛参数和具有NUMA亲和力的分层并行化策略相结合来提高稀疏Cholesky的性能。具体而言,我们的松弛超节点算法利用训练良好的GCN模型,根据稀疏矩阵的结构自适应调整松弛参数,在任务级并行性和密集计算粒度之间实现了适当的平衡。此外,分层并行化将超节点任务映射到本地NUMA并行队列,并以管道模式更新贡献块。此外,具有NUMA亲和性的流调度可以进一步提高数值分解过程中的内存访问效率。实验结果表明,HPS Cholesky在1128个数据集的(79.78% )、(79.60% )、(82.09% )和(74.47% )上的性能优于Eigen LL T、CHOLMOD、PaStiX和SuiteSparse等最先进的库。与当前最优松弛算法相比,它实现了1.41倍的平均加速。此外,在Xeon Gold 6248上,矩阵的(70.83% )已经超越了MKL稀疏Cholesky。
{"title":"HPS Cholesky: Hierarchical Parallelized Supernodal Cholesky with Adaptive Parameters","authors":"Shengle Lin, Wangdong Yang, Yikun Hu, Qinyun Cai, Minlu Dai, Haotian Wang, Kenli Li","doi":"10.1145/3630051","DOIUrl":"https://doi.org/10.1145/3630051","url":null,"abstract":"Sparse supernodal Cholesky on multi-NUMAs is challenging due to the supernode relaxation and load balancing. In this work, we propose a novel approach to improve the performance of sparse Cholesky by combining deep learning with a relaxation parameter and a hierarchical parallelization strategy with NUMA affinity. Specifically, our relaxed supernodal algorithm utilizes a well-trained GCN model to adaptively adjust relaxation parameters based on the sparse matrix’s structure, achieving a proper balance between task-level parallelism and dense computational granularity. Additionally, the hierarchical parallelization maps supernodal tasks to the local NUMA parallel queue and updates contribution blocks in pipeline mode. Furthermore, the stream scheduling with NUMA affinity can further enhance the efficiency of memory access during the numerical factorization. The experimental results show that HPS Cholesky can outperform state-of-the-art libraries, such as Eigen LL T , CHOLMOD, PaStiX and SuiteSparse on (79.78% ) , (79.60% ) , (82.09% ) and (74.47% ) of 1128 datasets. It achieves an average speedup of 1.41x over the current optimal relaxation algorithm. Moreover, (70.83% ) of matrices have surpassed MKL sparse Cholesky on Xeon Gold 6248.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134907691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improved Online Scheduling of Moldable Task Graphs under Common Speedup Models 常用加速模型下可塑任务图的改进在线调度
Q2 Computer Science Pub Date : 2023-10-26 DOI: 10.1145/3630052
Lucas Perotin, Hongyang Sun
We consider the online scheduling problem of moldable task graphs on multiprocessor systems for minimizing the overall completion time (or makespan). Moldable job scheduling has been widely studied in the literature, in particular when tasks have dependencies (i.e., task graphs) or when tasks are released on-the-fly (i.e., online). However, few studies have focused on both (i.e., online scheduling of moldable task graphs). In this paper, we design a new online scheduling algorithm for this problem and derive constant competitive ratios under several common yet realistic speedup models (i.e., roofline, communication, Amdahl, and a general combination). These results improve the ones we have shown in the preliminary version of the paper. We also prove, for each speedup model, a lower bound on the competitiveness of any online list scheduling algorithm that allocates processors to a task based only on the task’s parameters and not on its position in the graph. This lower bound matches exactly the competitive ratio of our algorithm for the roofline, communication and Amdahl’s model, and is close to the ratio for the general model. Finally, we provide a lower bound on the competitive ratio of any deterministic online algorithm for the arbitrary speedup model, which is not constant but depends on the number of tasks in the longest path of the graph.
研究了多处理机系统上可塑任务图的在线调度问题,以使总完成时间最小化。可塑作业调度在文献中得到了广泛的研究,特别是当任务具有依赖性(即任务图)或任务动态释放(即在线)时。然而,很少有研究同时关注两者(即可建模任务图的在线调度)。本文针对这一问题设计了一种新的在线调度算法,并在几种常见且现实的加速模型(即rooline、communication、Amdahl和一般组合)下推导出恒定的竞争比。这些结果改进了我们在论文初稿中所展示的结果。我们还证明了对于每个加速模型,任何在线列表调度算法的竞争下界,该算法仅根据任务的参数而不是其在图中的位置为任务分配处理器。这个下界完全符合我们的算法对屋顶线、通信和Amdahl模型的竞争比,并且接近于一般模型的竞争比。最后,我们给出了任意加速模型下任何确定性在线算法的竞争比的下界,它不是恒定的,而是取决于图中最长路径上的任务数。
{"title":"Improved Online Scheduling of Moldable Task Graphs under Common Speedup Models","authors":"Lucas Perotin, Hongyang Sun","doi":"10.1145/3630052","DOIUrl":"https://doi.org/10.1145/3630052","url":null,"abstract":"We consider the online scheduling problem of moldable task graphs on multiprocessor systems for minimizing the overall completion time (or makespan). Moldable job scheduling has been widely studied in the literature, in particular when tasks have dependencies (i.e., task graphs) or when tasks are released on-the-fly (i.e., online). However, few studies have focused on both (i.e., online scheduling of moldable task graphs). In this paper, we design a new online scheduling algorithm for this problem and derive constant competitive ratios under several common yet realistic speedup models (i.e., roofline, communication, Amdahl, and a general combination). These results improve the ones we have shown in the preliminary version of the paper. We also prove, for each speedup model, a lower bound on the competitiveness of any online list scheduling algorithm that allocates processors to a task based only on the task’s parameters and not on its position in the graph. This lower bound matches exactly the competitive ratio of our algorithm for the roofline, communication and Amdahl’s model, and is close to the ratio for the general model. Finally, we provide a lower bound on the competitive ratio of any deterministic online algorithm for the arbitrary speedup model, which is not constant but depends on the number of tasks in the longest path of the graph.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134908046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Checkpointing strategies to tolerate non-memoryless failures on HPC platforms 在HPC平台上容忍非无内存故障的检查点策略
Q2 Computer Science Pub Date : 2023-09-22 DOI: 10.1145/3624560
Anne Benoit, Lucas Perotin, Yves Robert, Frédéric Vivien
This paper studies checkpointing strategies for parallel applications subject to failures. The optimal strategy to minimize total execution time, or makespan, is well known when failure IATs obey an Exponential distribution, but it is unknown for non-memoryless failure distributions. We explain why the latter fact is misunderstood in recent literature. We propose a general strategy that maximizes the expected efficiency until the next failure, and we show that this strategy achieves an asymptotically optimal makespan, thereby establishing the first optimality result for arbitrary failure distributions. Through extensive simulations, we show that the new strategy is always at least as good as the Young/Daly strategy for various failure distributions. For distributions with a high infant mortality (such as LogNormal with shape parameter k = 2.51 or Weibull with shape parameter 0.5), the execution time is divided by a factor 1.9 on average, and up to a factor 4.2 for recently deployed platforms.
本文研究了存在故障的并行应用程序的检查点策略。当故障iat服从指数分布时,最小化总执行时间或makespan的最佳策略是众所周知的,但对于非无内存故障分布,则未知。我们解释了为什么后一个事实在最近的文献中被误解。我们提出了一种通用策略,在下一次故障之前最大化期望效率,并证明该策略实现了渐近最优的最大完工时间,从而建立了任意故障分布的第一最优性结果。通过大量的仿真,我们表明,对于各种故障分布,新策略总是至少与Young/Daly策略一样好。对于具有高婴儿死亡率的分布(例如形状参数k = 2.51的LogNormal或形状参数为0.5的Weibull),执行时间平均除以因子1.9,对于最近部署的平台,最高可达因子4.2。
{"title":"Checkpointing strategies to tolerate non-memoryless failures on HPC platforms","authors":"Anne Benoit, Lucas Perotin, Yves Robert, Frédéric Vivien","doi":"10.1145/3624560","DOIUrl":"https://doi.org/10.1145/3624560","url":null,"abstract":"This paper studies checkpointing strategies for parallel applications subject to failures. The optimal strategy to minimize total execution time, or makespan, is well known when failure IATs obey an Exponential distribution, but it is unknown for non-memoryless failure distributions. We explain why the latter fact is misunderstood in recent literature. We propose a general strategy that maximizes the expected efficiency until the next failure, and we show that this strategy achieves an asymptotically optimal makespan, thereby establishing the first optimality result for arbitrary failure distributions. Through extensive simulations, we show that the new strategy is always at least as good as the Young/Daly strategy for various failure distributions. For distributions with a high infant mortality (such as LogNormal with shape parameter k = 2.51 or Weibull with shape parameter 0.5), the execution time is divided by a factor 1.9 on average, and up to a factor 4.2 for recently deployed platforms.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136015029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distributed Graph Coloring Made Easy 分布式图形着色变得容易
IF 1.6 Q2 Computer Science Pub Date : 2023-08-17 DOI: 10.1145/3605896
Yannic Maus
In this paper, we present a deterministic (mathsf {CONGEST} ) algorithm to compute an O(kΔ)-vertex coloring in O(Δ/k) + log *n rounds, where Δ is the maximum degree of the network graph and k ≥ 1 can be freely chosen. The algorithm is extremely simple: Each node locally computes a sequence of colors and then it tries colors from the sequence in batches of size k. Our algorithm subsumes many important results in the history of distributed graph coloring as special cases, including Linial’s color reduction [Linial, FOCS’87], the celebrated locally iterative algorithm from [Barenboim, Elkin, Goldenberg, PODC’18], and various algorithms to compute defective and arbdefective colorings. Our algorithm can smoothly scale between several of these previous results and also simplifies the state of the art (Δ + 1)-coloring algorithm. At the cost of losing some of the algorithm’s simplicity we also provide a O(kΔ)-coloring algorithm in (O(sqrt {Delta /k})+log ^{*} n ) rounds. We also provide improved deterministic algorithms for ruling sets, and, additionally, we provide a tight characterization for 1-round color reduction algorithms.
在本文中,我们提出了一个确定性算法来计算O(Δ/k)+log中的O(kΔ)-顶点着色 *n轮,其中Δ是网络图的最大度,k≥1可以自由选择。该算法非常简单:每个节点局部计算一个颜色序列,然后按k大小批量尝试序列中的颜色。我们的算法将分布图着色历史上的许多重要结果作为特例,包括Linial的颜色约简[Linal,FOCS'87],Barenboim,Elkin,Goldenberg,PODC'18]中著名的局部迭代算法,以及计算缺陷和无缺陷着色的各种算法。我们的算法可以在之前的几个结果之间平滑缩放,还简化了现有技术的(Δ+1)-着色算法。以失去算法的一些简单性为代价,我们还提供了一种在(O(sqrt{Delta/k})+log^{*}n)轮中的O(kΔ)-着色算法。我们还为规则集提供了改进的确定性算法,此外,我们还为一轮颜色减少算法提供了严格的表征。
{"title":"Distributed Graph Coloring Made Easy","authors":"Yannic Maus","doi":"10.1145/3605896","DOIUrl":"https://doi.org/10.1145/3605896","url":null,"abstract":"In this paper, we present a deterministic (mathsf {CONGEST} ) algorithm to compute an O(kΔ)-vertex coloring in O(Δ/k) + log *n rounds, where Δ is the maximum degree of the network graph and k ≥ 1 can be freely chosen. The algorithm is extremely simple: Each node locally computes a sequence of colors and then it tries colors from the sequence in batches of size k. Our algorithm subsumes many important results in the history of distributed graph coloring as special cases, including Linial’s color reduction [Linial, FOCS’87], the celebrated locally iterative algorithm from [Barenboim, Elkin, Goldenberg, PODC’18], and various algorithms to compute defective and arbdefective colorings. Our algorithm can smoothly scale between several of these previous results and also simplifies the state of the art (Δ + 1)-coloring algorithm. At the cost of losing some of the algorithm’s simplicity we also provide a O(kΔ)-coloring algorithm in (O(sqrt {Delta /k})+log ^{*} n ) rounds. We also provide improved deterministic algorithms for ruling sets, and, additionally, we provide a tight characterization for 1-round color reduction algorithms.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42561903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Fast Algorithm for Aperiodic Linear Stencil Computation using Fast Fourier Transforms 基于快速傅里叶变换的非周期线性模板计算快速算法
IF 1.6 Q2 Computer Science Pub Date : 2023-07-24 DOI: 10.1145/3606338
Zafar Ahmad, R. Chowdhury, Rathish Das, P. Ganapathi, Aaron Gregory, Yimin Zhu
Stencil computations are widely used to simulate the change of state of physical systems across a multidimensional grid over multiple timesteps. The state-of-the-art techniques in this area fall into three groups: cache-aware tiled looping algorithms, cache-oblivious divide-and-conquer trapezoidal algorithms, and Krylov subspace methods. In this paper, we present two efficient parallel algorithms for performing linear stencil computations. Current direct solvers in this domain are computationally inefficient, and Krylov methods require manual labor and mathematical training. We solve these problems for linear stencils by using DFT preconditioning on a Krylov method to achieve a direct solver which is both fast and general. Indeed, while all currently available algorithms for solving general linear stencils perform Θ(NT) work, where N is the size of the spatial grid and T is the number of timesteps, our algorithms perform o(NT) work. To the best of our knowledge, we give the first algorithms that use fast Fourier transforms to compute final grid data by evolving the initial data for many timesteps at once. Our algorithms handle both periodic and aperiodic boundary conditions, and achieve polynomially better performance bounds (i.e., computational complexity and parallel runtime) than all other existing solutions. Initial experimental results show that implementations of our algorithms that evolve grids of roughly 107 cells for around 105 timesteps run orders of magnitude faster than state-of-the-art implementations for periodic stencil problems, and 1.3 × to 8.5 × faster for aperiodic stencil problems. Code Repository: https://github.com/TEAlab/FFTStencils
模板计算被广泛用于模拟多个时间步长上多维网格上物理系统状态的变化。该领域最先进的技术可分为三组:缓存感知平铺循环算法、缓存不感知分治梯形算法和Krylov子空间方法。在本文中,我们提出了两种有效的并行算法来执行线性模板计算。目前该领域的直接求解器在计算上效率低下,Krylov方法需要手工劳动和数学训练。我们通过在Krylov方法上使用DFT预处理来解决线性模板的这些问题,以实现快速且通用的直接求解器。事实上,虽然目前所有可用的求解一般线性模板的算法都执行θ(NT)功,其中N是空间网格的大小,T是时间步长,但我们的算法执行o(NT)工作。据我们所知,我们给出了第一个算法,该算法使用快速傅立叶变换,通过一次进化多个时间步长的初始数据来计算最终网格数据。我们的算法处理周期性和非周期性边界条件,并实现了比所有其他现有解决方案更好的性能边界(即计算复杂性和并行运行时间)。初步实验结果表明,我们的算法在大约105个时间步长内进化出大约107个单元的网格,对于周期性模板问题,其运行速度比最先进的实现快几个数量级,对于非周期性模板的问题,其速度快1.3倍至8.5倍。代码库:https://github.com/TEAlab/FFTStencils
{"title":"A Fast Algorithm for Aperiodic Linear Stencil Computation using Fast Fourier Transforms","authors":"Zafar Ahmad, R. Chowdhury, Rathish Das, P. Ganapathi, Aaron Gregory, Yimin Zhu","doi":"10.1145/3606338","DOIUrl":"https://doi.org/10.1145/3606338","url":null,"abstract":"Stencil computations are widely used to simulate the change of state of physical systems across a multidimensional grid over multiple timesteps. The state-of-the-art techniques in this area fall into three groups: cache-aware tiled looping algorithms, cache-oblivious divide-and-conquer trapezoidal algorithms, and Krylov subspace methods. In this paper, we present two efficient parallel algorithms for performing linear stencil computations. Current direct solvers in this domain are computationally inefficient, and Krylov methods require manual labor and mathematical training. We solve these problems for linear stencils by using DFT preconditioning on a Krylov method to achieve a direct solver which is both fast and general. Indeed, while all currently available algorithms for solving general linear stencils perform Θ(NT) work, where N is the size of the spatial grid and T is the number of timesteps, our algorithms perform o(NT) work. To the best of our knowledge, we give the first algorithms that use fast Fourier transforms to compute final grid data by evolving the initial data for many timesteps at once. Our algorithms handle both periodic and aperiodic boundary conditions, and achieve polynomially better performance bounds (i.e., computational complexity and parallel runtime) than all other existing solutions. Initial experimental results show that implementations of our algorithms that evolve grids of roughly 107 cells for around 105 timesteps run orders of magnitude faster than state-of-the-art implementations for periodic stencil problems, and 1.3 × to 8.5 × faster for aperiodic stencil problems. Code Repository: https://github.com/TEAlab/FFTStencils","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2023-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43986447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ACM Transactions on Parallel Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1