2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第9页

An Experimental Study of Two-level Schwarz Domain-Decomposition Preconditioners on GPUs gpu上两级Schwarz域分解预处理的实验研究

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-04-10 DOI: 10.1109/IPDPS54959.2023.00073

I. Yamazaki, Alexander Heinlein, S. Rajamanickam

The generalized Dryja–Smith–Widlund (GDSW) preconditioner is a two-level overlapping Schwarz domain decomposition (DD) preconditioner that couples a classical one-level overlapping Schwarz preconditioner with an energy-minimizing coarse space. When used to accelerate the convergence rate of Krylov subspace iterative methods, the GDSW preconditioner provides robustness and scalability for the solution of sparse linear systems arising from the discretization of a wide range of partial different equations. In this paper, we present FROSch (Fast and Robust Schwarz), a domain decomposition solver package which implements GDSW-type preconditioners for both CPU and GPU clusters. To improve the solver performance on GPUs, we use a novel decomposition to run multiple MPI processes on each GPU, reducing both solver’s computational and storage costs and potentially improving the convergence rate. This allowed us to obtain competitive or faster performance using GPUs compared to using CPUs alone. We demonstrate the performance of FROSch on the Summit supercomputer with NVIDIA V100 GPUs, where we used NVIDIA Multi-Process Service (MPS) to implement our decomposition strategy.The solver has a wide variety of algorithmic and implementation choices, which poses both opportunities and challenges for its GPU implementation. We conduct a thorough experimental study with different solver options including the exact or inexact solution of the local overlapping subdomain problems on a GPU. We also discuss the effect of using the iterative variant of the incomplete LU factorization and sparse-triangular solve as the approximate local solver, and using lower precision for computing the whole FROSch preconditioner. Overall, the solve time was reduced by factors of about 2× using GPUs, while the GPU acceleration of the numerical setup time depend on the solver options and the local matrix sizes.

广义Dryja-Smith-Widlund (GDSW)预条件是将经典的一级重叠Schwarz预条件与最小化能量的粗空间耦合在一起的两级重叠Schwarz域分解(DD)预条件。当用于加速Krylov子空间迭代方法的收敛速度时，GDSW预调节器为求解由大范围偏微分方程离散化引起的稀疏线性系统提供了鲁棒性和可扩展性。在本文中，我们提出了FROSch(快速和鲁棒Schwarz)，这是一个领域分解求解器包，它为CPU和GPU集群实现了gsd类型的前置条件。为了提高求解器在GPU上的性能，我们使用了一种新的分解方法在每个GPU上运行多个MPI进程，从而降低了求解器的计算和存储成本，并有可能提高收敛速度。与单独使用cpu相比，这使我们能够使用gpu获得具有竞争力或更快的性能。我们在使用NVIDIA V100 gpu的Summit超级计算机上演示了FROSch的性能，其中我们使用NVIDIA多进程服务(MPS)来实现我们的分解策略。求解器有各种各样的算法和实现选择，这为其GPU实现带来了机遇和挑战。我们使用不同的求解器选项进行了彻底的实验研究，包括在GPU上精确或不精确地解决局部重叠子域问题。我们还讨论了使用不完全LU分解的迭代变体和稀疏三角解作为近似局部解，以及使用较低精度计算整个FROSch预条件的影响。总体而言，使用GPU的求解时间减少了约2倍，而GPU的数值设置时间加速取决于求解器选项和局部矩阵大小。

{"title":"An Experimental Study of Two-level Schwarz Domain-Decomposition Preconditioners on GPUs","authors":"I. Yamazaki, Alexander Heinlein, S. Rajamanickam","doi":"10.1109/IPDPS54959.2023.00073","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00073","url":null,"abstract":"The generalized Dryja–Smith–Widlund (GDSW) preconditioner is a two-level overlapping Schwarz domain decomposition (DD) preconditioner that couples a classical one-level overlapping Schwarz preconditioner with an energy-minimizing coarse space. When used to accelerate the convergence rate of Krylov subspace iterative methods, the GDSW preconditioner provides robustness and scalability for the solution of sparse linear systems arising from the discretization of a wide range of partial different equations. In this paper, we present FROSch (Fast and Robust Schwarz), a domain decomposition solver package which implements GDSW-type preconditioners for both CPU and GPU clusters. To improve the solver performance on GPUs, we use a novel decomposition to run multiple MPI processes on each GPU, reducing both solver’s computational and storage costs and potentially improving the convergence rate. This allowed us to obtain competitive or faster performance using GPUs compared to using CPUs alone. We demonstrate the performance of FROSch on the Summit supercomputer with NVIDIA V100 GPUs, where we used NVIDIA Multi-Process Service (MPS) to implement our decomposition strategy.The solver has a wide variety of algorithmic and implementation choices, which poses both opportunities and challenges for its GPU implementation. We conduct a thorough experimental study with different solver options including the exact or inexact solution of the local overlapping subdomain problems on a GPU. We also discuss the effect of using the iterative variant of the incomplete LU factorization and sparse-triangular solve as the approximate local solver, and using lower precision for computing the whole FROSch preconditioner. Overall, the solve time was reduced by factors of about 2× using GPUs, while the GPU acceleration of the numerical setup time depend on the solver options and the local matrix sizes.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128185120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Dynasparse: Accelerating GNN Inference through Dynamic Sparsity Exploitation Dynasparse:通过动态稀疏性开发加速GNN推理

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-03-22 DOI: 10.1109/IPDPS54959.2023.00032

Bingyi Zhang, V. Prasanna

Graph Neural Network (GNN) inference is used in many real-world applications. Data sparsity in GNN inference, including sparsity in the input graph and the GNN model, offer opportunities to further speed up inference. Also, many pruning techniques have been proposed for model compression that increase the data sparsity of GNNs.We propose Dynasparse, a comprehensive hardware-software codesign on FPGA to accelerate GNN inference through dynamic sparsity exploitation. For this, we decouple the GNN computation kernels from the basic computation primitives, and explore hardware-software codesign as follows: 1) Hardware design: We propose a novel unified accelerator design on FPGA to efficiently execute various computation primitives. We develop a customized soft processor that is tightly coupled with the accelerator to execute a runtime system. Moreover, we develop efficient hardware mechanisms to profile the data sparsity and perform on-the-fly data format transformation to prepare the input data for various computation primitives; 2) Software design: We develop a runtime system that works synergistically with the accelerator to perform dynamic kernel-to-primitive mapping based on data sparsity. We implement Dynasparse on a state-of-the-art FPGA platform, Xilinx Alveo U250, and evaluate the design using widely used GNN models (GCN, GraphSAGE, GIN and SGC). For the above GNN models and various input graphs, the proposed accelerator and dynamic kernel-to-primitive mapping reduces the inference latency by 3.73× on the average compared with the static mapping strategies employed in the state-of-the-art GNN accelerators. Compared with state-of-the-art CPU (GPU) implementations, Dynasparse achieves up to 56.9× (2.37×) speedup in end-to-end latency. Compared with state-of-the-art FPGA implementations, Dynasparse achieves 2.7× speedup in accelerator execution latency.

图神经网络(GNN)推理在许多实际应用中得到了应用。GNN推理中的数据稀疏性，包括输入图和GNN模型的稀疏性，为进一步提高推理速度提供了机会。此外，许多修剪技术已被提出用于模型压缩，以增加gnn的数据稀疏性。我们提出了Dynasparse，一种基于FPGA的综合软硬件协同设计，通过动态稀疏性开发来加速GNN推理。为此，我们将GNN计算核与基本计算基元解耦，并从以下方面探索了软硬件协同设计:1)硬件设计:我们提出了一种新颖的FPGA统一加速器设计，以高效执行各种计算基元。我们开发了一个定制的软处理器，它与加速器紧密耦合以执行运行时系统。此外，我们开发了高效的硬件机制来分析数据稀疏性，并执行实时数据格式转换，为各种计算原语准备输入数据;2)软件设计:我们开发了一个运行时系统，与加速器协同工作，执行基于数据稀疏性的动态内核到原语映射。我们在最先进的FPGA平台Xilinx Alveo U250上实现了Dynasparse，并使用广泛使用的GNN模型(GCN, GraphSAGE, GIN和SGC)对设计进行了评估。对于上述GNN模型和各种输入图，与最先进的GNN加速器中采用的静态映射策略相比，所提出的加速器和动态核到原语映射策略平均减少了3.73倍的推理延迟。与最先进的CPU (GPU)实现相比，Dynasparse在端到端延迟方面实现了高达56.9倍(2.37倍)的加速。与最先进的FPGA实现相比，Dynasparse在加速器执行延迟方面实现了2.7倍的加速。

{"title":"Dynasparse: Accelerating GNN Inference through Dynamic Sparsity Exploitation","authors":"Bingyi Zhang, V. Prasanna","doi":"10.1109/IPDPS54959.2023.00032","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00032","url":null,"abstract":"Graph Neural Network (GNN) inference is used in many real-world applications. Data sparsity in GNN inference, including sparsity in the input graph and the GNN model, offer opportunities to further speed up inference. Also, many pruning techniques have been proposed for model compression that increase the data sparsity of GNNs.We propose Dynasparse, a comprehensive hardware-software codesign on FPGA to accelerate GNN inference through dynamic sparsity exploitation. For this, we decouple the GNN computation kernels from the basic computation primitives, and explore hardware-software codesign as follows: 1) Hardware design: We propose a novel unified accelerator design on FPGA to efficiently execute various computation primitives. We develop a customized soft processor that is tightly coupled with the accelerator to execute a runtime system. Moreover, we develop efficient hardware mechanisms to profile the data sparsity and perform on-the-fly data format transformation to prepare the input data for various computation primitives; 2) Software design: We develop a runtime system that works synergistically with the accelerator to perform dynamic kernel-to-primitive mapping based on data sparsity. We implement Dynasparse on a state-of-the-art FPGA platform, Xilinx Alveo U250, and evaluate the design using widely used GNN models (GCN, GraphSAGE, GIN and SGC). For the above GNN models and various input graphs, the proposed accelerator and dynamic kernel-to-primitive mapping reduces the inference latency by 3.73× on the average compared with the static mapping strategies employed in the state-of-the-art GNN accelerators. Compared with state-of-the-art CPU (GPU) implementations, Dynasparse achieves up to 56.9× (2.37×) speedup in end-to-end latency. Compared with state-of-the-art FPGA implementations, Dynasparse achieves 2.7× speedup in accelerator execution latency.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121729193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

RT-DBSCAN: Accelerating DBSCAN using Ray Tracing Hardware RT-DBSCAN:使用光线追踪硬件加速DBSCAN

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-03-16 DOI: 10.1109/IPDPS54959.2023.00100

Vani Nagarajan, Milind Kulkarni

General Purpose computing on Graphical Processing Units (GPGPU) has resulted in unprecedented levels of speedup over its CPU counterparts, allowing programmers to harness the computational power of GPU shader cores to accelerate other computing applications. But this style of acceleration is best suited for regular computations (e.g., linear algebra). Recent GPUs feature new Ray Tracing (RT) cores that instead speed up the irregular process of ray tracing using Bounding Volume Hierarchies. While these cores seem limited in functionality, they can be used to accelerate n-body problems by leveraging RT cores to accelerate the required distance computations. In this work, we propose RT-DBSCAN, the first RT-accelerated DBSCAN implementation. We use RT cores to accelerate Density-Based Clustering of Applications with Noise (DBSCAN) by translating fixed-radius nearest neighbor queries to ray tracing queries. We show that leveraging the RT hardware results in speedups between 1.3x to 4x over current state-of-the-art, GPU-based DBSCAN implementations.

图形处理单元(GPGPU)上的通用计算导致了前所未有的CPU加速水平，允许程序员利用GPU着色器核心的计算能力来加速其他计算应用程序。但这种加速方式最适合常规计算(例如，线性代数)。最近的gpu具有新的光线追踪(RT)内核，而不是使用边界体层次来加速不规则的光线追踪过程。虽然这些核在功能上似乎有限，但它们可以通过利用RT核加速所需的距离计算来加速n体问题。在这项工作中，我们提出了RT-DBSCAN，这是第一个rt加速的DBSCAN实现。我们使用RT内核通过将固定半径最近邻查询转换为光线跟踪查询来加速基于密度的带噪声应用聚类(DBSCAN)。我们表明，与当前最先进的、基于gpu的DBSCAN实现相比，利用RT硬件可以将速度提高1.3到4倍。

引用次数: 1

MCR-DL: Mix-and-Match Communication Runtime for Deep Learning MCR-DL:用于深度学习的混合匹配通信运行时

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-03-15 DOI: 10.1109/IPDPS54959.2023.00103

Quentin G. Anthony, A. Awan, Jeff Rasley, Yuxiong He, A. Shafi, M. Abduljabbar, H. Subramoni, D. Panda

In recent years, the training requirements of many state-of-the-art Deep Learning (DL) models have scaled beyond the compute and memory capabilities of a single processor, and necessitated distribution among processors. Training such massive models necessitates advanced parallelism strategies [1], [2] to maintain efficiency. However, such distributed DL parallelism strategies require a varied mixture of collective and point-to-point communication operations across a broad range of message sizes and scales. Examples of models using advanced parallelism strategies include Deep Learning Recommendation Models (DLRM) [3] and Mixture-of-Experts (MoE) [4], [5]. Communication libraries’ performance varies wildly across different communication operations, scales, and message sizes. We propose MCR-DL: an extensible DL communication framework that supports all point-to-point and collective operations while enabling users to dynamically mix-and-match communication backends for a given operation without deadlocks. MCR-DL also comes packaged with a tuning suite for dynamically selecting the best communication backend for a given input tensor. We select DeepSpeed-MoE and DLRM as candidate DL models and demonstrate a 31% improvement in DS-MoE throughput on 256 V100 GPUs on the Lassen HPC system. Further, we achieve a 20% throughput improvement in a dense Megatron-DeepSpeed model and a 25% throughput improvement in DLRM on 32 A100 GPUs with the Theta-GPU HPC system.

近年来，许多最先进的深度学习(DL)模型的训练需求已经超出了单个处理器的计算和内存能力，并且需要在处理器之间进行分布。训练如此庞大的模型需要先进的并行策略[1]，[2]来保持效率。然而，这种分布式DL并行策略需要在广泛的消息大小和规模范围内混合使用集体和点对点通信操作。使用高级并行策略的模型示例包括深度学习推荐模型(DLRM)[3]和专家混合模型(MoE)[4]，[5]。通信库的性能在不同的通信操作、规模和消息大小之间差异很大。我们提出MCR-DL:一个可扩展的DL通信框架，支持所有点对点和集体操作，同时使用户能够动态混合和匹配给定操作的通信后端，而不会出现死锁。MCR-DL还附带了一个调优套件，用于动态选择给定输入张量的最佳通信后端。我们选择DeepSpeed-MoE和DLRM作为候选DL模型，并在Lassen HPC系统上的256个V100 gpu上证明了DS-MoE吞吐量提高了31%。此外，我们在密集的Megatron-DeepSpeed模型中实现了20%的吞吐量提升，在32个A100 gpu上使用Theta-GPU HPC系统实现了25%的DLRM吞吐量提升。

{"title":"MCR-DL: Mix-and-Match Communication Runtime for Deep Learning","authors":"Quentin G. Anthony, A. Awan, Jeff Rasley, Yuxiong He, A. Shafi, M. Abduljabbar, H. Subramoni, D. Panda","doi":"10.1109/IPDPS54959.2023.00103","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00103","url":null,"abstract":"In recent years, the training requirements of many state-of-the-art Deep Learning (DL) models have scaled beyond the compute and memory capabilities of a single processor, and necessitated distribution among processors. Training such massive models necessitates advanced parallelism strategies [1], [2] to maintain efficiency. However, such distributed DL parallelism strategies require a varied mixture of collective and point-to-point communication operations across a broad range of message sizes and scales. Examples of models using advanced parallelism strategies include Deep Learning Recommendation Models (DLRM) [3] and Mixture-of-Experts (MoE) [4], [5]. Communication libraries’ performance varies wildly across different communication operations, scales, and message sizes. We propose MCR-DL: an extensible DL communication framework that supports all point-to-point and collective operations while enabling users to dynamically mix-and-match communication backends for a given operation without deadlocks. MCR-DL also comes packaged with a tuning suite for dynamically selecting the best communication backend for a given input tensor. We select DeepSpeed-MoE and DLRM as candidate DL models and demonstrate a 31% improvement in DS-MoE throughput on 256 V100 GPUs on the Lassen HPC system. Further, we achieve a 20% throughput improvement in a dense Megatron-DeepSpeed model and a 25% throughput improvement in DLRM on 32 A100 GPUs with the Theta-GPU HPC system.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114905793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

GPU-enabled Function-as-a-Service for Machine Learning Inference 支持gpu的功能即服务，用于机器学习推理

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-03-09 DOI: 10.1109/IPDPS54959.2023.00096

Ming Zhao, Kritshekhar Jha, Sungho Hong

Function-as-a-Service (FaaS) is emerging as an important cloud computing service model as it can improve the scalability and usability of a wide range of applications, especially Machine-Learning (ML) inference tasks that require scalable resources and complex software configurations. These inference tasks heavily rely on GPUs to achieve high performance; however, support for GPUs is currently lacking in the existing FaaS solutions. The unique event-triggered and short-lived nature of functions poses new challenges to enabling GPUs on FaaS, which must consider the overhead of transferring data (e.g., ML model parameters and inputs/outputs) between GPU and host memory. This paper proposes a novel GPU-enabled FaaS solution that enables ML inference functions to efficiently utilize GPUs to accelerate their computations. First, it extends existing FaaS frameworks such as OpenFaaS to support the scheduling and execution of functions across GPUs in a FaaS cluster. Second, it provides caching of ML models in GPU memory to improve the performance of model inference functions and global management of GPU memories to improve cache utilization. Third, it offers co-designed GPU function scheduling and cache management to optimize the performance of ML inference functions. Specifically, the paper proposes locality-aware scheduling, which maximizes the utilization of both GPU memory for cache hits and GPU cores for parallel processing. A thorough evaluation based on real-world traces and ML models shows that the proposed GPU-enabled FaaS works well for ML inference tasks, and the proposed locality-aware scheduler achieves a speedup of 48x compared to the default, load balancing only schedulers.

功能即服务(FaaS)正在成为一种重要的云计算服务模型，因为它可以提高广泛应用程序的可扩展性和可用性，特别是需要可扩展资源和复杂软件配置的机器学习(ML)推理任务。这些推理任务严重依赖gpu来实现高性能;然而，现有的FaaS解决方案目前缺乏对gpu的支持。独特的事件触发和功能的短暂性对在FaaS上启用GPU提出了新的挑战，这必须考虑在GPU和主机内存之间传输数据(例如，ML模型参数和输入/输出)的开销。本文提出了一种新的支持gpu的FaaS解决方案，使机器学习推理函数能够有效地利用gpu来加速其计算。首先，它扩展了现有的FaaS框架(如OpenFaaS)，以支持FaaS集群中跨gpu的功能调度和执行。其次，在GPU内存中提供ML模型的缓存，以提高模型推理函数的性能，并提供GPU内存的全局管理，以提高缓存利用率。第三，提供协同设计的GPU功能调度和缓存管理，以优化ML推理功能的性能。具体来说，本文提出了位置感知调度，它最大限度地利用GPU内存进行缓存命中和GPU内核进行并行处理。基于真实世界轨迹和ML模型的全面评估表明，提议的支持gpu的FaaS可以很好地用于ML推理任务，并且提议的位置感知调度器与默认的仅负载平衡调度器相比，实现了48倍的加速。

{"title":"GPU-enabled Function-as-a-Service for Machine Learning Inference","authors":"Ming Zhao, Kritshekhar Jha, Sungho Hong","doi":"10.1109/IPDPS54959.2023.00096","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00096","url":null,"abstract":"Function-as-a-Service (FaaS) is emerging as an important cloud computing service model as it can improve the scalability and usability of a wide range of applications, especially Machine-Learning (ML) inference tasks that require scalable resources and complex software configurations. These inference tasks heavily rely on GPUs to achieve high performance; however, support for GPUs is currently lacking in the existing FaaS solutions. The unique event-triggered and short-lived nature of functions poses new challenges to enabling GPUs on FaaS, which must consider the overhead of transferring data (e.g., ML model parameters and inputs/outputs) between GPU and host memory. This paper proposes a novel GPU-enabled FaaS solution that enables ML inference functions to efficiently utilize GPUs to accelerate their computations. First, it extends existing FaaS frameworks such as OpenFaaS to support the scheduling and execution of functions across GPUs in a FaaS cluster. Second, it provides caching of ML models in GPU memory to improve the performance of model inference functions and global management of GPU memories to improve cache utilization. Third, it offers co-designed GPU function scheduling and cache management to optimize the performance of ML inference functions. Specifically, the paper proposes locality-aware scheduling, which maximizes the utilization of both GPU memory for cache hits and GPU cores for parallel processing. A thorough evaluation based on real-world traces and ML models shows that the proposed GPU-enabled FaaS works well for ML inference tasks, and the proposed locality-aware scheduler achieves a speedup of 48x compared to the default, load balancing only schedulers.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132066654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

HyScale-GNN: A Scalable Hybrid GNN Training System on Single-Node Heterogeneous Architecture HyScale-GNN:基于单节点异构架构的可扩展混合GNN训练系统

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-03-01 DOI: 10.1109/IPDPS54959.2023.00062

Yi-Chien Lin, V. Prasanna

Graph Neural Networks (GNNs) have shown success in many real-world applications that involve graph-structured data. Most of the existing single-node GNN training systems are capable of training medium-scale graphs with tens of millions of edges; however, scaling them to large-scale graphs with billions of edges remains challenging. In addition, it is challenging to map GNN training algorithms onto a computation node as state-of-the-art machines feature heterogeneous architecture consisting of multiple processors and a variety of accelerators.We propose HyScale-GNN, a novel system to train GNN models on a single-node heterogeneous architecture. HyScale-GNN performs hybrid training which utilizes both the processors and the accelerators to train a model collaboratively. Our system design overcomes the memory size limitation of existing works and is optimized for training GNNs on large-scale graphs. We propose a two-stage data pre-fetching scheme to reduce the communication overhead during GNN training. To improve task mapping efficiency, we propose a dynamic resource management mechanism, which adjusts the workload assignment and resource allocation during runtime. We evaluate HyScale-GNN on a CPU-GPU and a CPU-FPGA heterogeneous architecture. Using several large-scale datasets and two widely-used GNN models, we compare the performance of our design with a multi-GPU baseline implemented in PyTorch-Geometric. The CPU-GPU design and the CPU-FPGA design achieve up to 2.08× speedup and 12.6× speedup, respectively. Compared with the state-of-the-art large-scale multi-node GNN training systems such as P3 and DistDGL, our CPU-FPGA design achieves up to 5.27× speedup using a single node.

图神经网络(gnn)已经在许多涉及图结构数据的实际应用中取得了成功。现有的大多数单节点GNN训练系统都能够训练具有数千万条边的中等规模图;然而，将它们扩展到具有数十亿条边的大规模图仍然具有挑战性。此外，将GNN训练算法映射到计算节点是具有挑战性的，因为最先进的机器具有由多个处理器和各种加速器组成的异构架构。我们提出了一种在单节点异构架构上训练GNN模型的新系统HyScale-GNN。HyScale-GNN执行混合训练，利用处理器和加速器协同训练模型。我们的系统设计克服了现有作品的内存大小限制，并针对大规模图训练gnn进行了优化。为了减少GNN训练过程中的通信开销，提出了一种两阶段数据预取方案。为了提高任务映射的效率，我们提出了一种动态资源管理机制，可以在运行时调整工作负载分配和资源分配。我们在CPU-GPU和CPU-FPGA异构架构上评估了HyScale-GNN。使用几个大规模数据集和两个广泛使用的GNN模型，我们将设计的性能与PyTorch-Geometric中实现的多gpu基线进行了比较。CPU-GPU设计和CPU-FPGA设计分别实现了高达2.08倍和12.6倍的加速。与P3和DistDGL等目前最先进的大规模多节点GNN训练系统相比，我们的CPU-FPGA设计在单个节点上实现了高达5.27倍的加速。

{"title":"HyScale-GNN: A Scalable Hybrid GNN Training System on Single-Node Heterogeneous Architecture","authors":"Yi-Chien Lin, V. Prasanna","doi":"10.1109/IPDPS54959.2023.00062","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00062","url":null,"abstract":"Graph Neural Networks (GNNs) have shown success in many real-world applications that involve graph-structured data. Most of the existing single-node GNN training systems are capable of training medium-scale graphs with tens of millions of edges; however, scaling them to large-scale graphs with billions of edges remains challenging. In addition, it is challenging to map GNN training algorithms onto a computation node as state-of-the-art machines feature heterogeneous architecture consisting of multiple processors and a variety of accelerators.We propose HyScale-GNN, a novel system to train GNN models on a single-node heterogeneous architecture. HyScale-GNN performs hybrid training which utilizes both the processors and the accelerators to train a model collaboratively. Our system design overcomes the memory size limitation of existing works and is optimized for training GNNs on large-scale graphs. We propose a two-stage data pre-fetching scheme to reduce the communication overhead during GNN training. To improve task mapping efficiency, we propose a dynamic resource management mechanism, which adjusts the workload assignment and resource allocation during runtime. We evaluate HyScale-GNN on a CPU-GPU and a CPU-FPGA heterogeneous architecture. Using several large-scale datasets and two widely-used GNN models, we compare the performance of our design with a multi-GPU baseline implemented in PyTorch-Geometric. The CPU-GPU design and the CPU-FPGA design achieve up to 2.08× speedup and 12.6× speedup, respectively. Compared with the state-of-the-art large-scale multi-node GNN training systems such as P3 and DistDGL, our CPU-FPGA design achieves up to 5.27× speedup using a single node.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"2017 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125742642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Efficient Hardware Primitives for Immediate Memory Reclamation in Optimistic Data Structures 乐观数据结构中用于即时内存回收的高效硬件原语

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-02-25 DOI: 10.1109/IPDPS54959.2023.00021

Ajay Singh, Trevor Brown, Michael F. Spear

Safe memory reclamation (SMR) algorithms are crucial for preventing use-after-free errors in optimistic data structures. SMR algorithms typically delay reclamation for safety and reclaim objects in batches for efficiency. It is difficult to strike a balance between performance and space efficiency. Small batch sizes and frequent reclamation attempts lead to high overhead, while freeing large batches can lead to long program interruptions and high memory footprints. An ideal SMR algorithm would forgo batching, and reclaim memory immediately, without suffering high reclamation overheads.To this end, we propose Conditional Access: a set of hardware instructions that offer immediate reclamation and low overhead in optimistic data structures. Conditional Access harnesses cache coherence to enable threads to efficiently detect potential use-after-free errors without explicit shared memory communication, and without introducing additional coherence traffic.We implement and evaluate Conditional Access in Graphite, a multicore simulator. Our experiments show that Conditional Access can rival the performance of highly optimized and carefully tuned SMR algorithms while simultaneously allowing immediate reclamation. This results in concurrent data structures with similar memory footprints to their sequential counterparts.

安全内存回收(SMR)算法是防止乐观数据结构中自由后使用错误的关键。SMR算法通常为了安全而延迟回收，为了效率而分批回收对象。很难在性能和空间效率之间取得平衡。小批处理大小和频繁的回收尝试导致高开销，而释放大批处理可能导致长时间的程序中断和高内存占用。理想的SMR算法应该放弃批处理，并立即回收内存，而不会产生很高的回收开销。为此，我们提出了条件访问:一组硬件指令，在乐观数据结构中提供即时回收和低开销。条件访问利用缓存一致性，使线程能够在没有显式共享内存通信的情况下有效地检测潜在的释放后使用错误，也不会引入额外的一致性流量。我们在多核模拟器Graphite中实现和评估条件访问。我们的实验表明，条件访问可以与高度优化和精心调整的SMR算法的性能相媲美，同时允许立即回收。这导致并发数据结构的内存占用与其顺序对应的数据结构相似。

{"title":"Efficient Hardware Primitives for Immediate Memory Reclamation in Optimistic Data Structures","authors":"Ajay Singh, Trevor Brown, Michael F. Spear","doi":"10.1109/IPDPS54959.2023.00021","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00021","url":null,"abstract":"Safe memory reclamation (SMR) algorithms are crucial for preventing use-after-free errors in optimistic data structures. SMR algorithms typically delay reclamation for safety and reclaim objects in batches for efficiency. It is difficult to strike a balance between performance and space efficiency. Small batch sizes and frequent reclamation attempts lead to high overhead, while freeing large batches can lead to long program interruptions and high memory footprints. An ideal SMR algorithm would forgo batching, and reclaim memory immediately, without suffering high reclamation overheads.To this end, we propose Conditional Access: a set of hardware instructions that offer immediate reclamation and low overhead in optimistic data structures. Conditional Access harnesses cache coherence to enable threads to efficiently detect potential use-after-free errors without explicit shared memory communication, and without introducing additional coherence traffic.We implement and evaluate Conditional Access in Graphite, a multicore simulator. Our experiments show that Conditional Access can rival the performance of highly optimized and carefully tuned SMR algorithms while simultaneously allowing immediate reclamation. This results in concurrent data structures with similar memory footprints to their sequential counterparts.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"42 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114298463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

k-Center Clustering with Outliers in the MPC and Streaming Model MPC和流模型中具有离群值的k-中心聚类

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-02-24 DOI: 10.1109/IPDPS54959.2023.00090

M. D. Berg, Leyla Biabani, M. Monemizadeh

Given a point set P ⊆ X of size n in a metric space (X, dist) of doubling dimension d and two parameters k ∈ ℕ and z ∈ ℕ, the k-center problem with z outliers asks to return a set ${{mathcal{C}}^ * } = { c_1^ * , cdots ,c_k^ * } subseteq X$ of k centers such that the maximum distance of all but z points of P to their nearest center in C* is minimized. An (ε, k, z)-coreset for this problem is a weighted point set P* such that an optimal solution for the k-center problem with z outliers on P* gives a (1 ± ε)-approximation for the k-center problem with z outliers on P. We study the construction of such coresets in the Massively Parallel Computing (MPC) model, and in the insertion-only as well as the fully dynamic streaming model. We obtain the following results, for any given 0 < ε ⩽ 1: In all cases, the size of the computed coreset is O(k/εd + z).• In the MPC model the data are distributed over m machines. One is the coordinator machine, which will contain the final answer, the others are worker machines.We present a deterministic 2-round algorithm using $O(sqrt n )$ machines, where the worker machines have $O(sqrt {nk/{varepsilon ^d}} + sqrt n cdot log (z + 1))$ local memory, and the coordinator has $O(sqrt {nk/{varepsilon ^d}} + sqrt n cdot log (z + 1) + z)$ local memory. The algorithm can handle point sets P that are distributed arbitrarily (possibly adversarially) over the machines. We also present a randomized algorithm that uses only a single round, under the assumption that the input set P is initially distributed randomly over the machines. Then we present a deterministic algorithm that obtains a trade-off between the number of rounds, R, and the storage per machine.In the streaming model we have a single machine with limited storage, and P is revealed in a streaming fashion.○ We present the first lower bound for the insertion-only streaming model, where the points arrive one by one and no points are deleted. We show that any deterministic algorithm that maintains an (ε, k, z)-coreset must use Ω(k/εd + z) space. We complement this by a deterministic streaming algorithm using O(k/εd + z) space, which is thus optimal. ○ For the fully dynamic data streams, where points can be inserted as well as deleted we give a randomized algorithm for point sets from a d-dimensional discrete Euclidean space [Δ]d, where Δ ∈ ℕ indicates the size of the universe from which the coordinates are taken. Our algorithm uses only O((k/εd + z)log4(kΔ/εδ)) space, and it is the first algorithm for this setting. We also present an Ω((k/εd)logΔ + z) lower bound for deterministic fully dynamic streaming algorithms. ○ For the sliding-window model, we show that any deterministic streaming algorithm that guarantees a (1 + ε)-approximation for the k-center problem with outliers in ℝd must use Ω((kz/εd) logσ) space, where σ is the ratio of the largest and smallest distance between any two points in the stream. This (negatively) answers a question posed by De Berg, Monemizadeh,

给定一个倍维d的度量空间(X, dist)中大小为n的点集P∈X，两个参数k∈＿1和z∈＿1，具有z个离群点的k中心问题要求返回一个包含k个中心的集合${{mathcal{C}}^ * } = { c_1^ * , cdots ,c_k^ * } subseteq X$，使得P中除z个点外的所有点到C*中最近的中心的最大距离最小。该问题的(ε， k, z)-核心集是一个加权点集P*，使得P*上有z个离群点的k中心问题的最优解给出P上有z个离群点的k中心问题的(1±ε)-逼近。我们研究了这种核心集在大规模并行计算(MPC)模型、仅插入模型以及全动态流模型中的构造。对于任何给定的0 < ε≥1，我们得到以下结果:在所有情况下，计算的核心集的大小为O(k/εd + z)。•在MPC模型中，数据分布在m台机器上。一个是协调机，它将包含最终答案，其他是工作机。我们提出了一个使用$O(sqrt n )$机器的确定性2轮算法，其中工作机器具有$O(sqrt {nk/{varepsilon ^d}} + sqrt n cdot log (z + 1))$本地内存，协调器具有$O(sqrt {nk/{varepsilon ^d}} + sqrt n cdot log (z + 1) + z)$本地内存。该算法可以处理任意(可能是对抗性)分布在机器上的点集P。我们还提出了一个随机算法，它只使用一个回合，假设输入集P最初随机分布在机器上。然后，我们提出了一种确定性算法，该算法可以在每台机器的轮数R和存储空间之间进行权衡。在流模型中，我们有一台存储有限的机器，P以流方式显示。我们给出了仅插入流模型的第一个下界，其中点一个接一个到达并且没有点被删除。我们证明了任何维持(ε， k, z)-核集的确定性算法必须使用Ω(k/εd + z)空间。我们补充了一个使用O(k/εd + z)空间的确定性流算法，这是最优的。〇对于完全动态的数据流，点可以插入也可以删除，我们给出了一个d维离散欧几里得空间[Δ]d的点集的随机化算法，其中Δ∈∧表示取坐标的宇宙的大小。我们的算法仅使用O((k/εd + z)log4(kΔ/εδ))空间，并且是该设置的第一个算法。我们还提出了确定性全动态流算法的Ω((k/εd)logΔ + z)下界。对于滑动窗口模型，我们证明了对于具有离群点的k中心问题，任何保证(1 + ε)-近似的确定性流算法必须使用Ω((kz/εd) logσ)空间，其中σ是流中任意两点之间的最大和最小距离之比。这(消极地)回答了De Berg、Monemizadeh和钟[1]提出的问题。

{"title":"k-Center Clustering with Outliers in the MPC and Streaming Model","authors":"M. D. Berg, Leyla Biabani, M. Monemizadeh","doi":"10.1109/IPDPS54959.2023.00090","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00090","url":null,"abstract":"Given a point set P ⊆ X of size n in a metric space (X, dist) of doubling dimension d and two parameters k ∈ ℕ and z ∈ ℕ, the k-center problem with z outliers asks to return a set ${{mathcal{C}}^ * } = { c_1^ * , cdots ,c_k^ * } subseteq X$ of k centers such that the maximum distance of all but z points of P to their nearest center in C* is minimized. An (ε, k, z)-coreset for this problem is a weighted point set P* such that an optimal solution for the k-center problem with z outliers on P* gives a (1 ± ε)-approximation for the k-center problem with z outliers on P. We study the construction of such coresets in the Massively Parallel Computing (MPC) model, and in the insertion-only as well as the fully dynamic streaming model. We obtain the following results, for any given 0 < ε ⩽ 1: In all cases, the size of the computed coreset is O(k/εd + z).• In the MPC model the data are distributed over m machines. One is the coordinator machine, which will contain the final answer, the others are worker machines.We present a deterministic 2-round algorithm using $O(sqrt n )$ machines, where the worker machines have $O(sqrt {nk/{varepsilon ^d}} + sqrt n cdot log (z + 1))$ local memory, and the coordinator has $O(sqrt {nk/{varepsilon ^d}} + sqrt n cdot log (z + 1) + z)$ local memory. The algorithm can handle point sets P that are distributed arbitrarily (possibly adversarially) over the machines. We also present a randomized algorithm that uses only a single round, under the assumption that the input set P is initially distributed randomly over the machines. Then we present a deterministic algorithm that obtains a trade-off between the number of rounds, R, and the storage per machine.In the streaming model we have a single machine with limited storage, and P is revealed in a streaming fashion.○ We present the first lower bound for the insertion-only streaming model, where the points arrive one by one and no points are deleted. We show that any deterministic algorithm that maintains an (ε, k, z)-coreset must use Ω(k/εd + z) space. We complement this by a deterministic streaming algorithm using O(k/εd + z) space, which is thus optimal. ○ For the fully dynamic data streams, where points can be inserted as well as deleted we give a randomized algorithm for point sets from a d-dimensional discrete Euclidean space [Δ]d, where Δ ∈ ℕ indicates the size of the universe from which the coordinates are taken. Our algorithm uses only O((k/εd + z)log4(kΔ/εδ)) space, and it is the first algorithm for this setting. We also present an Ω((k/εd)logΔ + z) lower bound for deterministic fully dynamic streaming algorithms. ○ For the sliding-window model, we show that any deterministic streaming algorithm that guarantees a (1 + ε)-approximation for the k-center problem with outliers in ℝd must use Ω((kz/εd) logσ) space, where σ is the ratio of the largest and smallest distance between any two points in the stream. This (negatively) answers a question posed by De Berg, Monemizadeh, ","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124298377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Engineering Massively Parallel MST Algorithms 工程大规模并行MST算法

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-02-23 DOI: 10.1109/IPDPS54959.2023.00075

P. Sanders, M. Schimek

We develop and extensively evaluate highly scalable distributed-memory algorithms for computing minimum spanning trees (MSTs). At the heart of our solutions is a scalable variant of Borůvka’s algorithm. For partitioned graphs with many local edges we improve this with an effective form of contracting local parts of the graph during a preprocessing step. We also adapt the filtering concept of the best practical sequential algorithm to develop a massively parallel Filter-Borůvka algorithm that is very useful for graphs with poor locality and high average degree. Our experiments indicate that our algorithms scale well up to at least 65 536 cores and are up to 800 times faster than previous distributed MST algorithms.

我们开发并广泛评估了用于计算最小生成树(MSTs)的高度可扩展的分布式内存算法。我们解决方案的核心是Borůvka算法的可扩展变体。对于具有许多局部边的分割图，我们通过在预处理步骤中压缩图的局部部分的有效形式改进了这一点。我们还采用了最佳实用顺序算法的过滤概念，开发了一种大规模并行Filter-Borůvka算法，该算法对局域性差和平均度高的图非常有用。我们的实验表明，我们的算法可以很好地扩展到至少65 536个内核，并且比以前的分布式MST算法快800倍。

引用次数: 2

Engineering a Distributed-Memory Triangle Counting Algorithm 设计分布式内存三角形计数算法

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-02-22 DOI: 10.1109/IPDPS54959.2023.00076

P. Sanders, Tim Niklas Uhl

Counting triangles in a graph and incident to each vertex is a fundamental and frequently considered task of graph analysis. We consider how to efficiently do this for huge graphs using massively parallel distributed-memory machines. Unsurprisingly, the main issue is to reduce communication between processors. We achieve this by counting locally whenever possible and reducing the amount of information that needs to be sent in order to handle (possible) nonlocal triangles. We also achieve linear memory requirements despite superlinear communication volume by introducing a new asynchronous sparse-all-to-all operation. Furthermore, we dramatically reduce startup overheads by allowing this communication to use indirect routing. Our algorithms scale (at least) up to 32 768 cores and are up to 18 times faster than the previous state of the art.

计算图中的三角形和与每个顶点相关的三角形是图分析的基本和经常考虑的任务。我们考虑如何在使用大规模并行分布式内存机器的大型图形中有效地做到这一点。不出所料，主要问题是减少处理器之间的通信。我们通过尽可能地对局部进行计数来实现这一点，并减少为了处理(可能的)非局部三角形而需要发送的信息量。通过引入新的异步稀疏全对全操作，我们还实现了线性内存需求，尽管通信量是超线性的。此外，通过允许这种通信使用间接路由，我们大大减少了启动开销。我们的算法规模(至少)高达32 768核，比以前的艺术状态快18倍。

引用次数: 2