TLPGNN：在单个和多个 GPU 上进行图神经网络计算的轻量级两级并行范式

Pub Date : 2024-02-09 DOI:10.1145/3644712

Qiang Fu, Yuede Ji, Thomas B. Rolinger, H. H. Huang

{"title":"TLPGNN：在单个和多个 GPU 上进行图神经网络计算的轻量级两级并行范式","authors":"Qiang Fu, Yuede Ji, Thomas B. Rolinger, H. H. Huang","doi":"10.1145/3644712","DOIUrl":null,"url":null,"abstract":"\n Graph Neural Networks (GNNs) are an emerging class of deep learning models specifically designed for graph-structured data. They have been effectively employed in a variety of real-world applications, including recommendation systems, drug development, and analysis of social networks. The GNN computation includes regular neural network operations and general graph convolution operations, which take most of the total computation time. Though several recent works have been proposed to accelerate the computation for GNNs, they face the limitations of heavy pre-processing, low efficiency atomic operations, and unnecessary kernel launches. In this paper, we design\n TLPGNN\n , a lightweight two-level parallelism paradigm for GNN computation. First, we conduct a systematic analysis of the hardware resource usage of GNN workloads to understand the characteristics of GNN workloads deeply. With the insightful observations, we then divide the GNN computation into two levels, i.e.,\n vertex parallelism\n for the first level and\n feature parallelism\n for the second. Next, we employ a novel hybrid dynamic workload assignment to address the imbalanced workload distribution. Furthermore, we fuse the kernels to reduce the number of kernel launches and cache the frequently accessed data into registers to avoid unnecessary memory traffic. To scale\n TLPGNN\n to multi-GPU environments, we propose an edge-aware row-wise 1-D partition method to ensure a balanced workload distribution across different GPU devices. Experimental results on various benchmark datasets demonstrate the superiority of our approach, achieving substantial performance improvement over state-of-the-art GNN computation systems, including Deep Graph Library (DGL), GNNAdvisor, and FeatGraph, with speedups of 6.1 ×, 7.7 ×, and 3.0 ×, respectively, on average. Evaluations of multiple-GPU\n TLPGNN\n also demonstrate that our solution achieves both linear scalability and a well-balanced workload distribution.\n","PeriodicalId":0,"journal":{"name":"","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TLPGNN\\n : A Lightweight Two-Level Parallelism Paradigm for Graph Neural Network Computation on Single and Multiple GPUs\",\"authors\":\"Qiang Fu, Yuede Ji, Thomas B. Rolinger, H. H. Huang\",\"doi\":\"10.1145/3644712\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n Graph Neural Networks (GNNs) are an emerging class of deep learning models specifically designed for graph-structured data. They have been effectively employed in a variety of real-world applications, including recommendation systems, drug development, and analysis of social networks. The GNN computation includes regular neural network operations and general graph convolution operations, which take most of the total computation time. Though several recent works have been proposed to accelerate the computation for GNNs, they face the limitations of heavy pre-processing, low efficiency atomic operations, and unnecessary kernel launches. In this paper, we design\\n TLPGNN\\n , a lightweight two-level parallelism paradigm for GNN computation. First, we conduct a systematic analysis of the hardware resource usage of GNN workloads to understand the characteristics of GNN workloads deeply. With the insightful observations, we then divide the GNN computation into two levels, i.e.,\\n vertex parallelism\\n for the first level and\\n feature parallelism\\n for the second. Next, we employ a novel hybrid dynamic workload assignment to address the imbalanced workload distribution. Furthermore, we fuse the kernels to reduce the number of kernel launches and cache the frequently accessed data into registers to avoid unnecessary memory traffic. To scale\\n TLPGNN\\n to multi-GPU environments, we propose an edge-aware row-wise 1-D partition method to ensure a balanced workload distribution across different GPU devices. Experimental results on various benchmark datasets demonstrate the superiority of our approach, achieving substantial performance improvement over state-of-the-art GNN computation systems, including Deep Graph Library (DGL), GNNAdvisor, and FeatGraph, with speedups of 6.1 ×, 7.7 ×, and 3.0 ×, respectively, on average. Evaluations of multiple-GPU\\n TLPGNN\\n also demonstrate that our solution achieves both linear scalability and a well-balanced workload distribution.\\n\",\"PeriodicalId\":0,\"journal\":{\"name\":\"\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0,\"publicationDate\":\"2024-02-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3644712\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3644712","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

图神经网络（GNN）是一类新兴的深度学习模型，专为图结构数据而设计。它们已被有效地应用于各种现实世界的应用中，包括推荐系统、药物开发和社交网络分析。GNN 计算包括常规的神经网络操作和一般的图卷积操作，这占用了总计算时间的大部分。虽然最近有几项研究提出了加速 GNN 计算的方法，但它们面临着繁重的预处理、低效的原子操作和不必要的内核启动等限制。在本文中，我们设计了用于 GNN 计算的轻量级两级并行范式 TLPGNN。首先，我们对 GNN 工作负载的硬件资源使用情况进行了系统分析，以深入了解 GNN 工作负载的特点。有了深刻的观察，我们将 GNN 计算分为两个层次，即第一层次的顶点并行和第二层次的特征并行。接下来，我们采用一种新颖的混合动态工作负载分配来解决工作负载分布不平衡的问题。此外，我们还对内核进行了融合，以减少内核启动次数，并将频繁访问的数据缓存到寄存器中，以避免不必要的内存流量。为了将 TLPGNN 扩展到多 GPU 环境，我们提出了一种边缘感知行向一维分区方法，以确保在不同 GPU 设备上实现均衡的工作量分配。在各种基准数据集上的实验结果证明了我们的方法的优越性，与最先进的 GNN 计算系统（包括 Deep Graph Library (DGL)、GNNAdvisor 和 FeatGraph）相比，我们的方法实现了性能的大幅提升，平均速度分别提高了 6.1 倍、7.7 倍和 3.0 倍。对多 GPU TLPGNN 的评估还表明，我们的解决方案实现了线性可扩展性和均衡的工作负载分布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

TLPGNN : A Lightweight Two-Level Parallelism Paradigm for Graph Neural Network Computation on Single and Multiple GPUs

Graph Neural Networks (GNNs) are an emerging class of deep learning models specifically designed for graph-structured data. They have been effectively employed in a variety of real-world applications, including recommendation systems, drug development, and analysis of social networks. The GNN computation includes regular neural network operations and general graph convolution operations, which take most of the total computation time. Though several recent works have been proposed to accelerate the computation for GNNs, they face the limitations of heavy pre-processing, low efficiency atomic operations, and unnecessary kernel launches. In this paper, we design TLPGNN , a lightweight two-level parallelism paradigm for GNN computation. First, we conduct a systematic analysis of the hardware resource usage of GNN workloads to understand the characteristics of GNN workloads deeply. With the insightful observations, we then divide the GNN computation into two levels, i.e., vertex parallelism for the first level and feature parallelism for the second. Next, we employ a novel hybrid dynamic workload assignment to address the imbalanced workload distribution. Furthermore, we fuse the kernels to reduce the number of kernel launches and cache the frequently accessed data into registers to avoid unnecessary memory traffic. To scale TLPGNN to multi-GPU environments, we propose an edge-aware row-wise 1-D partition method to ensure a balanced workload distribution across different GPU devices. Experimental results on various benchmark datasets demonstrate the superiority of our approach, achieving substantial performance improvement over state-of-the-art GNN computation systems, including Deep Graph Library (DGL), GNNAdvisor, and FeatGraph, with speedups of 6.1 ×, 7.7 ×, and 3.0 ×, respectively, on average. Evaluations of multiple-GPU TLPGNN also demonstrate that our solution achieves both linear scalability and a well-balanced workload distribution.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助