Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC) Pub Date : 2022-12-01 DOI:10.1109/HiPC56025.2022.00016

Qinghua Zhou, Quentin G. Anthony, A. Shafi, H. Subramoni, D. Panda

{"title":"Accelerating Broadcast Communication with GPU Compression for Deep Learning Workloads","authors":"Qinghua Zhou, Quentin G. Anthony, A. Shafi, H. Subramoni, D. Panda","doi":"10.1109/HiPC56025.2022.00016","DOIUrl":null,"url":null,"abstract":"With the rapidly increasing model sizes, state-of-the-art Deep Learning (DL) models rely on multiple GPU nodes to run distributed training. Large message communication of GPU data between the GPUs is becoming a performance bottleneck in the overall training performance. GPU-Aware MPI libraries are widely adopted for state-of-the-art DL frameworks to improve communication performance. In the existing optimization solutions for Distributed Data-Parallel (DDP) training, the broadcast operation is often utilized to sync up the updated model parameters among all the GPUs. However, for state-of-the-art GPU-Aware MPI libraries, broadcasting large GPU data turns to overburden the training performance due to the limited bandwidth of interconnect between the GPU nodes. On the other hand, the recent research on using GPU-based compression libraries to lower the pressure on the nearly saturated interconnection and co-designing online compression with the communication pattern provides a new perspective to optimize the performance of broadcast on modern GPU clusters.In this paper, we redesign the GPU-Aware MPI library to enable efficient collective-level online compression with an optimized chunked-chain scheme for large message broadcast communication. The proposed design is evaluated to show benefits at both microbenchmark and application levels. At the microbenchmark level, the proposed design can reduce the broadcast communication latency by up to 80.9% compared to the baseline using a state-of-the-art MPI library and 55.1% compared to the existing point-to-point-based compression on modern GPU clusters. For DDP training with PyTorch, the proposed design reduces the training time by up to 15.0% and 6.4% compared to the existing chunked-chain scheme and point-to-point-based compression, respectively, while keeping similar training accuracy. To the best of our knowledge, this is the first work that leverages online GPU-based compression techniques to significantly accelerate broadcast communication for DL workloads.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"11 8","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC56025.2022.00016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

With the rapidly increasing model sizes, state-of-the-art Deep Learning (DL) models rely on multiple GPU nodes to run distributed training. Large message communication of GPU data between the GPUs is becoming a performance bottleneck in the overall training performance. GPU-Aware MPI libraries are widely adopted for state-of-the-art DL frameworks to improve communication performance. In the existing optimization solutions for Distributed Data-Parallel (DDP) training, the broadcast operation is often utilized to sync up the updated model parameters among all the GPUs. However, for state-of-the-art GPU-Aware MPI libraries, broadcasting large GPU data turns to overburden the training performance due to the limited bandwidth of interconnect between the GPU nodes. On the other hand, the recent research on using GPU-based compression libraries to lower the pressure on the nearly saturated interconnection and co-designing online compression with the communication pattern provides a new perspective to optimize the performance of broadcast on modern GPU clusters.In this paper, we redesign the GPU-Aware MPI library to enable efficient collective-level online compression with an optimized chunked-chain scheme for large message broadcast communication. The proposed design is evaluated to show benefits at both microbenchmark and application levels. At the microbenchmark level, the proposed design can reduce the broadcast communication latency by up to 80.9% compared to the baseline using a state-of-the-art MPI library and 55.1% compared to the existing point-to-point-based compression on modern GPU clusters. For DDP training with PyTorch, the proposed design reduces the training time by up to 15.0% and 6.4% compared to the existing chunked-chain scheme and point-to-point-based compression, respectively, while keeping similar training accuracy. To the best of our knowledge, this is the first work that leverages online GPU-based compression techniques to significantly accelerate broadcast communication for DL workloads.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

加速广播通信与GPU压缩深度学习工作负载

随着模型规模的快速增长，最先进的深度学习(DL)模型依赖于多个GPU节点来运行分布式训练。GPU之间的大量消息通信正在成为影响整体训练性能的瓶颈。gpu感知MPI库被广泛用于最先进的DL框架，以提高通信性能。在现有的分布式数据并行(DDP)训练优化方案中，通常采用广播操作在所有gpu之间同步更新的模型参数。然而，对于最先进的GPU感知MPI库，由于GPU节点之间互连的带宽有限，广播大型GPU数据会使训练性能负担过重。另一方面，利用基于GPU的压缩库来降低近饱和互连的压力，并与通信模式共同设计在线压缩，为优化现代GPU集群上的广播性能提供了新的视角。在本文中，我们重新设计了gpu感知的MPI库，通过优化的块链方案实现高效的集体级在线压缩，用于大型消息广播通信。对所提出的设计进行了评估，以显示在微基准测试和应用程序级别上的优势。在微基准测试水平上，与使用最先进的MPI库的基线相比，所提出的设计可以将广播通信延迟减少80.9%，与现代GPU集群上现有的基于点对点的压缩相比，可以减少55.1%。对于PyTorch的DDP训练，与现有的块链方案和基于点对点的压缩相比，所提出的设计分别将训练时间缩短了15.0%和6.4%，同时保持了相似的训练精度。据我们所知，这是第一次利用基于gpu的在线压缩技术来显著加速DL工作负载的广播通信。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)

自引率

0.00%

发文量