Gradient Compression Supercharged High-Performance Data Parallel DNN Training

Q3 Computer Science Operating Systems Review (ACM) Pub Date : 2021-10-26 DOI:10.1145/3477132.3483553

Youhui Bai, Cheng Li, Quan Zhou, Jun Yi, Ping Gong, Feng Yan, Ruichuan Chen, Yinlong Xu

{"title":"Gradient Compression Supercharged High-Performance Data Parallel DNN Training","authors":"Youhui Bai, Cheng Li, Quan Zhou, Jun Yi, Ping Gong, Feng Yan, Ruichuan Chen, Yinlong Xu","doi":"10.1145/3477132.3483553","DOIUrl":null,"url":null,"abstract":"Gradient compression is a promising approach to alleviating the communication bottleneck in data parallel deep neural network (DNN) training by significantly reducing the data volume of gradients for synchronization. While gradient compression is being actively adopted by the industry (e.g., Facebook and AWS), our study reveals that there are two critical but often overlooked challenges: 1) inefficient coordination between compression and communication during gradient synchronization incurs substantial overheads, and 2) developing, optimizing, and integrating gradient compression algorithms into DNN systems imposes heavy burdens on DNN practitioners, and ad-hoc compression implementations often yield surprisingly poor system performance. In this paper, we first propose a compression-aware gradient synchronization architecture, CaSync, which relies on a flexible composition of basic computing and communication primitives. It is general and compatible with any gradient compression algorithms and gradient synchronization strategies, and enables high-performance computation-communication pipelining. We further introduce a gradient compression toolkit, CompLL, to enable efficient development and automated integration of on-GPU compression algorithms into DNN systems with little programming burden. Lastly, we build a compression-aware DNN training framework HiPress with CaSync and CompLL. HiPress is open-sourced and runs on mainstream DNN systems such as MXNet, TensorFlow, and PyTorch. Evaluation via a 16-node cluster with 128 NVIDIA V100 GPUs and 100Gbps network shows that HiPress improves the training speed over current compression-enabled systems (e.g., BytePS-onebit and Ring-DGC) by 17.2%-69.5% across six popular DNN models.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Operating Systems Review (ACM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3477132.3483553","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 29

Abstract

Gradient compression is a promising approach to alleviating the communication bottleneck in data parallel deep neural network (DNN) training by significantly reducing the data volume of gradients for synchronization. While gradient compression is being actively adopted by the industry (e.g., Facebook and AWS), our study reveals that there are two critical but often overlooked challenges: 1) inefficient coordination between compression and communication during gradient synchronization incurs substantial overheads, and 2) developing, optimizing, and integrating gradient compression algorithms into DNN systems imposes heavy burdens on DNN practitioners, and ad-hoc compression implementations often yield surprisingly poor system performance. In this paper, we first propose a compression-aware gradient synchronization architecture, CaSync, which relies on a flexible composition of basic computing and communication primitives. It is general and compatible with any gradient compression algorithms and gradient synchronization strategies, and enables high-performance computation-communication pipelining. We further introduce a gradient compression toolkit, CompLL, to enable efficient development and automated integration of on-GPU compression algorithms into DNN systems with little programming burden. Lastly, we build a compression-aware DNN training framework HiPress with CaSync and CompLL. HiPress is open-sourced and runs on mainstream DNN systems such as MXNet, TensorFlow, and PyTorch. Evaluation via a 16-node cluster with 128 NVIDIA V100 GPUs and 100Gbps network shows that HiPress improves the training speed over current compression-enabled systems (e.g., BytePS-onebit and Ring-DGC) by 17.2%-69.5% across six popular DNN models.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

梯度压缩增强了高性能数据并行DNN训练

梯度压缩是缓解数据并行深度神经网络(DNN)训练中通信瓶颈的一种很有前途的方法，可以显著减少用于同步的梯度数据量。虽然梯度压缩正在被行业(例如Facebook和AWS)积极采用，但我们的研究表明，存在两个关键但经常被忽视的挑战:1)在梯度同步过程中，压缩和通信之间的低效协调导致了大量的开销;2)开发、优化和将梯度压缩算法集成到深度神经网络系统中，给深度神经网络从业者带来了沉重的负担，而临时压缩实现通常会产生令人惊讶的低系统性能。在本文中，我们首先提出了一种压缩感知的梯度同步架构CaSync，它依赖于基本计算和通信原语的灵活组合。它是通用的，兼容任何梯度压缩算法和梯度同步策略，并实现高性能的计算通信流水线。我们进一步介绍了一个梯度压缩工具包，CompLL，以实现高效的开发和自动集成gpu上的压缩算法到DNN系统中，编程负担小。最后，我们用CaSync和CompLL构建了一个压缩感知的DNN训练框架HiPress。HiPress是开源的，运行在主流的DNN系统上，如MXNet、TensorFlow和PyTorch。通过使用128个NVIDIA V100 gpu和100Gbps网络的16节点集群进行的评估表明，在六种流行的DNN模型中，HiPress比当前支持压缩的系统(例如，BytePS-onebit和Ring-DGC)的训练速度提高了17.2%-69.5%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Operating Systems Review (ACM) Computer Science-Computer Networks and Communications

CiteScore

2.80

自引率

0.00%

发文量

期刊介绍： Operating Systems Review (OSR) is a publication of the ACM Special Interest Group on Operating Systems (SIGOPS), whose scope of interest includes: computer operating systems and architecture for multiprogramming, multiprocessing, and time sharing; resource management; evaluation and simulation; reliability, integrity, and security of data; communications among computing processors; and computer system modeling and analysis.

期刊最新文献

Disaggregated GPU Acceleration for Serverless Applications Navigating Performance-Efficiency Tradeoffs in Serverless Computing: Deduplication to the Rescue! Using Local Cache Coherence for Disaggregated Memory Systems Make It Real: An End-to-End Implementation of A Physically Disaggregated Data Center Memory disaggregation: why now and what are the challenges