{"title":"BIRD+: Design of a Lightweight Communication Compressor for Resource-Constrained Distribution Learning Platforms","authors":"Donglei Wu;Weihao Yang;Xiangyu Zou;Hao Feng;Dingwen Tao;Shiyi Li;Wen Xia;Binxing Fang","doi":"10.1109/TPDS.2024.3447221","DOIUrl":null,"url":null,"abstract":"The Top-K sparsification-based compression framework is extensively explored for reducing communication costs in distributed learning. However, we identified several issues with existing Top-K sparsification-based compression methods: (\n<i>i</i>\n) The limited compressibility of the Top-K parameter's indexes critically restricts the overall communication compression ratio; (\n<i>ii</i>\n) Several time-consuming compression operations significantly offset the benefits of communication compression; (\n<i>iii</i>\n) The use of error feedback techniques to maintain model quality results in a high memory footprint consumption. To solve these issues, we propose BIRD, a lightweight tensor-wise \n<i>Bi-Random sampling</i>\n strategy with an expectation invariance property. Specifically, BIRD applies a tensor-wise \n<i>index sharing</i>\n mechanism that reduces the index proportion by allowing multiple tensor elements to share a single index, thus improving the overall compression ratio. Additionally, BIRD replaces the time-consuming Top-K sorting with a faster \n<i>Bi-Random sampling</i>\n strategy based on the aforementioned \n<i>index sharing</i>\n mechanism, significantly reducing compression overheads; Moreover, BIRD establishes an \n<i>expectation invariance</i>\n property into the \n<i>Bi-Random sampling</i>\n to ensure an approximate unbiased representation for the \n<inline-formula><tex-math>$L_1$</tex-math></inline-formula>\n-norm of the sampled tensors, effectively maintaining the model quality without incurring extra memory costs. We further optimize BIRD to BIRD+ by introducing the uniform distribution-based sampling and Gamma correction on the tensor-wise sampling process, achieving a more flexibly adjustment of the sparsity with better convergence performance. Experimental evaluations across multiple conventional distributed learning tasks demonstrate that compared to state-of-the-art approaches, BIRD+ achieves higher communication compression ratios up to 36.2\n<inline-formula><tex-math>$\\times$</tex-math></inline-formula>\n and higher computation throughput up to 149.6\n<inline-formula><tex-math>$\\times$</tex-math></inline-formula>\n while maintaining the model quality without incurring extra memory costs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2193-2207"},"PeriodicalIF":5.6000,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Parallel and Distributed Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10643365/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
The Top-K sparsification-based compression framework is extensively explored for reducing communication costs in distributed learning. However, we identified several issues with existing Top-K sparsification-based compression methods: (
i
) The limited compressibility of the Top-K parameter's indexes critically restricts the overall communication compression ratio; (
ii
) Several time-consuming compression operations significantly offset the benefits of communication compression; (
iii
) The use of error feedback techniques to maintain model quality results in a high memory footprint consumption. To solve these issues, we propose BIRD, a lightweight tensor-wise
Bi-Random sampling
strategy with an expectation invariance property. Specifically, BIRD applies a tensor-wise
index sharing
mechanism that reduces the index proportion by allowing multiple tensor elements to share a single index, thus improving the overall compression ratio. Additionally, BIRD replaces the time-consuming Top-K sorting with a faster
Bi-Random sampling
strategy based on the aforementioned
index sharing
mechanism, significantly reducing compression overheads; Moreover, BIRD establishes an
expectation invariance
property into the
Bi-Random sampling
to ensure an approximate unbiased representation for the
$L_1$
-norm of the sampled tensors, effectively maintaining the model quality without incurring extra memory costs. We further optimize BIRD to BIRD+ by introducing the uniform distribution-based sampling and Gamma correction on the tensor-wise sampling process, achieving a more flexibly adjustment of the sparsity with better convergence performance. Experimental evaluations across multiple conventional distributed learning tasks demonstrate that compared to state-of-the-art approaches, BIRD+ achieves higher communication compression ratios up to 36.2
$\times$
and higher computation throughput up to 149.6
$\times$
while maintaining the model quality without incurring extra memory costs.
期刊介绍:
IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers. Particular areas of interest include, but are not limited to:
a) Parallel and distributed algorithms, focusing on topics such as: models of computation; numerical, combinatorial, and data-intensive parallel algorithms, scalability of algorithms and data structures for parallel and distributed systems, communication and synchronization protocols, network algorithms, scheduling, and load balancing.
b) Applications of parallel and distributed computing, including computational and data-enabled science and engineering, big data applications, parallel crowd sourcing, large-scale social network analysis, management of big data, cloud and grid computing, scientific and biomedical applications, mobile computing, and cyber-physical systems.
c) Parallel and distributed architectures, including architectures for instruction-level and thread-level parallelism; design, analysis, implementation, fault resilience and performance measurements of multiple-processor systems; multicore processors, heterogeneous many-core systems; petascale and exascale systems designs; novel big data architectures; special purpose architectures, including graphics processors, signal processors, network processors, media accelerators, and other special purpose processors and accelerators; impact of technology on architecture; network and interconnect architectures; parallel I/O and storage systems; architecture of the memory hierarchy; power-efficient and green computing architectures; dependable architectures; and performance modeling and evaluation.
d) Parallel and distributed software, including parallel and multicore programming languages and compilers, runtime systems, operating systems, Internet computing and web services, resource management including green computing, middleware for grids, clouds, and data centers, libraries, performance modeling and evaluation, parallel programming paradigms, and programming environments and tools.