The scalability of distributed Deep Neural Network (DNN) training is critically limited by its substantial communication latency, arising from the massive volume of gradient data and frequent synchronization rounds. While existing techniques like gradient sparsification and quantization aim to alleviate this bottleneck, they often compromise model accuracy or introduce significant computational costs. We observe that gradients of different layers of DNN exhibit varying sensitivities to model performance and drastically different distribution patterns at different training stages. Moreover, the computational overhead of gradient clustering itself is predicTable Motivated by these insights, we propose an Adaptive Gradient Sparsification with Layer and Stage-wise (AGSLS) method, based on a gradient clustering idea. We propose a novel efficiency-aware scheme that first identifies which gradient layers should be sparsified and which are better transmitted directly, ensuring that each sparsification operation contributes positively to training acceleration. Furthermore, AGSLS introduces a layer-wise adaptive gradient sensing scheme that tailors the number of clusters for each layer at different training stages, thereby minimizing communication traffic without sacrificing model accuracy. We evaluate AGSLS extensively across multiple widely-used datasets and models, spanning image classification (ResNet-18, VGGNet-16, and AlexNet on CIFAR-10/100) and natural language processing (BERT on SST-2). The results demonstrate that AGSLS significantly outperforms existing approaches, including Bulk Synchronous Parallel (BSP), STL-SGD, DGC, and RedSync, reducing training time by up to 86.67%, 56.6%, 52.1%, and 57.1%, respectively.
扫码关注我们
求助内容:
应助结果提醒方式:
