Dissecting the Communication Latency in Distributed Deep Sparse Learning

Proceedings of the ACM Internet Measurement Conference Pub Date : 2020-10-27 DOI:10.1145/3419394.3423637

H. Pan, Zhenyu Li, Jianbo Dong, Zheng Cao, Tao Lan, Di Zhang, Gareth Tyson, Gaogang Xie

引用次数: 11

Abstract

Distributed deep learning (DDL) uses a cluster of servers to train models in parallel. This has been applied to a multiplicity of problems, e.g. online advertisement, friend recommendations. However, the distribution of training means that the communication network becomes a key component in system performance. In this paper, we measure the Alibaba's DDL system, with a focus on understanding the bottlenecks introduced by the network. Our key finding is that the communications overhead has a surprisingly large impact on performance. To explore this, we analyse latency logs of 1.38M Remote Procedure Calls between servers during model training for two real applications of high-dimensional sparse data. We reveal the major contributors of the latency, including concurrent write/read operations of different connections and network connection management. We further observe a skewed distribution of update frequency for individual parameters, motivating us to propose using in-network computation capacity to offload server tasks.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

分布式深度稀疏学习中的通信延迟分析

分布式深度学习(DDL)使用服务器集群并行训练模型。这已经应用于多种问题，例如在线广告，朋友推荐。然而，训练的分布意味着通信网络成为系统性能的关键组成部分。在本文中，我们测量了阿里巴巴的DDL系统，重点是了解网络引入的瓶颈。我们的主要发现是，通信开销对性能的影响大得惊人。为了探索这一点，我们分析了在模型训练期间服务器之间1.38M远程过程调用的延迟日志，用于两个高维稀疏数据的实际应用。我们揭示了延迟的主要原因，包括不同连接的并发写/读操作和网络连接管理。我们进一步观察到单个参数更新频率的倾斜分布，这促使我们建议使用网络内计算能力来卸载服务器任务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the ACM Internet Measurement Conference

自引率

0.00%

发文量

期刊最新文献

Lumos5G A Bird's Eye View of the World's Fastest Networks Quantifying the Impact of Blocklisting in the Age of Address Reuse TopoScope No WAN's Land: Mapping U.S. Broadband Coverage with Millions of Address Queries to ISPs