Congestion control in machine learning clusters

Proceedings of the 21st ACM Workshop on Hot Topics in Networks Pub Date : 2022-11-14 DOI:10.1145/3563766.3564115

S. Rajasekaran, M. Ghobadi, Gautam Kumar, Aditya Akella

引用次数: 7

Abstract

This paper argues that fair-sharing, the holy grail of congestion control algorithms for decades, is not necessarily a desirable property in Machine Learning (ML) training clusters. We demonstrate that for a specific combination of jobs, introducing unfairness improves the training time for all competing jobs. We call this specific combination of jobs compatible and define the compatibility criterion using a novel geometric abstraction. Our abstraction rolls time around a circle and rotates the communication phases of jobs to identify fully compatible jobs. Using this abstraction, we demonstrate up to 1.3× improvement in the average training iteration time of popular ML models. We advocate that resource management algorithms should take job compatibility on network links into account. We then propose three directions to ameliorate the impact of network congestion in ML training clusters: (i) an adaptively unfair congestion control scheme, (ii) priority queues on switches, and (iii) precise flow scheduling.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

机器学习集群中的拥塞控制

本文认为，公平共享，几十年来拥塞控制算法的圣杯，并不一定是机器学习(ML)训练集群的理想属性。我们证明，对于特定的工作组合，引入不公平可以提高所有竞争工作的培训时间。我们将这种特定的作业组合称为相容的，并使用一种新的几何抽象来定义相容标准。我们的抽象将时间绕圈旋转，并旋转作业的通信阶段，以识别完全兼容的作业。使用这种抽象，我们证明了流行ML模型的平均训练迭代时间提高了1.3倍。我们主张资源管理算法应考虑网络链路上的作业兼容性。然后，我们提出了三个方向来改善ML训练集群中网络拥塞的影响:(i)自适应不公平拥塞控制方案，(ii)交换机上的优先队列，以及(iii)精确的流量调度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 21st ACM Workshop on Hot Topics in Networks

自引率

0.00%

发文量

期刊最新文献

The decoupling principle: a practical privacy framework Towards dual-band reconfigurable metasurfaces for satellite networking Sidecar: in-network performance enhancements in the age of paranoid transport protocols The internet of things in a laptop: rapid prototyping for IoT applications with digibox Making links on your web pages last longer than you