Diego Cardoso Nunes, Bruno Loureiro Coelho, Ricardo Parizotto, Alberto Egon Schaeffer‐Filho
{"title":"没有工人落在后面(太远):网络内 ML 聚合的动态混合同步","authors":"Diego Cardoso Nunes, Bruno Loureiro Coelho, Ricardo Parizotto, Alberto Egon Schaeffer‐Filho","doi":"10.1002/nem.2290","DOIUrl":null,"url":null,"abstract":"Achieving high‐performance aggregation is essential to scaling data‐parallel distributed machine learning (ML) training. Recent research in in‐network computing has shown that offloading the aggregation to the network data plane can accelerate the aggregation process compared to traditional server‐only approaches, reducing the propagation delay and consequently speeding up distributed training. However, the existing literature on in‐network aggregation does not provide ways to deal with slower workers (called stragglers). The presence of stragglers can negatively impact distributed training, increasing the time it takes to complete. In this paper, we present Serene, an in‐network aggregation system capable of circumventing the effects of stragglers. Serene coordinates the ML workers to cooperate with a programmable switch using a hybrid synchronization approach where approaches can be changed dynamically. The synchronization can change dynamically through a control plane API that translates high‐level code into switch rules. Serene switch employs an efficient data structure for managing synchronization and a hot‐swapping mechanism to consistently change from one synchronization strategy to another. We implemented and evaluated a prototype using BMv2 and a Proof‐of‐Concept in a Tofino ASIC. We ran experiments with realistic ML workloads, including a neural network trained for image classification. Our results show that Serene can speed up training by up to 40% in emulation scenarios by reducing drastically the cumulative waiting time compared to a synchronous baseline.","PeriodicalId":14154,"journal":{"name":"International Journal of Network Management","volume":"22 1","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"No Worker Left (Too Far) Behind: Dynamic Hybrid Synchronization for In‐Network ML Aggregation\",\"authors\":\"Diego Cardoso Nunes, Bruno Loureiro Coelho, Ricardo Parizotto, Alberto Egon Schaeffer‐Filho\",\"doi\":\"10.1002/nem.2290\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Achieving high‐performance aggregation is essential to scaling data‐parallel distributed machine learning (ML) training. Recent research in in‐network computing has shown that offloading the aggregation to the network data plane can accelerate the aggregation process compared to traditional server‐only approaches, reducing the propagation delay and consequently speeding up distributed training. However, the existing literature on in‐network aggregation does not provide ways to deal with slower workers (called stragglers). The presence of stragglers can negatively impact distributed training, increasing the time it takes to complete. In this paper, we present Serene, an in‐network aggregation system capable of circumventing the effects of stragglers. Serene coordinates the ML workers to cooperate with a programmable switch using a hybrid synchronization approach where approaches can be changed dynamically. The synchronization can change dynamically through a control plane API that translates high‐level code into switch rules. Serene switch employs an efficient data structure for managing synchronization and a hot‐swapping mechanism to consistently change from one synchronization strategy to another. We implemented and evaluated a prototype using BMv2 and a Proof‐of‐Concept in a Tofino ASIC. We ran experiments with realistic ML workloads, including a neural network trained for image classification. Our results show that Serene can speed up training by up to 40% in emulation scenarios by reducing drastically the cumulative waiting time compared to a synchronous baseline.\",\"PeriodicalId\":14154,\"journal\":{\"name\":\"International Journal of Network Management\",\"volume\":\"22 1\",\"pages\":\"\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2024-07-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Network Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1002/nem.2290\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Network Management","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1002/nem.2290","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
摘要
实现高性能聚合对于扩展数据并行分布式机器学习(ML)训练至关重要。最近的网络内计算研究表明,与传统的纯服务器方法相比,将聚合卸载到网络数据平面可以加速聚合过程,减少传播延迟,从而加快分布式训练。然而,关于网络内聚合的现有文献并没有提供处理速度较慢的工作者(称为 "游离者")的方法。散兵的存在会对分布式训练产生负面影响,增加训练完成所需的时间。在本文中,我们介绍了 Serene,一种能够规避散兵游勇影响的网内聚合系统。Serene 使用一种混合同步方法协调 ML 工作者与可编程交换机合作,这种方法可以动态改变。同步可通过将高级代码转换为交换规则的控制平面应用程序接口(API)动态更改。Serene switch 采用了一种高效的数据结构来管理同步,并采用了一种热插拔机制,可持续地从一种同步策略切换到另一种同步策略。我们使用 BMv2 实现并评估了一个原型,并在 Tofino ASIC 中进行了概念验证。我们使用现实的 ML 工作负载进行了实验,包括为图像分类而训练的神经网络。结果表明,与同步基线相比,Serene 通过大幅减少累计等待时间,可将仿真场景中的训练速度提高 40%。
No Worker Left (Too Far) Behind: Dynamic Hybrid Synchronization for In‐Network ML Aggregation
Achieving high‐performance aggregation is essential to scaling data‐parallel distributed machine learning (ML) training. Recent research in in‐network computing has shown that offloading the aggregation to the network data plane can accelerate the aggregation process compared to traditional server‐only approaches, reducing the propagation delay and consequently speeding up distributed training. However, the existing literature on in‐network aggregation does not provide ways to deal with slower workers (called stragglers). The presence of stragglers can negatively impact distributed training, increasing the time it takes to complete. In this paper, we present Serene, an in‐network aggregation system capable of circumventing the effects of stragglers. Serene coordinates the ML workers to cooperate with a programmable switch using a hybrid synchronization approach where approaches can be changed dynamically. The synchronization can change dynamically through a control plane API that translates high‐level code into switch rules. Serene switch employs an efficient data structure for managing synchronization and a hot‐swapping mechanism to consistently change from one synchronization strategy to another. We implemented and evaluated a prototype using BMv2 and a Proof‐of‐Concept in a Tofino ASIC. We ran experiments with realistic ML workloads, including a neural network trained for image classification. Our results show that Serene can speed up training by up to 40% in emulation scenarios by reducing drastically the cumulative waiting time compared to a synchronous baseline.
期刊介绍:
Modern computer networks and communication systems are increasing in size, scope, and heterogeneity. The promise of a single end-to-end technology has not been realized and likely never will occur. The decreasing cost of bandwidth is increasing the possible applications of computer networks and communication systems to entirely new domains. Problems in integrating heterogeneous wired and wireless technologies, ensuring security and quality of service, and reliably operating large-scale systems including the inclusion of cloud computing have all emerged as important topics. The one constant is the need for network management. Challenges in network management have never been greater than they are today. The International Journal of Network Management is the forum for researchers, developers, and practitioners in network management to present their work to an international audience. The journal is dedicated to the dissemination of information, which will enable improved management, operation, and maintenance of computer networks and communication systems. The journal is peer reviewed and publishes original papers (both theoretical and experimental) by leading researchers, practitioners, and consultants from universities, research laboratories, and companies around the world. Issues with thematic or guest-edited special topics typically occur several times per year. Topic areas for the journal are largely defined by the taxonomy for network and service management developed by IFIP WG6.6, together with IEEE-CNOM, the IRTF-NMRG and the Emanics Network of Excellence.