Breaking the Memory Wall for Heterogeneous Federated Learning via Model Splitting

IF 6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-10-14 DOI:10.1109/TPDS.2024.3480115

Chunlin Tian;Li Li;Kahou Tam;Yebo Wu;Cheng-Zhong Xu

{"title":"Breaking the Memory Wall for Heterogeneous Federated Learning via Model Splitting","authors":"Chunlin Tian;Li Li;Kahou Tam;Yebo Wu;Cheng-Zhong Xu","doi":"10.1109/TPDS.2024.3480115","DOIUrl":null,"url":null,"abstract":"Federated Learning (FL) enables multiple devices to collaboratively train a shared model while preserving data privacy. Ever-increasing model complexity coupled with limited memory resources on the participating devices severely bottlenecks the deployment of FL in real-world scenarios. Thus, a framework that can effectively break the memory wall while jointly taking into account the hardware and statistical heterogeneity in FL is urgently required. In this article, we propose \n<italic>SmartSplit</i>\n a framework that effectively reduces the memory footprint on the device side while guaranteeing the training progress and model accuracy for heterogeneous FL through model splitting. Towards this end, \n<italic>SmartSplit</i>\n employs a hierarchical structure to adaptively guide the overall training process. In each training round, the central manager, hosted on the server, dynamically selects the participating devices and sets the cutting layer by jointly considering the memory budget, training capacity, and data distribution of each device. The MEC manager, deployed within the edge server, proceeds to split the local model and perform training of the server-side portion. Meanwhile, it fine-tunes the splitting points based on the time-evolving statistical importance. The on-device manager, embedded inside each mobile device, continuously monitors the local training status while employing cost-aware checkpointing to match the runtime dynamic memory budget. Extensive experiments on representative datasets are conducted on both commercial off-the-shelf mobile device testbeds. The experimental results show that \n<italic>SmartSplit</i>\n excels in FL training on highly memory-constrained mobile SoCs, offering up to a 94% peak latency reduction and 100-fold memory savings. It enhances accuracy performance by 1.49%-57.18% and adaptively adjusts to dynamic memory budgets through cost-aware recomputation","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2513-2526"},"PeriodicalIF":6.0000,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Parallel and Distributed Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10716559/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Federated Learning (FL) enables multiple devices to collaboratively train a shared model while preserving data privacy. Ever-increasing model complexity coupled with limited memory resources on the participating devices severely bottlenecks the deployment of FL in real-world scenarios. Thus, a framework that can effectively break the memory wall while jointly taking into account the hardware and statistical heterogeneity in FL is urgently required. In this article, we propose SmartSplit a framework that effectively reduces the memory footprint on the device side while guaranteeing the training progress and model accuracy for heterogeneous FL through model splitting. Towards this end, SmartSplit employs a hierarchical structure to adaptively guide the overall training process. In each training round, the central manager, hosted on the server, dynamically selects the participating devices and sets the cutting layer by jointly considering the memory budget, training capacity, and data distribution of each device. The MEC manager, deployed within the edge server, proceeds to split the local model and perform training of the server-side portion. Meanwhile, it fine-tunes the splitting points based on the time-evolving statistical importance. The on-device manager, embedded inside each mobile device, continuously monitors the local training status while employing cost-aware checkpointing to match the runtime dynamic memory budget. Extensive experiments on representative datasets are conducted on both commercial off-the-shelf mobile device testbeds. The experimental results show that SmartSplit excels in FL training on highly memory-constrained mobile SoCs, offering up to a 94% peak latency reduction and 100-fold memory savings. It enhances accuracy performance by 1.49%-57.18% and adaptively adjusts to dynamic memory budgets through cost-aware recomputation

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过模型拆分打破异构联合学习的内存墙

联合学习（FL）使多台设备能够协同训练一个共享模型，同时保护数据隐私。模型的复杂性不断增加，而参与设备的内存资源有限，这严重制约了 FL 在实际应用场景中的部署。因此，急需一种框架，既能有效打破内存墙，又能共同考虑 FL 中的硬件和统计异质性。在本文中，我们提出了 SmartSplit 框架，它能有效减少设备端的内存占用，同时通过模型拆分保证异构 FL 的训练进度和模型准确性。为此，SmartSplit 采用分层结构，自适应地指导整个训练过程。在每一轮训练中，服务器上的中央管理器会动态选择参与的设备，并通过共同考虑每个设备的内存预算、训练容量和数据分布来设置切割层。部署在边缘服务器上的 MEC 管理器会继续拆分本地模型，并对服务器端部分进行训练。同时，它还会根据随时间变化的统计重要性对分割点进行微调。嵌入在每台移动设备中的设备上管理器会持续监控本地训练状态，同时采用成本感知检查点技术来匹配运行时的动态内存预算。在两个现成的商用移动设备测试平台上对具有代表性的数据集进行了广泛的实验。实验结果表明，SmartSplit 在内存高度受限的移动 SoC 上进行 FL 训练时表现出色，峰值延迟降低了 94%，内存节省了 100 倍。它将准确度性能提高了 1.49%-57.18%，并通过成本感知的重新计算适应动态内存预算。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Parallel and Distributed Systems 工程技术-工程：电子与电气

CiteScore

11.00

自引率

9.40%

发文量

281

审稿时长

5.6 months

期刊介绍： IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers. Particular areas of interest include, but are not limited to: a) Parallel and distributed algorithms, focusing on topics such as: models of computation; numerical, combinatorial, and data-intensive parallel algorithms, scalability of algorithms and data structures for parallel and distributed systems, communication and synchronization protocols, network algorithms, scheduling, and load balancing. b) Applications of parallel and distributed computing, including computational and data-enabled science and engineering, big data applications, parallel crowd sourcing, large-scale social network analysis, management of big data, cloud and grid computing, scientific and biomedical applications, mobile computing, and cyber-physical systems. c) Parallel and distributed architectures, including architectures for instruction-level and thread-level parallelism; design, analysis, implementation, fault resilience and performance measurements of multiple-processor systems; multicore processors, heterogeneous many-core systems; petascale and exascale systems designs; novel big data architectures; special purpose architectures, including graphics processors, signal processors, network processors, media accelerators, and other special purpose processors and accelerators; impact of technology on architecture; network and interconnect architectures; parallel I/O and storage systems; architecture of the memory hierarchy; power-efficient and green computing architectures; dependable architectures; and performance modeling and evaluation. d) Parallel and distributed software, including parallel and multicore programming languages and compilers, runtime systems, operating systems, Internet computing and web services, resource management including green computing, middleware for grids, clouds, and data centers, libraries, performance modeling and evaluation, parallel programming paradigms, and programming environments and tools.