Communication efficiency is a widely recognised research problem in Federated Learning (FL), with recent work focused on developing techniques for efficient compression, distribution and aggregation of model parameters between clients and the server. Particularly within distributed systems, it is important to balance the need for computational cost and communication efficiency. However, existing methods are often constrained to specific applications and are less generalisable. In this paper, we introduce FedFT (federated frequency-space transformation), a simple yet effective methodology for communicating model parameters in a FL setting. FedFT uses Discrete Cosine Transform (DCT) to represent model parameters in frequency space, enabling efficient compression and reducing communication overhead. FedFT is compatible with various existing FL methodologies and neural architectures, and its linear property eliminates the need for multiple transformations during federated aggregation. This methodology is vital for distributed solutions, tackling essential challenges like data privacy, interoperability, and energy efficiency inherent to these environments. We demonstrate the generalisability of the FedFT methodology on four datasets using comparative studies with three state-of-the-art FL baselines (FedAvg, FedProx, FedSim). Our results demonstrate that using FedFT to represent the differences in model parameters between communication rounds in frequency space results in a more compact representation compared to representing the entire model in frequency space. This leads to a reduction in communication overhead, while keeping accuracy levels comparable and in some cases even improving it. Our results suggest that this reduction can range from 5% to 30% per client, depending on dataset.
{"title":"FedFT: Improving Communication Performance for Federated Learning with Frequency Space Transformation","authors":"Chamath Palihawadana, Nirmalie Wiratunga, Anjana Wijekoon, Harsha Kalutarage","doi":"arxiv-2409.05242","DOIUrl":"https://doi.org/arxiv-2409.05242","url":null,"abstract":"Communication efficiency is a widely recognised research problem in Federated\u0000Learning (FL), with recent work focused on developing techniques for efficient\u0000compression, distribution and aggregation of model parameters between clients\u0000and the server. Particularly within distributed systems, it is important to\u0000balance the need for computational cost and communication efficiency. However,\u0000existing methods are often constrained to specific applications and are less\u0000generalisable. In this paper, we introduce FedFT (federated frequency-space\u0000transformation), a simple yet effective methodology for communicating model\u0000parameters in a FL setting. FedFT uses Discrete Cosine Transform (DCT) to\u0000represent model parameters in frequency space, enabling efficient compression\u0000and reducing communication overhead. FedFT is compatible with various existing\u0000FL methodologies and neural architectures, and its linear property eliminates\u0000the need for multiple transformations during federated aggregation. This\u0000methodology is vital for distributed solutions, tackling essential challenges\u0000like data privacy, interoperability, and energy efficiency inherent to these\u0000environments. We demonstrate the generalisability of the FedFT methodology on\u0000four datasets using comparative studies with three state-of-the-art FL\u0000baselines (FedAvg, FedProx, FedSim). Our results demonstrate that using FedFT\u0000to represent the differences in model parameters between communication rounds\u0000in frequency space results in a more compact representation compared to\u0000representing the entire model in frequency space. This leads to a reduction in\u0000communication overhead, while keeping accuracy levels comparable and in some\u0000cases even improving it. Our results suggest that this reduction can range from\u00005% to 30% per client, depending on dataset.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"106 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In modern materials science, effective and high-volume data management across leading-edge experimental facilities and world-class supercomputers is indispensable for cutting-edge research. Such facilities and supercomputers are typically utilized by a wide range of researchers across different fields and organizations in academia and industry. However, existing integrated systems that handle data from these resources have primarily focused just on smaller-scale cross-institutional or single-domain operations. As a result, they often lack the scalability, efficiency, agility, and interdisciplinarity, needed for handling substantial volumes of data from various researchers. In this paper, we introduce ARIM-mdx data system, a nationwide data platform for materials science in Japan. The platform involves 8 universities and institutes all over Japan through the governmental materials science project. Currently in its trial phase, the ARIM-mdx data system is utilized by over 800 researchers from around 140 organizations in academia and industry, being intended to gradually expand its reach. The system employs a hybrid architecture, combining a peta-scale dedicated storage system for security and stability with a high-performance academic cloud for efficiency and scalability. Through direct network connections between them, the system achieves 4.7x latency reduction compared to a conventional approach, resulting in near real-time interactive data analysis. It also utilizes specialized IoT devices for secure data transfer from equipment computers and connects to multiple supercomputers via an academic ultra-fast network, achieving 4x faster data transfer compared to the public Internet. The ARIM-mdx data system, as a pioneering nationwide data platform, has the potential to contribute to the creation of new research communities and accelerates innovations.
{"title":"ARIM-mdx Data System: Towards a Nationwide Data Platform for Materials Science","authors":"Masatoshi Hanai, Ryo Ishikawa, Mitsuaki Kawamura, Masato Ohnishi, Norio Takenaka, Kou Nakamura, Daiju Matsumura, Seiji Fujikawa, Hiroki Sakamoto, Yukinori Ochiai, Tetsuo Okane, Shin-Ichiro Kuroki, Atsuo Yamada, Toyotaro Suzumura, Junichiro Shiomi, Kenjiro Taura, Yoshio Mita, Naoya Shibata, Yuichi Ikuhara","doi":"arxiv-2409.06734","DOIUrl":"https://doi.org/arxiv-2409.06734","url":null,"abstract":"In modern materials science, effective and high-volume data management across\u0000leading-edge experimental facilities and world-class supercomputers is\u0000indispensable for cutting-edge research. Such facilities and supercomputers are\u0000typically utilized by a wide range of researchers across different fields and\u0000organizations in academia and industry. However, existing integrated systems\u0000that handle data from these resources have primarily focused just on\u0000smaller-scale cross-institutional or single-domain operations. As a result,\u0000they often lack the scalability, efficiency, agility, and interdisciplinarity,\u0000needed for handling substantial volumes of data from various researchers. In this paper, we introduce ARIM-mdx data system, a nationwide data platform\u0000for materials science in Japan. The platform involves 8 universities and\u0000institutes all over Japan through the governmental materials science project.\u0000Currently in its trial phase, the ARIM-mdx data system is utilized by over 800\u0000researchers from around 140 organizations in academia and industry, being\u0000intended to gradually expand its reach. The system employs a hybrid\u0000architecture, combining a peta-scale dedicated storage system for security and\u0000stability with a high-performance academic cloud for efficiency and\u0000scalability. Through direct network connections between them, the system\u0000achieves 4.7x latency reduction compared to a conventional approach, resulting\u0000in near real-time interactive data analysis. It also utilizes specialized IoT\u0000devices for secure data transfer from equipment computers and connects to\u0000multiple supercomputers via an academic ultra-fast network, achieving 4x faster\u0000data transfer compared to the public Internet. The ARIM-mdx data system, as a\u0000pioneering nationwide data platform, has the potential to contribute to the\u0000creation of new research communities and accelerates innovations.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Preference aggregation is a fundamental problem in voting theory, in which public input rankings of a set of alternatives (called preferences) must be aggregated into a single preference that satisfies certain soundness properties. The celebrated Arrow Impossibility Theorem is equivalent to a distributed task in a synchronous fault-free system that satisfies properties such as respecting unanimous preferences, maintaining independence of irrelevant alternatives (IIA), and non-dictatorship, along with consensus since only one preference can be decided. In this work, we study a weaker distributed task in which crash faults are introduced, IIA is not required, and the consensus property is relaxed to either $k$-set agreement or $epsilon$-approximate agreement using any metric on the set of preferences. In particular, we prove several novel impossibility results for both of these tasks in both synchronous and asynchronous distributed systems. We additionally show that the impossibility for our $epsilon$-approximate agreement task using the Kendall tau or Spearman footrule metrics holds under extremely weak assumptions.
{"title":"Distributed Agreement in the Arrovian Framework","authors":"Kenan Wood, Hammurabi Mendes, Jonad Pulaj","doi":"arxiv-2409.04685","DOIUrl":"https://doi.org/arxiv-2409.04685","url":null,"abstract":"Preference aggregation is a fundamental problem in voting theory, in which\u0000public input rankings of a set of alternatives (called preferences) must be\u0000aggregated into a single preference that satisfies certain soundness\u0000properties. The celebrated Arrow Impossibility Theorem is equivalent to a\u0000distributed task in a synchronous fault-free system that satisfies properties\u0000such as respecting unanimous preferences, maintaining independence of\u0000irrelevant alternatives (IIA), and non-dictatorship, along with consensus since\u0000only one preference can be decided. In this work, we study a weaker distributed task in which crash faults are\u0000introduced, IIA is not required, and the consensus property is relaxed to\u0000either $k$-set agreement or $epsilon$-approximate agreement using any metric\u0000on the set of preferences. In particular, we prove several novel impossibility\u0000results for both of these tasks in both synchronous and asynchronous\u0000distributed systems. We additionally show that the impossibility for our\u0000$epsilon$-approximate agreement task using the Kendall tau or Spearman\u0000footrule metrics holds under extremely weak assumptions.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"268 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dian Xiong, Li Chen, Youhe Jiang, Dan Li, Shuai Wang, Songtao Wang
AllReduce is an important and popular collective communication primitive, which has been widely used in areas such as distributed machine learning and high performance computing. To design, analyze, and choose from various algorithms and implementations of AllReduce, the time cost model plays a crucial role, and the predominant one is the $(alpha,beta,gamma)$ model. In this paper, we revisit this model, and reveal that it cannot well characterize the time cost of AllReduce on modern clusters; thus must be updated. We perform extensive measurements to identify two additional terms contributing to the time cost: the incast term and the memory access term. We augment the $(alpha,beta,gamma)$ model with these two terms, and present GenModel as a result. Using GenModel, we discover two new optimalities for AllReduce algorithms, and prove that they cannot be achieved simultaneously. Finally, striking the balance between the two new optimalities, we design GenTree, an AllReduce plan generation algorithm specialized for tree-like topologies. Experiments on a real testbed with 64 GPUs show that GenTree can achieve 1.22$times$ to 1.65$times$ speed-up against NCCL. Large-scale simulations also confirm that GenTree can improve the state-of-the-art AllReduce algorithm by a factor of $1.2$ to $7.4$ in scenarios where the two new terms dominate.
{"title":"Revisiting the Time Cost Model of AllReduce","authors":"Dian Xiong, Li Chen, Youhe Jiang, Dan Li, Shuai Wang, Songtao Wang","doi":"arxiv-2409.04202","DOIUrl":"https://doi.org/arxiv-2409.04202","url":null,"abstract":"AllReduce is an important and popular collective communication primitive,\u0000which has been widely used in areas such as distributed machine learning and\u0000high performance computing. To design, analyze, and choose from various\u0000algorithms and implementations of AllReduce, the time cost model plays a\u0000crucial role, and the predominant one is the $(alpha,beta,gamma)$ model. In\u0000this paper, we revisit this model, and reveal that it cannot well characterize\u0000the time cost of AllReduce on modern clusters; thus must be updated. We perform\u0000extensive measurements to identify two additional terms contributing to the\u0000time cost: the incast term and the memory access term. We augment the\u0000$(alpha,beta,gamma)$ model with these two terms, and present GenModel as a\u0000result. Using GenModel, we discover two new optimalities for AllReduce\u0000algorithms, and prove that they cannot be achieved simultaneously. Finally,\u0000striking the balance between the two new optimalities, we design GenTree, an\u0000AllReduce plan generation algorithm specialized for tree-like topologies.\u0000Experiments on a real testbed with 64 GPUs show that GenTree can achieve\u00001.22$times$ to 1.65$times$ speed-up against NCCL. Large-scale simulations\u0000also confirm that GenTree can improve the state-of-the-art AllReduce algorithm\u0000by a factor of $1.2$ to $7.4$ in scenarios where the two new terms dominate.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"62 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenxiao Zhang, Zhidong Gao, Yuanxiong Guo, Yanmin Gong
Motivated by the drawbacks of cloud-based federated learning (FL), cooperative federated edge learning (CFEL) has been proposed to improve efficiency for FL over mobile edge networks, where multiple edge servers collaboratively coordinate the distributed model training across a large number of edge devices. However, CFEL faces critical challenges arising from dynamic and heterogeneous device properties, which slow down the convergence and increase resource consumption. This paper proposes a heterogeneity-aware CFEL scheme called textit{Heterogeneity-Aware Cooperative Edge-based Federated Averaging} (HCEF) that aims to maximize the model accuracy while minimizing the training time and energy consumption via adaptive computation and communication compression in CFEL. By theoretically analyzing how local update frequency and gradient compression affect the convergence error bound in CFEL, we develop an efficient online control algorithm for HCEF to dynamically determine local update frequencies and compression ratios for heterogeneous devices. Experimental results show that compared with prior schemes, the proposed HCEF scheme can maintain higher model accuracy while reducing training latency and improving energy efficiency simultaneously.
{"title":"Heterogeneity-Aware Cooperative Federated Edge Learning with Adaptive Computation and Communication Compression","authors":"Zhenxiao Zhang, Zhidong Gao, Yuanxiong Guo, Yanmin Gong","doi":"arxiv-2409.04022","DOIUrl":"https://doi.org/arxiv-2409.04022","url":null,"abstract":"Motivated by the drawbacks of cloud-based federated learning (FL),\u0000cooperative federated edge learning (CFEL) has been proposed to improve\u0000efficiency for FL over mobile edge networks, where multiple edge servers\u0000collaboratively coordinate the distributed model training across a large number\u0000of edge devices. However, CFEL faces critical challenges arising from dynamic\u0000and heterogeneous device properties, which slow down the convergence and\u0000increase resource consumption. This paper proposes a heterogeneity-aware CFEL\u0000scheme called textit{Heterogeneity-Aware Cooperative Edge-based Federated\u0000Averaging} (HCEF) that aims to maximize the model accuracy while minimizing the\u0000training time and energy consumption via adaptive computation and communication\u0000compression in CFEL. By theoretically analyzing how local update frequency and\u0000gradient compression affect the convergence error bound in CFEL, we develop an\u0000efficient online control algorithm for HCEF to dynamically determine local\u0000update frequencies and compression ratios for heterogeneous devices.\u0000Experimental results show that compared with prior schemes, the proposed HCEF\u0000scheme can maintain higher model accuracy while reducing training latency and\u0000improving energy efficiency simultaneously.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jincheng Zhou, Jin Zhang, Xiang Zhang, Tiaojie Xiao, Di Ma, Chunye Gong
Sorting algorithms are the most extensively researched topics in computer science and serve for numerous practical applications. Although various sorts have been proposed for efficiency, different architectures offer distinct flavors to the implementation of parallel sorting. In this paper, we propose a hybrid vectorized merge sort on ARM NEON, named NEON Merge Sort for short (NEON-MS). In detail, according to the granted register functions, we first identify the optimal register number to avoid the register-to-memory access, due to the write-back of intermediate outcomes. More importantly, following the generic merge sort framework that primarily uses sorting network for column sort and merging networks for three types of vectorized merge, we further improve their structures for high efficiency in an unified asymmetry way: 1) it makes the optimal sorting networks with few comparators become possible; 2) hybrid implementation of both serial and vectorized merges incurs the pipeline with merge instructions highly interleaved. Experiments on a single FT2000+ core show that NEON-MS is 3.8 and 2.1 times faster than std::sort and boost::block_sort, respectively, on average. Additionally, as compared to the parallel version of the latter, NEON-MS gains an average speedup of 1.25.
{"title":"A Hybrid Vectorized Merge Sort on ARM NEON","authors":"Jincheng Zhou, Jin Zhang, Xiang Zhang, Tiaojie Xiao, Di Ma, Chunye Gong","doi":"arxiv-2409.03970","DOIUrl":"https://doi.org/arxiv-2409.03970","url":null,"abstract":"Sorting algorithms are the most extensively researched topics in computer\u0000science and serve for numerous practical applications. Although various sorts\u0000have been proposed for efficiency, different architectures offer distinct\u0000flavors to the implementation of parallel sorting. In this paper, we propose a\u0000hybrid vectorized merge sort on ARM NEON, named NEON Merge Sort for short\u0000(NEON-MS). In detail, according to the granted register functions, we first\u0000identify the optimal register number to avoid the register-to-memory access,\u0000due to the write-back of intermediate outcomes. More importantly, following the\u0000generic merge sort framework that primarily uses sorting network for column\u0000sort and merging networks for three types of vectorized merge, we further\u0000improve their structures for high efficiency in an unified asymmetry way: 1) it\u0000makes the optimal sorting networks with few comparators become possible; 2)\u0000hybrid implementation of both serial and vectorized merges incurs the pipeline\u0000with merge instructions highly interleaved. Experiments on a single FT2000+\u0000core show that NEON-MS is 3.8 and 2.1 times faster than std::sort and\u0000boost::block_sort, respectively, on average. Additionally, as compared to the\u0000parallel version of the latter, NEON-MS gains an average speedup of 1.25.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Wen, Quanyu Zhu, Weiwei Chu, Wen-Yen Chen, Jiyan Yang
Scaling up deep learning models has been proven effective to improve intelligence of machine learning (ML) models, especially for industry recommendation models and large language models. The co-design of distributed ML systems and algorithms (to maximize training performance) plays a pivotal role for its success. As it scales, the number of co-design hyper-parameters grows rapidly which brings challenges to feasibly find the optimal setup for system performance maximization. In this paper, we propose CubicML which uses ML to automatically optimize training performance of distributed ML systems. In CubicML, we use a ML model as a proxy to predict the training performance for search efficiency and performance modeling flexibility. We proved that CubicML can effectively optimize training speed of in-house ads recommendation models and large language models at Meta.
扩展深度学习模型已被证明能有效提高机器学习(ML)模型的智能,特别是对于行业推荐模型和大型语言模型。分布式 ML 系统和算法的协同设计(以最大限度地提高训练性能)对其成功起着关键作用。随着系统规模的扩大,协同设计超参数的数量也在迅速增加,这给找到系统性能最大化的最佳设置带来了挑战。在本文中,我们提出了 CubicML,它使用ML 自动优化分布式 ML 系统的训练性能。在 CubicML 中,我们使用一个 ML 模型作为代理来预测训练性能,以提高搜索效率和性能建模的灵活性。我们证明,CubicML 可以有效优化 Meta 公司内部广告推荐模型和大型语言模型的训练速度。
{"title":"CubicML: Automated ML for Distributed ML Systems Co-design with ML Prediction of Performance","authors":"Wei Wen, Quanyu Zhu, Weiwei Chu, Wen-Yen Chen, Jiyan Yang","doi":"arxiv-2409.04585","DOIUrl":"https://doi.org/arxiv-2409.04585","url":null,"abstract":"Scaling up deep learning models has been proven effective to improve\u0000intelligence of machine learning (ML) models, especially for industry\u0000recommendation models and large language models. The co-design of distributed\u0000ML systems and algorithms (to maximize training performance) plays a pivotal\u0000role for its success. As it scales, the number of co-design hyper-parameters\u0000grows rapidly which brings challenges to feasibly find the optimal setup for\u0000system performance maximization. In this paper, we propose CubicML which uses\u0000ML to automatically optimize training performance of distributed ML systems. In\u0000CubicML, we use a ML model as a proxy to predict the training performance for\u0000search efficiency and performance modeling flexibility. We proved that CubicML\u0000can effectively optimize training speed of in-house ads recommendation models\u0000and large language models at Meta.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"75 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yujie Wang, Shenhan Zhu, Fangcheng Fu, Xupeng Miao, Jie Zhang, Juan Zhu, Fan Hong, Yong Li, Bin Cui
Recent foundation models are capable of handling multiple machine learning (ML) tasks and multiple data modalities with the unified base model structure and several specialized model components. However, the development of such multi-task (MT) multi-modal (MM) models poses significant model management challenges to existing training systems. Due to the sophisticated model architecture and the heterogeneous workloads of different ML tasks and data modalities, training these models usually requires massive GPU resources and suffers from sub-optimal system efficiency. In this paper, we investigate how to achieve high-performance training of large-scale MT MM models through data heterogeneity-aware model management optimization. The key idea is to decompose the model execution into stages and address the joint optimization problem sequentially, including both heterogeneity-aware workload parallelization and dependency-driven execution scheduling. Based on this, we build a prototype system and evaluate it on various large MT MM models. Experiments demonstrate the superior performance and efficiency of our system, with speedup ratio up to 71% compared to state-of-the-art training systems.
最近的基础模型能够通过统一的基础模型结构和多个专用模型组件处理多种机器学习(ML)任务和多种数据模式。然而,这种多任务(MT)多模态(MM)模型的开发给现有的训练系统带来了巨大的模型管理挑战。由于复杂的模型架构以及不同 ML 任务和数据模态的异构工作负载,训练这些模型通常需要大量 GPU 资源,而且系统效率未达到最佳。在本文中,我们研究了如何通过数据异构感知模型管理优化来实现大规模 MT MM 模型的高性能训练。其关键思路是将模型执行分解为若干阶段,并按顺序解决联合优化问题,包括异构感知工作负载并行化和依赖驱动的执行调度。在此基础上,我们构建了一个原型系统,并在各种大型 MT MM 模型上对其进行了评估。实验证明了我们系统的卓越性能和效率,与最先进的训练系统相比,提速比高达 71%。
{"title":"Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management","authors":"Yujie Wang, Shenhan Zhu, Fangcheng Fu, Xupeng Miao, Jie Zhang, Juan Zhu, Fan Hong, Yong Li, Bin Cui","doi":"arxiv-2409.03365","DOIUrl":"https://doi.org/arxiv-2409.03365","url":null,"abstract":"Recent foundation models are capable of handling multiple machine learning\u0000(ML) tasks and multiple data modalities with the unified base model structure\u0000and several specialized model components. However, the development of such\u0000multi-task (MT) multi-modal (MM) models poses significant model management\u0000challenges to existing training systems. Due to the sophisticated model\u0000architecture and the heterogeneous workloads of different ML tasks and data\u0000modalities, training these models usually requires massive GPU resources and\u0000suffers from sub-optimal system efficiency. In this paper, we investigate how to achieve high-performance training of\u0000large-scale MT MM models through data heterogeneity-aware model management\u0000optimization. The key idea is to decompose the model execution into stages and\u0000address the joint optimization problem sequentially, including both\u0000heterogeneity-aware workload parallelization and dependency-driven execution\u0000scheduling. Based on this, we build a prototype system and evaluate it on\u0000various large MT MM models. Experiments demonstrate the superior performance\u0000and efficiency of our system, with speedup ratio up to 71% compared to\u0000state-of-the-art training systems.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The well-studied red-blue pebble game models the execution of an arbitrary computational DAG by a single processor over a two-level memory hierarchy. We present a natural generalization to a multiprocessor setting where each processor has its own limited fast memory, and all processors share unlimited slow memory. To our knowledge, this is the first thorough study that combines pebbling and DAG scheduling problems, capturing the computation of general workloads on multiple processors with memory constraints and communication costs. Our pebbling model enables us to analyze trade-offs between workload balancing, communication and memory limitations, and it captures real-world factors such as superlinear speedups due to parallelization. Our results include upper and lower bounds on the pebbling cost, an analysis of a greedy pebbling strategy, and an extension of NP-hardness results for specific DAG classes from simpler models. For our main technical contribution, we show two inapproximability results that already hold for the long-standing problem of standard red-blue pebbling: (i) the optimal I/O cost cannot be approximated to any finite factor, and (ii) the optimal total cost (I/O+computation) can only be approximated to a limited constant factor, i.e., it does not allow for a polynomial-time approximation scheme. These results also carry over naturally to our multiprocessor pebbling model.
经过深入研究的红蓝鹅卵石游戏模拟了单处理器在两级内存层次结构上执行任意计算 DAG 的情况。在多处理器环境中,每个处理器都有自己有限的快内存,而所有处理器共享无限的慢内存。据我们所知,这是第一项将鹅卵石和 DAG 调度问题相结合的深入研究,它捕捉到了多处理器上具有内存约束和通信成本的一般工作量的计算。我们的鹅卵石模型使我们能够分析工作负载平衡、通信和内存限制之间的权衡,它还捕捉了现实世界中的因素,如并行化带来的超线性加速。我们的研究成果包括pebbling 成本的上界和下界、贪婪pebbling 策略的分析,以及从更简单的模型扩展到特定 DAG 类的 NP-hardness 结果。对于我们的主要技术贡献,我们展示了两个不可逆结果,这两个结果在长期存在的标准红蓝鹅卵石问题中已经成立:(i) 最佳 I/O 成本不能被逼近到任何有限因子;(ii) 最佳总成本(I/O + 计算)只能被逼近到一个有限的常数因子,也就是说,它不允许多项式时间逼近方案。这些结果也可以自然地应用到我们的多处理器鹅卵石模型中。
{"title":"Red-Blue Pebbling with Multiple Processors: Time, Communication and Memory Trade-offs","authors":"Toni Böhnlein, Pál András Papp, A. N. Yzelman","doi":"arxiv-2409.03898","DOIUrl":"https://doi.org/arxiv-2409.03898","url":null,"abstract":"The well-studied red-blue pebble game models the execution of an arbitrary\u0000computational DAG by a single processor over a two-level memory hierarchy. We\u0000present a natural generalization to a multiprocessor setting where each\u0000processor has its own limited fast memory, and all processors share unlimited\u0000slow memory. To our knowledge, this is the first thorough study that combines\u0000pebbling and DAG scheduling problems, capturing the computation of general\u0000workloads on multiple processors with memory constraints and communication\u0000costs. Our pebbling model enables us to analyze trade-offs between workload\u0000balancing, communication and memory limitations, and it captures real-world\u0000factors such as superlinear speedups due to parallelization. Our results include upper and lower bounds on the pebbling cost, an analysis\u0000of a greedy pebbling strategy, and an extension of NP-hardness results for\u0000specific DAG classes from simpler models. For our main technical contribution,\u0000we show two inapproximability results that already hold for the long-standing\u0000problem of standard red-blue pebbling: (i) the optimal I/O cost cannot be\u0000approximated to any finite factor, and (ii) the optimal total cost\u0000(I/O+computation) can only be approximated to a limited constant factor, i.e.,\u0000it does not allow for a polynomial-time approximation scheme. These results\u0000also carry over naturally to our multiprocessor pebbling model.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"80 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jayden Serenari, Sreekanth Sreekumar, Kaiwen Zhao, Saurabh Sarkar, Stephen Lee
Serverless computing is an emerging cloud computing abstraction wherein the cloud platform transparently manages all resources, including explicitly provisioning resources and geographical load balancing when the demand for service spikes. Users provide code as functions, and the cloud platform runs these functions handling all aspects of function execution. While prior work has primarily focused on optimizing performance, this paper focuses on reducing the carbon footprint of these systems making variations in grid carbon intensity and intermittency from renewables transparent to the user. We introduce GreenWhisk, a carbon-aware serverless computing platform built upon Apache OpenWhisk, operating in two modes - grid-connected and grid-isolated - addressing intermittency challenges arising from renewables and the grid's carbon footprint. Moreover, we develop carbon-aware load balancing algorithms that leverage energy and carbon information to reduce the carbon footprint. Our evaluation results show that GreenWhisk can easily incorporate carbon-aware algorithms, thereby reducing the carbon footprint of functions without significantly impacting the performance of function execution. In doing so, our system design enables the integration of new carbon-aware strategies into a serverless computing platform.
{"title":"GreenWhisk: Emission-Aware Computing for Serverless Platform","authors":"Jayden Serenari, Sreekanth Sreekumar, Kaiwen Zhao, Saurabh Sarkar, Stephen Lee","doi":"arxiv-2409.03029","DOIUrl":"https://doi.org/arxiv-2409.03029","url":null,"abstract":"Serverless computing is an emerging cloud computing abstraction wherein the\u0000cloud platform transparently manages all resources, including explicitly\u0000provisioning resources and geographical load balancing when the demand for\u0000service spikes. Users provide code as functions, and the cloud platform runs\u0000these functions handling all aspects of function execution. While prior work\u0000has primarily focused on optimizing performance, this paper focuses on reducing\u0000the carbon footprint of these systems making variations in grid carbon\u0000intensity and intermittency from renewables transparent to the user. We\u0000introduce GreenWhisk, a carbon-aware serverless computing platform built upon\u0000Apache OpenWhisk, operating in two modes - grid-connected and grid-isolated -\u0000addressing intermittency challenges arising from renewables and the grid's\u0000carbon footprint. Moreover, we develop carbon-aware load balancing algorithms\u0000that leverage energy and carbon information to reduce the carbon footprint. Our\u0000evaluation results show that GreenWhisk can easily incorporate carbon-aware\u0000algorithms, thereby reducing the carbon footprint of functions without\u0000significantly impacting the performance of function execution. In doing so, our\u0000system design enables the integration of new carbon-aware strategies into a\u0000serverless computing platform.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}