arXiv - CS - Distributed, Parallel, and Cluster Computing最新文献_第5页

FedFT: Improving Communication Performance for Federated Learning with Frequency Space Transformation FedFT：利用频率空间变换提高联盟学习的通信性能

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-08 DOI: arxiv-2409.05242

Chamath Palihawadana, Nirmalie Wiratunga, Anjana Wijekoon, Harsha Kalutarage

Communication efficiency is a widely recognised research problem in FederatedLearning (FL), with recent work focused on developing techniques for efficientcompression, distribution and aggregation of model parameters between clientsand the server. Particularly within distributed systems, it is important tobalance the need for computational cost and communication efficiency. However,existing methods are often constrained to specific applications and are lessgeneralisable. In this paper, we introduce FedFT (federated frequency-spacetransformation), a simple yet effective methodology for communicating modelparameters in a FL setting. FedFT uses Discrete Cosine Transform (DCT) torepresent model parameters in frequency space, enabling efficient compressionand reducing communication overhead. FedFT is compatible with various existingFL methodologies and neural architectures, and its linear property eliminatesthe need for multiple transformations during federated aggregation. Thismethodology is vital for distributed solutions, tackling essential challengeslike data privacy, interoperability, and energy efficiency inherent to theseenvironments. We demonstrate the generalisability of the FedFT methodology onfour datasets using comparative studies with three state-of-the-art FLbaselines (FedAvg, FedProx, FedSim). Our results demonstrate that using FedFTto represent the differences in model parameters between communication roundsin frequency space results in a more compact representation compared torepresenting the entire model in frequency space. This leads to a reduction incommunication overhead, while keeping accuracy levels comparable and in somecases even improving it. Our results suggest that this reduction can range from5% to 30% per client, depending on dataset.

通信效率是联邦学习（FL）领域公认的一个研究问题，最近的工作重点是开发在客户端和服务器之间高效压缩、分发和聚合模型参数的技术。特别是在分布式系统中，平衡计算成本和通信效率的需求非常重要。然而，现有方法往往局限于特定应用，通用性较差。在本文中，我们介绍了 FedFT（联合频率-截面变换），这是一种简单而有效的方法，用于在 FL 设置中通信模型参数。FedFT 使用离散余弦变换（DCT）来表示频率空间中的模型参数，从而实现高效压缩并减少通信开销。FedFT 与现有的各种 FL 方法和神经架构兼容，其线性特性消除了在联合聚合过程中进行多重变换的需要。这种方法对于分布式解决方案至关重要，它能解决这些环境固有的数据隐私、互操作性和能效等基本挑战。我们通过与三种最先进的 FLbaseline（FedAvg、FedProx 和 FedSim）进行比较研究，在四个数据集上证明了 FedFT 方法的通用性。我们的结果表明，与在频率空间中表示整个模型相比，使用 FedFT 在频率空间中表示通信轮次之间的模型参数差异，能获得更紧凑的表示。这就减少了通信开销，同时保持了相当的精度水平，在某些情况下甚至有所提高。我们的研究结果表明，根据数据集的不同，每个客户端的通信开销可减少 5% 到 30%。

{"title":"FedFT: Improving Communication Performance for Federated Learning with Frequency Space Transformation","authors":"Chamath Palihawadana, Nirmalie Wiratunga, Anjana Wijekoon, Harsha Kalutarage","doi":"arxiv-2409.05242","DOIUrl":"https://doi.org/arxiv-2409.05242","url":null,"abstract":"Communication efficiency is a widely recognised research problem in Federated\u0000Learning (FL), with recent work focused on developing techniques for efficient\u0000compression, distribution and aggregation of model parameters between clients\u0000and the server. Particularly within distributed systems, it is important to\u0000balance the need for computational cost and communication efficiency. However,\u0000existing methods are often constrained to specific applications and are less\u0000generalisable. In this paper, we introduce FedFT (federated frequency-space\u0000transformation), a simple yet effective methodology for communicating model\u0000parameters in a FL setting. FedFT uses Discrete Cosine Transform (DCT) to\u0000represent model parameters in frequency space, enabling efficient compression\u0000and reducing communication overhead. FedFT is compatible with various existing\u0000FL methodologies and neural architectures, and its linear property eliminates\u0000the need for multiple transformations during federated aggregation. This\u0000methodology is vital for distributed solutions, tackling essential challenges\u0000like data privacy, interoperability, and energy efficiency inherent to these\u0000environments. We demonstrate the generalisability of the FedFT methodology on\u0000four datasets using comparative studies with three state-of-the-art FL\u0000baselines (FedAvg, FedProx, FedSim). Our results demonstrate that using FedFT\u0000to represent the differences in model parameters between communication rounds\u0000in frequency space results in a more compact representation compared to\u0000representing the entire model in frequency space. This leads to a reduction in\u0000communication overhead, while keeping accuracy levels comparable and in some\u0000cases even improving it. Our results suggest that this reduction can range from\u00005% to 30% per client, depending on dataset.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"106 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ARIM-mdx Data System: Towards a Nationwide Data Platform for Materials Science ARIM-mdx 数据系统：建立全国材料科学数据平台

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-08 DOI: arxiv-2409.06734

Masatoshi Hanai, Ryo Ishikawa, Mitsuaki Kawamura, Masato Ohnishi, Norio Takenaka, Kou Nakamura, Daiju Matsumura, Seiji Fujikawa, Hiroki Sakamoto, Yukinori Ochiai, Tetsuo Okane, Shin-Ichiro Kuroki, Atsuo Yamada, Toyotaro Suzumura, Junichiro Shiomi, Kenjiro Taura, Yoshio Mita, Naoya Shibata, Yuichi Ikuhara

In modern materials science, effective and high-volume data management acrossleading-edge experimental facilities and world-class supercomputers isindispensable for cutting-edge research. Such facilities and supercomputers aretypically utilized by a wide range of researchers across different fields andorganizations in academia and industry. However, existing integrated systemsthat handle data from these resources have primarily focused just onsmaller-scale cross-institutional or single-domain operations. As a result,they often lack the scalability, efficiency, agility, and interdisciplinarity,needed for handling substantial volumes of data from various researchers. In this paper, we introduce ARIM-mdx data system, a nationwide data platformfor materials science in Japan. The platform involves 8 universities andinstitutes all over Japan through the governmental materials science project.Currently in its trial phase, the ARIM-mdx data system is utilized by over 800researchers from around 140 organizations in academia and industry, beingintended to gradually expand its reach. The system employs a hybridarchitecture, combining a peta-scale dedicated storage system for security andstability with a high-performance academic cloud for efficiency andscalability. Through direct network connections between them, the systemachieves 4.7x latency reduction compared to a conventional approach, resultingin near real-time interactive data analysis. It also utilizes specialized IoTdevices for secure data transfer from equipment computers and connects tomultiple supercomputers via an academic ultra-fast network, achieving 4x fasterdata transfer compared to the public Internet. The ARIM-mdx data system, as apioneering nationwide data platform, has the potential to contribute to thecreation of new research communities and accelerates innovations.

在现代材料科学领域，前沿实验设施和世界级超级计算机之间有效而大量的数据管理对于前沿研究来说是必不可少的。这些设施和超级计算机通常由学术界和工业界不同领域和组织的众多研究人员使用。然而，现有的处理这些资源数据的集成系统主要侧重于较小规模的跨机构或单一领域操作。因此，它们往往缺乏处理来自不同研究人员的大量数据所需的可扩展性、效率、敏捷性和跨学科性。本文介绍了 ARIM-mdx 数据系统，这是日本一个全国性的材料科学数据平台。目前，ARIM-mdx 数据系统正处于试运行阶段，已有来自学术界和工业界约 140 个组织的 800 多名研究人员使用，并计划逐步扩大其覆盖范围。该系统采用了混合架构，将具有安全性和稳定性的 pet 级专用存储系统与具有效率和可扩展性的高性能学术云相结合。通过它们之间的直接网络连接，系统的延迟时间比传统方法减少了 4.7 倍，从而实现了近乎实时的交互式数据分析。该系统还利用专门的物联网设备从设备计算机安全传输数据，并通过学术超高速网络与多台超级计算机连接，实现了比公共互联网快4倍的数据传输速度。ARIM-mdx数据系统作为一个全国性的先驱数据平台，有可能为创建新的研究社区和加速创新做出贡献。

{"title":"ARIM-mdx Data System: Towards a Nationwide Data Platform for Materials Science","authors":"Masatoshi Hanai, Ryo Ishikawa, Mitsuaki Kawamura, Masato Ohnishi, Norio Takenaka, Kou Nakamura, Daiju Matsumura, Seiji Fujikawa, Hiroki Sakamoto, Yukinori Ochiai, Tetsuo Okane, Shin-Ichiro Kuroki, Atsuo Yamada, Toyotaro Suzumura, Junichiro Shiomi, Kenjiro Taura, Yoshio Mita, Naoya Shibata, Yuichi Ikuhara","doi":"arxiv-2409.06734","DOIUrl":"https://doi.org/arxiv-2409.06734","url":null,"abstract":"In modern materials science, effective and high-volume data management across\u0000leading-edge experimental facilities and world-class supercomputers is\u0000indispensable for cutting-edge research. Such facilities and supercomputers are\u0000typically utilized by a wide range of researchers across different fields and\u0000organizations in academia and industry. However, existing integrated systems\u0000that handle data from these resources have primarily focused just on\u0000smaller-scale cross-institutional or single-domain operations. As a result,\u0000they often lack the scalability, efficiency, agility, and interdisciplinarity,\u0000needed for handling substantial volumes of data from various researchers. In this paper, we introduce ARIM-mdx data system, a nationwide data platform\u0000for materials science in Japan. The platform involves 8 universities and\u0000institutes all over Japan through the governmental materials science project.\u0000Currently in its trial phase, the ARIM-mdx data system is utilized by over 800\u0000researchers from around 140 organizations in academia and industry, being\u0000intended to gradually expand its reach. The system employs a hybrid\u0000architecture, combining a peta-scale dedicated storage system for security and\u0000stability with a high-performance academic cloud for efficiency and\u0000scalability. Through direct network connections between them, the system\u0000achieves 4.7x latency reduction compared to a conventional approach, resulting\u0000in near real-time interactive data analysis. It also utilizes specialized IoT\u0000devices for secure data transfer from equipment computers and connects to\u0000multiple supercomputers via an academic ultra-fast network, achieving 4x faster\u0000data transfer compared to the public Internet. The ARIM-mdx data system, as a\u0000pioneering nationwide data platform, has the potential to contribute to the\u0000creation of new research communities and accelerates innovations.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Distributed Agreement in the Arrovian Framework 阿罗维亚框架中的分布式协议

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-07 DOI: arxiv-2409.04685

Kenan Wood, Hammurabi Mendes, Jonad Pulaj

Preference aggregation is a fundamental problem in voting theory, in whichpublic input rankings of a set of alternatives (called preferences) must beaggregated into a single preference that satisfies certain soundnessproperties. The celebrated Arrow Impossibility Theorem is equivalent to adistributed task in a synchronous fault-free system that satisfies propertiessuch as respecting unanimous preferences, maintaining independence ofirrelevant alternatives (IIA), and non-dictatorship, along with consensus sinceonly one preference can be decided. In this work, we study a weaker distributed task in which crash faults areintroduced, IIA is not required, and the consensus property is relaxed toeither $k$-set agreement or $epsilon$-approximate agreement using any metricon the set of preferences. In particular, we prove several novel impossibilityresults for both of these tasks in both synchronous and asynchronousdistributed systems. We additionally show that the impossibility for our$epsilon$-approximate agreement task using the Kendall tau or Spearmanfootrule metrics holds under extremely weak assumptions.

偏好聚合是投票理论中的一个基本问题，在这个问题中，必须将一组备选方案的公共输入排名（称为偏好）聚合成一个满足特定合理性属性的单一偏好。著名的阿罗不可能定理等同于同步无故障系统中的分布式任务，它满足以下属性：尊重一致偏好、保持相关备选方案的独立性（IIA）、非独裁以及共识，因为只能决定一个偏好。在这项工作中，我们研究了一个较弱的分布式任务，其中引入了崩溃故障，不要求 IIA，并且共识属性被放宽为使用偏好集上的任意度量的 $k$ 集协议或 $epsilon$ 近似协议。我们特别证明了在同步和异步分布式系统中这两项任务的几个新的不可能性结果。此外，我们还证明了使用 Kendall tau 或 Spearmanfootrule 度量的$epsilon$-近似一致任务的不可能性在极弱的假设条件下成立。

{"title":"Distributed Agreement in the Arrovian Framework","authors":"Kenan Wood, Hammurabi Mendes, Jonad Pulaj","doi":"arxiv-2409.04685","DOIUrl":"https://doi.org/arxiv-2409.04685","url":null,"abstract":"Preference aggregation is a fundamental problem in voting theory, in which\u0000public input rankings of a set of alternatives (called preferences) must be\u0000aggregated into a single preference that satisfies certain soundness\u0000properties. The celebrated Arrow Impossibility Theorem is equivalent to a\u0000distributed task in a synchronous fault-free system that satisfies properties\u0000such as respecting unanimous preferences, maintaining independence of\u0000irrelevant alternatives (IIA), and non-dictatorship, along with consensus since\u0000only one preference can be decided. In this work, we study a weaker distributed task in which crash faults are\u0000introduced, IIA is not required, and the consensus property is relaxed to\u0000either $k$-set agreement or $epsilon$-approximate agreement using any metric\u0000on the set of preferences. In particular, we prove several novel impossibility\u0000results for both of these tasks in both synchronous and asynchronous\u0000distributed systems. We additionally show that the impossibility for our\u0000$epsilon$-approximate agreement task using the Kendall tau or Spearman\u0000footrule metrics holds under extremely weak assumptions.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"268 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Revisiting the Time Cost Model of AllReduce 重新审视 AllReduce 的时间成本模型

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-06 DOI: arxiv-2409.04202

Dian Xiong, Li Chen, Youhe Jiang, Dan Li, Shuai Wang, Songtao Wang

AllReduce is an important and popular collective communication primitive,which has been widely used in areas such as distributed machine learning andhigh performance computing. To design, analyze, and choose from variousalgorithms and implementations of AllReduce, the time cost model plays acrucial role, and the predominant one is the $(alpha,beta,gamma)$ model. Inthis paper, we revisit this model, and reveal that it cannot well characterizethe time cost of AllReduce on modern clusters; thus must be updated. We performextensive measurements to identify two additional terms contributing to thetime cost: the incast term and the memory access term. We augment the$(alpha,beta,gamma)$ model with these two terms, and present GenModel as aresult. Using GenModel, we discover two new optimalities for AllReducealgorithms, and prove that they cannot be achieved simultaneously. Finally,striking the balance between the two new optimalities, we design GenTree, anAllReduce plan generation algorithm specialized for tree-like topologies.Experiments on a real testbed with 64 GPUs show that GenTree can achieve1.22$times$ to 1.65$times$ speed-up against NCCL. Large-scale simulationsalso confirm that GenTree can improve the state-of-the-art AllReduce algorithmby a factor of $1.2$ to $7.4$ in scenarios where the two new terms dominate.

AllReduce 是一种重要而流行的集体通信基元，已广泛应用于分布式机器学习和高性能计算等领域。在设计、分析和选择 AllReduce 的各种算法和实现时，时间成本模型起着至关重要的作用，其中最主要的是 $(α,beta,gamma)$ 模型。在本文中，我们重新审视了这个模型，发现它不能很好地描述现代集群上 AllReduce 的时间成本，因此必须更新。我们进行了广泛的测量，确定了导致时间成本的两个附加项：incast 项和内存访问项。我们用这两个项扩充了$(alpha,beta,gamma)$ 模型，并将 GenModel 作为结果呈现。利用 GenModel，我们为 AllReduce 算法发现了两个新的最优性，并证明它们不能同时实现。最后，为了在这两个新的最优性之间取得平衡，我们设计了专门针对树状拓扑的AllReduce计划生成算法--GenTree。在使用64个GPU的真实测试平台上进行的实验表明，GenTree的速度比NCCL提高了1.22倍到1.65倍。大规模模拟也证实，在两个新项占主导地位的情况下，GenTree 可以将最先进的 AllReduce 算法提高 1.2 美元到 7.4 美元。

{"title":"Revisiting the Time Cost Model of AllReduce","authors":"Dian Xiong, Li Chen, Youhe Jiang, Dan Li, Shuai Wang, Songtao Wang","doi":"arxiv-2409.04202","DOIUrl":"https://doi.org/arxiv-2409.04202","url":null,"abstract":"AllReduce is an important and popular collective communication primitive,\u0000which has been widely used in areas such as distributed machine learning and\u0000high performance computing. To design, analyze, and choose from various\u0000algorithms and implementations of AllReduce, the time cost model plays a\u0000crucial role, and the predominant one is the $(alpha,beta,gamma)$ model. In\u0000this paper, we revisit this model, and reveal that it cannot well characterize\u0000the time cost of AllReduce on modern clusters; thus must be updated. We perform\u0000extensive measurements to identify two additional terms contributing to the\u0000time cost: the incast term and the memory access term. We augment the\u0000$(alpha,beta,gamma)$ model with these two terms, and present GenModel as a\u0000result. Using GenModel, we discover two new optimalities for AllReduce\u0000algorithms, and prove that they cannot be achieved simultaneously. Finally,\u0000striking the balance between the two new optimalities, we design GenTree, an\u0000AllReduce plan generation algorithm specialized for tree-like topologies.\u0000Experiments on a real testbed with 64 GPUs show that GenTree can achieve\u00001.22$times$ to 1.65$times$ speed-up against NCCL. Large-scale simulations\u0000also confirm that GenTree can improve the state-of-the-art AllReduce algorithm\u0000by a factor of $1.2$ to $7.4$ in scenarios where the two new terms dominate.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"62 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Heterogeneity-Aware Cooperative Federated Edge Learning with Adaptive Computation and Communication Compression 具有自适应计算和通信压缩功能的异构感知合作式联盟边缘学习

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-06 DOI: arxiv-2409.04022

Zhenxiao Zhang, Zhidong Gao, Yuanxiong Guo, Yanmin Gong

Motivated by the drawbacks of cloud-based federated learning (FL),cooperative federated edge learning (CFEL) has been proposed to improveefficiency for FL over mobile edge networks, where multiple edge serverscollaboratively coordinate the distributed model training across a large numberof edge devices. However, CFEL faces critical challenges arising from dynamicand heterogeneous device properties, which slow down the convergence andincrease resource consumption. This paper proposes a heterogeneity-aware CFELscheme called textit{Heterogeneity-Aware Cooperative Edge-based FederatedAveraging} (HCEF) that aims to maximize the model accuracy while minimizing thetraining time and energy consumption via adaptive computation and communicationcompression in CFEL. By theoretically analyzing how local update frequency andgradient compression affect the convergence error bound in CFEL, we develop anefficient online control algorithm for HCEF to dynamically determine localupdate frequencies and compression ratios for heterogeneous devices.Experimental results show that compared with prior schemes, the proposed HCEFscheme can maintain higher model accuracy while reducing training latency andimproving energy efficiency simultaneously.

受基于云的联合学习（Federated Learning，FL）弊端的启发，人们提出了合作联合边缘学习（CFEL），以提高移动边缘网络上联合学习的效率。然而，CFEL 面临着动态和异构设备特性带来的严峻挑战，这些特性会减慢收敛速度并增加资源消耗。本文提出了一种名为textit{Heterogeneity-Aware Cooperative Edge-based FederatedAveraging}（HCEF）的异构感知CFEL方案，旨在通过CFEL中的自适应计算和通信压缩，最大限度地提高模型精度，同时最大限度地减少训练时间和能耗。通过从理论上分析局部更新频率和梯度压缩如何影响 CFEL 中的收敛误差边界，我们为 HCEF 开发了一种高效的在线控制算法，以动态确定异构设备的局部更新频率和压缩比。实验结果表明，与之前的方案相比，所提出的 HCEF 方案可以保持更高的模型精度，同时减少训练延迟并提高能效。

{"title":"Heterogeneity-Aware Cooperative Federated Edge Learning with Adaptive Computation and Communication Compression","authors":"Zhenxiao Zhang, Zhidong Gao, Yuanxiong Guo, Yanmin Gong","doi":"arxiv-2409.04022","DOIUrl":"https://doi.org/arxiv-2409.04022","url":null,"abstract":"Motivated by the drawbacks of cloud-based federated learning (FL),\u0000cooperative federated edge learning (CFEL) has been proposed to improve\u0000efficiency for FL over mobile edge networks, where multiple edge servers\u0000collaboratively coordinate the distributed model training across a large number\u0000of edge devices. However, CFEL faces critical challenges arising from dynamic\u0000and heterogeneous device properties, which slow down the convergence and\u0000increase resource consumption. This paper proposes a heterogeneity-aware CFEL\u0000scheme called textit{Heterogeneity-Aware Cooperative Edge-based Federated\u0000Averaging} (HCEF) that aims to maximize the model accuracy while minimizing the\u0000training time and energy consumption via adaptive computation and communication\u0000compression in CFEL. By theoretically analyzing how local update frequency and\u0000gradient compression affect the convergence error bound in CFEL, we develop an\u0000efficient online control algorithm for HCEF to dynamically determine local\u0000update frequencies and compression ratios for heterogeneous devices.\u0000Experimental results show that compared with prior schemes, the proposed HCEF\u0000scheme can maintain higher model accuracy while reducing training latency and\u0000improving energy efficiency simultaneously.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Hybrid Vectorized Merge Sort on ARM NEON ARM NEON 上的混合矢量化合并排序

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-06 DOI: arxiv-2409.03970

Jincheng Zhou, Jin Zhang, Xiang Zhang, Tiaojie Xiao, Di Ma, Chunye Gong

Sorting algorithms are the most extensively researched topics in computerscience and serve for numerous practical applications. Although various sortshave been proposed for efficiency, different architectures offer distinctflavors to the implementation of parallel sorting. In this paper, we propose ahybrid vectorized merge sort on ARM NEON, named NEON Merge Sort for short(NEON-MS). In detail, according to the granted register functions, we firstidentify the optimal register number to avoid the register-to-memory access,due to the write-back of intermediate outcomes. More importantly, following thegeneric merge sort framework that primarily uses sorting network for columnsort and merging networks for three types of vectorized merge, we furtherimprove their structures for high efficiency in an unified asymmetry way: 1) itmakes the optimal sorting networks with few comparators become possible; 2)hybrid implementation of both serial and vectorized merges incurs the pipelinewith merge instructions highly interleaved. Experiments on a single FT2000+core show that NEON-MS is 3.8 and 2.1 times faster than std::sort andboost::block_sort, respectively, on average. Additionally, as compared to theparallel version of the latter, NEON-MS gains an average speedup of 1.25.

排序算法是计算机科学中研究最为广泛的课题，并在许多实际应用中发挥着作用。虽然为了提高效率，人们提出了各种排序方法，但不同的架构为并行排序的实现提供了不同的风格。本文提出了一种基于 ARM NEON 的混合矢量化合并排序方法，简称 NEON 合并排序（NEON-MS）。具体来说，我们首先根据所赋予的寄存器功能确定最佳寄存器编号，以避免由于回写中间结果而造成的寄存器到内存的访问。更重要的是，在通用合并排序框架（主要用于列排序的排序网络和用于三种矢量化合并的合并网络）的基础上，我们进一步改进了它们的结构，以统一的非对称方式实现高效率：1）这使得使用较少比较器的最优排序网络成为可能；2）串行合并和矢量化合并的混合实现产生了合并指令高度交错的流水线。在单个 FT2000+ 核上进行的实验表明，NEON-MS 的平均速度分别是 std::sort 和 boost::block_sort 的 3.8 倍和 2.1 倍。此外，与后者的并行版本相比，NEON-MS 的平均速度提高了 1.25 倍。

{"title":"A Hybrid Vectorized Merge Sort on ARM NEON","authors":"Jincheng Zhou, Jin Zhang, Xiang Zhang, Tiaojie Xiao, Di Ma, Chunye Gong","doi":"arxiv-2409.03970","DOIUrl":"https://doi.org/arxiv-2409.03970","url":null,"abstract":"Sorting algorithms are the most extensively researched topics in computer\u0000science and serve for numerous practical applications. Although various sorts\u0000have been proposed for efficiency, different architectures offer distinct\u0000flavors to the implementation of parallel sorting. In this paper, we propose a\u0000hybrid vectorized merge sort on ARM NEON, named NEON Merge Sort for short\u0000(NEON-MS). In detail, according to the granted register functions, we first\u0000identify the optimal register number to avoid the register-to-memory access,\u0000due to the write-back of intermediate outcomes. More importantly, following the\u0000generic merge sort framework that primarily uses sorting network for column\u0000sort and merging networks for three types of vectorized merge, we further\u0000improve their structures for high efficiency in an unified asymmetry way: 1) it\u0000makes the optimal sorting networks with few comparators become possible; 2)\u0000hybrid implementation of both serial and vectorized merges incurs the pipeline\u0000with merge instructions highly interleaved. Experiments on a single FT2000+\u0000core show that NEON-MS is 3.8 and 2.1 times faster than std::sort and\u0000boost::block_sort, respectively, on average. Additionally, as compared to the\u0000parallel version of the latter, NEON-MS gains an average speedup of 1.25.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CubicML: Automated ML for Distributed ML Systems Co-design with ML Prediction of Performance CubicML：分布式 ML 系统的自动 ML 协同设计与 ML 性能预测

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-06 DOI: arxiv-2409.04585

Wei Wen, Quanyu Zhu, Weiwei Chu, Wen-Yen Chen, Jiyan Yang

Scaling up deep learning models has been proven effective to improveintelligence of machine learning (ML) models, especially for industryrecommendation models and large language models. The co-design of distributedML systems and algorithms (to maximize training performance) plays a pivotalrole for its success. As it scales, the number of co-design hyper-parametersgrows rapidly which brings challenges to feasibly find the optimal setup forsystem performance maximization. In this paper, we propose CubicML which usesML to automatically optimize training performance of distributed ML systems. InCubicML, we use a ML model as a proxy to predict the training performance forsearch efficiency and performance modeling flexibility. We proved that CubicMLcan effectively optimize training speed of in-house ads recommendation modelsand large language models at Meta.

扩展深度学习模型已被证明能有效提高机器学习（ML）模型的智能，特别是对于行业推荐模型和大型语言模型。分布式 ML 系统和算法的协同设计（以最大限度地提高训练性能）对其成功起着关键作用。随着系统规模的扩大，协同设计超参数的数量也在迅速增加，这给找到系统性能最大化的最佳设置带来了挑战。在本文中，我们提出了 CubicML，它使用ML 自动优化分布式 ML 系统的训练性能。在 CubicML 中，我们使用一个 ML 模型作为代理来预测训练性能，以提高搜索效率和性能建模的灵活性。我们证明，CubicML 可以有效优化 Meta 公司内部广告推荐模型和大型语言模型的训练速度。

引用次数: 0

Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management 通过数据异质性感知模型管理实现高效的多任务大型模型训练

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-05 DOI: arxiv-2409.03365

Yujie Wang, Shenhan Zhu, Fangcheng Fu, Xupeng Miao, Jie Zhang, Juan Zhu, Fan Hong, Yong Li, Bin Cui

Recent foundation models are capable of handling multiple machine learning(ML) tasks and multiple data modalities with the unified base model structureand several specialized model components. However, the development of suchmulti-task (MT) multi-modal (MM) models poses significant model managementchallenges to existing training systems. Due to the sophisticated modelarchitecture and the heterogeneous workloads of different ML tasks and datamodalities, training these models usually requires massive GPU resources andsuffers from sub-optimal system efficiency. In this paper, we investigate how to achieve high-performance training oflarge-scale MT MM models through data heterogeneity-aware model managementoptimization. The key idea is to decompose the model execution into stages andaddress the joint optimization problem sequentially, including bothheterogeneity-aware workload parallelization and dependency-driven executionscheduling. Based on this, we build a prototype system and evaluate it onvarious large MT MM models. Experiments demonstrate the superior performanceand efficiency of our system, with speedup ratio up to 71% compared tostate-of-the-art training systems.

最近的基础模型能够通过统一的基础模型结构和多个专用模型组件处理多种机器学习（ML）任务和多种数据模式。然而，这种多任务（MT）多模态（MM）模型的开发给现有的训练系统带来了巨大的模型管理挑战。由于复杂的模型架构以及不同 ML 任务和数据模态的异构工作负载，训练这些模型通常需要大量 GPU 资源，而且系统效率未达到最佳。在本文中，我们研究了如何通过数据异构感知模型管理优化来实现大规模 MT MM 模型的高性能训练。其关键思路是将模型执行分解为若干阶段，并按顺序解决联合优化问题，包括异构感知工作负载并行化和依赖驱动的执行调度。在此基础上，我们构建了一个原型系统，并在各种大型 MT MM 模型上对其进行了评估。实验证明了我们系统的卓越性能和效率，与最先进的训练系统相比，提速比高达 71%。

{"title":"Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management","authors":"Yujie Wang, Shenhan Zhu, Fangcheng Fu, Xupeng Miao, Jie Zhang, Juan Zhu, Fan Hong, Yong Li, Bin Cui","doi":"arxiv-2409.03365","DOIUrl":"https://doi.org/arxiv-2409.03365","url":null,"abstract":"Recent foundation models are capable of handling multiple machine learning\u0000(ML) tasks and multiple data modalities with the unified base model structure\u0000and several specialized model components. However, the development of such\u0000multi-task (MT) multi-modal (MM) models poses significant model management\u0000challenges to existing training systems. Due to the sophisticated model\u0000architecture and the heterogeneous workloads of different ML tasks and data\u0000modalities, training these models usually requires massive GPU resources and\u0000suffers from sub-optimal system efficiency. In this paper, we investigate how to achieve high-performance training of\u0000large-scale MT MM models through data heterogeneity-aware model management\u0000optimization. The key idea is to decompose the model execution into stages and\u0000address the joint optimization problem sequentially, including both\u0000heterogeneity-aware workload parallelization and dependency-driven execution\u0000scheduling. Based on this, we build a prototype system and evaluate it on\u0000various large MT MM models. Experiments demonstrate the superior performance\u0000and efficiency of our system, with speedup ratio up to 71% compared to\u0000state-of-the-art training systems.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Red-Blue Pebbling with Multiple Processors: Time, Communication and Memory Trade-offs 使用多处理器的红蓝鹅卵石：时间、通信和内存权衡

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-05 DOI: arxiv-2409.03898

Toni Böhnlein, Pál András Papp, A. N. Yzelman

The well-studied red-blue pebble game models the execution of an arbitrarycomputational DAG by a single processor over a two-level memory hierarchy. Wepresent a natural generalization to a multiprocessor setting where eachprocessor has its own limited fast memory, and all processors share unlimitedslow memory. To our knowledge, this is the first thorough study that combinespebbling and DAG scheduling problems, capturing the computation of generalworkloads on multiple processors with memory constraints and communicationcosts. Our pebbling model enables us to analyze trade-offs between workloadbalancing, communication and memory limitations, and it captures real-worldfactors such as superlinear speedups due to parallelization. Our results include upper and lower bounds on the pebbling cost, an analysisof a greedy pebbling strategy, and an extension of NP-hardness results forspecific DAG classes from simpler models. For our main technical contribution,we show two inapproximability results that already hold for the long-standingproblem of standard red-blue pebbling: (i) the optimal I/O cost cannot beapproximated to any finite factor, and (ii) the optimal total cost(I/O+computation) can only be approximated to a limited constant factor, i.e.,it does not allow for a polynomial-time approximation scheme. These resultsalso carry over naturally to our multiprocessor pebbling model.

经过深入研究的红蓝鹅卵石游戏模拟了单处理器在两级内存层次结构上执行任意计算 DAG 的情况。在多处理器环境中，每个处理器都有自己有限的快内存，而所有处理器共享无限的慢内存。据我们所知，这是第一项将鹅卵石和 DAG 调度问题相结合的深入研究，它捕捉到了多处理器上具有内存约束和通信成本的一般工作量的计算。我们的鹅卵石模型使我们能够分析工作负载平衡、通信和内存限制之间的权衡，它还捕捉了现实世界中的因素，如并行化带来的超线性加速。我们的研究成果包括pebbling 成本的上界和下界、贪婪pebbling 策略的分析，以及从更简单的模型扩展到特定 DAG 类的 NP-hardness 结果。对于我们的主要技术贡献，我们展示了两个不可逆结果，这两个结果在长期存在的标准红蓝鹅卵石问题中已经成立：(i) 最佳 I/O 成本不能被逼近到任何有限因子；(ii) 最佳总成本（I/O + 计算）只能被逼近到一个有限的常数因子，也就是说，它不允许多项式时间逼近方案。这些结果也可以自然地应用到我们的多处理器鹅卵石模型中。

{"title":"Red-Blue Pebbling with Multiple Processors: Time, Communication and Memory Trade-offs","authors":"Toni Böhnlein, Pál András Papp, A. N. Yzelman","doi":"arxiv-2409.03898","DOIUrl":"https://doi.org/arxiv-2409.03898","url":null,"abstract":"The well-studied red-blue pebble game models the execution of an arbitrary\u0000computational DAG by a single processor over a two-level memory hierarchy. We\u0000present a natural generalization to a multiprocessor setting where each\u0000processor has its own limited fast memory, and all processors share unlimited\u0000slow memory. To our knowledge, this is the first thorough study that combines\u0000pebbling and DAG scheduling problems, capturing the computation of general\u0000workloads on multiple processors with memory constraints and communication\u0000costs. Our pebbling model enables us to analyze trade-offs between workload\u0000balancing, communication and memory limitations, and it captures real-world\u0000factors such as superlinear speedups due to parallelization. Our results include upper and lower bounds on the pebbling cost, an analysis\u0000of a greedy pebbling strategy, and an extension of NP-hardness results for\u0000specific DAG classes from simpler models. For our main technical contribution,\u0000we show two inapproximability results that already hold for the long-standing\u0000problem of standard red-blue pebbling: (i) the optimal I/O cost cannot be\u0000approximated to any finite factor, and (ii) the optimal total cost\u0000(I/O+computation) can only be approximated to a limited constant factor, i.e.,\u0000it does not allow for a polynomial-time approximation scheme. These results\u0000also carry over naturally to our multiprocessor pebbling model.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"80 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GreenWhisk: Emission-Aware Computing for Serverless Platform GreenWhisk：无服务器平台的排放感知计算

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-04 DOI: arxiv-2409.03029

Jayden Serenari, Sreekanth Sreekumar, Kaiwen Zhao, Saurabh Sarkar, Stephen Lee

Serverless computing is an emerging cloud computing abstraction wherein thecloud platform transparently manages all resources, including explicitlyprovisioning resources and geographical load balancing when the demand forservice spikes. Users provide code as functions, and the cloud platform runsthese functions handling all aspects of function execution. While prior workhas primarily focused on optimizing performance, this paper focuses on reducingthe carbon footprint of these systems making variations in grid carbonintensity and intermittency from renewables transparent to the user. Weintroduce GreenWhisk, a carbon-aware serverless computing platform built uponApache OpenWhisk, operating in two modes - grid-connected and grid-isolated -addressing intermittency challenges arising from renewables and the grid'scarbon footprint. Moreover, we develop carbon-aware load balancing algorithmsthat leverage energy and carbon information to reduce the carbon footprint. Ourevaluation results show that GreenWhisk can easily incorporate carbon-awarealgorithms, thereby reducing the carbon footprint of functions withoutsignificantly impacting the performance of function execution. In doing so, oursystem design enables the integration of new carbon-aware strategies into aserverless computing platform.

无服务器计算是一种新兴的云计算抽象概念，云平台可以透明地管理所有资源，包括在服务需求激增时明确配置资源和地理负载平衡。用户将代码作为函数提供，云平台运行这些函数，处理函数执行的所有方面。以前的工作主要集中在优化性能上，而本文则侧重于减少这些系统的碳足迹，使电网碳强度和可再生能源间歇性的变化对用户透明。我们介绍了基于Apache OpenWhisk的无碳感知服务器计算平台GreenWhisk，该平台以两种模式运行--并网和隔离--解决了可再生能源和电网碳足迹带来的间歇性挑战。此外，我们还开发了碳感知负载平衡算法，利用能源和碳信息来减少碳足迹。评估结果表明，GreenWhisk 可以轻松集成碳感知算法，从而减少函数的碳足迹，而不会对函数的执行性能产生显著影响。这样，我们的系统设计就能将新的碳感知策略集成到无服务器计算平台中。

{"title":"GreenWhisk: Emission-Aware Computing for Serverless Platform","authors":"Jayden Serenari, Sreekanth Sreekumar, Kaiwen Zhao, Saurabh Sarkar, Stephen Lee","doi":"arxiv-2409.03029","DOIUrl":"https://doi.org/arxiv-2409.03029","url":null,"abstract":"Serverless computing is an emerging cloud computing abstraction wherein the\u0000cloud platform transparently manages all resources, including explicitly\u0000provisioning resources and geographical load balancing when the demand for\u0000service spikes. Users provide code as functions, and the cloud platform runs\u0000these functions handling all aspects of function execution. While prior work\u0000has primarily focused on optimizing performance, this paper focuses on reducing\u0000the carbon footprint of these systems making variations in grid carbon\u0000intensity and intermittency from renewables transparent to the user. We\u0000introduce GreenWhisk, a carbon-aware serverless computing platform built upon\u0000Apache OpenWhisk, operating in two modes - grid-connected and grid-isolated -\u0000addressing intermittency challenges arising from renewables and the grid's\u0000carbon footprint. Moreover, we develop carbon-aware load balancing algorithms\u0000that leverage energy and carbon information to reduce the carbon footprint. Our\u0000evaluation results show that GreenWhisk can easily incorporate carbon-aware\u0000algorithms, thereby reducing the carbon footprint of functions without\u0000significantly impacting the performance of function execution. In doing so, our\u0000system design enables the integration of new carbon-aware strategies into a\u0000serverless computing platform.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0