arXiv - CS - Distributed, Parallel, and Cluster Computing最新文献_第8页

Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer 用全流水线分布式变压器训练超长语境语言模型

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-30 DOI: arxiv-2408.16978

Jinghan Yao, Sam Ade Jacobs, Masahiro Tanaka, Olatunji Ruwase, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda

Large Language Models (LLMs) with long context capabilities are integral tocomplex tasks in natural language processing and computational biology, such astext generation and protein sequence analysis. However, training LLMs directlyon extremely long contexts demands considerable GPU resources and increasedmemory, leading to higher costs and greater complexity. Alternative approachesthat introduce long context capabilities via downstream finetuning oradaptations impose significant design limitations. In this paper, we proposeFully Pipelined Distributed Transformer (FPDT) for efficiently traininglong-context LLMs with extreme hardware efficiency. For GPT and Llama models,we achieve a 16x increase in sequence length that can be trained on the samehardware compared to current state-of-the-art solutions. With our dedicatedsequence chunk pipeline design, we can now train 8B LLM with 2 million sequencelength on only 4 GPUs, while also maintaining over 55% of MFU. Our proposedFPDT is agnostic to existing training techniques and is proven to workefficiently across different LLM models.

具有长语境能力的大型语言模型（LLM）是自然语言处理和计算生物学（如文本生成和蛋白质序列分析）中复杂任务不可或缺的一部分。然而，直接在超长上下文上训练 LLM 需要大量 GPU 资源和更多内存，导致成本更高、复杂性更大。通过下游微调或适配引入长上下文功能的替代方法会带来很大的设计限制。在本文中，我们提出了全流水线分布式转换器（FPDT），用于以极高的硬件效率高效地训练长上下文 LLM。对于 GPT 和 Llama 模型，与当前最先进的解决方案相比，我们在相同硬件上训练的序列长度增加了 16 倍。利用我们的专用序列块流水线设计，现在只需 4 个 GPU 就能训练出具有 200 万序列长度的 8B LLM，同时还能保持 55% 以上的 MFU。我们提出的 FPDT 与现有的训练技术无关，并被证明可以在不同的 LLM 模型中高效工作。

{"title":"Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer","authors":"Jinghan Yao, Sam Ade Jacobs, Masahiro Tanaka, Olatunji Ruwase, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda","doi":"arxiv-2408.16978","DOIUrl":"https://doi.org/arxiv-2408.16978","url":null,"abstract":"Large Language Models (LLMs) with long context capabilities are integral to\u0000complex tasks in natural language processing and computational biology, such as\u0000text generation and protein sequence analysis. However, training LLMs directly\u0000on extremely long contexts demands considerable GPU resources and increased\u0000memory, leading to higher costs and greater complexity. Alternative approaches\u0000that introduce long context capabilities via downstream finetuning or\u0000adaptations impose significant design limitations. In this paper, we propose\u0000Fully Pipelined Distributed Transformer (FPDT) for efficiently training\u0000long-context LLMs with extreme hardware efficiency. For GPT and Llama models,\u0000we achieve a 16x increase in sequence length that can be trained on the same\u0000hardware compared to current state-of-the-art solutions. With our dedicated\u0000sequence chunk pipeline design, we can now train 8B LLM with 2 million sequence\u0000length on only 4 GPUs, while also maintaining over 55% of MFU. Our proposed\u0000FPDT is agnostic to existing training techniques and is proven to work\u0000efficiently across different LLM models.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Benchmarking the Performance of Large Language Models on the Cerebras Wafer Scale Engine 在 Cerebras 晶圆级引擎上对大型语言模型的性能进行基准测试

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-30 DOI: arxiv-2409.00287

Zuoning Zhang, Dhruv Parikh, Youning Zhang, Viktor Prasanna

Transformer based Large Language Models (LLMs) have recently reached state ofthe art performance in Natural Language Processing (NLP) and Computer Vision(CV) domains. LLMs use the Multi-Headed Self-Attention (MHSA) mechanism tocapture long-range global attention relationships among input words or imagepatches, drastically improving its performance over prior deep learningapproaches. In this paper, we evaluate the performance of LLMs on the CerebrasWafer Scale Engine (WSE). Cerebras WSE is a high performance computing systemwith 2.6 trillion transistors, 850,000 cores and 40 GB on-chip memory. CerebrasWSE's Sparse Linear Algebra Compute (SLAC) cores eliminates multiply-by-zerosoperations and its 40 GB of on-chip memory is uniformly distributed among SLACcores, enabling fast local access to model parameters. Moreover, Cerebrassoftware configures routing between cores at runtime, optimizing communicationoverhead among cores. As LLMs are becoming more commonly used, new hardwarearchitectures are needed to accelerate LLMs training and inference. Webenchmark the effectiveness of this hardware architecture at accelerating LLMstraining and inference. Additionally, we analyze if Cerebras WSE can scale thememory-wall associated with traditionally memory-bound compute tasks using its20 PB/s high bandwidth memory. Furthermore, we examine the performancescalability of Cerebras WSE through a roofline model. By plotting performancemetrics against computational intensity, we aim to assess their effectivenessat handling high compute-intensive LLMs training and inference tasks.

基于变压器的大型语言模型（LLM）最近在自然语言处理（NLP）和计算机视觉（CV）领域达到了最先进的性能。LLMs 使用多头自注意（MHSA）机制来捕捉输入单词或图像斑块之间的长距离全局注意关系，与之前的深度学习方法相比，大大提高了其性能。本文评估了 LLM 在 CerebrasWafer Scale Engine（WSE）上的性能。Cerebras WSE是一个拥有2.6万亿个晶体管、85万个内核和40 GB片上内存的高性能计算系统。CerebrasWSE 的稀疏线性代数计算（SLAC）内核消除了逐乘迭代，40 GB 的片上内存在 SLAC 内核之间均匀分布，实现了对模型参数的快速本地访问。此外，Cerebrass软件可在运行时配置内核之间的路由，优化内核之间的通信开销。随着 LLM 的使用越来越普遍，需要新的硬件架构来加速 LLM 的训练和推理。我们对这种硬件架构在加速 LLM 训练和推理方面的有效性进行了测试。此外，我们还分析了 Cerebras WSE 能否利用其 20 PB/s 的高带宽内存扩展与传统内存约束计算任务相关的内存墙。此外，我们还通过屋顶线模型检验了 Cerebras WSE 的性能可计算性。通过绘制性能指标与计算强度的对比图，我们旨在评估它们处理高计算密集型 LLMs 训练和推理任务的有效性。

{"title":"Benchmarking the Performance of Large Language Models on the Cerebras Wafer Scale Engine","authors":"Zuoning Zhang, Dhruv Parikh, Youning Zhang, Viktor Prasanna","doi":"arxiv-2409.00287","DOIUrl":"https://doi.org/arxiv-2409.00287","url":null,"abstract":"Transformer based Large Language Models (LLMs) have recently reached state of\u0000the art performance in Natural Language Processing (NLP) and Computer Vision\u0000(CV) domains. LLMs use the Multi-Headed Self-Attention (MHSA) mechanism to\u0000capture long-range global attention relationships among input words or image\u0000patches, drastically improving its performance over prior deep learning\u0000approaches. In this paper, we evaluate the performance of LLMs on the Cerebras\u0000Wafer Scale Engine (WSE). Cerebras WSE is a high performance computing system\u0000with 2.6 trillion transistors, 850,000 cores and 40 GB on-chip memory. Cerebras\u0000WSE's Sparse Linear Algebra Compute (SLAC) cores eliminates multiply-by-zeros\u0000operations and its 40 GB of on-chip memory is uniformly distributed among SLAC\u0000cores, enabling fast local access to model parameters. Moreover, Cerebras\u0000software configures routing between cores at runtime, optimizing communication\u0000overhead among cores. As LLMs are becoming more commonly used, new hardware\u0000architectures are needed to accelerate LLMs training and inference. We\u0000benchmark the effectiveness of this hardware architecture at accelerating LLMs\u0000training and inference. Additionally, we analyze if Cerebras WSE can scale the\u0000memory-wall associated with traditionally memory-bound compute tasks using its\u000020 PB/s high bandwidth memory. Furthermore, we examine the performance\u0000scalability of Cerebras WSE through a roofline model. By plotting performance\u0000metrics against computational intensity, we aim to assess their effectiveness\u0000at handling high compute-intensive LLMs training and inference tasks.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Monadring: A lightweight consensus protocol to offer Validation-as-a-Service to AVS nodes Monadring：为 AVS 节点提供验证即服务的轻量级共识协议

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-28 DOI: arxiv-2408.16094

Yu Zhang, Xiao Yan, Gang Tang, Helena Wang

Existing blockchain networks are often large-scale, requiring transactions tobe synchronized across the entire network to reach consensus. On-chaincomputations can be prohibitively expensive, making many CPU-intensivecomputations infeasible. Inspired by the structure of IBM's token ringnetworks, we propose a lightweight consensus protocol called Monadring toaddress these issues. Monadring allows nodes within a large blockchain networkto form smaller subnetworks, enabling faster and more cost-effectivecomputations while maintaining the security guarantees of the main blockchainnetwork. To further enhance Monadring's security, we introduce a node rotationmechanism based on Verifiable Random Function (VRF) and blind voting usingFully Homomorphic Encryption (FHE) within the smaller subnetwork. Unlike thecommon voting-based election of validator nodes, Monadring leverages FHE toconceal voting information, eliminating the advantage of the last mover in thevoting process. This paper details the design and implementation of the Monadring protocoland evaluates its performance and feasibility through simulation experiments.Our research contributes to enhancing the practical utility of blockchaintechnology in large-scale application scenarios.

现有的区块链网络通常规模庞大，需要在整个网络内同步交易才能达成共识。链上计算的成本可能高得令人望而却步，使得许多 CPU 密集型计算变得不可行。受 IBM 令牌环网络结构的启发，我们提出了一种名为 Monadring 的轻量级共识协议来解决这些问题。Monadring允许大型区块链网络中的节点组成较小的子网络，从而实现更快、更具成本效益的计算，同时保持主区块链网络的安全保证。为了进一步增强Monadring的安全性，我们引入了一种基于可验证随机函数（VRF）的节点轮换机制，并在较小的子网络中使用完全同态加密（FHE）进行盲投票。与常见的基于投票的验证器节点选举不同，Monadring 利用 FHE 隐藏投票信息，消除了投票过程中后发者的优势。本文详细介绍了 Monadring 协议的设计与实现，并通过仿真实验评估了其性能和可行性。

{"title":"Monadring: A lightweight consensus protocol to offer Validation-as-a-Service to AVS nodes","authors":"Yu Zhang, Xiao Yan, Gang Tang, Helena Wang","doi":"arxiv-2408.16094","DOIUrl":"https://doi.org/arxiv-2408.16094","url":null,"abstract":"Existing blockchain networks are often large-scale, requiring transactions to\u0000be synchronized across the entire network to reach consensus. On-chain\u0000computations can be prohibitively expensive, making many CPU-intensive\u0000computations infeasible. Inspired by the structure of IBM's token ring\u0000networks, we propose a lightweight consensus protocol called Monadring to\u0000address these issues. Monadring allows nodes within a large blockchain network\u0000to form smaller subnetworks, enabling faster and more cost-effective\u0000computations while maintaining the security guarantees of the main blockchain\u0000network. To further enhance Monadring's security, we introduce a node rotation\u0000mechanism based on Verifiable Random Function (VRF) and blind voting using\u0000Fully Homomorphic Encryption (FHE) within the smaller subnetwork. Unlike the\u0000common voting-based election of validator nodes, Monadring leverages FHE to\u0000conceal voting information, eliminating the advantage of the last mover in the\u0000voting process. This paper details the design and implementation of the Monadring protocol\u0000and evaluates its performance and feasibility through simulation experiments.\u0000Our research contributes to enhancing the practical utility of blockchain\u0000technology in large-scale application scenarios.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LLMSecCode: Evaluating Large Language Models for Secure Coding LLMSecCode：评估用于安全编码的大型语言模型

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-28 DOI: arxiv-2408.16100

Anton Rydén, Erik Näslund, Elad Michael Schiller, Magnus Almgren

The rapid deployment of Large Language Models (LLMs) requires carefulconsideration of their effect on cybersecurity. Our work aims to improve theselection process of LLMs that are suitable for facilitating Secure Coding(SC). This raises challenging research questions, such as (RQ1) Whichfunctionality can streamline the LLM evaluation? (RQ2) What should theevaluation measure? (RQ3) How to attest that the evaluation process isimpartial? To address these questions, we introduce LLMSecCode, an open-sourceevaluation framework designed to assess LLM SC capabilities objectively. We validate the LLMSecCode implementation through experiments. When varyingparameters and prompts, we find a 10% and 9% difference in performance,respectively. We also compare some results to reliable external actors, whereour results show a 5% difference. We strive to ensure the ease of use of our open-source framework andencourage further development by external actors. With LLMSecCode, we hope toencourage the standardization and benchmarking of LLMs' capabilities insecurity-oriented code and tasks.

快速部署大型语言模型（LLM）需要仔细考虑其对网络安全的影响。我们的工作旨在改进适合促进安全编码（SC）的 LLM 的选择过程。这就提出了一些具有挑战性的研究问题，例如 (RQ1) 哪些功能可以简化 LLM 评估？(问题 2）评估应该衡量什么？(问题 3）如何证明评价过程是公正的？为了解决这些问题，我们引入了 LLMSecCode，这是一个开源评估框架，旨在客观地评估 LLM SC 能力。我们通过实验验证了 LLMSecCode 的实现。当改变参数和提示时，我们发现性能分别有 10% 和 9% 的差异。我们还将一些结果与可靠的外部行为者进行了比较，结果显示两者相差 5%。我们努力确保开源框架的易用性，并鼓励外部参与者进一步开发。我们希望通过 LLMSecCode，鼓励对 LLM 在面向安全的代码和任务方面的能力进行标准化和基准测试。

{"title":"LLMSecCode: Evaluating Large Language Models for Secure Coding","authors":"Anton Rydén, Erik Näslund, Elad Michael Schiller, Magnus Almgren","doi":"arxiv-2408.16100","DOIUrl":"https://doi.org/arxiv-2408.16100","url":null,"abstract":"The rapid deployment of Large Language Models (LLMs) requires careful\u0000consideration of their effect on cybersecurity. Our work aims to improve the\u0000selection process of LLMs that are suitable for facilitating Secure Coding\u0000(SC). This raises challenging research questions, such as (RQ1) Which\u0000functionality can streamline the LLM evaluation? (RQ2) What should the\u0000evaluation measure? (RQ3) How to attest that the evaluation process is\u0000impartial? To address these questions, we introduce LLMSecCode, an open-source\u0000evaluation framework designed to assess LLM SC capabilities objectively. We validate the LLMSecCode implementation through experiments. When varying\u0000parameters and prompts, we find a 10% and 9% difference in performance,\u0000respectively. We also compare some results to reliable external actors, where\u0000our results show a 5% difference. We strive to ensure the ease of use of our open-source framework and\u0000encourage further development by external actors. With LLMSecCode, we hope to\u0000encourage the standardization and benchmarking of LLMs' capabilities in\u0000security-oriented code and tasks.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Decentralized LLM Inference over Edge Networks with Energy Harvesting 利用能量收集对边缘网络进行分散式 LLM 推断

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-28 DOI: arxiv-2408.15907

Aria Khoshsirat, Giovanni Perin, Michele Rossi

Large language models have significantly transformed multiple fields withtheir exceptional performance in natural language tasks, but their deploymentin resource-constrained environments like edge networks presents an ongoingchallenge. Decentralized techniques for inference have emerged, distributingthe model blocks among multiple devices to improve flexibility and costeffectiveness. However, energy limitations remain a significant concern foredge devices. We propose a sustainable model for collaborative inference oninterconnected, battery-powered edge devices with energy harvesting. Asemi-Markov model is developed to describe the states of the devices,considering processing parameters and average green energy arrivals. Thisinforms the design of scheduling algorithms that aim to minimize devicedowntimes and maximize network throughput. Through empirical evaluations andsimulated runs, we validate the effectiveness of our approach, paving the wayfor energy-efficient decentralized inference over edge networks.

大型语言模型在自然语言任务中的卓越表现极大地改变了多个领域，但在边缘网络等资源受限的环境中部署这些模型仍是一项挑战。分散式推理技术已经出现，它将模型块分配给多个设备，以提高灵活性和成本效益。然而，能源限制仍然是边缘设备面临的一个重大问题。我们提出了一种可持续模型，用于在互联的、由电池供电的边缘设备上通过能量收集进行协作推理。考虑到处理参数和平均绿色能源到达量，我们开发了一个马尔可夫模型来描述设备的状态。这为调度算法的设计提供了依据，调度算法的目标是最大限度地减少设备掉电时间，最大限度地提高网络吞吐量。通过经验评估和模拟运行，我们验证了我们方法的有效性，为在边缘网络上实现高能效分散推理铺平了道路。

引用次数: 0

Towards cloud-native scientific workflow management 实现云原生科学工作流程管理

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-27 DOI: arxiv-2408.15445

Michal Orzechowski, Bartosz Balis, Krzysztof Janecki

Cloud-native is an approach to building and running scalable applications inmodern cloud infrastructures, with the Kubernetes container orchestrationplatform being often considered as a fundamental cloud-native building block.In this paper, we evaluate alternative execution models for scientificworkflows in Kubernetes. We compare the simplest job-based model, its variantwith task clustering, and finally we propose a cloud-native model based onmicroservices comprising auto-scalable worker-pools. We implement the proposedmodels in the HyperFlow workflow management system, and evaluate them using alarge Montage workflow on a Kubernetes cluster. The results indicate that theproposed cloud-native worker-pools execution model achieves best performance interms of average cluster utilization, resulting in a nearly 20% improvement ofthe workflow makespan compared to the best-performing job-based model. However,better performance comes at the cost of significantly higher complexity of theimplementation and maintenance. We believe that our experiments provide avaluable insight into the performance, advantages and disadvantages ofalternative cloud-native execution models for scientific workflows.

云原生是一种在现代云基础设施中构建和运行可扩展应用程序的方法，Kubernetes容器编排平台通常被视为云原生的基本构建模块。在本文中，我们评估了Kubernetes中科学工作流的替代执行模型。我们比较了最简单的基于作业的模型、其带有任务集群的变体，最后我们提出了一种基于微服务的云原生模型，该模型由可自动扩展的工作池组成。我们在 HyperFlow 工作流管理系统中实现了所提出的模型，并使用 Kubernetes 集群上的大型 Montage 工作流对其进行了评估。结果表明，提出的云原生工作池执行模型在集群平均利用率方面达到了最佳性能，与性能最佳的基于作业的模型相比，工作流的时间跨度提高了近20%。然而，更好的性能是以大大提高实施和维护的复杂性为代价的。我们相信，我们的实验为科学工作流的性能、其他云原生执行模型的优缺点提供了宝贵的见解。

{"title":"Towards cloud-native scientific workflow management","authors":"Michal Orzechowski, Bartosz Balis, Krzysztof Janecki","doi":"arxiv-2408.15445","DOIUrl":"https://doi.org/arxiv-2408.15445","url":null,"abstract":"Cloud-native is an approach to building and running scalable applications in\u0000modern cloud infrastructures, with the Kubernetes container orchestration\u0000platform being often considered as a fundamental cloud-native building block.\u0000In this paper, we evaluate alternative execution models for scientific\u0000workflows in Kubernetes. We compare the simplest job-based model, its variant\u0000with task clustering, and finally we propose a cloud-native model based on\u0000microservices comprising auto-scalable worker-pools. We implement the proposed\u0000models in the HyperFlow workflow management system, and evaluate them using a\u0000large Montage workflow on a Kubernetes cluster. The results indicate that the\u0000proposed cloud-native worker-pools execution model achieves best performance in\u0000terms of average cluster utilization, resulting in a nearly 20% improvement of\u0000the workflow makespan compared to the best-performing job-based model. However,\u0000better performance comes at the cost of significantly higher complexity of the\u0000implementation and maintenance. We believe that our experiments provide a\u0000valuable insight into the performance, advantages and disadvantages of\u0000alternative cloud-native execution models for scientific workflows.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bandwidth-Aware and Overlap-Weighted Compression for Communication-Efficient Federated Learning 带宽感知和重叠加权压缩，实现高效通信的联合学习

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-27 DOI: arxiv-2408.14736

Zichen Tang, Junlin Huang, Rudan Yan, Yuxin Wang, Zhenheng Tang, Shaohuai Shi, Amelie Chi Zhou, Xiaowen Chu

Current data compression methods, such as sparsification in FederatedAveraging (FedAvg), effectively enhance the communication efficiency ofFederated Learning (FL). However, these methods encounter challenges such asthe straggler problem and diminished model performance due to heterogeneousbandwidth and non-IID (Independently and Identically Distributed) data. Toaddress these issues, we introduce a bandwidth-aware compression framework forFL, aimed at improving communication efficiency while mitigating the problemsassociated with non-IID data. First, our strategy dynamically adjustscompression ratios according to bandwidth, enabling clients to upload theirmodels at a close pace, thus exploiting the otherwise wasted time to transmitmore data. Second, we identify the non-overlapped pattern of retainedparameters after compression, which results in diminished client update signalsdue to uniformly averaged weights. Based on this finding, we propose aparameter mask to adjust the client-averaging coefficients at the parameterlevel, thereby more closely approximating the original updates, and improvingthe training convergence under heterogeneous environments. Our evaluationsreveal that our method significantly boosts model accuracy, with a maximumimprovement of 13% over the uncompressed FedAvg. Moreover, it achieves a$3.37times$ speedup in reaching the target accuracy compared to FedAvg with aTop-K compressor, demonstrating its effectiveness in accelerating convergencewith compression. The integration of common compression techniques into ourframework further establishes its potential as a versatile foundation forfuture cross-device, communication-efficient FL research, addressing criticalchallenges in FL and advancing the field of distributed machine learning.

当前的数据压缩方法，如联合平均（FedAvg）中的稀疏化，有效地提高了联合学习（FL）的通信效率。然而，这些方法也遇到了一些挑战，比如由于带宽不均和非独立且相同分布（IID）数据导致的杂波问题和模型性能下降。为了解决这些问题，我们为FL 引入了带宽感知压缩框架，旨在提高通信效率，同时缓解与非 IID 数据相关的问题。首先，我们的策略会根据带宽动态调整压缩比，使客户端能以较快的速度上传模型，从而利用原本浪费的时间传输更多数据。其次，我们确定了压缩后保留参数的非重叠模式，这种模式会由于权重的均匀平均而导致客户端更新信号的减少。基于这一发现，我们提出了一种参数掩码来调整参数级的客户端平均系数，从而更接近原始更新，并改善异构环境下的训练收敛性。评估结果表明，我们的方法显著提高了模型的准确性，与未压缩的 FedAvg 相比，最大提高了 13%。此外，与使用 Top-K 压缩器的 FedAvg 相比，它在达到目标精度方面的速度提高了 3.37 倍，这证明了它在通过压缩加速收敛方面的有效性。将常用压缩技术集成到我们的框架中，进一步确立了它作为未来跨设备、高效通信 FL 研究的多功能基础的潜力，解决了 FL 中的关键挑战，推动了分布式机器学习领域的发展。

{"title":"Bandwidth-Aware and Overlap-Weighted Compression for Communication-Efficient Federated Learning","authors":"Zichen Tang, Junlin Huang, Rudan Yan, Yuxin Wang, Zhenheng Tang, Shaohuai Shi, Amelie Chi Zhou, Xiaowen Chu","doi":"arxiv-2408.14736","DOIUrl":"https://doi.org/arxiv-2408.14736","url":null,"abstract":"Current data compression methods, such as sparsification in Federated\u0000Averaging (FedAvg), effectively enhance the communication efficiency of\u0000Federated Learning (FL). However, these methods encounter challenges such as\u0000the straggler problem and diminished model performance due to heterogeneous\u0000bandwidth and non-IID (Independently and Identically Distributed) data. To\u0000address these issues, we introduce a bandwidth-aware compression framework for\u0000FL, aimed at improving communication efficiency while mitigating the problems\u0000associated with non-IID data. First, our strategy dynamically adjusts\u0000compression ratios according to bandwidth, enabling clients to upload their\u0000models at a close pace, thus exploiting the otherwise wasted time to transmit\u0000more data. Second, we identify the non-overlapped pattern of retained\u0000parameters after compression, which results in diminished client update signals\u0000due to uniformly averaged weights. Based on this finding, we propose a\u0000parameter mask to adjust the client-averaging coefficients at the parameter\u0000level, thereby more closely approximating the original updates, and improving\u0000the training convergence under heterogeneous environments. Our evaluations\u0000reveal that our method significantly boosts model accuracy, with a maximum\u0000improvement of 13% over the uncompressed FedAvg. Moreover, it achieves a\u0000$3.37times$ speedup in reaching the target accuracy compared to FedAvg with a\u0000Top-K compressor, demonstrating its effectiveness in accelerating convergence\u0000with compression. The integration of common compression techniques into our\u0000framework further establishes its potential as a versatile foundation for\u0000future cross-device, communication-efficient FL research, addressing critical\u0000challenges in FL and advancing the field of distributed machine learning.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Faster Cycle Detection in the Congested Clique 在拥挤的小群中更快地检测周期

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-27 DOI: arxiv-2408.15132

Keren Censor-Hillel, Tomer Even, Virginia Vassilevska Williams

We provide a fast distributed algorithm for detecting $h$-cycles in thetextsf{Congested Clique} model, whose running time decreases as the number of$h$-cycles in the graph increases. In undirected graphs, constant-roundalgorithms are known for cycles of even length. Our algorithm greatly improvesupon the state of the art for odd values of $h$. Moreover, our running timeapplies also to directed graphs, in which case the improvement is for allvalues of $h$. Further, our techniques allow us to obtain a triangle detectionalgorithm in the quantum variant of this model, which is faster than priorwork. A key technical contribution we develop to obtain our fast cycle detectionalgorithm is a new algorithm for computing the product of many pairs of smallmatrices in parallel, which may be of independent interest.

我们提供了一种快速分布式算法，用于检测文本簇模型（textsf{Congested Clique}）中的 $h$循环，其运行时间随着图中 $h$ 循环数量的增加而减少。在无向图中，已知的常圆算法适用于偶数长度的循环。对于 $h$ 的奇数值，我们的算法大大提高了技术水平。此外，我们的运行时间也适用于有向图，在这种情况下，我们的改进适用于 $h$ 的所有值。此外，我们的技术允许我们在该模型的量子变体中获得一种三角形检测算法，它比之前的工作更快。我们为获得快速循环检测算法而开发的一个关键技术贡献，是一种并行计算多对小矩阵乘积的新算法，这可能会引起人们的独立兴趣。

引用次数: 0

Towards observability of scientific applications 实现科学应用的可观测性

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-27 DOI: arxiv-2408.15439

Bartosz Balis, Konrad Czerepak, Albert Kuzma, Jan Meizner, Lukasz Wronski

As software systems increase in complexity, conventional monitoring methodsstruggle to provide a comprehensive overview or identify performance issues,often missing unexpected problems. Observability, however, offers a holisticapproach, providing methods and tools that gather and analyze detailedtelemetry data to uncover hidden issues. Originally developed for cloud-nativesystems, modern observability is less prevalent in scientific computing,particularly in HPC clusters, due to differences in application architecture,execution environments, and technology stacks. This paper proposes andevaluates an end-to-end observability solution tailored for scientificcomputing in HPC environments. We address several challenges, includingcollection of application-level metrics, instrumentation, context propagation,and tracing. We argue that typical dashboards with charts are not sufficientfor advanced observability-driven analysis of scientific applications.Consequently, we propose a different approach based on data analysis usingDataFrames and a Jupyter environment. The proposed solution is implemented andevaluated on two medical scientific pipelines running on an HPC cluster.

随着软件系统复杂性的增加，传统的监控方法难以提供全面的概览或识别性能问题，往往会遗漏一些意想不到的问题。然而，可观察性提供了一种整体方法，它提供了收集和分析详细跟踪数据的方法和工具，以发现隐藏的问题。现代可观测性最初是为云原生系统开发的，但由于应用架构、执行环境和技术栈的差异，它在科学计算领域，尤其是高性能计算集群中的应用并不普遍。本文提出并评估了一种为 HPC 环境中的科学计算量身定制的端到端可观测性解决方案。我们解决了几个难题，包括应用级指标的收集、仪表化、上下文传播和跟踪。因此，我们提出了一种基于使用数据框架和 Jupyter 环境进行数据分析的不同方法。我们在高性能计算集群上运行的两个医学科学流水线上实施并评估了所提出的解决方案。

{"title":"Towards observability of scientific applications","authors":"Bartosz Balis, Konrad Czerepak, Albert Kuzma, Jan Meizner, Lukasz Wronski","doi":"arxiv-2408.15439","DOIUrl":"https://doi.org/arxiv-2408.15439","url":null,"abstract":"As software systems increase in complexity, conventional monitoring methods\u0000struggle to provide a comprehensive overview or identify performance issues,\u0000often missing unexpected problems. Observability, however, offers a holistic\u0000approach, providing methods and tools that gather and analyze detailed\u0000telemetry data to uncover hidden issues. Originally developed for cloud-native\u0000systems, modern observability is less prevalent in scientific computing,\u0000particularly in HPC clusters, due to differences in application architecture,\u0000execution environments, and technology stacks. This paper proposes and\u0000evaluates an end-to-end observability solution tailored for scientific\u0000computing in HPC environments. We address several challenges, including\u0000collection of application-level metrics, instrumentation, context propagation,\u0000and tracing. We argue that typical dashboards with charts are not sufficient\u0000for advanced observability-driven analysis of scientific applications.\u0000Consequently, we propose a different approach based on data analysis using\u0000DataFrames and a Jupyter environment. The proposed solution is implemented and\u0000evaluated on two medical scientific pipelines running on an HPC cluster.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"177 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Partition Detection in Byzantine Networks 拜占庭网络中的分区检测

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-27 DOI: arxiv-2408.14814

Yérom-David BrombergIRISA, UR, Jérémie DecouchantTU Delft, Manon SourisseauIRISA, UR, François TaïaniIRISA, UR

Detecting and handling network partitions is a fundamental requirement ofdistributed systems. Although existing partition detection methods in arbitrarygraphs tolerate unreliable networks, they either assume that all nodes arecorrect or that a limited number of nodes might crash. In particular, Byzantinebehaviors are out of the scope of these algorithms despite Byzantine faulttolerance being an active research topic for important problems such asconsensus. Moreover, Byzantinetolerant protocols, such as broadcast orconsensus, always rely on the assumption of connected networks. This paperaddresses the problem of detecting partition in Byzantine networks (withoutconnectivity assumption). We present a novel algorithm, which we call NECTAR,that safely detects partitioned and possibly partitionable networks and proveits correctness. NECTAR allows all correct nodes to detect whether a networkcould suffer from Byzantine nodes. We evaluate NECTAR's performance and compareit to two existing baselines using up to 100 nodes running real code, onvarious realistic topologies. Our results confirm that NECTAR maintains a 100%accuracy while the accuracy of the various existing baselines decreases by atleast 40% as soon as one participant is Byzantine. Although NECTAR's networkcost increases with the number of nodes and decreases with the network'sdiameter, it does not go above around 500KB in the worst cases.

检测和处理网络分区是分布式系统的基本要求。尽管现有的任意图分区检测方法可以容忍不可靠的网络，但它们要么假定所有节点都是正确的，要么假定有限数量的节点可能会崩溃。尤其是拜占庭行为，尽管拜占庭容错是共识等重要问题的一个活跃研究课题，但拜占庭容错却不在这些算法的研究范围内。此外，拜占庭容错协议（如广播或共识）总是依赖于连接网络的假设。本文探讨了拜占庭网络（无连接性假设）中的分区检测问题。我们提出了一种新颖的算法，称为 NECTAR，它能安全地检测分区和可能可分区的网络，并证明其正确性。NECTAR 允许所有正确节点检测网络是否可能存在拜占庭节点。我们评估了 NECTAR 的性能，并使用多达 100 个运行真实代码的节点在各种现实拓扑结构上将其与两个现有基线进行了比较。我们的结果证实，NECTAR 保持了 100% 的准确率，而只要有一个参与者是拜占庭成员，现有各种基线的准确率就会下降至少 40%。虽然 NECTAR 的网络成本随节点数增加而增加，随网络直径减小而减小，但在最糟糕的情况下也不会超过 500KB 左右。

{"title":"Partition Detection in Byzantine Networks","authors":"Yérom-David BrombergIRISA, UR, Jérémie DecouchantTU Delft, Manon SourisseauIRISA, UR, François TaïaniIRISA, UR","doi":"arxiv-2408.14814","DOIUrl":"https://doi.org/arxiv-2408.14814","url":null,"abstract":"Detecting and handling network partitions is a fundamental requirement of\u0000distributed systems. Although existing partition detection methods in arbitrary\u0000graphs tolerate unreliable networks, they either assume that all nodes are\u0000correct or that a limited number of nodes might crash. In particular, Byzantine\u0000behaviors are out of the scope of these algorithms despite Byzantine fault\u0000tolerance being an active research topic for important problems such as\u0000consensus. Moreover, Byzantinetolerant protocols, such as broadcast or\u0000consensus, always rely on the assumption of connected networks. This paper\u0000addresses the problem of detecting partition in Byzantine networks (without\u0000connectivity assumption). We present a novel algorithm, which we call NECTAR,\u0000that safely detects partitioned and possibly partitionable networks and prove\u0000its correctness. NECTAR allows all correct nodes to detect whether a network\u0000could suffer from Byzantine nodes. We evaluate NECTAR's performance and compare\u0000it to two existing baselines using up to 100 nodes running real code, on\u0000various realistic topologies. Our results confirm that NECTAR maintains a 100%\u0000accuracy while the accuracy of the various existing baselines decreases by at\u0000least 40% as soon as one participant is Byzantine. Although NECTAR's network\u0000cost increases with the number of nodes and decreases with the network's\u0000diameter, it does not go above around 500KB in the worst cases.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0