arXiv - CS - Distributed, Parallel, and Cluster Computing最新文献_第6页

Accelerating Large Language Model Training with Hybrid GPU-based Compression 利用基于 GPU 的混合压缩技术加速大型语言模型训练

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-04 DOI: arxiv-2409.02423

Lang Xu, Quentin Anthony, Qinghua Zhou, Nawras Alnaasan, Radha R. Gulhane, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda

Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP)are the three strategies widely adopted to enable fast and efficient LargeLanguage Model (LLM) training. However, these approaches rely on data-intensivecommunication routines to collect, aggregate, and re-distribute gradients,activations, and other important model information, which pose significantoverhead. Co-designed with GPU-based compression libraries, MPI libraries havebeen proven to reduce message size significantly, and leverage interconnectbandwidth, thus increasing training efficiency while maintaining acceptableaccuracy. In this work, we investigate the efficacy of compression-assisted MPIcollectives under the context of distributed LLM training using 3D parallelismand ZeRO optimizations. We scaled up to 192 V100 GPUs on the Lassensupercomputer. First, we enabled a na"ive compression scheme across allcollectives and observed a 22.5% increase in TFLOPS per GPU and a 23.6%increase in samples per second for GPT-NeoX-20B training. Nonetheless, such astrategy ignores the sparsity discrepancy among messages communicated in eachparallelism degree, thus introducing more errors and causing degradation intraining loss. Therefore, we incorporated hybrid compression settings towardeach parallel dimension and adjusted the compression intensity accordingly.Given their low-rank structure (arXiv:2301.02654), we apply aggressivecompression on gradients when performing DP All-reduce. We adopt mildercompression to preserve precision while communicating activations, optimizerstates, and model parameters in TP and PP. Using the adjusted hybridcompression scheme, we demonstrate a 17.3% increase in TFLOPS per GPU and a12.7% increase in samples per second while reaching baseline loss convergence.

数据并行（DP）、张量并行（TP）和管道并行（PP）是为实现快速高效的大型语言模型（LLM）训练而广泛采用的三种策略。然而，这些方法都依赖于数据密集型通信例程来收集、汇总和重新分配梯度、激活度和其他重要的模型信息，从而造成了巨大的开销。与基于 GPU 的压缩库共同设计的 MPI 库已被证明能显著减少信息大小，并充分利用互连带宽，从而在提高训练效率的同时保持可接受的精度。在这项工作中，我们利用三维并行性和 ZeRO 优化，研究了在分布式 LLM 训练背景下压缩辅助 MPI 集合的功效。我们在 Lassens 超级计算机上扩展到 192 个 V100 GPU。首先，我们在所有GPU上启用了na（na "ive）压缩方案，并观察到在GPT-NeoX-20B训练中，每个GPU的TFLOPS增加了22.5%，每秒采样增加了23.6%。然而，这种策略忽略了在每个并行度上通信的信息之间的稀疏性差异，从而引入了更多错误并导致训练损耗下降。考虑到它们的低秩结构（arXiv:2301.02654），我们在执行 DP All-reduce 时对梯度进行了积极的压缩。我们在 TP 和 PP 中交流激活、优化器状态和模型参数时，采用了较温和的压缩以保持精度。使用调整后的混合压缩方案，我们证明每 GPU 的 TFLOPS 增加了 17.3%，每秒采样增加了 12.7%，同时达到了基线损耗收敛。

{"title":"Accelerating Large Language Model Training with Hybrid GPU-based Compression","authors":"Lang Xu, Quentin Anthony, Qinghua Zhou, Nawras Alnaasan, Radha R. Gulhane, Aamir Shafi, Hari Subramoni, Dhabaleswar K. Panda","doi":"arxiv-2409.02423","DOIUrl":"https://doi.org/arxiv-2409.02423","url":null,"abstract":"Data Parallelism (DP), Tensor Parallelism (TP), and Pipeline Parallelism (PP)\u0000are the three strategies widely adopted to enable fast and efficient Large\u0000Language Model (LLM) training. However, these approaches rely on data-intensive\u0000communication routines to collect, aggregate, and re-distribute gradients,\u0000activations, and other important model information, which pose significant\u0000overhead. Co-designed with GPU-based compression libraries, MPI libraries have\u0000been proven to reduce message size significantly, and leverage interconnect\u0000bandwidth, thus increasing training efficiency while maintaining acceptable\u0000accuracy. In this work, we investigate the efficacy of compression-assisted MPI\u0000collectives under the context of distributed LLM training using 3D parallelism\u0000and ZeRO optimizations. We scaled up to 192 V100 GPUs on the Lassen\u0000supercomputer. First, we enabled a na\"ive compression scheme across all\u0000collectives and observed a 22.5% increase in TFLOPS per GPU and a 23.6%\u0000increase in samples per second for GPT-NeoX-20B training. Nonetheless, such a\u0000strategy ignores the sparsity discrepancy among messages communicated in each\u0000parallelism degree, thus introducing more errors and causing degradation in\u0000training loss. Therefore, we incorporated hybrid compression settings toward\u0000each parallel dimension and adjusted the compression intensity accordingly.\u0000Given their low-rank structure (arXiv:2301.02654), we apply aggressive\u0000compression on gradients when performing DP All-reduce. We adopt milder\u0000compression to preserve precision while communicating activations, optimizer\u0000states, and model parameters in TP and PP. Using the adjusted hybrid\u0000compression scheme, we demonstrate a 17.3% increase in TFLOPS per GPU and a\u000012.7% increase in samples per second while reaching baseline loss convergence.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"268 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Joint Time and Energy-Efficient Federated Learning-based Computation Offloading Method for Mobile Edge Computing 为移动边缘计算提供基于联合学习的时间和能效计算卸载方法

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-04 DOI: arxiv-2409.02548

Anwesha Mukherjee, Rajkumar Buyya

Computation offloading at lower time and lower energy consumption is crucialfor resource limited mobile devices. This paper proposes an offloadingdecision-making model using federated learning. Based on the task type and theuser input, the proposed decision-making model predicts whether the task iscomputationally intensive or not. If the predicted result is computationallyintensive, then based on the network parameters the proposed decision-makingmodel predicts whether to offload or locally execute the task. According to thepredicted result the task is either locally executed or offloaded to the edgeserver. The proposed method is implemented in a real-time environment, and theexperimental results show that the proposed method has achieved above 90%prediction accuracy in offloading decision-making. The experimental resultsalso present that the proposed offloading method reduces the response time andenergy consumption of the user device by ~11-31% for computationally intensivetasks. A partial computation offloading method for federated learning is alsoproposed and implemented in this paper, where the devices which are unable toanalyse the huge number of data samples, offload a part of their local datasetsto the edge server. For secure data transmission, cryptography is used. Theexperimental results present that using encryption and decryption the totaltime is increased by only 0.05-0.16%. The results also present that theproposed partial computation offloading method for federated learning hasachieved a prediction accuracy of above 98% for the global model.

对于资源有限的移动设备来说，以更短的时间和更低的能耗卸载计算至关重要。本文提出了一种使用联合学习的卸载决策模型。根据任务类型和用户输入，所提出的决策模型会预测任务是否属于计算密集型。如果预测结果是计算密集型的，那么根据网络参数，提议的决策模型就会预测是卸载任务还是本地执行任务。根据预测结果，任务要么在本地执行，要么卸载到边缘服务器。实验结果表明，所提出的方法在卸载决策中的预测准确率达到了 90% 以上。实验结果还表明，对于计算密集型任务，所提出的卸载方法将用户设备的响应时间和能耗降低了约11%-31%。本文还提出并实现了联盟学习的部分计算卸载方法，即无法分析大量数据样本的设备将其本地数据集的一部分卸载到边缘服务器。为了保证数据传输的安全性，本文使用了加密技术。实验结果表明，使用加密和解密技术，总时间只增加了 0.05-0.16%。实验结果还表明，针对联合学习提出的部分计算卸载方法使全局模型的预测准确率达到了 98% 以上。

{"title":"A Joint Time and Energy-Efficient Federated Learning-based Computation Offloading Method for Mobile Edge Computing","authors":"Anwesha Mukherjee, Rajkumar Buyya","doi":"arxiv-2409.02548","DOIUrl":"https://doi.org/arxiv-2409.02548","url":null,"abstract":"Computation offloading at lower time and lower energy consumption is crucial\u0000for resource limited mobile devices. This paper proposes an offloading\u0000decision-making model using federated learning. Based on the task type and the\u0000user input, the proposed decision-making model predicts whether the task is\u0000computationally intensive or not. If the predicted result is computationally\u0000intensive, then based on the network parameters the proposed decision-making\u0000model predicts whether to offload or locally execute the task. According to the\u0000predicted result the task is either locally executed or offloaded to the edge\u0000server. The proposed method is implemented in a real-time environment, and the\u0000experimental results show that the proposed method has achieved above 90%\u0000prediction accuracy in offloading decision-making. The experimental results\u0000also present that the proposed offloading method reduces the response time and\u0000energy consumption of the user device by ~11-31% for computationally intensive\u0000tasks. A partial computation offloading method for federated learning is also\u0000proposed and implemented in this paper, where the devices which are unable to\u0000analyse the huge number of data samples, offload a part of their local datasets\u0000to the edge server. For secure data transmission, cryptography is used. The\u0000experimental results present that using encryption and decryption the total\u0000time is increased by only 0.05-0.16%. The results also present that the\u0000proposed partial computation offloading method for federated learning has\u0000achieved a prediction accuracy of above 98% for the global model.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

QPOPSS: Query and Parallelism Optimized Space-Saving for Finding Frequent Stream Elements QPOPSS：优化查询和并行性，为查找频繁流元素节省空间

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-03 DOI: arxiv-2409.01749

Victor Jarlow, Charalampos Stylianopoulos, Marina Papatriantafilou

The frequent elements problem, a key component in demanding stream-dataanalytics, involves selecting elements whose occurrence exceeds auser-specified threshold. Fast, memory-efficient $epsilon$-approximatesynopsis algorithms select all frequent elements but may overestimate themdepending on $epsilon$ (user-defined parameter). Evolving applications demandperformance only achievable by parallelization. However, algorithmic guaranteesconcerning concurrent updates and queries have been overlooked. We proposeQuery and Parallelism Optimized Space-Saving (QPOPSS), providing concurrencyguarantees. The design includes an implementation of the emph{Space-Saving}algorithm supporting fast queries, implying minimal overlap with concurrentupdates. QPOPSS integrates this with the distribution of work and fine-grainedsynchronization among threads, swiftly balancing high throughput, highaccuracy, and low memory consumption. Our analysis, under various concurrencyand data distribution conditions, shows space and approximation bounds. Ourempirical evaluation relative to representative state-of-the-art methodsreveals that QPOPSS's multi-threaded throughput scales linearly whilemaintaining the highest accuracy, with orders of magnitude smaller memoryfootprint.

频繁元素问题是要求苛刻的流数据分析中的一个关键组成部分，涉及选择出现率超过用户指定阈值的元素。快速、内存效率高的 $/epsilon$-近似提要算法会选择所有频繁元素，但可能会高估它们，这取决于 $/epsilon$（用户定义的参数）。不断发展的应用对性能的要求只能通过并行化来实现。然而，有关并发更新和查询的算法保证一直被忽视。我们提出了查询和并行优化节省空间（QPOPSS），提供并发保证。该设计包括支持快速查询的emph{Space-Saving}算法的实现，这意味着与并发更新的重叠最小。QPOPSS 将其与线程间的工作分配和细粒度同步整合在一起，迅速平衡了高吞吐量、高精确度和低内存消耗。我们在各种并发和数据分布条件下进行的分析表明了空间和近似边界。与最先进的代表性方法相比，我们的实证评估结果表明，QPOPSS 的多线程吞吐量呈线性扩展，同时保持了最高精度，内存占用却小了几个数量级。

{"title":"QPOPSS: Query and Parallelism Optimized Space-Saving for Finding Frequent Stream Elements","authors":"Victor Jarlow, Charalampos Stylianopoulos, Marina Papatriantafilou","doi":"arxiv-2409.01749","DOIUrl":"https://doi.org/arxiv-2409.01749","url":null,"abstract":"The frequent elements problem, a key component in demanding stream-data\u0000analytics, involves selecting elements whose occurrence exceeds a\u0000user-specified threshold. Fast, memory-efficient $epsilon$-approximate\u0000synopsis algorithms select all frequent elements but may overestimate them\u0000depending on $epsilon$ (user-defined parameter). Evolving applications demand\u0000performance only achievable by parallelization. However, algorithmic guarantees\u0000concerning concurrent updates and queries have been overlooked. We propose\u0000Query and Parallelism Optimized Space-Saving (QPOPSS), providing concurrency\u0000guarantees. The design includes an implementation of the emph{Space-Saving}\u0000algorithm supporting fast queries, implying minimal overlap with concurrent\u0000updates. QPOPSS integrates this with the distribution of work and fine-grained\u0000synchronization among threads, swiftly balancing high throughput, high\u0000accuracy, and low memory consumption. Our analysis, under various concurrency\u0000and data distribution conditions, shows space and approximation bounds. Our\u0000empirical evaluation relative to representative state-of-the-art methods\u0000reveals that QPOPSS's multi-threaded throughput scales linearly while\u0000maintaining the highest accuracy, with orders of magnitude smaller memory\u0000footprint.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Checkpoint and Restart: An Energy Consumption Characterization in Clusters 检查点和重启：集群中的能耗特征分析

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-03 DOI: arxiv-2409.02214

Marina Moran, Javier Balladini, Dolores Rexachs, Emilio Luque

The fault tolerance method currently used in High Performance Computing (HPC)is the rollback-recovery method by using checkpoints. This, like any otherfault tolerance method, adds an additional energy consumption to that of theexecution of the application. The objective of this work is to determine thefactors that affect the energy consumption of the computing nodes onhomogeneous cluster, when performing checkpoint and restart operations, on SPMD(Single Program Multiple Data) applications. We have focused on the energeticstudy of compute nodes, contemplating different configurations of hardware andsoftware parameters. We studied the effect of performance states (states P) andpower states (states C) of processors, application problem size, checkpointsoftware (DMTCP) and distributed file system (NFS) configuration. The resultsanalysis allowed to identify opportunities to reduce the energy consumption ofcheckpoint and restart operations.

目前高性能计算（HPC）中使用的容错方法是使用检查点进行回滚恢复的方法。与其他容错方法一样，这种方法也会在执行应用程序的过程中增加额外的能耗。这项工作的目的是确定在 SPMD（单程序多数据）应用中执行检查点和重启操作时，影响同构集群上计算节点能耗的因素。考虑到硬件和软件参数的不同配置，我们重点研究了计算节点的能耗。我们研究了处理器的性能状态（状态 P）和功率状态（状态 C）、应用问题大小、检查点软件（DMTCP）和分布式文件系统（NFS）配置的影响。通过分析结果，我们发现了减少检查点和重启操作能耗的机会。

引用次数: 0

Contemporary Model Compression on Large Language Models Inference 当代大型语言模型推理的模型压缩

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-03 DOI: arxiv-2409.01990

Dong Liu

Large Language Models (LLMs) have revolutionized natural language processingby achieving state-of-the-art results across a variety of tasks. However, thecomputational demands of LLM inference, including high memory consumption andslow processing speeds, pose significant challenges for real-worldapplications, particularly on resource-constrained devices. Efficient inferenceis crucial for scaling the deployment of LLMs to a broader range of platforms,including mobile and edge devices. This survey explores contemporary techniques in model compression thataddress these challenges by reducing the size and computational requirements ofLLMs while maintaining their performance. We focus on model-level compressionmethods, including quantization, knowledge distillation, and pruning, as wellas system-level optimizations like KV cache efficient design. Each of thesemethodologies offers a unique approach to optimizing LLMs, from reducingnumerical precision to transferring knowledge between models and structurallysimplifying neural networks. Additionally, we discuss emerging trends insystem-level design that further enhance the efficiency of LLM inference. Thissurvey aims to provide a comprehensive overview of current advancements inmodel compression and their potential to make LLMs more accessible andpractical for diverse applications.

大型语言模型（LLM）在各种任务中取得了最先进的结果，从而彻底改变了自然语言处理技术。然而，LLM推理的计算需求，包括高内存消耗和低处理速度，给现实世界的应用带来了巨大挑战，尤其是在资源受限的设备上。高效推理对于将 LLM 部署到更广泛的平台（包括移动设备和边缘设备）至关重要。本研究探讨了模型压缩方面的当代技术，这些技术通过减少 LLMs 的大小和计算要求，同时保持其性能，应对了这些挑战。我们重点关注模型级压缩方法，包括量化、知识提炼和剪枝，以及系统级优化，如 KV 缓存的高效设计。每种方法都提供了优化 LLM 的独特方法，包括降低数值精度、在模型间转移知识以及从结构上简化神经网络。此外，我们还讨论了进一步提高 LLM 推断效率的系统级设计新趋势。本调查旨在全面概述当前模型压缩方面的进展及其使 LLM 在各种应用中更易获得和更实用的潜力。

{"title":"Contemporary Model Compression on Large Language Models Inference","authors":"Dong Liu","doi":"arxiv-2409.01990","DOIUrl":"https://doi.org/arxiv-2409.01990","url":null,"abstract":"Large Language Models (LLMs) have revolutionized natural language processing\u0000by achieving state-of-the-art results across a variety of tasks. However, the\u0000computational demands of LLM inference, including high memory consumption and\u0000slow processing speeds, pose significant challenges for real-world\u0000applications, particularly on resource-constrained devices. Efficient inference\u0000is crucial for scaling the deployment of LLMs to a broader range of platforms,\u0000including mobile and edge devices. This survey explores contemporary techniques in model compression that\u0000address these challenges by reducing the size and computational requirements of\u0000LLMs while maintaining their performance. We focus on model-level compression\u0000methods, including quantization, knowledge distillation, and pruning, as well\u0000as system-level optimizations like KV cache efficient design. Each of these\u0000methodologies offers a unique approach to optimizing LLMs, from reducing\u0000numerical precision to transferring knowledge between models and structurally\u0000simplifying neural networks. Additionally, we discuss emerging trends in\u0000system-level design that further enhance the efficiency of LLM inference. This\u0000survey aims to provide a comprehensive overview of current advancements in\u0000model compression and their potential to make LLMs more accessible and\u0000practical for diverse applications.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ACCESS-FL: Agile Communication and Computation for Efficient Secure Aggregation in Stable Federated Learning Networks ACCESS-FL：在稳定的联合学习网络中实现高效安全聚合的敏捷通信和计算

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-03 DOI: arxiv-2409.01722

Niousha Nazemi, Omid Tavallaie, Shuaijun Chen, Anna Maria Mandalario, Kanchana Thilakarathna, Ralph Holz, Hamed Haddadi, Albert Y. Zomaya

Federated Learning (FL) is a promising distributed learning frameworkdesigned for privacy-aware applications. FL trains models on client deviceswithout sharing the client's data and generates a global model on a server byaggregating model updates. Traditional FL approaches risk exposing sensitiveclient data when plain model updates are transmitted to the server, making themvulnerable to security threats such as model inversion attacks where the servercan infer the client's original training data from monitoring the changes ofthe trained model in different rounds. Google's Secure Aggregation (SecAgg)protocol addresses this threat by employing a double-masking technique, secretsharing, and cryptography computations in honest-but-curious and adversarialscenarios with client dropouts. However, in scenarios without the presence ofan active adversary, the computational and communication cost of SecAggsignificantly increases by growing the number of clients. To address thisissue, in this paper, we propose ACCESS-FL, acommunication-and-computation-efficient secure aggregation method designed forhonest-but-curious scenarios in stable FL networks with a limited rate ofclient dropout. ACCESS-FL reduces the computation/communication cost to aconstant level (independent of the network size) by generating shared secretsbetween only two clients and eliminating the need for double masking, secretsharing, and cryptography computations. To evaluate the performance ofACCESS-FL, we conduct experiments using the MNIST, FMNIST, and CIFAR datasetsto verify the performance of our proposed method. The evaluation resultsdemonstrate that our proposed method significantly reduces computation andcommunication overhead compared to state-of-the-art methods, SecAgg andSecAgg+.

联合学习（FL）是一种有前途的分布式学习框架，专为隐私感知应用而设计。联合学习在不共享客户端数据的情况下在客户端设备上训练模型，并通过聚合模型更新在服务器上生成全局模型。传统的 FL 方法在向服务器发送普通模型更新时，有可能暴露敏感的客户端数据，使其容易受到安全威胁，如模型反转攻击，即服务器可以通过监控不同轮次中训练模型的变化来推断客户端的原始训练数据。谷歌的安全聚合（SecAgg）协议通过采用双重掩码技术、秘密共享和加密计算，在诚实但不诚实以及客户端退出的对抗性场景中解决了这一威胁。然而，在不存在活跃对手的情况下，SecAggs 的计算和通信成本会随着客户端数量的增加而显著增加。为了解决这个问题，我们在本文中提出了 ACCESS-FL，一种通信和计算效率高的安全聚合方法，设计用于稳定的 FL 网络中客户丢失率有限的 "诚实但好奇 "场景。ACCESS-FL 只在两个客户端之间生成共享秘密，无需进行双重屏蔽、秘密共享和密码学计算，从而将计算/通信成本降至恒定水平（与网络规模无关）。为了评估 ACCESS-FL 的性能，我们使用 MNIST、FMNIST 和 CIFAR 数据集进行了实验，以验证我们提出的方法的性能。评估结果表明，与最先进的 SecAgg 和 SecAgg+ 方法相比，我们提出的方法大大减少了计算和通信开销。

{"title":"ACCESS-FL: Agile Communication and Computation for Efficient Secure Aggregation in Stable Federated Learning Networks","authors":"Niousha Nazemi, Omid Tavallaie, Shuaijun Chen, Anna Maria Mandalario, Kanchana Thilakarathna, Ralph Holz, Hamed Haddadi, Albert Y. Zomaya","doi":"arxiv-2409.01722","DOIUrl":"https://doi.org/arxiv-2409.01722","url":null,"abstract":"Federated Learning (FL) is a promising distributed learning framework\u0000designed for privacy-aware applications. FL trains models on client devices\u0000without sharing the client's data and generates a global model on a server by\u0000aggregating model updates. Traditional FL approaches risk exposing sensitive\u0000client data when plain model updates are transmitted to the server, making them\u0000vulnerable to security threats such as model inversion attacks where the server\u0000can infer the client's original training data from monitoring the changes of\u0000the trained model in different rounds. Google's Secure Aggregation (SecAgg)\u0000protocol addresses this threat by employing a double-masking technique, secret\u0000sharing, and cryptography computations in honest-but-curious and adversarial\u0000scenarios with client dropouts. However, in scenarios without the presence of\u0000an active adversary, the computational and communication cost of SecAgg\u0000significantly increases by growing the number of clients. To address this\u0000issue, in this paper, we propose ACCESS-FL, a\u0000communication-and-computation-efficient secure aggregation method designed for\u0000honest-but-curious scenarios in stable FL networks with a limited rate of\u0000client dropout. ACCESS-FL reduces the computation/communication cost to a\u0000constant level (independent of the network size) by generating shared secrets\u0000between only two clients and eliminating the need for double masking, secret\u0000sharing, and cryptography computations. To evaluate the performance of\u0000ACCESS-FL, we conduct experiments using the MNIST, FMNIST, and CIFAR datasets\u0000to verify the performance of our proposed method. The evaluation results\u0000demonstrate that our proposed method significantly reduces computation and\u0000communication overhead compared to state-of-the-art methods, SecAgg and\u0000SecAgg+.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Quantifying Liveness and Safety of Avalanche's Snowball 量化雪崩雪球的有效性和安全性

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-03 DOI: arxiv-2409.02217

Quentin Kniep, Maxime Laval, Jakub Sliwinski, Roger Wattenhofer

This work examines the resilience properties of the Snowball and Avalancheprotocols that underlie the popular Avalanche blockchain. We experimentallyquantify the resilience of Snowball using a simulation implemented in Rust,where the adversary strategically rebalances the network to delay termination. We show that in a network of $n$ nodes of equal stake, the adversary is ableto break liveness when controlling $Omega(sqrt{n})$ nodes. Specifically, for$n = 2000$, a simple adversary controlling $5.2%$ of stake can successfullyattack liveness. When the adversary is given additional information about thestate of the network (without any communication or other advantages), the stakeneeded for a successful attack is as little as $2.8%$. We show that theadversary can break safety in time exponentially dependent on their stake, andinversely linearly related to the size of the network, e.g. in 265 rounds inexpectation when the adversary controls $25%$ of a network of 3000. We conclude that Snowball and Avalanche are akin to Byzantine reliablebroadcast protocols as opposed to consensus.

这项工作研究了作为流行的雪崩区块链基础的 Snowball 和 Avalancheprotocol 的弹性特性。我们使用 Rust 实现的模拟对 Snowball 的弹性进行了实验性量化，在该模拟中，对手策略性地重新平衡网络以延迟终止。我们的研究表明，在一个由 $n$ 节点组成的等价网络中，当对手控制 $Omega(sqrt{n})$ 节点时，就能打破有效性。具体来说，对于 $n = 2000$，一个简单的对手控制着 5.2%$ 的赌注，就能成功攻击有效性。当对手获得关于网络状态的额外信息时（没有任何通信或其他优势），成功攻击所需的赌注仅为 2.8 美元。我们的研究表明，对手可以在与他们的赌注成指数关系的时间内打破安全，而与网络规模成反线性关系，例如，当对手控制了 3000 个网络中的 25 美元/%$时，只需 265 轮即可。我们的结论是，与共识相比，"雪球 "和 "雪崩 "类似于拜占庭可靠广播协议。

{"title":"Quantifying Liveness and Safety of Avalanche's Snowball","authors":"Quentin Kniep, Maxime Laval, Jakub Sliwinski, Roger Wattenhofer","doi":"arxiv-2409.02217","DOIUrl":"https://doi.org/arxiv-2409.02217","url":null,"abstract":"This work examines the resilience properties of the Snowball and Avalanche\u0000protocols that underlie the popular Avalanche blockchain. We experimentally\u0000quantify the resilience of Snowball using a simulation implemented in Rust,\u0000where the adversary strategically rebalances the network to delay termination. We show that in a network of $n$ nodes of equal stake, the adversary is able\u0000to break liveness when controlling $Omega(sqrt{n})$ nodes. Specifically, for\u0000$n = 2000$, a simple adversary controlling $5.2%$ of stake can successfully\u0000attack liveness. When the adversary is given additional information about the\u0000state of the network (without any communication or other advantages), the stake\u0000needed for a successful attack is as little as $2.8%$. We show that the\u0000adversary can break safety in time exponentially dependent on their stake, and\u0000inversely linearly related to the size of the network, e.g. in 265 rounds in\u0000expectation when the adversary controls $25%$ of a network of 3000. We conclude that Snowball and Avalanche are akin to Byzantine reliable\u0000broadcast protocols as opposed to consensus.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Unified, Practical, and Understandable Summary of Non-transactional Consistency Levels in Distributed Replication 分布式复制中非事务一致性级别的统一、实用、易懂摘要

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-03 DOI: arxiv-2409.01576

Guanzhou Hu, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau

We present a summary of non-transactional consistency levels in the contextof distributed data replication protocols. The levels are built upon apractical object pool model and are defined in a unified framework centeredaround the concept of ordering. We show that each consistency level can beintuitively defined by specifying two types of constraints that determine thevalidity of orderings allowed by the level: convergence, which bounds thelineage shape of the ordering, and relationship, which bounds the relativepositions of operations in the ordering. We give examples of representativeprotocols and systems that implement each consistency level. Furthermore, wediscuss the availability upper bound of presented consistency levels.

我们总结了分布式数据复制协议背景下的非事务一致性级别。这些级别建立在实用的对象池模型之上，并在以排序概念为中心的统一框架中进行定义。我们证明，每个一致性级别都可以通过指定两类约束来直观地定义，这两类约束决定了该级别所允许的排序的有效性：收敛性（约束排序的线形）和关系（约束排序中操作的相对位置）。我们举例说明了实现每个一致性级别的代表性协议和系统。此外，我们还讨论了所提出的一致性等级的可用性上限。

引用次数: 0

EcoLife: Carbon-Aware Serverless Function Scheduling for Sustainable Computing 生态生命：面向可持续计算的碳感知无服务器功能调度

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-03 DOI: arxiv-2409.02085

Yankai Jiang, Rohan Basu Roy, Baolin Li, Devesh Tiwari

This work introduces ECOLIFE, the first carbon-aware serverless functionscheduler to co-optimize carbon footprint and performance. ECOLIFE builds onthe key insight of intelligently exploiting multi-generation hardware toachieve high performance and lower carbon footprint. ECOLIFE designs multiplenovel extensions to Particle Swarm Optimization (PSO) in the context ofserverless execution environment to achieve high performance while effectivelyreducing the carbon footprint.

这项工作介绍了 ECOLIFE，它是首个可共同优化碳足迹和性能的无碳感知服务器函数调度程序。ECOLIFE 基于智能利用多代硬件实现高性能和低碳足迹的关键见解。ECOLIFE 设计了无服务器执行环境下粒子群优化（PSO）的多层次扩展，以在实现高性能的同时有效减少碳足迹。

引用次数: 0

Rapid GPU-Based Pangenome Graph Layout 基于 GPU 的庞基因组图快速布局

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-02 DOI: arxiv-2409.00876

Jiajie Li, Jan-Niklas Schmelzle, Yixiao Du, Simon Heumos, Andrea Guarracino, Giulia Guidi, Pjotr Prins, Erik Garrison, Zhiru Zhang

Computational Pangenomics is an emerging field that studies genetic variationusing a graph structure encompassing multiple genomes. Visualizing pangenomegraphs is vital for understanding genome diversity. Yet, handling large graphscan be challenging due to the high computational demands of the graph layoutprocess. In this work, we conduct a thorough performance characterization of astate-of-the-art pangenome graph layout algorithm, revealing significantdata-level parallelism, which makes GPUs a promising option for computeacceleration. However, irregular data access and the algorithm's memory-boundnature present significant hurdles. To overcome these challenges, we develop asolution implementing three key optimizations: a cache-friendly data layout,coalesced random states, and warp merging. Additionally, we propose aquantitative metric for scalable evaluation of pangenome layout quality. Evaluated on 24 human whole-chromosome pangenomes, our GPU-based solutionachieves a 57.3x speedup over the state-of-the-art multithreaded CPU baselinewithout layout quality loss, reducing execution time from hours to minutes.

计算庞基因组学是一个新兴领域，它利用包含多个基因组的图结构研究遗传变异。可视化庞基因组图对于了解基因组多样性至关重要。然而，由于图布局过程的计算要求很高，处理大型图可能具有挑战性。在这项工作中，我们对最先进的庞基因组图布局算法进行了全面的性能鉴定，发现了显著的数据级并行性，这使得 GPU 成为计算加速的一个有前途的选择。然而，不规则的数据访问和算法的内存约束特性带来了重大障碍。为了克服这些挑战，我们开发了实现三个关键优化的解决方案：缓存友好型数据布局、凝聚随机状态和翘曲合并。此外，我们还提出了可扩展的庞基因组布局质量评估指标。我们基于 GPU 的解决方案在 24 个人类全染色体庞基因组上进行了评估，与最先进的多线程 CPU 相比，速度提高了 57.3 倍，但布局质量没有下降，执行时间从几小时缩短到几分钟。

{"title":"Rapid GPU-Based Pangenome Graph Layout","authors":"Jiajie Li, Jan-Niklas Schmelzle, Yixiao Du, Simon Heumos, Andrea Guarracino, Giulia Guidi, Pjotr Prins, Erik Garrison, Zhiru Zhang","doi":"arxiv-2409.00876","DOIUrl":"https://doi.org/arxiv-2409.00876","url":null,"abstract":"Computational Pangenomics is an emerging field that studies genetic variation\u0000using a graph structure encompassing multiple genomes. Visualizing pangenome\u0000graphs is vital for understanding genome diversity. Yet, handling large graphs\u0000can be challenging due to the high computational demands of the graph layout\u0000process. In this work, we conduct a thorough performance characterization of a\u0000state-of-the-art pangenome graph layout algorithm, revealing significant\u0000data-level parallelism, which makes GPUs a promising option for compute\u0000acceleration. However, irregular data access and the algorithm's memory-bound\u0000nature present significant hurdles. To overcome these challenges, we develop a\u0000solution implementing three key optimizations: a cache-friendly data layout,\u0000coalesced random states, and warp merging. Additionally, we propose a\u0000quantitative metric for scalable evaluation of pangenome layout quality. Evaluated on 24 human whole-chromosome pangenomes, our GPU-based solution\u0000achieves a 57.3x speedup over the state-of-the-art multithreaded CPU baseline\u0000without layout quality loss, reducing execution time from hours to minutes.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0