arXiv - CS - Distributed, Parallel, and Cluster Computing最新文献_第7页

Vortex: Efficient Sample-Free Dynamic Tensor Program Optimization via Hardware-aware Strategy Space Hierarchization 涡旋：通过硬件感知的策略空间分层实现高效的无采样动态张量程序优化

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-02 DOI: arxiv-2409.01075

Yangjie Zhou, Honglin Zhu, Qian Qiu, Weihao Cui, Zihan Liu, Cong Guo, Siyuan Feng, Jintao Meng, Haidong Lan, Jingwen Leng, Wenxi Zhu, Minwen Deng

Dynamic-shape deep neural networks (DNNs) are rapidly evolving, attractingattention for their ability to handle variable input sizes in real-timeapplications. However, existing compilation optimization methods for suchnetworks often rely heavily on predefined samples to guide the compilationprocess, which restricts their adaptability and efficiency. These sample-drivenmethods struggle to efficiently manage the diverse and unpredictable shapesencountered in real-world scenarios, often resulting in suboptimal performance. To tackle these issues, we introduce Vortex, a hardware-driven andsample-free compiler tailored for dynamic-shape tensor programs. Vortexcapitalizes on detailed hardware information and hierarchizes the strategyspace to facilitate high-performance code generation without relying on runtimeshape samples. It features a unique bidirectional compilation workflow,combining top-down abstraction for aligning tensor program execution withhardware hierarchies and bottom-up kernel construction to narrow the searchspace, enabling Vortex to achieve remarkable efficiency. Comprehensiveevaluations confirm that Vortex reduces compilation time by $176times$compared to the existing dynamic-shape compiler. Additionally, it substantiallyoutperforms existing vendor-provided libraries and dynamic-shape compilers onboth CPU and GPU platforms, delivering speedups of $2.53times$ and$3.01times$, respectively.

动态形状深度神经网络（Dynamic-shape deep neural networks，DNN）发展迅速，因其在实时应用中处理可变输入大小的能力而备受关注。然而，针对此类网络的现有编译优化方法往往严重依赖于预定义样本来指导编译过程，这限制了它们的适应性和效率。这些样本驱动的方法难以有效管理真实世界场景中遇到的多样化和不可预测的形状，往往导致性能不理想。为了解决这些问题，我们引入了 Vortex，这是一种硬件驱动的无样本编译器，专为动态形状张量程序量身定制。Vortex 利用详细的硬件信息，对策略空间进行分层，以促进高性能代码生成，而无需依赖运行时形状样本。它采用独特的双向编译工作流，将自上而下的抽象与硬件分层相结合，以调整张量程序的执行，并自下而上地构建内核以缩小搜索空间，从而使 Vortex 实现了卓越的效率。综合评估证实，与现有的动态形状编译器相比，Vortex 将编译时间缩短了 176 倍。此外，在CPU和GPU平台上，它的性能大大优于现有供应商提供的库和动态形状编译器，速度分别提高了2.53倍和3.01倍。

{"title":"Vortex: Efficient Sample-Free Dynamic Tensor Program Optimization via Hardware-aware Strategy Space Hierarchization","authors":"Yangjie Zhou, Honglin Zhu, Qian Qiu, Weihao Cui, Zihan Liu, Cong Guo, Siyuan Feng, Jintao Meng, Haidong Lan, Jingwen Leng, Wenxi Zhu, Minwen Deng","doi":"arxiv-2409.01075","DOIUrl":"https://doi.org/arxiv-2409.01075","url":null,"abstract":"Dynamic-shape deep neural networks (DNNs) are rapidly evolving, attracting\u0000attention for their ability to handle variable input sizes in real-time\u0000applications. However, existing compilation optimization methods for such\u0000networks often rely heavily on predefined samples to guide the compilation\u0000process, which restricts their adaptability and efficiency. These sample-driven\u0000methods struggle to efficiently manage the diverse and unpredictable shapes\u0000encountered in real-world scenarios, often resulting in suboptimal performance. To tackle these issues, we introduce Vortex, a hardware-driven and\u0000sample-free compiler tailored for dynamic-shape tensor programs. Vortex\u0000capitalizes on detailed hardware information and hierarchizes the strategy\u0000space to facilitate high-performance code generation without relying on runtime\u0000shape samples. It features a unique bidirectional compilation workflow,\u0000combining top-down abstraction for aligning tensor program execution with\u0000hardware hierarchies and bottom-up kernel construction to narrow the search\u0000space, enabling Vortex to achieve remarkable efficiency. Comprehensive\u0000evaluations confirm that Vortex reduces compilation time by $176times$\u0000compared to the existing dynamic-shape compiler. Additionally, it substantially\u0000outperforms existing vendor-provided libraries and dynamic-shape compilers on\u0000both CPU and GPU platforms, delivering speedups of $2.53times$ and\u0000$3.01times$, respectively.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

How local constraints influence network diameter and applications to LCL generalizations 局部约束如何影响网络直径以及在 LCL 概括中的应用

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-02 DOI: arxiv-2409.01305

Nicolas Bousquet, Laurent Feuilloley, Théo Pierron

In this paper, we investigate how local rules enforced at every node caninfluence the topology of a network. More precisely, we establish severalresults on the diameter of trees as a function of the number of nodes, aslisted below. These results have important consequences on the landscape oflocally checkable labelings (LCL) on emph{unbounded} degree graphs, a case inwhich our lack of knowledge is in striking contrast with that of emph{boundeddegree graphs}, that has been intensively studied recently. [See paper for fullabstract.]

在本文中，我们研究了在每个节点执行的局部规则如何影响网络的拓扑结构。更准确地说，我们建立了树的直径与节点数函数关系的几个结果，如下所列。这些结果对 emph{unbounded} 度图上的局部可检查标签（LCL）的景观具有重要影响，在这种情况下，我们的知识匮乏与最近被深入研究的 emph{boundeddegree graphs} 形成了鲜明对比。[全文摘要见论文]

引用次数: 0

LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs LuWu：分布式 GPU 上用于 100B 级网络模型数据并行训练的端到端网络内核外优化器

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-02 DOI: arxiv-2409.00918

Mo Sun, Zihan Yang, Changyue Liao, Yingtao Li, Fei Wu, Zeke Wang

The recent progress made in large language models (LLMs) has broughttremendous application prospects to the world. The growing model size demandsLLM training on multiple GPUs, while data parallelism is the most populardistributed training strategy due to its simplicity, efficiency, andscalability. Current systems adopt the model-sharded data parallelism to enablememory-efficient training, however, existing model-sharded data-parallelsystems fail to efficiently utilize GPU on a commodity GPU cluster with 100Gbps (or 200 Gbps) inter-GPU bandwidth due to 1) severe interference betweencollective operation and GPU computation and 2) heavy CPU optimizer overhead.Recent works propose in-network aggregation (INA) to relieve the networkbandwidth pressure in data-parallel training, but they are incompatible withmodel sharding due to the network design. To this end, we propose LuWu, a novelin-network optimizer that enables efficient model-in-network data-paralleltraining of a 100B-scale model on distributed GPUs. Such new data-parallelparadigm keeps a similar communication pattern as model-sharded dataparallelism but with a centralized in-network optimizer execution. The key ideais to offload the entire optimizer states and parameters from GPU workers ontoan in-network optimizer node and to offload the entire collective communicationfrom GPU-implemented NCCL to SmartNIC-SmartSwitch co-optimization. Theexperimental results show that LuWu outperforms the state-of-the-art trainingsystem by 3.98x when training on a 175B model on an 8-worker cluster.

大型语言模型（LLM）近年来取得的进展为世界带来了巨大的应用前景。模型规模的不断扩大要求在多个 GPU 上进行 LLM 训练，而数据并行性因其简单、高效和可扩展性而成为最受欢迎的分布式训练策略。然而，现有的模型分片数据并行系统无法在具有 100Gbps （或 200Gbps ）GPU 间带宽的商品 GPU 集群上有效利用 GPU，原因在于：1）集合操作和 GPU 计算之间存在严重干扰；2）CPU 优化器开销巨大。最近的研究提出了网络内聚合（in-network aggregation，INA）来缓解数据并行训练中的网络带宽压力，但由于网络设计的原因，它们与模型分片（model sharding）不兼容。为此，我们提出了一种新颖的网络内优化器 LuWu，它能在分布式 GPU 上对 100B 规模的模型进行高效的模型网络内数据并行训练。这种新的数据并行范式保持了与模型分片数据并行类似的通信模式，但采用了集中式网内优化器执行方式。其关键思路是将整个优化器的状态和参数从 GPU 工作站卸载到网内优化器节点上，并将整个集体通信从 GPU 实现的 NCCL 卸载到 SmartNIC-SmartSwitch 协同优化上。实验结果表明，在 8 个工作站集群上对一个 175B 的模型进行训练时，LuWu 的性能是最先进训练系统的 3.98 倍。

{"title":"LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs","authors":"Mo Sun, Zihan Yang, Changyue Liao, Yingtao Li, Fei Wu, Zeke Wang","doi":"arxiv-2409.00918","DOIUrl":"https://doi.org/arxiv-2409.00918","url":null,"abstract":"The recent progress made in large language models (LLMs) has brought\u0000tremendous application prospects to the world. The growing model size demands\u0000LLM training on multiple GPUs, while data parallelism is the most popular\u0000distributed training strategy due to its simplicity, efficiency, and\u0000scalability. Current systems adopt the model-sharded data parallelism to enable\u0000memory-efficient training, however, existing model-sharded data-parallel\u0000systems fail to efficiently utilize GPU on a commodity GPU cluster with 100\u0000Gbps (or 200 Gbps) inter-GPU bandwidth due to 1) severe interference between\u0000collective operation and GPU computation and 2) heavy CPU optimizer overhead.\u0000Recent works propose in-network aggregation (INA) to relieve the network\u0000bandwidth pressure in data-parallel training, but they are incompatible with\u0000model sharding due to the network design. To this end, we propose LuWu, a novel\u0000in-network optimizer that enables efficient model-in-network data-parallel\u0000training of a 100B-scale model on distributed GPUs. Such new data-parallel\u0000paradigm keeps a similar communication pattern as model-sharded data\u0000parallelism but with a centralized in-network optimizer execution. The key idea\u0000is to offload the entire optimizer states and parameters from GPU workers onto\u0000an in-network optimizer node and to offload the entire collective communication\u0000from GPU-implemented NCCL to SmartNIC-SmartSwitch co-optimization. The\u0000experimental results show that LuWu outperforms the state-of-the-art training\u0000system by 3.98x when training on a 175B model on an 8-worker cluster.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment FlashFlex：适应异构环境下的大型语言模型训练

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-02 DOI: arxiv-2409.01143

Ran Yan, Youhe Jiang, Wangcheng Tao, Xiaonan Nie, Bin Cui, Binhang Yuan

Training large language model (LLM) is a computationally intensive task,which is typically conducted in data centers with homogeneous high-performanceGPUs. This paper explores an alternative approach by deploying the trainingcomputation across heterogeneous GPUs to enable better flexibility andefficiency for heterogeneous resource utilization. To achieve this goal, wepropose a novel system, FlashFlex, that can flexibly support an asymmetricpartition of the parallel training computations across the scope of data-,pipeline-, and tensor model parallelism. We further formalize the allocation ofasymmetric partitioned training computations over a set of heterogeneous GPUsas a constrained optimization problem and propose an efficient solution basedon a hierarchical graph partitioning algorithm. Our approach can adaptivelyallocate asymmetric training computations across GPUs, fully leveraging theavailable computational power. We conduct extensive empirical studies toevaluate the performance of FlashFlex, where we find that when training LLMs atdifferent scales (from 7B to 30B), FlashFlex can achieve comparable trainingMFU when running over a set of heterogeneous GPUs compared with the state ofthe art training systems running over a set of homogeneous high-performanceGPUs with the same amount of total peak FLOPS. The achieved smallest gaps inMFU are 11.61% and 0.30%, depending on whether the homogeneous setting isequipped with and without RDMA. Our implementation is available athttps://github.com/Relaxed-System-Lab/FlashFlex.

训练大型语言模型（LLM）是一项计算密集型任务，通常在配备同构高性能 GPU 的数据中心进行。本文探讨了另一种方法，即在异构 GPU 上部署训练计算，从而提高异构资源利用的灵活性和效率。为了实现这一目标，我们提出了一种新颖的系统--FlashFlex，它可以灵活地支持数据、流水线和张量模型并行范围内并行训练计算的非对称分区。我们进一步将非对称分区训练计算在一组异构 GPU 上的分配形式化为一个受限优化问题，并提出了一种基于分层图分区算法的高效解决方案。我们的方法可以在 GPU 上自适应地分配非对称训练计算，充分利用现有的计算能力。我们进行了广泛的实证研究来评估FlashFlex的性能，结果发现，当训练不同规模（从7B到30B）的LLM时，FlashFlex在一组异构GPU上运行时，与在一组同构高性能GPU上运行的具有相同峰值FLOPS总量的最新训练系统相比，可以达到相当的训练MFU。MFU的最小差距分别为11.61%和0.30%，这取决于同构设置是否配备RDMA。我们的实现可在https://github.com/Relaxed-System-Lab/FlashFlex。

{"title":"FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment","authors":"Ran Yan, Youhe Jiang, Wangcheng Tao, Xiaonan Nie, Bin Cui, Binhang Yuan","doi":"arxiv-2409.01143","DOIUrl":"https://doi.org/arxiv-2409.01143","url":null,"abstract":"Training large language model (LLM) is a computationally intensive task,\u0000which is typically conducted in data centers with homogeneous high-performance\u0000GPUs. This paper explores an alternative approach by deploying the training\u0000computation across heterogeneous GPUs to enable better flexibility and\u0000efficiency for heterogeneous resource utilization. To achieve this goal, we\u0000propose a novel system, FlashFlex, that can flexibly support an asymmetric\u0000partition of the parallel training computations across the scope of data-,\u0000pipeline-, and tensor model parallelism. We further formalize the allocation of\u0000asymmetric partitioned training computations over a set of heterogeneous GPUs\u0000as a constrained optimization problem and propose an efficient solution based\u0000on a hierarchical graph partitioning algorithm. Our approach can adaptively\u0000allocate asymmetric training computations across GPUs, fully leveraging the\u0000available computational power. We conduct extensive empirical studies to\u0000evaluate the performance of FlashFlex, where we find that when training LLMs at\u0000different scales (from 7B to 30B), FlashFlex can achieve comparable training\u0000MFU when running over a set of heterogeneous GPUs compared with the state of\u0000the art training systems running over a set of homogeneous high-performance\u0000GPUs with the same amount of total peak FLOPS. The achieved smallest gaps in\u0000MFU are 11.61% and 0.30%, depending on whether the homogeneous setting is\u0000equipped with and without RDMA. Our implementation is available at\u0000https://github.com/Relaxed-System-Lab/FlashFlex.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Federated Aggregation of Mallows Rankings: A Comparative Analysis of Borda and Lehmer Coding 马洛斯排名的联合聚合：Borda 和 Lehmer 编码的比较分析

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-01 DOI: arxiv-2409.00848

Jin Sima, Vishal Rana, Olgica Milenkovic

Rank aggregation combines multiple ranked lists into a consensus ranking. Infields like biomedical data sharing, rankings may be distributed and requireprivacy. This motivates the need for federated rank aggregation protocols,which support distributed, private, and communication-efficient learning acrossmultiple clients with local data. We present the first known federated rankaggregation methods using Borda scoring and Lehmer codes, focusing on thesample complexity for federated algorithms on Mallows distributions with aknown scaling factor $phi$ and an unknown centroid permutation $sigma_0$.Federated Borda approach involves local client scoring, nontrivialquantization, and privacy-preserving protocols. We show that for $phi in[0,1)$, and arbitrary $sigma_0$ of length $N$, it suffices for each of the $L$clients to locally aggregate $max{C_1(phi), C_2(phi)frac{1}{L}logfrac{N}{delta}}$ rankings, where $C_1(phi)$ and $C_2(phi)$ are constants,quantize the result, and send it to the server who can then recover $sigma_0$with probability $geq 1-delta$. Communication complexity scales as $NL logN$. Our results represent the first rigorous analysis of Borda's method incentralized and distributed settings under the Mallows model. Federated Lehmercoding approach creates a local Lehmer code for each client, using acoordinate-majority aggregation approach with specialized quantization methodsfor efficiency and privacy. We show that for $phi+phi^2<1+phi^N$, andarbitrary $sigma_0$ of length $N$, it suffices for each of the $L$ clients tolocally aggregate $max{C_3(phi), C_4(phi)frac{1}{L}logfrac{N}{delta}}$ rankings, where $C_3(phi)$ and $C_4(phi)$ are constants.Clients send truncated Lehmer coordinate histograms to the server, which canrecover $sigma_0$ with probability $geq 1-delta$. Communication complexityis $sim O(Nlog NLlog L)$.

排名汇总将多个排名列表合并成一个共识排名。在生物医学数据共享等领域，排名可能是分布式的，需要保密。这就激发了对联合排名聚合协议的需求，该协议支持多个客户端利用本地数据进行分布式、私密和通信效率高的学习。我们提出了第一种已知的使用博尔达评分和雷默编码的联合秩聚合方法，重点研究了具有已知缩放因子$phi$和未知中心点排列组合$sigma_0$的马洛斯分布上的联合算法的样本复杂度。我们证明，对于$phi [0,1)$和长度为$N$的任意$sigma_0$，每个$L$客户端只需局部聚合$max{C_1(phi), C_2(phi)frac{1}{L}logfrac{N}{delta}}$ 排名、其中$C_1(phi)$和$C_2(phi)$是常量，量化结果并发送给服务器，服务器就能以$geq 1-delta$的概率恢复$sigma_0$。通信复杂度以 $NL logN$ 的形式扩展。我们的结果代表了在 Mallows 模型下对 Borda 方法的集中式和分布式设置的首次严格分析。Federated Lehmercoding 方法为每个客户端创建一个本地 Lehmer 代码，使用坐标多数聚合方法和专门的量化方法来提高效率和隐私性。我们证明，对于长度为 $N$ 的 $phi+phi^2<1+phi^N$，以及长度为 $N$ 的任意 $sigma_0$，每个 $L$ 客户端只需局部聚合 $max{C_3(phi)、C_4(phi)frac{1}{L}logfrac{N}{delta}}$ 排序，其中 $C_3(phi)$ 和 $C_4(phi)$ 是常数。客户端向服务器发送截断的雷默坐标直方图，服务器能以 $geq 1-delta$ 的概率恢复 $sigma_0$。通信复杂度为 $sim O(Nlog NLlog L)$.

{"title":"Federated Aggregation of Mallows Rankings: A Comparative Analysis of Borda and Lehmer Coding","authors":"Jin Sima, Vishal Rana, Olgica Milenkovic","doi":"arxiv-2409.00848","DOIUrl":"https://doi.org/arxiv-2409.00848","url":null,"abstract":"Rank aggregation combines multiple ranked lists into a consensus ranking. In\u0000fields like biomedical data sharing, rankings may be distributed and require\u0000privacy. This motivates the need for federated rank aggregation protocols,\u0000which support distributed, private, and communication-efficient learning across\u0000multiple clients with local data. We present the first known federated rank\u0000aggregation methods using Borda scoring and Lehmer codes, focusing on the\u0000sample complexity for federated algorithms on Mallows distributions with a\u0000known scaling factor $phi$ and an unknown centroid permutation $sigma_0$.\u0000Federated Borda approach involves local client scoring, nontrivial\u0000quantization, and privacy-preserving protocols. We show that for $phi in\u0000[0,1)$, and arbitrary $sigma_0$ of length $N$, it suffices for each of the $L$\u0000clients to locally aggregate $max{C_1(phi), C_2(phi)frac{1}{L}log\u0000frac{N}{delta}}$ rankings, where $C_1(phi)$ and $C_2(phi)$ are constants,\u0000quantize the result, and send it to the server who can then recover $sigma_0$\u0000with probability $geq 1-delta$. Communication complexity scales as $NL log\u0000N$. Our results represent the first rigorous analysis of Borda's method in\u0000centralized and distributed settings under the Mallows model. Federated Lehmer\u0000coding approach creates a local Lehmer code for each client, using a\u0000coordinate-majority aggregation approach with specialized quantization methods\u0000for efficiency and privacy. We show that for $phi+phi^2<1+phi^N$, and\u0000arbitrary $sigma_0$ of length $N$, it suffices for each of the $L$ clients to\u0000locally aggregate $max{C_3(phi), C_4(phi)frac{1}{L}log\u0000frac{N}{delta}}$ rankings, where $C_3(phi)$ and $C_4(phi)$ are constants.\u0000Clients send truncated Lehmer coordinate histograms to the server, which can\u0000recover $sigma_0$ with probability $geq 1-delta$. Communication complexity\u0000is $sim O(Nlog NLlog L)$.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RTop-K: Ultra-Fast Row-Wise Top-K Algorithm and GPU Implementation for Neural Networks RTop-K：用于神经网络的超快行向 Top-K 算法和 GPU 实现

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-01 DOI: arxiv-2409.00822

Xi Xie, Yuebo Luo, Hongwu Peng, Caiwen Ding

Top-k algorithms are essential in various applications, from high-performancecomputing and information retrieval to big data and neural network modeltraining. This paper introduces RTop-K, a highly efficient parallel row-wisetop-k selection algorithm designed for GPUs. RTop-K employs a BinarySearch-based approach to optimize resource allocation and provides a scalablesolution that significantly accelerates top-k operations. We perform atheoretical analysis of the effects of early stopping in our algorithm,demonstrating that it maintains the accuracy of neural network models whileenhancing performance. Comprehensive tests show that our GPU implementation ofRTop-K outperforms other row-wise top-k GPU implementations, with minimalimpact on testing accuracy when early stopping is applied. Notably, RTop-Kachieves speed increases ranging from 4.245$times$ to 9.506$times$ with earlystopping, and 3.936$times$ without early stopping, compared tostate-of-the-art implementations. The proposed methods offer significantimprovements in the training and inference of Graph Neural Networks (GNNs),addressing critical challenges in latency and throughput on GPU platforms.

从高性能计算和信息检索到大数据和神经网络模型训练，拓扑-k 算法在各种应用中都是必不可少的。本文介绍了 RTop-K，这是一种专为 GPU 设计的高效并行行智顶 k 选择算法。RTop-K 采用了一种基于二进制搜索的方法来优化资源分配，并提供了一种可扩展的解决方案，大大加快了 top-k 运算的速度。我们对算法中早期停止的效果进行了理论分析，证明它在提高性能的同时保持了神经网络模型的准确性。综合测试表明，我们的RTop-K GPU实现优于其他行向顶k GPU实现，在应用早期停止时对测试精度的影响最小。值得注意的是，与最先进的实现相比，RTop-K在使用提前停止的情况下速度提高了4.245倍到9.506倍，而在不使用提前停止的情况下提高了3.936倍。所提出的方法大大改进了图神经网络（GNN）的训练和推理，解决了GPU平台在延迟和吞吐量方面的关键挑战。

{"title":"RTop-K: Ultra-Fast Row-Wise Top-K Algorithm and GPU Implementation for Neural Networks","authors":"Xi Xie, Yuebo Luo, Hongwu Peng, Caiwen Ding","doi":"arxiv-2409.00822","DOIUrl":"https://doi.org/arxiv-2409.00822","url":null,"abstract":"Top-k algorithms are essential in various applications, from high-performance\u0000computing and information retrieval to big data and neural network model\u0000training. This paper introduces RTop-K, a highly efficient parallel row-wise\u0000top-k selection algorithm designed for GPUs. RTop-K employs a Binary\u0000Search-based approach to optimize resource allocation and provides a scalable\u0000solution that significantly accelerates top-k operations. We perform a\u0000theoretical analysis of the effects of early stopping in our algorithm,\u0000demonstrating that it maintains the accuracy of neural network models while\u0000enhancing performance. Comprehensive tests show that our GPU implementation of\u0000RTop-K outperforms other row-wise top-k GPU implementations, with minimal\u0000impact on testing accuracy when early stopping is applied. Notably, RTop-K\u0000achieves speed increases ranging from 4.245$times$ to 9.506$times$ with early\u0000stopping, and 3.936$times$ without early stopping, compared to\u0000state-of-the-art implementations. The proposed methods offer significant\u0000improvements in the training and inference of Graph Neural Networks (GNNs),\u0000addressing critical challenges in latency and throughput on GPU platforms.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"268 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Container Data Item: An Abstract Datatype for Efficient Container-based Edge Computing 容器数据项：基于容器的高效边缘计算的抽象数据类型

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-01 DOI: arxiv-2409.00801

Md Rezwanur Rahman, Tarun Annapareddy, Shirin Ebadi, Varsha Natarajan, Adarsh Srinivasan, Eric Keller, Shivakant Mishra

We present Container Data Item (CDI), an abstract datatype that allowsmultiple containers to efficiently operate on a common data item whilepreserving their strong security and isolation semantics. Applicationdevelopers can use CDIs to enable multiple containers to operate on the samedata, synchronize execution among themselves, and control the ownership of theshared data item during runtime. These containers may reside on the same serveror different servers. CDI is designed to support microservice basedapplications comprised of a set of interconnected microservices, eachimplemented by a separate dedicated container. CDI preserves the importantisolation semantics of containers by ensuring that exactly one container owns aCDI object at any instant and the ownership of a CDI object may be transferredfrom one container to another only by the current CDI object owner. We presentthree different implementations of CDI that allow different containers residingon the same server as well containers residing on different servers to use CDIfor efficiently operating on a common data item. The paper provides anextensive performance evaluation of CDI along with two representativeapplications, an augmented reality application and a decentralized workfloworchestrator.

我们提出的容器数据项（CDI）是一种抽象数据类型，它允许多个容器在保留其强大的安全性和隔离语义的同时，高效地对一个共同的数据项进行操作。应用程序开发人员可以使用 CDI 使多个容器能够对同一数据进行操作，在它们之间同步执行，并在运行期间控制共享数据项的所有权。这些容器可以位于同一服务器上，也可以位于不同的服务器上。CDI 设计用于支持基于微服务的应用，该应用由一组相互连接的微服务组成，每个微服务由一个单独的专用容器实现。CDI 保留了容器的重要隔离语义，确保在任何时刻都只有一个容器拥有 CDI 对象，并且 CDI 对象的所有权只能由当前 CDI 对象的所有者从一个容器转移到另一个容器。我们介绍了 CDI 的三种不同实现，它们允许驻留在同一服务器上的不同容器以及驻留在不同服务器上的容器使用 CDI 高效地操作一个共同的数据项。本文对 CDI 进行了广泛的性能评估，并介绍了两个具有代表性的应用，一个是增强现实应用，另一个是分散式工作流协调器。

{"title":"Container Data Item: An Abstract Datatype for Efficient Container-based Edge Computing","authors":"Md Rezwanur Rahman, Tarun Annapareddy, Shirin Ebadi, Varsha Natarajan, Adarsh Srinivasan, Eric Keller, Shivakant Mishra","doi":"arxiv-2409.00801","DOIUrl":"https://doi.org/arxiv-2409.00801","url":null,"abstract":"We present Container Data Item (CDI), an abstract datatype that allows\u0000multiple containers to efficiently operate on a common data item while\u0000preserving their strong security and isolation semantics. Application\u0000developers can use CDIs to enable multiple containers to operate on the same\u0000data, synchronize execution among themselves, and control the ownership of the\u0000shared data item during runtime. These containers may reside on the same server\u0000or different servers. CDI is designed to support microservice based\u0000applications comprised of a set of interconnected microservices, each\u0000implemented by a separate dedicated container. CDI preserves the important\u0000isolation semantics of containers by ensuring that exactly one container owns a\u0000CDI object at any instant and the ownership of a CDI object may be transferred\u0000from one container to another only by the current CDI object owner. We present\u0000three different implementations of CDI that allow different containers residing\u0000on the same server as well containers residing on different servers to use CDI\u0000for efficiently operating on a common data item. The paper provides an\u0000extensive performance evaluation of CDI along with two representative\u0000applications, an augmented reality application and a decentralized workflow\u0000orchestrator.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HopGNN: Boosting Distributed GNN Training Efficiency via Feature-Centric Model Migration HopGNN：通过以特征为中心的模型迁移提升分布式 GNN 训练效率

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-01 DOI: arxiv-2409.00657

Weijian Chen, Shuibing He, Haoyang Qu, Xuechen Zhang, Dan Feng

Distributed training of graph neural networks (GNNs) has become a crucialtechnique for processing large graphs. Prevalent GNN frameworks aremodel-centric, necessitating the transfer of massive graph vertex features toGNN models, which leads to a significant communication bottleneck. Recognizingthat the model size is often significantly smaller than the feature size, wepropose LeapGNN, a feature-centric framework that reverses this paradigm bybringing GNN models to vertex features. To make it truly effective, we firstpropose a micrograph-based training strategy that trains the model using arefined structure with superior locality to reduce remote feature retrieval.Then, we devise a feature pre-gathering approach that merges multiple fetchoperations into a single one to eliminate redundant feature transmissions.Finally, we employ a micrograph-based merging method that adjusts the number ofmicrographs for each worker to minimize kernel switches and synchronizationoverhead. Our experimental results demonstrate that LeapGNN achieves aperformance speedup of up to 4.2x compared to the state-of-the-art method,namely P3.

图神经网络（GNN）的分布式训练已成为处理大型图的关键技术。目前流行的图神经网络框架以模型为中心，必须将大量图顶点特征传输到图神经网络模型中，这导致了显著的通信瓶颈。我们认识到模型的大小往往远远小于特征的大小，因此提出了 LeapGNN，这是一种以特征为中心的框架，它通过将 GNN 模型与顶点特征相结合来扭转这种模式。为了使 LeapGNN 真正有效，我们首先提出了一种基于微图的训练策略，该策略使用具有出色局部性的限定结构训练模型，以减少远程特征检索。然后，我们设计了一种特征预收集方法，该方法将多个获取操作合并为一个操作，以消除多余的特征传输。我们的实验结果表明，与最先进的方法（即 P3）相比，LeapGNN 的性能提速高达 4.2 倍。

{"title":"HopGNN: Boosting Distributed GNN Training Efficiency via Feature-Centric Model Migration","authors":"Weijian Chen, Shuibing He, Haoyang Qu, Xuechen Zhang, Dan Feng","doi":"arxiv-2409.00657","DOIUrl":"https://doi.org/arxiv-2409.00657","url":null,"abstract":"Distributed training of graph neural networks (GNNs) has become a crucial\u0000technique for processing large graphs. Prevalent GNN frameworks are\u0000model-centric, necessitating the transfer of massive graph vertex features to\u0000GNN models, which leads to a significant communication bottleneck. Recognizing\u0000that the model size is often significantly smaller than the feature size, we\u0000propose LeapGNN, a feature-centric framework that reverses this paradigm by\u0000bringing GNN models to vertex features. To make it truly effective, we first\u0000propose a micrograph-based training strategy that trains the model using a\u0000refined structure with superior locality to reduce remote feature retrieval.\u0000Then, we devise a feature pre-gathering approach that merges multiple fetch\u0000operations into a single one to eliminate redundant feature transmissions.\u0000Finally, we employ a micrograph-based merging method that adjusts the number of\u0000micrographs for each worker to minimize kernel switches and synchronization\u0000overhead. Our experimental results demonstrate that LeapGNN achieves a\u0000performance speedup of up to 4.2x compared to the state-of-the-art method,\u0000namely P3.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"80 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Universal Finite-State and Self-Stabilizing Computation in Anonymous Dynamic Networks 匿名动态网络中的通用有限状态和自稳定计算

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-01 DOI: arxiv-2409.00688

Giuseppe A. Di Luna, Giovanni Viglietta

A network is said to be "anonymous" if its agents are indistinguishable fromeach other; it is "dynamic" if its communication links may appear or disappearunpredictably over time. Assuming that an anonymous dynamic network is alwaysconnected and each of its $n$ agents is initially given an input, it takes $2n$communication rounds for the agents to compute an arbitrary (frequency-based)function of such inputs (Di Luna-Viglietta, DISC 2023). It is known that, without making additional assumptions on the network andwithout knowing the number of agents $n$, it is impossible to compute mostfunctions and explicitly terminate. In fact, current state-of-the-artalgorithms only achieve stabilization, i.e., allow each agent to return anoutput after every communication round; outputs can be changed, and areguaranteed to be all correct after $2n$ rounds. Such algorithms rely on theincremental construction of a data structure called "history tree", which isaugmented at every round. Thus, they end up consuming an unlimited amount ofmemory, and are also prone to errors in case of memory loss or corruption. In this paper, we provide a general self-stabilizing algorithm for anonymousdynamic networks that stabilizes in $max{4n-2h, 2h}$ rounds (where $h$measures the amount of corrupted data initially present in the memory of eachagent), as well as a general finite-state algorithm that stabilizes in $3n^2$rounds. Our work improves upon previously known methods that only apply tostatic networks (Boldi-Vigna, Dist. Comp. 2002). In addition, we develop newfundamental techniques and operations involving history trees, which are ofindependent interest.

如果网络中的代理彼此无法区分，则称该网络为 "匿名 "网络；如果网络中的通信链路会随着时间的推移出现或消失，则称该网络为 "动态 "网络。假定一个匿名动态网络始终保持连接，并且其每个 $n$ 代理最初都有一个输入，那么代理需要 2n$ 个通信回合才能计算出这些输入的任意（基于频率的）函数（Di Luna-Viglietta, DISC 2023）。众所周知，如果不对网络做出额外假设，也不知道代理的数量为 $n$，就不可能计算出大多数函数并明确终止。事实上，目前最先进的算法只能实现稳定，即允许每个代理在每轮通信后返回一个输出；输出可以改变，并保证在 2n$ 轮后全部正确。这种算法依赖于一种名为 "历史树 "的数据结构的递增构造，这种数据结构在每一轮中都会被扩充。因此，它们最终会消耗无限量的内存，而且在内存丢失或损坏的情况下容易出错。在本文中，我们提供了一种适用于匿名动态网络的通用自稳定算法，它可以在 $max{4n-2h, 2h}$ 轮（其中 $h$ 衡量每个代理内存中最初存在的损坏数据量）内稳定下来，同时还提供了一种通用有限状态算法，它可以在 $3n^2$ 轮内稳定下来。我们的工作改进了以前已知的只适用于静态网络的方法（Boldi-Vigna，Dist. Comp. 2002）。此外，我们还开发了涉及历史树的新基础技术和操作，这些技术和操作具有独立的意义。

{"title":"Universal Finite-State and Self-Stabilizing Computation in Anonymous Dynamic Networks","authors":"Giuseppe A. Di Luna, Giovanni Viglietta","doi":"arxiv-2409.00688","DOIUrl":"https://doi.org/arxiv-2409.00688","url":null,"abstract":"A network is said to be \"anonymous\" if its agents are indistinguishable from\u0000each other; it is \"dynamic\" if its communication links may appear or disappear\u0000unpredictably over time. Assuming that an anonymous dynamic network is always\u0000connected and each of its $n$ agents is initially given an input, it takes $2n$\u0000communication rounds for the agents to compute an arbitrary (frequency-based)\u0000function of such inputs (Di Luna-Viglietta, DISC 2023). It is known that, without making additional assumptions on the network and\u0000without knowing the number of agents $n$, it is impossible to compute most\u0000functions and explicitly terminate. In fact, current state-of-the-art\u0000algorithms only achieve stabilization, i.e., allow each agent to return an\u0000output after every communication round; outputs can be changed, and are\u0000guaranteed to be all correct after $2n$ rounds. Such algorithms rely on the\u0000incremental construction of a data structure called \"history tree\", which is\u0000augmented at every round. Thus, they end up consuming an unlimited amount of\u0000memory, and are also prone to errors in case of memory loss or corruption. In this paper, we provide a general self-stabilizing algorithm for anonymous\u0000dynamic networks that stabilizes in $max{4n-2h, 2h}$ rounds (where $h$\u0000measures the amount of corrupted data initially present in the memory of each\u0000agent), as well as a general finite-state algorithm that stabilizes in $3n^2$\u0000rounds. Our work improves upon previously known methods that only apply to\u0000static networks (Boldi-Vigna, Dist. Comp. 2002). In addition, we develop new\u0000fundamental techniques and operations involving history trees, which are of\u0000independent interest.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"213 Suppl 2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Demo: FedCampus: A Real-world Privacy-preserving Mobile Application for Smart Campus via Federated Learning & Analytics 演示：FedCampus：通过联合学习与分析为智慧校园提供的真实世界隐私保护移动应用程序

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-08-31 DOI: arxiv-2409.00327

Jiaxiang Geng, Beilong Tang, Boyan Zhang, Jiaqi Shao, Bing Luo

In this demo, we introduce FedCampus, a privacy-preserving mobile applicationfor smart underline{campus} with underline{fed}erated learning (FL) andfederated analytics (FA). FedCampus enables cross-platform on-device FL/FA forboth iOS and Android, supporting continuously models and algorithms deployment(MLOps). Our app integrates privacy-preserving processed data via differentialprivacy (DP) from smartwatches, where the processed parameters are used forFL/FA through the FedCampus backend platform. We distributed 100 smartwatchesto volunteers at Duke Kunshan University and have successfully completed aseries of smart campus tasks featuring capabilities such as sleep tracking,physical activity monitoring, personalized recommendations, and heavy hitters.Our project is opensourced at https://github.com/FedCampus/FedCampus_Flutter.See the FedCampus video at https://youtu.be/k5iu46IjA38.

在本演示中，我们将介绍 FedCampus，这是一款保护隐私的移动应用程序，用于智能联合学习（FL）和联合分析（FA）。FedCampus 可在 iOS 和 Android 设备上实现跨平台 FL/FA，支持连续模型和算法部署（MLOps）。我们的应用程序通过智能手表的差分隐私（DP）集成了隐私保护处理数据，处理后的参数通过 FedCampus 后端平台用于 FL/FA。我们向昆山杜克大学的志愿者分发了 100 块智能手表，并成功完成了一系列智慧校园任务，包括睡眠跟踪、体力活动监测、个性化推荐和重击等功能。我们的项目开源于 https://github.com/FedCampus/FedCampus_Flutter.See，FedCampus 视频开源于 https://youtu.be/k5iu46IjA38。

引用次数: 0