Dynamic-shape deep neural networks (DNNs) are rapidly evolving, attracting attention for their ability to handle variable input sizes in real-time applications. However, existing compilation optimization methods for such networks often rely heavily on predefined samples to guide the compilation process, which restricts their adaptability and efficiency. These sample-driven methods struggle to efficiently manage the diverse and unpredictable shapes encountered in real-world scenarios, often resulting in suboptimal performance. To tackle these issues, we introduce Vortex, a hardware-driven and sample-free compiler tailored for dynamic-shape tensor programs. Vortex capitalizes on detailed hardware information and hierarchizes the strategy space to facilitate high-performance code generation without relying on runtime shape samples. It features a unique bidirectional compilation workflow, combining top-down abstraction for aligning tensor program execution with hardware hierarchies and bottom-up kernel construction to narrow the search space, enabling Vortex to achieve remarkable efficiency. Comprehensive evaluations confirm that Vortex reduces compilation time by $176times$ compared to the existing dynamic-shape compiler. Additionally, it substantially outperforms existing vendor-provided libraries and dynamic-shape compilers on both CPU and GPU platforms, delivering speedups of $2.53times$ and $3.01times$, respectively.
{"title":"Vortex: Efficient Sample-Free Dynamic Tensor Program Optimization via Hardware-aware Strategy Space Hierarchization","authors":"Yangjie Zhou, Honglin Zhu, Qian Qiu, Weihao Cui, Zihan Liu, Cong Guo, Siyuan Feng, Jintao Meng, Haidong Lan, Jingwen Leng, Wenxi Zhu, Minwen Deng","doi":"arxiv-2409.01075","DOIUrl":"https://doi.org/arxiv-2409.01075","url":null,"abstract":"Dynamic-shape deep neural networks (DNNs) are rapidly evolving, attracting\u0000attention for their ability to handle variable input sizes in real-time\u0000applications. However, existing compilation optimization methods for such\u0000networks often rely heavily on predefined samples to guide the compilation\u0000process, which restricts their adaptability and efficiency. These sample-driven\u0000methods struggle to efficiently manage the diverse and unpredictable shapes\u0000encountered in real-world scenarios, often resulting in suboptimal performance. To tackle these issues, we introduce Vortex, a hardware-driven and\u0000sample-free compiler tailored for dynamic-shape tensor programs. Vortex\u0000capitalizes on detailed hardware information and hierarchizes the strategy\u0000space to facilitate high-performance code generation without relying on runtime\u0000shape samples. It features a unique bidirectional compilation workflow,\u0000combining top-down abstraction for aligning tensor program execution with\u0000hardware hierarchies and bottom-up kernel construction to narrow the search\u0000space, enabling Vortex to achieve remarkable efficiency. Comprehensive\u0000evaluations confirm that Vortex reduces compilation time by $176times$\u0000compared to the existing dynamic-shape compiler. Additionally, it substantially\u0000outperforms existing vendor-provided libraries and dynamic-shape compilers on\u0000both CPU and GPU platforms, delivering speedups of $2.53times$ and\u0000$3.01times$, respectively.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicolas Bousquet, Laurent Feuilloley, Théo Pierron
In this paper, we investigate how local rules enforced at every node can influence the topology of a network. More precisely, we establish several results on the diameter of trees as a function of the number of nodes, as listed below. These results have important consequences on the landscape of locally checkable labelings (LCL) on emph{unbounded} degree graphs, a case in which our lack of knowledge is in striking contrast with that of emph{bounded degree graphs}, that has been intensively studied recently. [See paper for full abstract.]
{"title":"How local constraints influence network diameter and applications to LCL generalizations","authors":"Nicolas Bousquet, Laurent Feuilloley, Théo Pierron","doi":"arxiv-2409.01305","DOIUrl":"https://doi.org/arxiv-2409.01305","url":null,"abstract":"In this paper, we investigate how local rules enforced at every node can\u0000influence the topology of a network. More precisely, we establish several\u0000results on the diameter of trees as a function of the number of nodes, as\u0000listed below. These results have important consequences on the landscape of\u0000locally checkable labelings (LCL) on emph{unbounded} degree graphs, a case in\u0000which our lack of knowledge is in striking contrast with that of emph{bounded\u0000degree graphs}, that has been intensively studied recently. [See paper for full\u0000abstract.]","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mo Sun, Zihan Yang, Changyue Liao, Yingtao Li, Fei Wu, Zeke Wang
The recent progress made in large language models (LLMs) has brought tremendous application prospects to the world. The growing model size demands LLM training on multiple GPUs, while data parallelism is the most popular distributed training strategy due to its simplicity, efficiency, and scalability. Current systems adopt the model-sharded data parallelism to enable memory-efficient training, however, existing model-sharded data-parallel systems fail to efficiently utilize GPU on a commodity GPU cluster with 100 Gbps (or 200 Gbps) inter-GPU bandwidth due to 1) severe interference between collective operation and GPU computation and 2) heavy CPU optimizer overhead. Recent works propose in-network aggregation (INA) to relieve the network bandwidth pressure in data-parallel training, but they are incompatible with model sharding due to the network design. To this end, we propose LuWu, a novel in-network optimizer that enables efficient model-in-network data-parallel training of a 100B-scale model on distributed GPUs. Such new data-parallel paradigm keeps a similar communication pattern as model-sharded data parallelism but with a centralized in-network optimizer execution. The key idea is to offload the entire optimizer states and parameters from GPU workers onto an in-network optimizer node and to offload the entire collective communication from GPU-implemented NCCL to SmartNIC-SmartSwitch co-optimization. The experimental results show that LuWu outperforms the state-of-the-art training system by 3.98x when training on a 175B model on an 8-worker cluster.
{"title":"LuWu: An End-to-End In-Network Out-of-Core Optimizer for 100B-Scale Model-in-Network Data-Parallel Training on Distributed GPUs","authors":"Mo Sun, Zihan Yang, Changyue Liao, Yingtao Li, Fei Wu, Zeke Wang","doi":"arxiv-2409.00918","DOIUrl":"https://doi.org/arxiv-2409.00918","url":null,"abstract":"The recent progress made in large language models (LLMs) has brought\u0000tremendous application prospects to the world. The growing model size demands\u0000LLM training on multiple GPUs, while data parallelism is the most popular\u0000distributed training strategy due to its simplicity, efficiency, and\u0000scalability. Current systems adopt the model-sharded data parallelism to enable\u0000memory-efficient training, however, existing model-sharded data-parallel\u0000systems fail to efficiently utilize GPU on a commodity GPU cluster with 100\u0000Gbps (or 200 Gbps) inter-GPU bandwidth due to 1) severe interference between\u0000collective operation and GPU computation and 2) heavy CPU optimizer overhead.\u0000Recent works propose in-network aggregation (INA) to relieve the network\u0000bandwidth pressure in data-parallel training, but they are incompatible with\u0000model sharding due to the network design. To this end, we propose LuWu, a novel\u0000in-network optimizer that enables efficient model-in-network data-parallel\u0000training of a 100B-scale model on distributed GPUs. Such new data-parallel\u0000paradigm keeps a similar communication pattern as model-sharded data\u0000parallelism but with a centralized in-network optimizer execution. The key idea\u0000is to offload the entire optimizer states and parameters from GPU workers onto\u0000an in-network optimizer node and to offload the entire collective communication\u0000from GPU-implemented NCCL to SmartNIC-SmartSwitch co-optimization. The\u0000experimental results show that LuWu outperforms the state-of-the-art training\u0000system by 3.98x when training on a 175B model on an 8-worker cluster.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ran Yan, Youhe Jiang, Wangcheng Tao, Xiaonan Nie, Bin Cui, Binhang Yuan
Training large language model (LLM) is a computationally intensive task, which is typically conducted in data centers with homogeneous high-performance GPUs. This paper explores an alternative approach by deploying the training computation across heterogeneous GPUs to enable better flexibility and efficiency for heterogeneous resource utilization. To achieve this goal, we propose a novel system, FlashFlex, that can flexibly support an asymmetric partition of the parallel training computations across the scope of data-, pipeline-, and tensor model parallelism. We further formalize the allocation of asymmetric partitioned training computations over a set of heterogeneous GPUs as a constrained optimization problem and propose an efficient solution based on a hierarchical graph partitioning algorithm. Our approach can adaptively allocate asymmetric training computations across GPUs, fully leveraging the available computational power. We conduct extensive empirical studies to evaluate the performance of FlashFlex, where we find that when training LLMs at different scales (from 7B to 30B), FlashFlex can achieve comparable training MFU when running over a set of heterogeneous GPUs compared with the state of the art training systems running over a set of homogeneous high-performance GPUs with the same amount of total peak FLOPS. The achieved smallest gaps in MFU are 11.61% and 0.30%, depending on whether the homogeneous setting is equipped with and without RDMA. Our implementation is available at https://github.com/Relaxed-System-Lab/FlashFlex.
{"title":"FlashFlex: Accommodating Large Language Model Training over Heterogeneous Environment","authors":"Ran Yan, Youhe Jiang, Wangcheng Tao, Xiaonan Nie, Bin Cui, Binhang Yuan","doi":"arxiv-2409.01143","DOIUrl":"https://doi.org/arxiv-2409.01143","url":null,"abstract":"Training large language model (LLM) is a computationally intensive task,\u0000which is typically conducted in data centers with homogeneous high-performance\u0000GPUs. This paper explores an alternative approach by deploying the training\u0000computation across heterogeneous GPUs to enable better flexibility and\u0000efficiency for heterogeneous resource utilization. To achieve this goal, we\u0000propose a novel system, FlashFlex, that can flexibly support an asymmetric\u0000partition of the parallel training computations across the scope of data-,\u0000pipeline-, and tensor model parallelism. We further formalize the allocation of\u0000asymmetric partitioned training computations over a set of heterogeneous GPUs\u0000as a constrained optimization problem and propose an efficient solution based\u0000on a hierarchical graph partitioning algorithm. Our approach can adaptively\u0000allocate asymmetric training computations across GPUs, fully leveraging the\u0000available computational power. We conduct extensive empirical studies to\u0000evaluate the performance of FlashFlex, where we find that when training LLMs at\u0000different scales (from 7B to 30B), FlashFlex can achieve comparable training\u0000MFU when running over a set of heterogeneous GPUs compared with the state of\u0000the art training systems running over a set of homogeneous high-performance\u0000GPUs with the same amount of total peak FLOPS. The achieved smallest gaps in\u0000MFU are 11.61% and 0.30%, depending on whether the homogeneous setting is\u0000equipped with and without RDMA. Our implementation is available at\u0000https://github.com/Relaxed-System-Lab/FlashFlex.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rank aggregation combines multiple ranked lists into a consensus ranking. In fields like biomedical data sharing, rankings may be distributed and require privacy. This motivates the need for federated rank aggregation protocols, which support distributed, private, and communication-efficient learning across multiple clients with local data. We present the first known federated rank aggregation methods using Borda scoring and Lehmer codes, focusing on the sample complexity for federated algorithms on Mallows distributions with a known scaling factor $phi$ and an unknown centroid permutation $sigma_0$. Federated Borda approach involves local client scoring, nontrivial quantization, and privacy-preserving protocols. We show that for $phi in [0,1)$, and arbitrary $sigma_0$ of length $N$, it suffices for each of the $L$ clients to locally aggregate $max{C_1(phi), C_2(phi)frac{1}{L}log frac{N}{delta}}$ rankings, where $C_1(phi)$ and $C_2(phi)$ are constants, quantize the result, and send it to the server who can then recover $sigma_0$ with probability $geq 1-delta$. Communication complexity scales as $NL log N$. Our results represent the first rigorous analysis of Borda's method in centralized and distributed settings under the Mallows model. Federated Lehmer coding approach creates a local Lehmer code for each client, using a coordinate-majority aggregation approach with specialized quantization methods for efficiency and privacy. We show that for $phi+phi^2<1+phi^N$, and arbitrary $sigma_0$ of length $N$, it suffices for each of the $L$ clients to locally aggregate $max{C_3(phi), C_4(phi)frac{1}{L}log frac{N}{delta}}$ rankings, where $C_3(phi)$ and $C_4(phi)$ are constants. Clients send truncated Lehmer coordinate histograms to the server, which can recover $sigma_0$ with probability $geq 1-delta$. Communication complexity is $sim O(Nlog NLlog L)$.
{"title":"Federated Aggregation of Mallows Rankings: A Comparative Analysis of Borda and Lehmer Coding","authors":"Jin Sima, Vishal Rana, Olgica Milenkovic","doi":"arxiv-2409.00848","DOIUrl":"https://doi.org/arxiv-2409.00848","url":null,"abstract":"Rank aggregation combines multiple ranked lists into a consensus ranking. In\u0000fields like biomedical data sharing, rankings may be distributed and require\u0000privacy. This motivates the need for federated rank aggregation protocols,\u0000which support distributed, private, and communication-efficient learning across\u0000multiple clients with local data. We present the first known federated rank\u0000aggregation methods using Borda scoring and Lehmer codes, focusing on the\u0000sample complexity for federated algorithms on Mallows distributions with a\u0000known scaling factor $phi$ and an unknown centroid permutation $sigma_0$.\u0000Federated Borda approach involves local client scoring, nontrivial\u0000quantization, and privacy-preserving protocols. We show that for $phi in\u0000[0,1)$, and arbitrary $sigma_0$ of length $N$, it suffices for each of the $L$\u0000clients to locally aggregate $max{C_1(phi), C_2(phi)frac{1}{L}log\u0000frac{N}{delta}}$ rankings, where $C_1(phi)$ and $C_2(phi)$ are constants,\u0000quantize the result, and send it to the server who can then recover $sigma_0$\u0000with probability $geq 1-delta$. Communication complexity scales as $NL log\u0000N$. Our results represent the first rigorous analysis of Borda's method in\u0000centralized and distributed settings under the Mallows model. Federated Lehmer\u0000coding approach creates a local Lehmer code for each client, using a\u0000coordinate-majority aggregation approach with specialized quantization methods\u0000for efficiency and privacy. We show that for $phi+phi^2<1+phi^N$, and\u0000arbitrary $sigma_0$ of length $N$, it suffices for each of the $L$ clients to\u0000locally aggregate $max{C_3(phi), C_4(phi)frac{1}{L}log\u0000frac{N}{delta}}$ rankings, where $C_3(phi)$ and $C_4(phi)$ are constants.\u0000Clients send truncated Lehmer coordinate histograms to the server, which can\u0000recover $sigma_0$ with probability $geq 1-delta$. Communication complexity\u0000is $sim O(Nlog NLlog L)$.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Top-k algorithms are essential in various applications, from high-performance computing and information retrieval to big data and neural network model training. This paper introduces RTop-K, a highly efficient parallel row-wise top-k selection algorithm designed for GPUs. RTop-K employs a Binary Search-based approach to optimize resource allocation and provides a scalable solution that significantly accelerates top-k operations. We perform a theoretical analysis of the effects of early stopping in our algorithm, demonstrating that it maintains the accuracy of neural network models while enhancing performance. Comprehensive tests show that our GPU implementation of RTop-K outperforms other row-wise top-k GPU implementations, with minimal impact on testing accuracy when early stopping is applied. Notably, RTop-K achieves speed increases ranging from 4.245$times$ to 9.506$times$ with early stopping, and 3.936$times$ without early stopping, compared to state-of-the-art implementations. The proposed methods offer significant improvements in the training and inference of Graph Neural Networks (GNNs), addressing critical challenges in latency and throughput on GPU platforms.
{"title":"RTop-K: Ultra-Fast Row-Wise Top-K Algorithm and GPU Implementation for Neural Networks","authors":"Xi Xie, Yuebo Luo, Hongwu Peng, Caiwen Ding","doi":"arxiv-2409.00822","DOIUrl":"https://doi.org/arxiv-2409.00822","url":null,"abstract":"Top-k algorithms are essential in various applications, from high-performance\u0000computing and information retrieval to big data and neural network model\u0000training. This paper introduces RTop-K, a highly efficient parallel row-wise\u0000top-k selection algorithm designed for GPUs. RTop-K employs a Binary\u0000Search-based approach to optimize resource allocation and provides a scalable\u0000solution that significantly accelerates top-k operations. We perform a\u0000theoretical analysis of the effects of early stopping in our algorithm,\u0000demonstrating that it maintains the accuracy of neural network models while\u0000enhancing performance. Comprehensive tests show that our GPU implementation of\u0000RTop-K outperforms other row-wise top-k GPU implementations, with minimal\u0000impact on testing accuracy when early stopping is applied. Notably, RTop-K\u0000achieves speed increases ranging from 4.245$times$ to 9.506$times$ with early\u0000stopping, and 3.936$times$ without early stopping, compared to\u0000state-of-the-art implementations. The proposed methods offer significant\u0000improvements in the training and inference of Graph Neural Networks (GNNs),\u0000addressing critical challenges in latency and throughput on GPU platforms.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"268 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present Container Data Item (CDI), an abstract datatype that allows multiple containers to efficiently operate on a common data item while preserving their strong security and isolation semantics. Application developers can use CDIs to enable multiple containers to operate on the same data, synchronize execution among themselves, and control the ownership of the shared data item during runtime. These containers may reside on the same server or different servers. CDI is designed to support microservice based applications comprised of a set of interconnected microservices, each implemented by a separate dedicated container. CDI preserves the important isolation semantics of containers by ensuring that exactly one container owns a CDI object at any instant and the ownership of a CDI object may be transferred from one container to another only by the current CDI object owner. We present three different implementations of CDI that allow different containers residing on the same server as well containers residing on different servers to use CDI for efficiently operating on a common data item. The paper provides an extensive performance evaluation of CDI along with two representative applications, an augmented reality application and a decentralized workflow orchestrator.
{"title":"Container Data Item: An Abstract Datatype for Efficient Container-based Edge Computing","authors":"Md Rezwanur Rahman, Tarun Annapareddy, Shirin Ebadi, Varsha Natarajan, Adarsh Srinivasan, Eric Keller, Shivakant Mishra","doi":"arxiv-2409.00801","DOIUrl":"https://doi.org/arxiv-2409.00801","url":null,"abstract":"We present Container Data Item (CDI), an abstract datatype that allows\u0000multiple containers to efficiently operate on a common data item while\u0000preserving their strong security and isolation semantics. Application\u0000developers can use CDIs to enable multiple containers to operate on the same\u0000data, synchronize execution among themselves, and control the ownership of the\u0000shared data item during runtime. These containers may reside on the same server\u0000or different servers. CDI is designed to support microservice based\u0000applications comprised of a set of interconnected microservices, each\u0000implemented by a separate dedicated container. CDI preserves the important\u0000isolation semantics of containers by ensuring that exactly one container owns a\u0000CDI object at any instant and the ownership of a CDI object may be transferred\u0000from one container to another only by the current CDI object owner. We present\u0000three different implementations of CDI that allow different containers residing\u0000on the same server as well containers residing on different servers to use CDI\u0000for efficiently operating on a common data item. The paper provides an\u0000extensive performance evaluation of CDI along with two representative\u0000applications, an augmented reality application and a decentralized workflow\u0000orchestrator.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weijian Chen, Shuibing He, Haoyang Qu, Xuechen Zhang, Dan Feng
Distributed training of graph neural networks (GNNs) has become a crucial technique for processing large graphs. Prevalent GNN frameworks are model-centric, necessitating the transfer of massive graph vertex features to GNN models, which leads to a significant communication bottleneck. Recognizing that the model size is often significantly smaller than the feature size, we propose LeapGNN, a feature-centric framework that reverses this paradigm by bringing GNN models to vertex features. To make it truly effective, we first propose a micrograph-based training strategy that trains the model using a refined structure with superior locality to reduce remote feature retrieval. Then, we devise a feature pre-gathering approach that merges multiple fetch operations into a single one to eliminate redundant feature transmissions. Finally, we employ a micrograph-based merging method that adjusts the number of micrographs for each worker to minimize kernel switches and synchronization overhead. Our experimental results demonstrate that LeapGNN achieves a performance speedup of up to 4.2x compared to the state-of-the-art method, namely P3.
{"title":"HopGNN: Boosting Distributed GNN Training Efficiency via Feature-Centric Model Migration","authors":"Weijian Chen, Shuibing He, Haoyang Qu, Xuechen Zhang, Dan Feng","doi":"arxiv-2409.00657","DOIUrl":"https://doi.org/arxiv-2409.00657","url":null,"abstract":"Distributed training of graph neural networks (GNNs) has become a crucial\u0000technique for processing large graphs. Prevalent GNN frameworks are\u0000model-centric, necessitating the transfer of massive graph vertex features to\u0000GNN models, which leads to a significant communication bottleneck. Recognizing\u0000that the model size is often significantly smaller than the feature size, we\u0000propose LeapGNN, a feature-centric framework that reverses this paradigm by\u0000bringing GNN models to vertex features. To make it truly effective, we first\u0000propose a micrograph-based training strategy that trains the model using a\u0000refined structure with superior locality to reduce remote feature retrieval.\u0000Then, we devise a feature pre-gathering approach that merges multiple fetch\u0000operations into a single one to eliminate redundant feature transmissions.\u0000Finally, we employ a micrograph-based merging method that adjusts the number of\u0000micrographs for each worker to minimize kernel switches and synchronization\u0000overhead. Our experimental results demonstrate that LeapGNN achieves a\u0000performance speedup of up to 4.2x compared to the state-of-the-art method,\u0000namely P3.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"80 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A network is said to be "anonymous" if its agents are indistinguishable from each other; it is "dynamic" if its communication links may appear or disappear unpredictably over time. Assuming that an anonymous dynamic network is always connected and each of its $n$ agents is initially given an input, it takes $2n$ communication rounds for the agents to compute an arbitrary (frequency-based) function of such inputs (Di Luna-Viglietta, DISC 2023). It is known that, without making additional assumptions on the network and without knowing the number of agents $n$, it is impossible to compute most functions and explicitly terminate. In fact, current state-of-the-art algorithms only achieve stabilization, i.e., allow each agent to return an output after every communication round; outputs can be changed, and are guaranteed to be all correct after $2n$ rounds. Such algorithms rely on the incremental construction of a data structure called "history tree", which is augmented at every round. Thus, they end up consuming an unlimited amount of memory, and are also prone to errors in case of memory loss or corruption. In this paper, we provide a general self-stabilizing algorithm for anonymous dynamic networks that stabilizes in $max{4n-2h, 2h}$ rounds (where $h$ measures the amount of corrupted data initially present in the memory of each agent), as well as a general finite-state algorithm that stabilizes in $3n^2$ rounds. Our work improves upon previously known methods that only apply to static networks (Boldi-Vigna, Dist. Comp. 2002). In addition, we develop new fundamental techniques and operations involving history trees, which are of independent interest.
{"title":"Universal Finite-State and Self-Stabilizing Computation in Anonymous Dynamic Networks","authors":"Giuseppe A. Di Luna, Giovanni Viglietta","doi":"arxiv-2409.00688","DOIUrl":"https://doi.org/arxiv-2409.00688","url":null,"abstract":"A network is said to be \"anonymous\" if its agents are indistinguishable from\u0000each other; it is \"dynamic\" if its communication links may appear or disappear\u0000unpredictably over time. Assuming that an anonymous dynamic network is always\u0000connected and each of its $n$ agents is initially given an input, it takes $2n$\u0000communication rounds for the agents to compute an arbitrary (frequency-based)\u0000function of such inputs (Di Luna-Viglietta, DISC 2023). It is known that, without making additional assumptions on the network and\u0000without knowing the number of agents $n$, it is impossible to compute most\u0000functions and explicitly terminate. In fact, current state-of-the-art\u0000algorithms only achieve stabilization, i.e., allow each agent to return an\u0000output after every communication round; outputs can be changed, and are\u0000guaranteed to be all correct after $2n$ rounds. Such algorithms rely on the\u0000incremental construction of a data structure called \"history tree\", which is\u0000augmented at every round. Thus, they end up consuming an unlimited amount of\u0000memory, and are also prone to errors in case of memory loss or corruption. In this paper, we provide a general self-stabilizing algorithm for anonymous\u0000dynamic networks that stabilizes in $max{4n-2h, 2h}$ rounds (where $h$\u0000measures the amount of corrupted data initially present in the memory of each\u0000agent), as well as a general finite-state algorithm that stabilizes in $3n^2$\u0000rounds. Our work improves upon previously known methods that only apply to\u0000static networks (Boldi-Vigna, Dist. Comp. 2002). In addition, we develop new\u0000fundamental techniques and operations involving history trees, which are of\u0000independent interest.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"213 Suppl 2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiaxiang Geng, Beilong Tang, Boyan Zhang, Jiaqi Shao, Bing Luo
In this demo, we introduce FedCampus, a privacy-preserving mobile application for smart underline{campus} with underline{fed}erated learning (FL) and federated analytics (FA). FedCampus enables cross-platform on-device FL/FA for both iOS and Android, supporting continuously models and algorithms deployment (MLOps). Our app integrates privacy-preserving processed data via differential privacy (DP) from smartwatches, where the processed parameters are used for FL/FA through the FedCampus backend platform. We distributed 100 smartwatches to volunteers at Duke Kunshan University and have successfully completed a series of smart campus tasks featuring capabilities such as sleep tracking, physical activity monitoring, personalized recommendations, and heavy hitters. Our project is opensourced at https://github.com/FedCampus/FedCampus_Flutter. See the FedCampus video at https://youtu.be/k5iu46IjA38.
{"title":"Demo: FedCampus: A Real-world Privacy-preserving Mobile Application for Smart Campus via Federated Learning & Analytics","authors":"Jiaxiang Geng, Beilong Tang, Boyan Zhang, Jiaqi Shao, Bing Luo","doi":"arxiv-2409.00327","DOIUrl":"https://doi.org/arxiv-2409.00327","url":null,"abstract":"In this demo, we introduce FedCampus, a privacy-preserving mobile application\u0000for smart underline{campus} with underline{fed}erated learning (FL) and\u0000federated analytics (FA). FedCampus enables cross-platform on-device FL/FA for\u0000both iOS and Android, supporting continuously models and algorithms deployment\u0000(MLOps). Our app integrates privacy-preserving processed data via differential\u0000privacy (DP) from smartwatches, where the processed parameters are used for\u0000FL/FA through the FedCampus backend platform. We distributed 100 smartwatches\u0000to volunteers at Duke Kunshan University and have successfully completed a\u0000series of smart campus tasks featuring capabilities such as sleep tracking,\u0000physical activity monitoring, personalized recommendations, and heavy hitters.\u0000Our project is opensourced at https://github.com/FedCampus/FedCampus_Flutter.\u0000See the FedCampus video at https://youtu.be/k5iu46IjA38.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}