Typically, the significant efficiency can be achieved by deploying different edge AI models in various real world scenarios while a few large models manage those edge AI models remotely from cloud servers. However, customizing edge AI models for each user's specific application or extending current models to new application scenarios remains a challenge. Inappropriate local training or fine tuning of edge AI models by users can lead to model malfunction, potentially resulting in legal issues for the manufacturer. To address aforementioned issues, this paper proposes an innovative framework called "DiReD", which involves knowledge DIstillation & REverse DIstillation. In the initial step, an edge AI model is trained with presumed data and a KD process using the cloud AI model in the upper management cloud server. This edge AI model is then dispatched to edge AI devices solely for inference in the user's application scenario. When the user needs to update the edge AI model to better fit the actual scenario, the reverse distillation (RD) process is employed to extract the knowledge: the difference between user preferences and the manufacturer's presumptions from the edge AI model using the user's exclusive data. Only the extracted knowledge is reported back to the upper management cloud server to update the cloud AI model, thus protecting user privacy by not using any exclusive data. The updated cloud AI can then update the edge AI model with the extended knowledge. Simulation results demonstrate that the proposed "DiReDi" framework allows the manufacturer to update the user model by learning new knowledge from the user's actual scenario with private data. The initial redundant knowledge is reduced since the retraining emphasizes user private data.
{"title":"DiReDi: Distillation and Reverse Distillation for AIoT Applications","authors":"Chen Sun, Qing Tong, Wenshuang Yang, Wenqi Zhang","doi":"arxiv-2409.08308","DOIUrl":"https://doi.org/arxiv-2409.08308","url":null,"abstract":"Typically, the significant efficiency can be achieved by deploying different\u0000edge AI models in various real world scenarios while a few large models manage\u0000those edge AI models remotely from cloud servers. However, customizing edge AI\u0000models for each user's specific application or extending current models to new\u0000application scenarios remains a challenge. Inappropriate local training or fine\u0000tuning of edge AI models by users can lead to model malfunction, potentially\u0000resulting in legal issues for the manufacturer. To address aforementioned\u0000issues, this paper proposes an innovative framework called \"DiReD\", which\u0000involves knowledge DIstillation & REverse DIstillation. In the initial step, an\u0000edge AI model is trained with presumed data and a KD process using the cloud AI\u0000model in the upper management cloud server. This edge AI model is then\u0000dispatched to edge AI devices solely for inference in the user's application\u0000scenario. When the user needs to update the edge AI model to better fit the\u0000actual scenario, the reverse distillation (RD) process is employed to extract\u0000the knowledge: the difference between user preferences and the manufacturer's\u0000presumptions from the edge AI model using the user's exclusive data. Only the\u0000extracted knowledge is reported back to the upper management cloud server to\u0000update the cloud AI model, thus protecting user privacy by not using any\u0000exclusive data. The updated cloud AI can then update the edge AI model with the\u0000extended knowledge. Simulation results demonstrate that the proposed \"DiReDi\"\u0000framework allows the manufacturer to update the user model by learning new\u0000knowledge from the user's actual scenario with private data. The initial\u0000redundant knowledge is reduced since the retraining emphasizes user private\u0000data.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"64 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kangyang Luo, Shuai Wang, Yexuan Fu, Renrong Shao, Xiang Li, Yunshi Lan, Ming Gao, Jinlong Shu
Federated Learning (FL) is a distributed machine learning scheme in which clients jointly participate in the collaborative training of a global model by sharing model information rather than their private datasets. In light of concerns associated with communication and privacy, one-shot FL with a single communication round has emerged as a de facto promising solution. However, existing one-shot FL methods either require public datasets, focus on model homogeneous settings, or distill limited knowledge from local models, making it difficult or even impractical to train a robust global model. To address these limitations, we propose a new data-free dual-generator adversarial distillation method (namely DFDG) for one-shot FL, which can explore a broader local models' training space via training dual generators. DFDG is executed in an adversarial manner and comprises two parts: dual-generator training and dual-model distillation. In dual-generator training, we delve into each generator concerning fidelity, transferability and diversity to ensure its utility, and additionally tailor the cross-divergence loss to lessen the overlap of dual generators' output spaces. In dual-model distillation, the trained dual generators work together to provide the training data for updates of the global model. At last, our extensive experiments on various image classification tasks show that DFDG achieves significant performance gains in accuracy compared to SOTA baselines.
联合学习(FL)是一种分布式机器学习方案,其中客户通过共享模型信息而非其私有数据集,共同参与全局模型的协作训练。考虑到与通信和隐私相关的问题,只有一轮通信的单次 FL 已成为事实上有前途的解决方案。然而,现有的单次 FL 方法要么需要公共数据集,要么专注于模型同构设置,要么从局部模型中提炼出有限的知识,这使得训练一个稳健的全局模型变得困难甚至不切实际。为了解决这些局限性,我们提出了一种新的无数据双生成器对抗性提炼方法(即 DFDG),它可以通过训练双生成器来探索更广阔的局部模型训练空间。DFDG 以对抗方式执行,包括两个部分:双发电机训练和双模型蒸馏。在双生成器训练中,我们对每个生成器的保真度、可转移性和多样性进行深入研究,以确保其实用性,此外,我们还对交叉发散损失进行了调整,以减少双生成器输出空间的重叠。在双模型蒸馏过程中,经过训练的双生成器共同为全局模型的更新提供训练数据。最后,我们在各种图像分类任务中进行的大量实验表明,与 SOTA 基线相比,DFDG 在准确率方面取得了显著的性能提升。
{"title":"DFDG: Data-Free Dual-Generator Adversarial Distillation for One-Shot Federated Learning","authors":"Kangyang Luo, Shuai Wang, Yexuan Fu, Renrong Shao, Xiang Li, Yunshi Lan, Ming Gao, Jinlong Shu","doi":"arxiv-2409.07734","DOIUrl":"https://doi.org/arxiv-2409.07734","url":null,"abstract":"Federated Learning (FL) is a distributed machine learning scheme in which\u0000clients jointly participate in the collaborative training of a global model by\u0000sharing model information rather than their private datasets. In light of\u0000concerns associated with communication and privacy, one-shot FL with a single\u0000communication round has emerged as a de facto promising solution. However,\u0000existing one-shot FL methods either require public datasets, focus on model\u0000homogeneous settings, or distill limited knowledge from local models, making it\u0000difficult or even impractical to train a robust global model. To address these\u0000limitations, we propose a new data-free dual-generator adversarial distillation\u0000method (namely DFDG) for one-shot FL, which can explore a broader local models'\u0000training space via training dual generators. DFDG is executed in an adversarial\u0000manner and comprises two parts: dual-generator training and dual-model\u0000distillation. In dual-generator training, we delve into each generator\u0000concerning fidelity, transferability and diversity to ensure its utility, and\u0000additionally tailor the cross-divergence loss to lessen the overlap of dual\u0000generators' output spaces. In dual-model distillation, the trained dual\u0000generators work together to provide the training data for updates of the global\u0000model. At last, our extensive experiments on various image classification tasks\u0000show that DFDG achieves significant performance gains in accuracy compared to\u0000SOTA baselines.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yibin Xu, Jianhua Shao, Tijs Slaats, Boris Düdder, Yongluan Zhou
Vote-based blockchains construct a state machine replication (SMR) system among participating nodes, using Byzantine Fault Tolerance (BFT) consensus protocols to transition from one state to another. Currently, they rely on either synchronous or partially synchronous networks with leader-based coordination or costly Asynchronous Common Subset (ACS) protocols in asynchronous settings, making them impractical for large-scale asynchronous applications. To make Asynchronous SMR scalable, this paper proposes a emph{validated strong} BFT consensus model that allows leader-based coordination in asynchronous settings. Our BFT consensus model offers the same level of tolerance as binary byzantine agreement but does not demand consistency among honest nodes before they vote. An SMR using our model allows nodes to operate in different, tentative, but mutually exclusive states until they eventually converge on the same state. We propose an asynchronous BFT protocol for vote-based blockchains employing our consensus model to address several critical challenges: how to ensure that nodes eventually converge on the same state across voting rounds, how to assure that a blockchain will steadily progress through epochs while reaching consensus for previous epochs, and how to maintain robust byzantine fault tolerance. Our protocol greatly reduces message complexity and is the first one to achieve linear view changes without relying on threshold signatures. We prove that an asynchronous blockchain built on our protocol can operate with the emph{same} simplicity and efficiency as partially synchronous blockchains built on, e.g. HotStuff-2. This facilitates deploying asynchronous blockchains across large-scale networks.
{"title":"A Study on Asynchronous Vote-based Blockchains","authors":"Yibin Xu, Jianhua Shao, Tijs Slaats, Boris Düdder, Yongluan Zhou","doi":"arxiv-2409.08161","DOIUrl":"https://doi.org/arxiv-2409.08161","url":null,"abstract":"Vote-based blockchains construct a state machine replication (SMR) system\u0000among participating nodes, using Byzantine Fault Tolerance (BFT) consensus\u0000protocols to transition from one state to another. Currently, they rely on\u0000either synchronous or partially synchronous networks with leader-based\u0000coordination or costly Asynchronous Common Subset (ACS) protocols in\u0000asynchronous settings, making them impractical for large-scale asynchronous\u0000applications. To make Asynchronous SMR scalable, this paper proposes a emph{validated\u0000strong} BFT consensus model that allows leader-based coordination in\u0000asynchronous settings. Our BFT consensus model offers the same level of\u0000tolerance as binary byzantine agreement but does not demand consistency among\u0000honest nodes before they vote. An SMR using our model allows nodes to operate\u0000in different, tentative, but mutually exclusive states until they eventually\u0000converge on the same state. We propose an asynchronous BFT protocol for\u0000vote-based blockchains employing our consensus model to address several\u0000critical challenges: how to ensure that nodes eventually converge on the same\u0000state across voting rounds, how to assure that a blockchain will steadily\u0000progress through epochs while reaching consensus for previous epochs, and how\u0000to maintain robust byzantine fault tolerance. Our protocol greatly reduces message complexity and is the first one to\u0000achieve linear view changes without relying on threshold signatures. We prove\u0000that an asynchronous blockchain built on our protocol can operate with the\u0000emph{same} simplicity and efficiency as partially synchronous blockchains\u0000built on, e.g. HotStuff-2. This facilitates deploying asynchronous blockchains\u0000across large-scale networks.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deploying deep learning models on Internet of Things (IoT) devices often faces challenges due to limited memory resources and computing capabilities. Cooperative inference is an important method for addressing this issue, requiring the partitioning and distributive deployment of an intelligent model. To perform horizontal partitions, existing cooperative inference methods take either the output channel of operators or the height and width of feature maps as the partition dimensions. In this manner, since the activation of operators is distributed, they have to be concatenated together before being fed to the next operator, which incurs the delay for cooperative inference. In this paper, we propose the Interleaved Operator Partitioning (IOP) strategy for CNN models. By partitioning an operator based on the output channel dimension and its successive operator based on the input channel dimension, activation concatenation becomes unnecessary, thereby reducing the number of communication connections, which consequently reduces cooperative inference de-lay. Based on IOP, we further present a model segmentation algorithm for minimizing cooperative inference time, which greedily selects operators for IOP pairing based on the inference delay benefit harvested. Experimental results demonstrate that compared with the state-of-the-art partition approaches used in CoEdge, the IOP strategy achieves 6.39% ~ 16.83% faster acceleration and reduces peak memory footprint by 21.22% ~ 49.98% for three classical image classification models.
{"title":"Cooperative Inference with Interleaved Operator Partitioning for CNNs","authors":"Zhibang Liu, Chaonong Xu, Zhizhuo Liu, Lekai Huang, Jiachen Wei, Chao Li","doi":"arxiv-2409.07693","DOIUrl":"https://doi.org/arxiv-2409.07693","url":null,"abstract":"Deploying deep learning models on Internet of Things (IoT) devices often\u0000faces challenges due to limited memory resources and computing capabilities.\u0000Cooperative inference is an important method for addressing this issue,\u0000requiring the partitioning and distributive deployment of an intelligent model.\u0000To perform horizontal partitions, existing cooperative inference methods take\u0000either the output channel of operators or the height and width of feature maps\u0000as the partition dimensions. In this manner, since the activation of operators\u0000is distributed, they have to be concatenated together before being fed to the\u0000next operator, which incurs the delay for cooperative inference. In this paper,\u0000we propose the Interleaved Operator Partitioning (IOP) strategy for CNN models.\u0000By partitioning an operator based on the output channel dimension and its\u0000successive operator based on the input channel dimension, activation\u0000concatenation becomes unnecessary, thereby reducing the number of communication\u0000connections, which consequently reduces cooperative inference de-lay. Based on\u0000IOP, we further present a model segmentation algorithm for minimizing\u0000cooperative inference time, which greedily selects operators for IOP pairing\u0000based on the inference delay benefit harvested. Experimental results\u0000demonstrate that compared with the state-of-the-art partition approaches used\u0000in CoEdge, the IOP strategy achieves 6.39% ~ 16.83% faster acceleration and\u0000reduces peak memory footprint by 21.22% ~ 49.98% for three classical image\u0000classification models.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data backup is a core technology for improving system resilience to system failures. Data backup in enterprise systems is required to minimize the impacts on business processing, which can be categorized into two factors: system slowdown and downtime. To eliminate system slowdown, asynchronous data copy (ADC) technology is prevalent, which copies data asynchronously with original data updates. However, the ADC can collapse backup data when applied to enterprise systems with multiple resources. Then, the demonstration system employed consistency group technology, which makes the order of data updates the same between the original and backup data. In addition, we developed a container platform operator to unravel the complicated correspondence between storage volumes and applications. The operator automates the configuration of the ADC with the setting of consistency groups. We integrated the storage and container technologies into the demonstration system, which can eliminate both system slowdown and downtime.
{"title":"Data Backup System with No Impact on Business Processing Utilizing Storage and Container Technologies","authors":"Satoru Watanabe","doi":"arxiv-2409.07081","DOIUrl":"https://doi.org/arxiv-2409.07081","url":null,"abstract":"Data backup is a core technology for improving system resilience to system\u0000failures. Data backup in enterprise systems is required to minimize the impacts\u0000on business processing, which can be categorized into two factors: system\u0000slowdown and downtime. To eliminate system slowdown, asynchronous data copy\u0000(ADC) technology is prevalent, which copies data asynchronously with original\u0000data updates. However, the ADC can collapse backup data when applied to\u0000enterprise systems with multiple resources. Then, the demonstration system\u0000employed consistency group technology, which makes the order of data updates\u0000the same between the original and backup data. In addition, we developed a\u0000container platform operator to unravel the complicated correspondence between\u0000storage volumes and applications. The operator automates the configuration of\u0000the ADC with the setting of consistency groups. We integrated the storage and\u0000container technologies into the demonstration system, which can eliminate both\u0000system slowdown and downtime.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiashu ZhangYiming, Zihan PanYiming, MollyYiming, Xu, Khuzaima Daudjee, Sihang Liu
The occurrence of bubbles in pipeline parallelism is an inherent limitation that can account for more than 40% of the large language model (LLM) training time and is one of the main reasons for the underutilization of GPU resources in LLM training. Harvesting these bubbles for GPU side tasks can increase resource utilization and reduce training costs but comes with challenges. First, because bubbles are discontinuous with various shapes, programming side tasks becomes difficult while requiring excessive engineering effort. Second, a side task can compete with pipeline training for GPU resources and incur significant overhead. To address these challenges, we propose FreeRide, a system designed to harvest bubbles in pipeline parallelism for side tasks. FreeRide provides programmers with interfaces to implement side tasks easily, manages bubbles and side tasks during pipeline training, and controls access to GPU resources by side tasks to reduce overhead. We demonstrate that FreeRide achieves 7.8% average cost savings with a negligible overhead of about 1% in training LLMs while serving model training, graph analytics, and image processing side tasks.
{"title":"FreeRide: Harvesting Bubbles in Pipeline Parallelism","authors":"Jiashu ZhangYiming, Zihan PanYiming, MollyYiming, Xu, Khuzaima Daudjee, Sihang Liu","doi":"arxiv-2409.06941","DOIUrl":"https://doi.org/arxiv-2409.06941","url":null,"abstract":"The occurrence of bubbles in pipeline parallelism is an inherent limitation\u0000that can account for more than 40% of the large language model (LLM) training\u0000time and is one of the main reasons for the underutilization of GPU resources\u0000in LLM training. Harvesting these bubbles for GPU side tasks can increase\u0000resource utilization and reduce training costs but comes with challenges.\u0000First, because bubbles are discontinuous with various shapes, programming side\u0000tasks becomes difficult while requiring excessive engineering effort. Second, a\u0000side task can compete with pipeline training for GPU resources and incur\u0000significant overhead. To address these challenges, we propose FreeRide, a\u0000system designed to harvest bubbles in pipeline parallelism for side tasks.\u0000FreeRide provides programmers with interfaces to implement side tasks easily,\u0000manages bubbles and side tasks during pipeline training, and controls access to\u0000GPU resources by side tasks to reduce overhead. We demonstrate that FreeRide\u0000achieves 7.8% average cost savings with a negligible overhead of about 1% in\u0000training LLMs while serving model training, graph analytics, and image\u0000processing side tasks.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiaxang Tang, Zeshan Fayyaz, Mohammad A. Salahuddin, Raouf Boutaba, Zhi-Li Zhang, Ali Anwar
Federated Learning is a well-researched approach for collaboratively training machine learning models across decentralized data while preserving privacy. However, integrating Homomorphic Encryption to ensure data confidentiality introduces significant computational and communication overheads, particularly in heterogeneous environments where clients have varying computational capacities and security needs. In this paper, we propose HERL, a Reinforcement Learning-based approach that uses Q-Learning to dynamically optimize encryption parameters, specifically the polynomial modulus degree, $N$, and the coefficient modulus, $q$, across different client tiers. Our proposed method involves first profiling and tiering clients according to the chosen clustering approach, followed by dynamically selecting the most suitable encryption parameters using an RL-agent. Experimental results demonstrate that our approach significantly reduces the computational overhead while maintaining utility and a high level of security. Empirical results show that HERL improves utility by 17%, reduces the convergence time by up to 24%, and increases convergence efficiency by up to 30%, with minimal security loss.
{"title":"HERL: Tiered Federated Learning with Adaptive Homomorphic Encryption using Reinforcement Learning","authors":"Jiaxang Tang, Zeshan Fayyaz, Mohammad A. Salahuddin, Raouf Boutaba, Zhi-Li Zhang, Ali Anwar","doi":"arxiv-2409.07631","DOIUrl":"https://doi.org/arxiv-2409.07631","url":null,"abstract":"Federated Learning is a well-researched approach for collaboratively training\u0000machine learning models across decentralized data while preserving privacy.\u0000However, integrating Homomorphic Encryption to ensure data confidentiality\u0000introduces significant computational and communication overheads, particularly\u0000in heterogeneous environments where clients have varying computational\u0000capacities and security needs. In this paper, we propose HERL, a Reinforcement\u0000Learning-based approach that uses Q-Learning to dynamically optimize encryption\u0000parameters, specifically the polynomial modulus degree, $N$, and the\u0000coefficient modulus, $q$, across different client tiers. Our proposed method\u0000involves first profiling and tiering clients according to the chosen clustering\u0000approach, followed by dynamically selecting the most suitable encryption\u0000parameters using an RL-agent. Experimental results demonstrate that our\u0000approach significantly reduces the computational overhead while maintaining\u0000utility and a high level of security. Empirical results show that HERL improves\u0000utility by 17%, reduces the convergence time by up to 24%, and increases\u0000convergence efficiency by up to 30%, with minimal security loss.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ChayanonNamo, WichitrnithedHelen, Woo-Sun-YangHelen, YunHelen, He, Brad Richardson, Koichi Sakaguchi, Manuel Arenaz, William I. Gustafson Jr., Jacob Shpund, Ulises Costi Blanco, Alvaro Goldar Dieste
Currently, the Weather Research and Forecasting model (WRF) utilizes shared memory (OpenMP) and distributed memory (MPI) parallelisms. To take advantage of GPU resources on the Perlmutter supercomputer at NERSC, we port parts of the computationally expensive routines of the Fast Spectral Bin Microphysics (FSBM) microphysical scheme to NVIDIA GPUs using OpenMP device offloading directives. To facilitate this process, we explore a workflow for optimization which uses both runtime profilers and a static code inspection tool Codee to refactor the subroutine. We observe a 2.08x overall speedup for the CONUS-12km thunderstorm test case.
{"title":"Optimizing the Weather Research and Forecasting Model with OpenMP Offload and Codee","authors":"ChayanonNamo, WichitrnithedHelen, Woo-Sun-YangHelen, YunHelen, He, Brad Richardson, Koichi Sakaguchi, Manuel Arenaz, William I. Gustafson Jr., Jacob Shpund, Ulises Costi Blanco, Alvaro Goldar Dieste","doi":"arxiv-2409.07232","DOIUrl":"https://doi.org/arxiv-2409.07232","url":null,"abstract":"Currently, the Weather Research and Forecasting model (WRF) utilizes shared\u0000memory (OpenMP) and distributed memory (MPI) parallelisms. To take advantage of\u0000GPU resources on the Perlmutter supercomputer at NERSC, we port parts of the\u0000computationally expensive routines of the Fast Spectral Bin Microphysics (FSBM)\u0000microphysical scheme to NVIDIA GPUs using OpenMP device offloading directives.\u0000To facilitate this process, we explore a workflow for optimization which uses\u0000both runtime profilers and a static code inspection tool Codee to refactor the\u0000subroutine. We observe a 2.08x overall speedup for the CONUS-12km thunderstorm\u0000test case.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"63 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pranav Rama, Madison Threadgill, Andreas Gerstlauer
The training of deep and/or convolutional neural networks (DNNs/CNNs) is traditionally done on servers with powerful CPUs and GPUs. Recent efforts have emerged to localize machine learning tasks fully on the edge. This brings advantages in reduced latency and increased privacy, but necessitates working with resource-constrained devices. Approaches for inference and training in mobile and edge devices based on pruning, quantization or incremental and transfer learning require trading off accuracy. Several works have explored distributing inference operations on mobile and edge clusters instead. However, there is limited literature on distributed training on the edge. Existing approaches all require a central, potentially powerful edge or cloud server for coordination or offloading. In this paper, we describe an approach for distributed CNN training exclusively on mobile and edge devices. Our approach is beneficial for the initial CNN layers that are feature map dominated. It is based on partitioning forward inference and back-propagation operations among devices through tiling and fusing to maximize locality and expose communication and memory-aware parallelism. We also introduce the concept of layer grouping to further fine-tune performance based on computation and communication trade-off. Results show that for a cluster of 2-6 quad-core Raspberry Pi3 devices, training of an object-detection CNN provides a 2x-15x speedup with respect to a single core and up to 8x reduction in memory usage per device, all without sacrificing accuracy. Grouping offers up to 1.5x speedup depending on the reference profile and batch size.
{"title":"Distributed Convolutional Neural Network Training on Mobile and Edge Clusters","authors":"Pranav Rama, Madison Threadgill, Andreas Gerstlauer","doi":"arxiv-2409.09083","DOIUrl":"https://doi.org/arxiv-2409.09083","url":null,"abstract":"The training of deep and/or convolutional neural networks (DNNs/CNNs) is\u0000traditionally done on servers with powerful CPUs and GPUs. Recent efforts have\u0000emerged to localize machine learning tasks fully on the edge. This brings\u0000advantages in reduced latency and increased privacy, but necessitates working\u0000with resource-constrained devices. Approaches for inference and training in\u0000mobile and edge devices based on pruning, quantization or incremental and\u0000transfer learning require trading off accuracy. Several works have explored\u0000distributing inference operations on mobile and edge clusters instead. However,\u0000there is limited literature on distributed training on the edge. Existing\u0000approaches all require a central, potentially powerful edge or cloud server for\u0000coordination or offloading. In this paper, we describe an approach for\u0000distributed CNN training exclusively on mobile and edge devices. Our approach\u0000is beneficial for the initial CNN layers that are feature map dominated. It is\u0000based on partitioning forward inference and back-propagation operations among\u0000devices through tiling and fusing to maximize locality and expose communication\u0000and memory-aware parallelism. We also introduce the concept of layer grouping\u0000to further fine-tune performance based on computation and communication\u0000trade-off. Results show that for a cluster of 2-6 quad-core Raspberry Pi3\u0000devices, training of an object-detection CNN provides a 2x-15x speedup with\u0000respect to a single core and up to 8x reduction in memory usage per device, all\u0000without sacrificing accuracy. Grouping offers up to 1.5x speedup depending on\u0000the reference profile and batch size.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph Neural Network (GNN) models on streaming graphs entail algorithmic challenges to continuously capture its dynamic state, as well as systems challenges to optimize latency, memory, and throughput during both inference and training. We present D3-GNN, the first distributed, hybrid-parallel, streaming GNN system designed to handle real-time graph updates under online query setting. Our system addresses data management, algorithmic, and systems challenges, enabling continuous capturing of the dynamic state of the graph and updating node representations with fault-tolerance and optimal latency, load-balance, and throughput. D3-GNN utilizes streaming GNN aggregators and an unrolled, distributed computation graph architecture to handle cascading graph updates. To counteract data skew and neighborhood explosion issues, we introduce inter-layer and intra-layer windowed forward pass solutions. Experiments on large-scale graph streams demonstrate that D3-GNN achieves high efficiency and scalability. Compared to DGL, D3-GNN achieves a significant throughput improvement of about 76x for streaming workloads. The windowed enhancement further reduces running times by around 10x and message volumes by up to 15x at higher parallelism.
{"title":"D3-GNN: Dynamic Distributed Dataflow for Streaming Graph Neural Networks","authors":"Rustam Guliyev, Aparajita Haldar, Hakan Ferhatosmanoglu","doi":"arxiv-2409.09079","DOIUrl":"https://doi.org/arxiv-2409.09079","url":null,"abstract":"Graph Neural Network (GNN) models on streaming graphs entail algorithmic\u0000challenges to continuously capture its dynamic state, as well as systems\u0000challenges to optimize latency, memory, and throughput during both inference\u0000and training. We present D3-GNN, the first distributed, hybrid-parallel,\u0000streaming GNN system designed to handle real-time graph updates under online\u0000query setting. Our system addresses data management, algorithmic, and systems\u0000challenges, enabling continuous capturing of the dynamic state of the graph and\u0000updating node representations with fault-tolerance and optimal latency,\u0000load-balance, and throughput. D3-GNN utilizes streaming GNN aggregators and an\u0000unrolled, distributed computation graph architecture to handle cascading graph\u0000updates. To counteract data skew and neighborhood explosion issues, we\u0000introduce inter-layer and intra-layer windowed forward pass solutions.\u0000Experiments on large-scale graph streams demonstrate that D3-GNN achieves high\u0000efficiency and scalability. Compared to DGL, D3-GNN achieves a significant\u0000throughput improvement of about 76x for streaming workloads. The windowed\u0000enhancement further reduces running times by around 10x and message volumes by\u0000up to 15x at higher parallelism.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}