arXiv - CS - Distributed, Parallel, and Cluster Computing最新文献_第3页

DiReDi: Distillation and Reverse Distillation for AIoT Applications DiReDi：面向 AIoT 应用的蒸馏和反向蒸馏

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-12 DOI: arxiv-2409.08308

Chen Sun, Qing Tong, Wenshuang Yang, Wenqi Zhang

Typically, the significant efficiency can be achieved by deploying differentedge AI models in various real world scenarios while a few large models managethose edge AI models remotely from cloud servers. However, customizing edge AImodels for each user's specific application or extending current models to newapplication scenarios remains a challenge. Inappropriate local training or finetuning of edge AI models by users can lead to model malfunction, potentiallyresulting in legal issues for the manufacturer. To address aforementionedissues, this paper proposes an innovative framework called "DiReD", whichinvolves knowledge DIstillation & REverse DIstillation. In the initial step, anedge AI model is trained with presumed data and a KD process using the cloud AImodel in the upper management cloud server. This edge AI model is thendispatched to edge AI devices solely for inference in the user's applicationscenario. When the user needs to update the edge AI model to better fit theactual scenario, the reverse distillation (RD) process is employed to extractthe knowledge: the difference between user preferences and the manufacturer'spresumptions from the edge AI model using the user's exclusive data. Only theextracted knowledge is reported back to the upper management cloud server toupdate the cloud AI model, thus protecting user privacy by not using anyexclusive data. The updated cloud AI can then update the edge AI model with theextended knowledge. Simulation results demonstrate that the proposed "DiReDi"framework allows the manufacturer to update the user model by learning newknowledge from the user's actual scenario with private data. The initialredundant knowledge is reduced since the retraining emphasizes user privatedata.

通常情况下，通过在各种现实场景中部署不同的边缘人工智能模型，同时由少数大型模型从云服务器远程管理这些边缘人工智能模型，可以实现显著的效率。然而，为每个用户的特定应用定制边缘人工智能模型或将当前模型扩展到新的应用场景仍然是一项挑战。用户对边缘人工智能模型进行不恰当的本地训练或微调可能会导致模型失灵，从而给制造商带来潜在的法律问题。为解决上述问题，本文提出了一个名为 "DiReD "的创新框架。第一步，利用上层管理云服务器中的云人工智能模型，通过假定数据和 KD 流程训练边缘人工智能模型。然后，该边缘人工智能模型被分配到边缘人工智能设备上，仅用于用户应用场景中的推理。当用户需要更新边缘人工智能模型以更好地适应实际场景时，就会采用反向蒸馏（RD）流程，利用用户的独家数据从边缘人工智能模型中提取知识：用户偏好与制造商假设之间的差异。只有提取的知识才会反馈给上层管理云服务器，用于更新云人工智能模型，从而通过不使用任何独家数据来保护用户隐私。更新后的云人工智能可以利用扩展知识更新边缘人工智能模型。仿真结果表明，所提出的 "DiReDi "框架允许制造商通过从用户的实际场景中学习新知识来更新用户模型。由于再训练强调用户私有数据，因此减少了初始冗余知识。

{"title":"DiReDi: Distillation and Reverse Distillation for AIoT Applications","authors":"Chen Sun, Qing Tong, Wenshuang Yang, Wenqi Zhang","doi":"arxiv-2409.08308","DOIUrl":"https://doi.org/arxiv-2409.08308","url":null,"abstract":"Typically, the significant efficiency can be achieved by deploying different\u0000edge AI models in various real world scenarios while a few large models manage\u0000those edge AI models remotely from cloud servers. However, customizing edge AI\u0000models for each user's specific application or extending current models to new\u0000application scenarios remains a challenge. Inappropriate local training or fine\u0000tuning of edge AI models by users can lead to model malfunction, potentially\u0000resulting in legal issues for the manufacturer. To address aforementioned\u0000issues, this paper proposes an innovative framework called \"DiReD\", which\u0000involves knowledge DIstillation & REverse DIstillation. In the initial step, an\u0000edge AI model is trained with presumed data and a KD process using the cloud AI\u0000model in the upper management cloud server. This edge AI model is then\u0000dispatched to edge AI devices solely for inference in the user's application\u0000scenario. When the user needs to update the edge AI model to better fit the\u0000actual scenario, the reverse distillation (RD) process is employed to extract\u0000the knowledge: the difference between user preferences and the manufacturer's\u0000presumptions from the edge AI model using the user's exclusive data. Only the\u0000extracted knowledge is reported back to the upper management cloud server to\u0000update the cloud AI model, thus protecting user privacy by not using any\u0000exclusive data. The updated cloud AI can then update the edge AI model with the\u0000extended knowledge. Simulation results demonstrate that the proposed \"DiReDi\"\u0000framework allows the manufacturer to update the user model by learning new\u0000knowledge from the user's actual scenario with private data. The initial\u0000redundant knowledge is reduced since the retraining emphasizes user private\u0000data.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"64 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DFDG: Data-Free Dual-Generator Adversarial Distillation for One-Shot Federated Learning DFDG：用于单次联合学习的无数据双生成器逆向精馏法

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-12 DOI: arxiv-2409.07734

Kangyang Luo, Shuai Wang, Yexuan Fu, Renrong Shao, Xiang Li, Yunshi Lan, Ming Gao, Jinlong Shu

Federated Learning (FL) is a distributed machine learning scheme in whichclients jointly participate in the collaborative training of a global model bysharing model information rather than their private datasets. In light ofconcerns associated with communication and privacy, one-shot FL with a singlecommunication round has emerged as a de facto promising solution. However,existing one-shot FL methods either require public datasets, focus on modelhomogeneous settings, or distill limited knowledge from local models, making itdifficult or even impractical to train a robust global model. To address theselimitations, we propose a new data-free dual-generator adversarial distillationmethod (namely DFDG) for one-shot FL, which can explore a broader local models'training space via training dual generators. DFDG is executed in an adversarialmanner and comprises two parts: dual-generator training and dual-modeldistillation. In dual-generator training, we delve into each generatorconcerning fidelity, transferability and diversity to ensure its utility, andadditionally tailor the cross-divergence loss to lessen the overlap of dualgenerators' output spaces. In dual-model distillation, the trained dualgenerators work together to provide the training data for updates of the globalmodel. At last, our extensive experiments on various image classification tasksshow that DFDG achieves significant performance gains in accuracy compared toSOTA baselines.

联合学习（FL）是一种分布式机器学习方案，其中客户通过共享模型信息而非其私有数据集，共同参与全局模型的协作训练。考虑到与通信和隐私相关的问题，只有一轮通信的单次 FL 已成为事实上有前途的解决方案。然而，现有的单次 FL 方法要么需要公共数据集，要么专注于模型同构设置，要么从局部模型中提炼出有限的知识，这使得训练一个稳健的全局模型变得困难甚至不切实际。为了解决这些局限性，我们提出了一种新的无数据双生成器对抗性提炼方法（即 DFDG），它可以通过训练双生成器来探索更广阔的局部模型训练空间。DFDG 以对抗方式执行，包括两个部分：双发电机训练和双模型蒸馏。在双生成器训练中，我们对每个生成器的保真度、可转移性和多样性进行深入研究，以确保其实用性，此外，我们还对交叉发散损失进行了调整，以减少双生成器输出空间的重叠。在双模型蒸馏过程中，经过训练的双生成器共同为全局模型的更新提供训练数据。最后，我们在各种图像分类任务中进行的大量实验表明，与 SOTA 基线相比，DFDG 在准确率方面取得了显著的性能提升。

{"title":"DFDG: Data-Free Dual-Generator Adversarial Distillation for One-Shot Federated Learning","authors":"Kangyang Luo, Shuai Wang, Yexuan Fu, Renrong Shao, Xiang Li, Yunshi Lan, Ming Gao, Jinlong Shu","doi":"arxiv-2409.07734","DOIUrl":"https://doi.org/arxiv-2409.07734","url":null,"abstract":"Federated Learning (FL) is a distributed machine learning scheme in which\u0000clients jointly participate in the collaborative training of a global model by\u0000sharing model information rather than their private datasets. In light of\u0000concerns associated with communication and privacy, one-shot FL with a single\u0000communication round has emerged as a de facto promising solution. However,\u0000existing one-shot FL methods either require public datasets, focus on model\u0000homogeneous settings, or distill limited knowledge from local models, making it\u0000difficult or even impractical to train a robust global model. To address these\u0000limitations, we propose a new data-free dual-generator adversarial distillation\u0000method (namely DFDG) for one-shot FL, which can explore a broader local models'\u0000training space via training dual generators. DFDG is executed in an adversarial\u0000manner and comprises two parts: dual-generator training and dual-model\u0000distillation. In dual-generator training, we delve into each generator\u0000concerning fidelity, transferability and diversity to ensure its utility, and\u0000additionally tailor the cross-divergence loss to lessen the overlap of dual\u0000generators' output spaces. In dual-model distillation, the trained dual\u0000generators work together to provide the training data for updates of the global\u0000model. At last, our extensive experiments on various image classification tasks\u0000show that DFDG achieves significant performance gains in accuracy compared to\u0000SOTA baselines.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Study on Asynchronous Vote-based Blockchains 基于异步投票的区块链研究

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-12 DOI: arxiv-2409.08161

Yibin Xu, Jianhua Shao, Tijs Slaats, Boris Düdder, Yongluan Zhou

Vote-based blockchains construct a state machine replication (SMR) systemamong participating nodes, using Byzantine Fault Tolerance (BFT) consensusprotocols to transition from one state to another. Currently, they rely oneither synchronous or partially synchronous networks with leader-basedcoordination or costly Asynchronous Common Subset (ACS) protocols inasynchronous settings, making them impractical for large-scale asynchronousapplications. To make Asynchronous SMR scalable, this paper proposes a emph{validatedstrong} BFT consensus model that allows leader-based coordination inasynchronous settings. Our BFT consensus model offers the same level oftolerance as binary byzantine agreement but does not demand consistency amonghonest nodes before they vote. An SMR using our model allows nodes to operatein different, tentative, but mutually exclusive states until they eventuallyconverge on the same state. We propose an asynchronous BFT protocol forvote-based blockchains employing our consensus model to address severalcritical challenges: how to ensure that nodes eventually converge on the samestate across voting rounds, how to assure that a blockchain will steadilyprogress through epochs while reaching consensus for previous epochs, and howto maintain robust byzantine fault tolerance. Our protocol greatly reduces message complexity and is the first one toachieve linear view changes without relying on threshold signatures. We provethat an asynchronous blockchain built on our protocol can operate with theemph{same} simplicity and efficiency as partially synchronous blockchainsbuilt on, e.g. HotStuff-2. This facilitates deploying asynchronous blockchainsacross large-scale networks.

基于投票的区块链在参与节点之间构建了一个状态机复制（SMR）系统，使用拜占庭容错（BFT）共识协议从一个状态过渡到另一个状态。目前，它们要么依赖于同步或部分同步网络，采用基于领导者的协调，要么依赖于异步设置中代价高昂的异步通用子集（ACS）协议，因此对于大规模异步应用来说并不实用。为了使异步 SMR 具有可扩展性，本文提出了一种 BFT 共识模型，允许在异步环境中基于领导者进行协调。我们的 BFT 共识模型提供了与二进制拜占庭协议相同的容忍度，但不要求诚信节点在投票前保持一致。使用我们的模型的 SMR 允许节点在不同的、暂定的但相互排斥的状态下运行，直到它们最终趋同于相同的状态。我们为基于投票的区块链提出了一种异步 BFT 协议，该协议采用我们的共识模型来解决几个关键挑战：如何确保节点在各轮投票中最终趋同于相同的状态；如何确保区块链在达成上一轮共识的同时，在各个纪元中稳步前进；以及如何保持稳健的拜占庭容错。我们的协议大大降低了消息的复杂性，是第一个不依赖阈值签名就能实现线性视图变化的协议。我们证明，基于我们的协议构建的异步区块链可以与基于 HotStuff-2 等协议构建的部分同步区块链一样简单高效地运行。这为在大规模网络中部署异步区块链提供了便利。

{"title":"A Study on Asynchronous Vote-based Blockchains","authors":"Yibin Xu, Jianhua Shao, Tijs Slaats, Boris Düdder, Yongluan Zhou","doi":"arxiv-2409.08161","DOIUrl":"https://doi.org/arxiv-2409.08161","url":null,"abstract":"Vote-based blockchains construct a state machine replication (SMR) system\u0000among participating nodes, using Byzantine Fault Tolerance (BFT) consensus\u0000protocols to transition from one state to another. Currently, they rely on\u0000either synchronous or partially synchronous networks with leader-based\u0000coordination or costly Asynchronous Common Subset (ACS) protocols in\u0000asynchronous settings, making them impractical for large-scale asynchronous\u0000applications. To make Asynchronous SMR scalable, this paper proposes a emph{validated\u0000strong} BFT consensus model that allows leader-based coordination in\u0000asynchronous settings. Our BFT consensus model offers the same level of\u0000tolerance as binary byzantine agreement but does not demand consistency among\u0000honest nodes before they vote. An SMR using our model allows nodes to operate\u0000in different, tentative, but mutually exclusive states until they eventually\u0000converge on the same state. We propose an asynchronous BFT protocol for\u0000vote-based blockchains employing our consensus model to address several\u0000critical challenges: how to ensure that nodes eventually converge on the same\u0000state across voting rounds, how to assure that a blockchain will steadily\u0000progress through epochs while reaching consensus for previous epochs, and how\u0000to maintain robust byzantine fault tolerance. Our protocol greatly reduces message complexity and is the first one to\u0000achieve linear view changes without relying on threshold signatures. We prove\u0000that an asynchronous blockchain built on our protocol can operate with the\u0000emph{same} simplicity and efficiency as partially synchronous blockchains\u0000built on, e.g. HotStuff-2. This facilitates deploying asynchronous blockchains\u0000across large-scale networks.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cooperative Inference with Interleaved Operator Partitioning for CNNs 利用交错算子分区为 CNN 进行合作推理

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-12 DOI: arxiv-2409.07693

Zhibang Liu, Chaonong Xu, Zhizhuo Liu, Lekai Huang, Jiachen Wei, Chao Li

Deploying deep learning models on Internet of Things (IoT) devices oftenfaces challenges due to limited memory resources and computing capabilities.Cooperative inference is an important method for addressing this issue,requiring the partitioning and distributive deployment of an intelligent model.To perform horizontal partitions, existing cooperative inference methods takeeither the output channel of operators or the height and width of feature mapsas the partition dimensions. In this manner, since the activation of operatorsis distributed, they have to be concatenated together before being fed to thenext operator, which incurs the delay for cooperative inference. In this paper,we propose the Interleaved Operator Partitioning (IOP) strategy for CNN models.By partitioning an operator based on the output channel dimension and itssuccessive operator based on the input channel dimension, activationconcatenation becomes unnecessary, thereby reducing the number of communicationconnections, which consequently reduces cooperative inference de-lay. Based onIOP, we further present a model segmentation algorithm for minimizingcooperative inference time, which greedily selects operators for IOP pairingbased on the inference delay benefit harvested. Experimental resultsdemonstrate that compared with the state-of-the-art partition approaches usedin CoEdge, the IOP strategy achieves 6.39% ~ 16.83% faster acceleration andreduces peak memory footprint by 21.22% ~ 49.98% for three classical imageclassification models.

由于内存资源和计算能力有限，在物联网（IoT）设备上部署深度学习模型经常面临挑战。合作推理是解决这一问题的重要方法，需要对智能模型进行分区和分布式部署。这样一来，由于算子的激活是分布式的，因此必须先将算子的激活串联起来，然后再输送给下一个算子，这就造成了合作推理的延迟。在本文中，我们提出了针对 CNN 模型的交错算子分区（IOP）策略。通过根据输出通道维度对算子进行分区，并根据输入通道维度对其后继算子进行分区，激活串联变得没有必要，从而减少了通信连接的数量，进而减少了合作推理的延迟。在 IOP 的基础上，我们进一步提出了一种用于最小化合作推理时间的模型分割算法，该算法基于所收获的推理延迟收益，贪婪地为 IOP 配对选择算子。实验结果表明，与 CoEdge 中使用的最先进的分割方法相比，IOP 策略在三个经典图像分类模型中实现了 6.39% ~ 16.83% 的加速，并将峰值内存占用减少了 21.22% ~ 49.98%。

{"title":"Cooperative Inference with Interleaved Operator Partitioning for CNNs","authors":"Zhibang Liu, Chaonong Xu, Zhizhuo Liu, Lekai Huang, Jiachen Wei, Chao Li","doi":"arxiv-2409.07693","DOIUrl":"https://doi.org/arxiv-2409.07693","url":null,"abstract":"Deploying deep learning models on Internet of Things (IoT) devices often\u0000faces challenges due to limited memory resources and computing capabilities.\u0000Cooperative inference is an important method for addressing this issue,\u0000requiring the partitioning and distributive deployment of an intelligent model.\u0000To perform horizontal partitions, existing cooperative inference methods take\u0000either the output channel of operators or the height and width of feature maps\u0000as the partition dimensions. In this manner, since the activation of operators\u0000is distributed, they have to be concatenated together before being fed to the\u0000next operator, which incurs the delay for cooperative inference. In this paper,\u0000we propose the Interleaved Operator Partitioning (IOP) strategy for CNN models.\u0000By partitioning an operator based on the output channel dimension and its\u0000successive operator based on the input channel dimension, activation\u0000concatenation becomes unnecessary, thereby reducing the number of communication\u0000connections, which consequently reduces cooperative inference de-lay. Based on\u0000IOP, we further present a model segmentation algorithm for minimizing\u0000cooperative inference time, which greedily selects operators for IOP pairing\u0000based on the inference delay benefit harvested. Experimental results\u0000demonstrate that compared with the state-of-the-art partition approaches used\u0000in CoEdge, the IOP strategy achieves 6.39% ~ 16.83% faster acceleration and\u0000reduces peak memory footprint by 21.22% ~ 49.98% for three classical image\u0000classification models.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Data Backup System with No Impact on Business Processing Utilizing Storage and Container Technologies 利用存储和容器技术，不影响业务处理的数据备份系统

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-11 DOI: arxiv-2409.07081

Satoru Watanabe

Data backup is a core technology for improving system resilience to systemfailures. Data backup in enterprise systems is required to minimize the impactson business processing, which can be categorized into two factors: systemslowdown and downtime. To eliminate system slowdown, asynchronous data copy(ADC) technology is prevalent, which copies data asynchronously with originaldata updates. However, the ADC can collapse backup data when applied toenterprise systems with multiple resources. Then, the demonstration systememployed consistency group technology, which makes the order of data updatesthe same between the original and backup data. In addition, we developed acontainer platform operator to unravel the complicated correspondence betweenstorage volumes and applications. The operator automates the configuration ofthe ADC with the setting of consistency groups. We integrated the storage andcontainer technologies into the demonstration system, which can eliminate bothsystem slowdown and downtime.

数据备份是提高系统抵御系统故障能力的核心技术。企业系统需要进行数据备份，以最大限度地减少对业务处理的影响，这种影响可分为两个因素：系统变慢和宕机。为了消除系统运行速度减慢的问题，异步数据复制（ADC）技术非常流行，它可以在原始数据更新的同时异步复制数据。然而，当 ADC 应用于拥有多个资源的企业系统时，会导致备份数据崩溃。因此，演示系统采用了一致性群组技术，使原始数据和备份数据的更新顺序一致。此外，我们还开发了一个容器平台操作器，以解开存储卷与应用程序之间复杂的对应关系。操作员通过设置一致性组自动配置 ADC。我们将存储技术和容器技术集成到演示系统中，这样可以避免系统减速和停机。

引用次数: 0

FreeRide: Harvesting Bubbles in Pipeline Parallelism FreeRide：在管道并行中收获气泡

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-11 DOI: arxiv-2409.06941

Jiashu ZhangYiming, Zihan PanYiming, MollyYiming, Xu, Khuzaima Daudjee, Sihang Liu

The occurrence of bubbles in pipeline parallelism is an inherent limitationthat can account for more than 40% of the large language model (LLM) trainingtime and is one of the main reasons for the underutilization of GPU resourcesin LLM training. Harvesting these bubbles for GPU side tasks can increaseresource utilization and reduce training costs but comes with challenges.First, because bubbles are discontinuous with various shapes, programming sidetasks becomes difficult while requiring excessive engineering effort. Second, aside task can compete with pipeline training for GPU resources and incursignificant overhead. To address these challenges, we propose FreeRide, asystem designed to harvest bubbles in pipeline parallelism for side tasks.FreeRide provides programmers with interfaces to implement side tasks easily,manages bubbles and side tasks during pipeline training, and controls access toGPU resources by side tasks to reduce overhead. We demonstrate that FreeRideachieves 7.8% average cost savings with a negligible overhead of about 1% intraining LLMs while serving model training, graph analytics, and imageprocessing side tasks.

流水线并行中出现的气泡是一个固有的限制，可能占大型语言模型（LLM）训练时间的 40% 以上，也是 LLM 训练中 GPU 资源利用率不足的主要原因之一。首先，由于气泡形状各异且不连续，因此编写辅助任务变得非常困难，同时需要投入过多的工程精力。其次，旁路任务会与流水线训练争夺 GPU 资源，产生大量开销。为了应对这些挑战，我们提出了 FreeRide 系统，该系统旨在利用流水线并行性中的气泡来完成旁路任务。FreeRide 为程序员提供了轻松实现旁路任务的接口，在流水线训练过程中管理气泡和旁路任务，并控制旁路任务对 GPU 资源的访问以减少开销。我们证明，FreeRide 在训练 LLM 时平均节省了 7.8% 的成本，而为模型训练、图分析和图像处理辅助任务提供服务的开销仅为 1%，可以忽略不计。

{"title":"FreeRide: Harvesting Bubbles in Pipeline Parallelism","authors":"Jiashu ZhangYiming, Zihan PanYiming, MollyYiming, Xu, Khuzaima Daudjee, Sihang Liu","doi":"arxiv-2409.06941","DOIUrl":"https://doi.org/arxiv-2409.06941","url":null,"abstract":"The occurrence of bubbles in pipeline parallelism is an inherent limitation\u0000that can account for more than 40% of the large language model (LLM) training\u0000time and is one of the main reasons for the underutilization of GPU resources\u0000in LLM training. Harvesting these bubbles for GPU side tasks can increase\u0000resource utilization and reduce training costs but comes with challenges.\u0000First, because bubbles are discontinuous with various shapes, programming side\u0000tasks becomes difficult while requiring excessive engineering effort. Second, a\u0000side task can compete with pipeline training for GPU resources and incur\u0000significant overhead. To address these challenges, we propose FreeRide, a\u0000system designed to harvest bubbles in pipeline parallelism for side tasks.\u0000FreeRide provides programmers with interfaces to implement side tasks easily,\u0000manages bubbles and side tasks during pipeline training, and controls access to\u0000GPU resources by side tasks to reduce overhead. We demonstrate that FreeRide\u0000achieves 7.8% average cost savings with a negligible overhead of about 1% in\u0000training LLMs while serving model training, graph analytics, and image\u0000processing side tasks.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HERL: Tiered Federated Learning with Adaptive Homomorphic Encryption using Reinforcement Learning HERL：利用强化学习进行分层联合学习与自适应同态加密

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-11 DOI: arxiv-2409.07631

Jiaxang Tang, Zeshan Fayyaz, Mohammad A. Salahuddin, Raouf Boutaba, Zhi-Li Zhang, Ali Anwar

Federated Learning is a well-researched approach for collaboratively trainingmachine learning models across decentralized data while preserving privacy.However, integrating Homomorphic Encryption to ensure data confidentialityintroduces significant computational and communication overheads, particularlyin heterogeneous environments where clients have varying computationalcapacities and security needs. In this paper, we propose HERL, a ReinforcementLearning-based approach that uses Q-Learning to dynamically optimize encryptionparameters, specifically the polynomial modulus degree, $N$, and thecoefficient modulus, $q$, across different client tiers. Our proposed methodinvolves first profiling and tiering clients according to the chosen clusteringapproach, followed by dynamically selecting the most suitable encryptionparameters using an RL-agent. Experimental results demonstrate that ourapproach significantly reduces the computational overhead while maintainingutility and a high level of security. Empirical results show that HERL improvesutility by 17%, reduces the convergence time by up to 24%, and increasesconvergence efficiency by up to 30%, with minimal security loss.

联盟学习（Federated Learning）是一种经过深入研究的方法，用于在保护隐私的前提下跨分散数据协作训练机器学习模型。然而，集成同态加密技术以确保数据保密性会带来巨大的计算和通信开销，尤其是在客户端具有不同计算能力和安全需求的异构环境中。本文提出的 HERL 是一种基于强化学习的方法，它使用 Q 学习来动态优化加密参数，特别是跨不同客户层的多项式模数度 $N$ 和系数模数 $q$。我们提出的方法首先根据所选的聚类方法对客户进行剖析和分层，然后使用 RL 代理动态选择最合适的加密参数。实验结果表明，我们的方法在保持实用性和高安全性的同时，显著降低了计算开销。实证结果表明，HERL 将实用性提高了 17%，收敛时间缩短了 24%，收敛效率提高了 30%，而安全性损失却很小。

{"title":"HERL: Tiered Federated Learning with Adaptive Homomorphic Encryption using Reinforcement Learning","authors":"Jiaxang Tang, Zeshan Fayyaz, Mohammad A. Salahuddin, Raouf Boutaba, Zhi-Li Zhang, Ali Anwar","doi":"arxiv-2409.07631","DOIUrl":"https://doi.org/arxiv-2409.07631","url":null,"abstract":"Federated Learning is a well-researched approach for collaboratively training\u0000machine learning models across decentralized data while preserving privacy.\u0000However, integrating Homomorphic Encryption to ensure data confidentiality\u0000introduces significant computational and communication overheads, particularly\u0000in heterogeneous environments where clients have varying computational\u0000capacities and security needs. In this paper, we propose HERL, a Reinforcement\u0000Learning-based approach that uses Q-Learning to dynamically optimize encryption\u0000parameters, specifically the polynomial modulus degree, $N$, and the\u0000coefficient modulus, $q$, across different client tiers. Our proposed method\u0000involves first profiling and tiering clients according to the chosen clustering\u0000approach, followed by dynamically selecting the most suitable encryption\u0000parameters using an RL-agent. Experimental results demonstrate that our\u0000approach significantly reduces the computational overhead while maintaining\u0000utility and a high level of security. Empirical results show that HERL improves\u0000utility by 17%, reduces the convergence time by up to 24%, and increases\u0000convergence efficiency by up to 30%, with minimal security loss.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142211094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimizing the Weather Research and Forecasting Model with OpenMP Offload and Codee 利用 OpenMP 卸载和 Codee 优化天气研究和预测模型

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-11 DOI: arxiv-2409.07232

ChayanonNamo, WichitrnithedHelen, Woo-Sun-YangHelen, YunHelen, He, Brad Richardson, Koichi Sakaguchi, Manuel Arenaz, William I. Gustafson Jr., Jacob Shpund, Ulises Costi Blanco, Alvaro Goldar Dieste

Currently, the Weather Research and Forecasting model (WRF) utilizes sharedmemory (OpenMP) and distributed memory (MPI) parallelisms. To take advantage ofGPU resources on the Perlmutter supercomputer at NERSC, we port parts of thecomputationally expensive routines of the Fast Spectral Bin Microphysics (FSBM)microphysical scheme to NVIDIA GPUs using OpenMP device offloading directives.To facilitate this process, we explore a workflow for optimization which usesboth runtime profilers and a static code inspection tool Codee to refactor thesubroutine. We observe a 2.08x overall speedup for the CONUS-12km thunderstormtest case.

目前，天气研究与预报模型（WRF）使用共享内存（OpenMP）和分布式内存（MPI）并行。为了充分利用 NERSC Perlmutter 超级计算机上的 GPU 资源，我们使用 OpenMP 设备卸载指令，将快速光谱斌微物理（FSBM）微物理方案中部分计算成本较高的例程移植到英伟达™（NVIDIA®）GPU 上。为了促进这一过程，我们探索了一种优化工作流程，该流程同时使用运行时剖析器和静态代码检查工具 Codee 来重构子例程。在 CONUS-12km 雷暴测试案例中，我们观察到整体速度提高了 2.08 倍。

引用次数: 0

Distributed Convolutional Neural Network Training on Mobile and Edge Clusters 移动和边缘集群上的分布式卷积神经网络训练

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-11 DOI: arxiv-2409.09083

Pranav Rama, Madison Threadgill, Andreas Gerstlauer

The training of deep and/or convolutional neural networks (DNNs/CNNs) istraditionally done on servers with powerful CPUs and GPUs. Recent efforts haveemerged to localize machine learning tasks fully on the edge. This bringsadvantages in reduced latency and increased privacy, but necessitates workingwith resource-constrained devices. Approaches for inference and training inmobile and edge devices based on pruning, quantization or incremental andtransfer learning require trading off accuracy. Several works have exploreddistributing inference operations on mobile and edge clusters instead. However,there is limited literature on distributed training on the edge. Existingapproaches all require a central, potentially powerful edge or cloud server forcoordination or offloading. In this paper, we describe an approach fordistributed CNN training exclusively on mobile and edge devices. Our approachis beneficial for the initial CNN layers that are feature map dominated. It isbased on partitioning forward inference and back-propagation operations amongdevices through tiling and fusing to maximize locality and expose communicationand memory-aware parallelism. We also introduce the concept of layer groupingto further fine-tune performance based on computation and communicationtrade-off. Results show that for a cluster of 2-6 quad-core Raspberry Pi3devices, training of an object-detection CNN provides a 2x-15x speedup withrespect to a single core and up to 8x reduction in memory usage per device, allwithout sacrificing accuracy. Grouping offers up to 1.5x speedup depending onthe reference profile and batch size.

深度和/或卷积神经网络（DNN/CNN）的训练传统上是在配备强大 CPU 和 GPU 的服务器上进行的。最近，人们开始努力将机器学习任务完全本地化到边缘。这带来了降低延迟和提高隐私性的优势，但必须与资源受限的设备协同工作。基于剪枝、量化或增量和转移学习的移动和边缘设备推理和训练方法需要权衡准确性。有几项研究探索了在移动和边缘集群上进行分布式推理操作。然而，关于边缘分布式训练的文献还很有限。现有的方法都需要一个潜在功能强大的中央边缘或云服务器来进行协调或卸载。在本文中，我们介绍了一种专门在移动和边缘设备上进行分布式 CNN 训练的方法。我们的方法有利于以特征图为主的初始 CNN 层。它的基础是通过平铺和融合将前向推理和反向传播操作在设备间进行分区，以最大限度地提高局部性，并暴露出通信和内存感知并行性。我们还引入了层分组的概念，以根据计算和通信分担情况进一步微调性能。结果表明，对于由 2-6 个四核 Raspberry Pi3 设备组成的集群，与单核相比，物体检测 CNN 的训练速度提高了 2-15 倍，每个设备的内存使用量最多减少了 8 倍，而所有这一切都没有牺牲准确性。分组速度最多可提高 1.5 倍，具体取决于参考配置文件和批量大小。

{"title":"Distributed Convolutional Neural Network Training on Mobile and Edge Clusters","authors":"Pranav Rama, Madison Threadgill, Andreas Gerstlauer","doi":"arxiv-2409.09083","DOIUrl":"https://doi.org/arxiv-2409.09083","url":null,"abstract":"The training of deep and/or convolutional neural networks (DNNs/CNNs) is\u0000traditionally done on servers with powerful CPUs and GPUs. Recent efforts have\u0000emerged to localize machine learning tasks fully on the edge. This brings\u0000advantages in reduced latency and increased privacy, but necessitates working\u0000with resource-constrained devices. Approaches for inference and training in\u0000mobile and edge devices based on pruning, quantization or incremental and\u0000transfer learning require trading off accuracy. Several works have explored\u0000distributing inference operations on mobile and edge clusters instead. However,\u0000there is limited literature on distributed training on the edge. Existing\u0000approaches all require a central, potentially powerful edge or cloud server for\u0000coordination or offloading. In this paper, we describe an approach for\u0000distributed CNN training exclusively on mobile and edge devices. Our approach\u0000is beneficial for the initial CNN layers that are feature map dominated. It is\u0000based on partitioning forward inference and back-propagation operations among\u0000devices through tiling and fusing to maximize locality and expose communication\u0000and memory-aware parallelism. We also introduce the concept of layer grouping\u0000to further fine-tune performance based on computation and communication\u0000trade-off. Results show that for a cluster of 2-6 quad-core Raspberry Pi3\u0000devices, training of an object-detection CNN provides a 2x-15x speedup with\u0000respect to a single core and up to 8x reduction in memory usage per device, all\u0000without sacrificing accuracy. Grouping offers up to 1.5x speedup depending on\u0000the reference profile and batch size.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

D3-GNN: Dynamic Distributed Dataflow for Streaming Graph Neural Networks D3-GNN：流图神经网络的动态分布式数据流

arXiv - CS - Distributed, Parallel, and Cluster Computing

Pub Date : 2024-09-10 DOI: arxiv-2409.09079

Rustam Guliyev, Aparajita Haldar, Hakan Ferhatosmanoglu

Graph Neural Network (GNN) models on streaming graphs entail algorithmicchallenges to continuously capture its dynamic state, as well as systemschallenges to optimize latency, memory, and throughput during both inferenceand training. We present D3-GNN, the first distributed, hybrid-parallel,streaming GNN system designed to handle real-time graph updates under onlinequery setting. Our system addresses data management, algorithmic, and systemschallenges, enabling continuous capturing of the dynamic state of the graph andupdating node representations with fault-tolerance and optimal latency,load-balance, and throughput. D3-GNN utilizes streaming GNN aggregators and anunrolled, distributed computation graph architecture to handle cascading graphupdates. To counteract data skew and neighborhood explosion issues, weintroduce inter-layer and intra-layer windowed forward pass solutions.Experiments on large-scale graph streams demonstrate that D3-GNN achieves highefficiency and scalability. Compared to DGL, D3-GNN achieves a significantthroughput improvement of about 76x for streaming workloads. The windowedenhancement further reduces running times by around 10x and message volumes byup to 15x at higher parallelism.

流图上的图神经网络（GNN）模型面临着持续捕捉其动态状态的算法挑战，以及在推理和训练过程中优化延迟、内存和吞吐量的系统挑战。我们介绍的 D3-GNN 是首个分布式混合并行流 GNN 系统，旨在处理在线查询设置下的实时图更新。我们的系统解决了数据管理、算法和系统方面的挑战，能够持续捕捉图的动态状态，并以容错、最佳延迟、负载平衡和吞吐量的方式更新节点表示。D3-GNN 利用流 GNN 聚合器和无滚动分布式计算图架构来处理层叠图更新。为了解决数据倾斜和邻域爆炸问题，我们引入了层间和层内窗口化前向传递解决方案。在大规模图流上的实验证明，D3-GNN 实现了高效率和可扩展性。与 DGL 相比，D3-GNN 对流工作负载的吞吐量显著提高了约 76 倍。在并行度较高的情况下，窗口化增强进一步将运行时间缩短了约 10 倍，将消息量减少了多达 15 倍。

{"title":"D3-GNN: Dynamic Distributed Dataflow for Streaming Graph Neural Networks","authors":"Rustam Guliyev, Aparajita Haldar, Hakan Ferhatosmanoglu","doi":"arxiv-2409.09079","DOIUrl":"https://doi.org/arxiv-2409.09079","url":null,"abstract":"Graph Neural Network (GNN) models on streaming graphs entail algorithmic\u0000challenges to continuously capture its dynamic state, as well as systems\u0000challenges to optimize latency, memory, and throughput during both inference\u0000and training. We present D3-GNN, the first distributed, hybrid-parallel,\u0000streaming GNN system designed to handle real-time graph updates under online\u0000query setting. Our system addresses data management, algorithmic, and systems\u0000challenges, enabling continuous capturing of the dynamic state of the graph and\u0000updating node representations with fault-tolerance and optimal latency,\u0000load-balance, and throughput. D3-GNN utilizes streaming GNN aggregators and an\u0000unrolled, distributed computation graph architecture to handle cascading graph\u0000updates. To counteract data skew and neighborhood explosion issues, we\u0000introduce inter-layer and intra-layer windowed forward pass solutions.\u0000Experiments on large-scale graph streams demonstrate that D3-GNN achieves high\u0000efficiency and scalability. Compared to DGL, D3-GNN achieves a significant\u0000throughput improvement of about 76x for streaming workloads. The windowed\u0000enhancement further reduces running times by around 10x and message volumes by\u0000up to 15x at higher parallelism.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0