首页 > 最新文献

IEEE Transactions on Parallel and Distributed Systems最新文献

英文 中文
Efficient Schedule Construction for Distributed Execution of Large DNN Models 高效构建大型 DNN 模型的分布式执行时间表
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-24 DOI: 10.1109/TPDS.2024.3466913
Zhiqi Lin;Youshan Miao;Guanbin Xu;Cheng Li;Olli Saarikivi;Saeed Maleki;Fan Yang
Increasingly complex and diverse deep neural network (DNN) models necessitate distributing the execution across multiple devices for training and inference tasks, and also require carefully planned schedules for performance. However, existing practices often rely on predefined schedules that may not fully exploit the benefits of emerging diverse model-aware operator placement strategies. Handcrafting high-efficiency schedules can be challenging due to the large and varying schedule space. This paper presents Tessel, an automated system that searches for efficient schedules for distributed DNN training and inference for diverse operator placement strategies. To reduce search costs, Tessel leverages the insight that the most efficient schedules often exhibit repetitive pattern (repetend) across different data inputs. This leads to a two-phase approach: repetend construction and schedule completion. By exploring schedules for various operator placement strategies, Tessel significantly improves both training and inference performance. Experiments with representative DNN models demonstrate that Tessel achieves up to 5.5× training performance speedup and up to 38% inference latency reduction.
由于深度神经网络(DNN)模型日益复杂多样,因此有必要在多个设备上执行训练和推理任务,同时还需要精心规划的性能时间表。然而,现有的做法往往依赖于预定义的时间表,这些时间表可能无法充分利用新兴的多样化模型感知运算符放置策略的优势。由于计划空间巨大且变化多端,手工制作高效率的计划具有挑战性。本文介绍的 Tessel 是一种自动系统,可为分布式 DNN 训练和推理搜索高效时间表,并采用多种算子放置策略。为了降低搜索成本,Tessel 利用了最高效的计划通常在不同的数据输入中表现出重复模式(repetend)这一洞察力。这导致了一种两阶段的方法:repetend 构建和计划完成。通过探索各种运算器放置策略的时间表,Tessel 显著提高了训练和推理性能。对具有代表性的 DNN 模型进行的实验表明,Tessel 的训练速度提高了 5.5 倍,推理延迟降低了 38%。
{"title":"Efficient Schedule Construction for Distributed Execution of Large DNN Models","authors":"Zhiqi Lin;Youshan Miao;Guanbin Xu;Cheng Li;Olli Saarikivi;Saeed Maleki;Fan Yang","doi":"10.1109/TPDS.2024.3466913","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3466913","url":null,"abstract":"Increasingly complex and diverse deep neural network (DNN) models necessitate distributing the execution across multiple devices for training and inference tasks, and also require carefully planned schedules for performance. However, existing practices often rely on predefined schedules that may not fully exploit the benefits of emerging diverse model-aware operator placement strategies. Handcrafting high-efficiency schedules can be challenging due to the large and varying schedule space. This paper presents Tessel, an automated system that searches for efficient schedules for distributed DNN training and inference for diverse operator placement strategies. To reduce search costs, Tessel leverages the insight that the most efficient schedules often exhibit repetitive pattern (\u0000<italic>repetend</i>\u0000) across different data inputs. This leads to a two-phase approach: repetend construction and schedule completion. By exploring schedules for various operator placement strategies, Tessel significantly improves both training and inference performance. Experiments with representative DNN models demonstrate that Tessel achieves up to 5.5× training performance speedup and up to 38% inference latency reduction.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2375-2391"},"PeriodicalIF":5.6,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Two-Timescale Joint Optimization of Task Scheduling and Resource Scaling in Multi-Data Center System Based on Multi-Agent Deep Reinforcement Learning 基于多代理深度强化学习的多数据中心系统中任务调度和资源规模的双时标联合优化
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-24 DOI: 10.1109/TPDS.2024.3467212
Shuangwu Chen;Jiangming Li;Qifeng Yuan;Huasen He;Sen Li;Jian Yang
As a new computing paradigm, multi-data center computing enables service providers to deploy their applications close to the users. However, due to the spatio-temporal changes in workloads, it is challenging to coordinate multiple distributed data centers to provide high-quality services while reducing service operation costs. To address this challenge, this article studies the joint optimization problem of task scheduling and resource scaling in multi-data center systems. Since the task scheduling and the resource scaling are usually performed in different timescales, we decompose the joint optimization problem into two sub-problems and propose a two-timescale optimization framework. The short-timescale task scheduling can promptly relieve the bursty arrivals of computing tasks, and the long-timescale resource scaling can adapt well to the long-term changes in workloads. To address the distributed optimization problem, we propose a two-timescale multi-agent deep reinforcement learning algorithm. In order to characterize the graph-structured states of connected data centers, we develop a directed graph convolutional network based global state representation model. The evaluation indicates that the proposed algorithm is able to reduce both the task makespan and the task timeout while maintaining a reasonable cost.
作为一种新的计算模式,多数据中心计算使服务提供商能够在用户附近部署应用。然而,由于工作负载的时空变化,如何协调多个分布式数据中心,在提供高质量服务的同时降低服务运营成本是一个挑战。为了应对这一挑战,本文研究了多数据中心系统中任务调度和资源扩展的联合优化问题。由于任务调度和资源扩展通常在不同的时间尺度上进行,我们将联合优化问题分解为两个子问题,并提出了一个双时间尺度优化框架。短时标任务调度可以及时缓解计算任务的突发性到达,而长时标资源扩展可以很好地适应工作负载的长期变化。为了解决分布式优化问题,我们提出了一种双时标多代理深度强化学习算法。为了描述互联数据中心的图结构状态,我们开发了基于有向图卷积网络的全局状态表示模型。评估结果表明,所提出的算法能够在保持合理成本的同时,减少任务持续时间和任务超时。
{"title":"Two-Timescale Joint Optimization of Task Scheduling and Resource Scaling in Multi-Data Center System Based on Multi-Agent Deep Reinforcement Learning","authors":"Shuangwu Chen;Jiangming Li;Qifeng Yuan;Huasen He;Sen Li;Jian Yang","doi":"10.1109/TPDS.2024.3467212","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3467212","url":null,"abstract":"As a new computing paradigm, multi-data center computing enables service providers to deploy their applications close to the users. However, due to the spatio-temporal changes in workloads, it is challenging to coordinate multiple distributed data centers to provide high-quality services while reducing service operation costs. To address this challenge, this article studies the joint optimization problem of task scheduling and resource scaling in multi-data center systems. Since the task scheduling and the resource scaling are usually performed in different timescales, we decompose the joint optimization problem into two sub-problems and propose a two-timescale optimization framework. The short-timescale task scheduling can promptly relieve the bursty arrivals of computing tasks, and the long-timescale resource scaling can adapt well to the long-term changes in workloads. To address the distributed optimization problem, we propose a two-timescale multi-agent deep reinforcement learning algorithm. In order to characterize the graph-structured states of connected data centers, we develop a directed graph convolutional network based global state representation model. The evaluation indicates that the proposed algorithm is able to reduce both the task makespan and the task timeout while maintaining a reasonable cost.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2331-2346"},"PeriodicalIF":5.6,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VisionAGILE: A Versatile Domain-Specific Accelerator for Computer Vision Tasks VisionAGILE:用于计算机视觉任务的多功能特定领域加速器
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-24 DOI: 10.1109/TPDS.2024.3466891
Bingyi Zhang;Rajgopal Kannan;Carl Busart;Viktor K. Prasanna
The emergence of diverse machine learning (ML) models has led to groundbreaking revolutions in computer vision (CV). These ML models include convolutional neural networks (CNNs), graph neural networks (GNNs), and vision transformers (ViTs). However, existing hardware accelerators designed for CV lack the versatility to support various ML models, potentially limiting their applicability to real-world scenarios. To address this limitation, we introduce VisionAGILE, a domain-specific accelerator designed to be versatile and capable of accommodating a range of ML models, including CNNs, GNNs, and ViTs. VisionAGILE comprises a compiler, a runtime system, and a hardware accelerator. For the hardware accelerator, we develop a novel unified architecture with a flexible data path and memory organization to support the computation primitives in various ML models. Regarding the compiler design, we develop a unified compilation workflow that maps various ML models to the proposed hardware accelerator. The runtime system executes dynamic sparsity exploitation to reduce inference latency and dynamic task scheduling for workload balance. The compiler, the runtime system, and the hardware accelerator work synergistically to support a variety of ML models in CV, enabling low-latency inference. We deploy the hardware accelerator on a state-of-the-art data center FPGA (Xilinx Alveo U250). We evaluate VisionAGILE on diverse ML models for CV, including CNNs, GNNs, hybrid models (comprising both CNN and GNN), and ViTs. The experimental results indicate that, compared with state-of-the-art CPU (GPU) implementations, VisionAGILE achieves a speedup of $81.7times$ ($4.8times$) in terms of latency. Evaluated on standalone CNNs, GNNs, and ViTs, VisionAGILE demonstrates comparable or higher performance with state-of-the-art CNN accelerators, GNN accelerators, and ViT accelerators, respectively.
各种机器学习(ML)模型的出现为计算机视觉(CV)领域带来了突破性的变革。这些 ML 模型包括卷积神经网络 (CNN)、图神经网络 (GNN) 和视觉转换器 (ViT)。然而,为 CV 设计的现有硬件加速器缺乏支持各种 ML 模型的多功能性,这可能会限制它们在现实世界场景中的适用性。为了解决这一局限性,我们推出了 VisionAGILE,它是一种针对特定领域设计的加速器,具有多功能性,能够支持一系列 ML 模型,包括 CNN、GNN 和 ViT。VisionAGILE 由编译器、运行系统和硬件加速器组成。在硬件加速器方面,我们开发了一种新颖的统一架构,具有灵活的数据路径和内存组织,可支持各种 ML 模型中的计算基元。在编译器设计方面,我们开发了一个统一的编译工作流程,可将各种 ML 模型映射到拟议的硬件加速器。运行时系统执行动态稀疏性利用以减少推理延迟,并执行动态任务调度以平衡工作量。编译器、运行时系统和硬件加速器协同工作,支持 CV 中的各种 ML 模型,从而实现低延迟推理。我们在最先进的数据中心 FPGA(赛灵思 Alveo U250)上部署了硬件加速器。我们评估了 VisionAGILE 在 CV 中的各种 ML 模型,包括 CNN、GNN、混合模型(包括 CNN 和 GNN)和 ViT。实验结果表明,与最先进的 CPU(GPU)实现相比,VisionAGILE 在延迟方面提高了 81.7 美元(4.8 美元)。在独立的 CNN、GNN 和 ViT 上进行评估后,VisionAGILE 的性能分别与最先进的 CNN 加速器、GNN 加速器和 ViT 加速器相当或更高。
{"title":"VisionAGILE: A Versatile Domain-Specific Accelerator for Computer Vision Tasks","authors":"Bingyi Zhang;Rajgopal Kannan;Carl Busart;Viktor K. Prasanna","doi":"10.1109/TPDS.2024.3466891","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3466891","url":null,"abstract":"The emergence of diverse machine learning (ML) models has led to groundbreaking revolutions in computer vision (CV). These ML models include convolutional neural networks (CNNs), graph neural networks (GNNs), and vision transformers (ViTs). However, existing hardware accelerators designed for CV lack the versatility to support various ML models, potentially limiting their applicability to real-world scenarios. To address this limitation, we introduce VisionAGILE, a domain-specific accelerator designed to be versatile and capable of accommodating a range of ML models, including CNNs, GNNs, and ViTs. VisionAGILE comprises a compiler, a runtime system, and a hardware accelerator. For the hardware accelerator, we develop a novel unified architecture with a flexible data path and memory organization to support the computation primitives in various ML models. Regarding the compiler design, we develop a unified compilation workflow that maps various ML models to the proposed hardware accelerator. The runtime system executes dynamic sparsity exploitation to reduce inference latency and dynamic task scheduling for workload balance. The compiler, the runtime system, and the hardware accelerator work synergistically to support a variety of ML models in CV, enabling low-latency inference. We deploy the hardware accelerator on a state-of-the-art data center FPGA (Xilinx Alveo U250). We evaluate VisionAGILE on diverse ML models for CV, including CNNs, GNNs, hybrid models (comprising both CNN and GNN), and ViTs. The experimental results indicate that, compared with state-of-the-art CPU (GPU) implementations, VisionAGILE achieves a speedup of \u0000<inline-formula><tex-math>$81.7times$</tex-math></inline-formula>\u0000 (\u0000<inline-formula><tex-math>$4.8times$</tex-math></inline-formula>\u0000) in terms of latency. Evaluated on standalone CNNs, GNNs, and ViTs, VisionAGILE demonstrates comparable or higher performance with state-of-the-art CNN accelerators, GNN accelerators, and ViT accelerators, respectively.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2405-2422"},"PeriodicalIF":5.6,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142447219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity Alignment 利用整体稀疏性对齐在移动设备上高效推断剪枝 CNN 模型
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-17 DOI: 10.1109/TPDS.2024.3462092
Yuyang Jin;Runxin Zhong;Saiqin Long;Jidong Zhai
Many artificial intelligence applications based on convolutional neural networks are directly deployed on mobile devices to avoid network unavailability and user privacy leakage. However, the significant increase in model parameter volumes makes it difficult to achieve high-performance convolutional neural network inference on these mobile devices with limited computing power. Weight pruning is one of the main approaches to compress models by reducing model parameters and computational operations, which also introduces irregular sparsity of neural networks, leading to inefficient computation and memory access during inference. This work proposes an end-to-end framework, namely MCPruner, for efficient inference of pruned convolutional neural networks on mobile devices by aligning the sparse patterns with hardware execution features in computation, memory access, and parallelism. It first co-designs pruning methods and code generation optimizations for the alignment of non-zero weight count and vector width, to improve computational efficiency while ensuring accuracy. During the code generation, it applies a sparse pattern-aware format to reduce inefficient memory accesses. Besides, convolution computations are reordered for alignment, and then mapped to parallel threads on accelerated units to achieve high parallelism. Experimental results using several commonly used models and datasets on the ARM-based Hikey970 demonstrate that our work outperforms state-of-the-art methods in inference efficiency, with no accuracy degradation.
许多基于卷积神经网络的人工智能应用都直接部署在移动设备上,以避免网络不可用和用户隐私泄露。然而,由于模型参数量大幅增加,很难在计算能力有限的移动设备上实现高性能的卷积神经网络推理。权重剪枝是通过减少模型参数和计算操作来压缩模型的主要方法之一,但同时也会引入神经网络的不规则稀疏性,导致推理过程中的计算和内存访问效率低下。本研究提出了一种端到端框架,即 MCPruner,通过将稀疏模式与计算、内存访问和并行性方面的硬件执行特性相匹配,在移动设备上高效推断剪枝卷积神经网络。它首先共同设计了剪枝方法和代码生成优化方法,用于调整非零权重计数和向量宽度,以提高计算效率,同时确保准确性。在代码生成过程中,它采用了稀疏模式感知格式,以减少低效的内存访问。此外,卷积计算会重新排序以进行对齐,然后映射到加速单元上的并行线程,以实现高并行性。在基于 ARM 的 Hikey970 上使用几种常用模型和数据集的实验结果表明,我们的工作在推理效率方面优于最先进的方法,而且准确性没有降低。
{"title":"Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity Alignment","authors":"Yuyang Jin;Runxin Zhong;Saiqin Long;Jidong Zhai","doi":"10.1109/TPDS.2024.3462092","DOIUrl":"10.1109/TPDS.2024.3462092","url":null,"abstract":"Many artificial intelligence applications based on convolutional neural networks are directly deployed on mobile devices to avoid network unavailability and user privacy leakage. However, the significant increase in model parameter volumes makes it difficult to achieve high-performance convolutional neural network inference on these mobile devices with limited computing power. Weight pruning is one of the main approaches to compress models by reducing model parameters and computational operations, which also introduces irregular sparsity of neural networks, leading to inefficient computation and memory access during inference. This work proposes an end-to-end framework, namely MCPruner, for efficient inference of pruned convolutional neural networks on mobile devices by aligning the sparse patterns with hardware execution features in computation, memory access, and parallelism. It first co-designs pruning methods and code generation optimizations for the alignment of non-zero weight count and vector width, to improve computational efficiency while ensuring accuracy. During the code generation, it applies a sparse pattern-aware format to reduce inefficient memory accesses. Besides, convolution computations are reordered for alignment, and then mapped to parallel threads on accelerated units to achieve high parallelism. Experimental results using several commonly used models and datasets on the ARM-based Hikey970 demonstrate that our work outperforms state-of-the-art methods in inference efficiency, with no accuracy degradation.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2208-2223"},"PeriodicalIF":5.6,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Freyr $^+$+: Harvesting Idle Resources in Serverless Computing via Deep Reinforcement Learning Freyr$^+$:通过深度强化学习挖掘无服务器计算中的闲置资源
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-17 DOI: 10.1109/TPDS.2024.3462294
Hanfei Yu;Hao Wang;Jian Li;Xu Yuan;Seung-Jong Park
Serverless computing has revolutionized online service development and deployment with ease-to-use operations, auto-scaling, fine-grained resource allocation, and pay-as-you-go pricing. However, a gap remains in configuring serverless functions—the actual resource consumption may vary due to function types, dependencies, and input data sizes, thus mismatching the static resource configuration by users. Dynamic resource consumption against static configuration may lead to either poor function execution performance or low utilization. This paper proposes Freyr$^+$, a novel resource manager (RM) that dynamically harvests idle resources from over-provisioned functions to accelerate under-provisioned functions for serverless platforms. Freyr$^+$ monitors each function's resource utilization in real-time and detects the mismatches between user configuration and actual resource consumption. We design deep reinforcement learning (DRL) algorithms with attention-enhanced embedding, incremental learning, and safeguard mechanism for Freyr$^+$ to harvest idle resources safely and accelerate functions efficiently. We have implemented and deployed a Freyr$^+$ prototype in a 13-node Apache OpenWhisk cluster using AWS EC2. Freyr$^+$ is evaluated on both large-scale simulation and real-world testbed. Experimental results show that Freyr$^+$ harvests 38% of function invocations’ idle resources and accelerates 39% of invocations using harvested resources. Freyr$^+$ reduces the 99th-percentile function response latency by 26% compared to the baseline RMs.
无服务器计算凭借易于使用的操作、自动缩放、细粒度资源分配和即用即付的定价方式,彻底改变了在线服务的开发和部署。然而,在配置无服务器功能方面仍存在差距--实际资源消耗可能会因功能类型、依赖关系和输入数据大小而变化,从而与用户的静态资源配置不匹配。与静态配置不匹配的动态资源消耗可能导致函数执行性能低下或利用率低。本文提出了一种新型资源管理器(RM)--Freyr$^+$,它可以动态地从超额配置的函数中获取闲置资源,从而加速无服务器平台中配置不足的函数。Freyr$^+$ 实时监控每个函数的资源利用率,并检测用户配置与实际资源消耗之间的不匹配。我们为Freyr$^+$设计了具有注意力增强嵌入、增量学习和保障机制的深度强化学习(DRL)算法,以安全地获取闲置资源并高效地加速函数。我们在使用 AWS EC2 的 13 节点 Apache OpenWhisk 集群中实施并部署了 Freyr$^+$ 原型。Freyr$^+$ 在大规模模拟和真实世界测试平台上进行了评估。实验结果表明,Freyr$^+$ 可收集 38% 的函数调用闲置资源,并利用收集的资源加速 39% 的调用。与基准 RM 相比,Freyr$^+$ 将第 99 百分位函数响应延迟降低了 26%。
{"title":"Freyr $^+$+: Harvesting Idle Resources in Serverless Computing via Deep Reinforcement Learning","authors":"Hanfei Yu;Hao Wang;Jian Li;Xu Yuan;Seung-Jong Park","doi":"10.1109/TPDS.2024.3462294","DOIUrl":"10.1109/TPDS.2024.3462294","url":null,"abstract":"Serverless computing has revolutionized online service development and deployment with ease-to-use operations, auto-scaling, fine-grained resource allocation, and pay-as-you-go pricing. However, a gap remains in configuring serverless functions—the actual resource consumption may vary due to function types, dependencies, and input data sizes, thus mismatching the static resource configuration by users. Dynamic resource consumption against static configuration may lead to either poor function execution performance or low utilization. This paper proposes \u0000<i>Freyr</i>\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000, a novel resource manager (RM) that dynamically harvests idle resources from over-provisioned functions to accelerate under-provisioned functions for serverless platforms. \u0000<i>Freyr</i>\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 monitors each function's resource utilization in real-time and detects the mismatches between user configuration and actual resource consumption. We design deep reinforcement learning (DRL) algorithms with attention-enhanced embedding, incremental learning, and safeguard mechanism for \u0000<i>Freyr</i>\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 to harvest idle resources safely and accelerate functions efficiently. We have implemented and deployed a \u0000<i>Freyr</i>\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 prototype in a 13-node Apache OpenWhisk cluster using AWS EC2. \u0000<i>Freyr</i>\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 is evaluated on both large-scale simulation and real-world testbed. Experimental results show that \u0000<i>Freyr</i>\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 harvests 38% of function invocations’ idle resources and accelerates 39% of invocations using harvested resources. \u0000<i>Freyr</i>\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 reduces the 99th-percentile function response latency by 26% compared to the baseline RMs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2254-2269"},"PeriodicalIF":5.6,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142264434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Cross-Cloud Partial Reduce With CREW 利用 CREW 实现高效的跨云部分还原
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-13 DOI: 10.1109/TPDS.2024.3460185
Shouxi Luo;Renyi Wang;Ke Li;Huanlai Xing
By allowing $p$ out of $n$ workers to conduct all reduce operations among them for a round of synchronization, partial reduce, a promising partially-asynchronous variant of all reduce, has shown its power in alleviating the impacts of stragglers for iterative distributed machine learning (DML). Current partial reduce solutions are mainly designed for intra-cluster DML, in which workers are networked with high-bandwidth LAN links. Yet no prior work has looked into the problem of how to achieve efficient partial reduce for cross-cloud DML, where inter-worker connections are with scarcely-available capacities. To fill the gap, in this paper, we propose CREW, a flexible and efficient implementation of partial reduce for cross-cloud DML. At the high level, CREW is built upon the novel design of employing all active workers along with their internal connection capacities to execute the involved communication and computation tasks; and at the low level, CREW employs a suite of algorithms to distribute the tasks among workers in a load-balanced way, and deal with possible outages of workers/connections, and bandwidth contention. Detailed performance studies confirm that, CREW not only shortens the execution of each partial reduce operation, outperforming existing communication schemes such as PS, Ring, TopoAdopt, and BLINK greatly, but also significantly accelerates the training of large models, up to $15times$ and $9times$, respectively, when compared with the all-to-all direct communication scheme and original partial reduce design.
部分还原是全部还原的一种很有前途的部分异步变体,它允许 n 个工作者中的 $p$ 在一轮同步中在他们之间进行所有还原操作,在减轻迭代式分布式机器学习(DML)中滞后者的影响方面显示出了它的威力。目前的部分还原解决方案主要是针对集群内的 DML 而设计的,在集群内,工作人员通过高带宽局域网链接联网。然而,对于如何在跨云 DML 中实现高效的部分还原这一问题,此前还没有任何研究成果,因为在跨云 DML 中,工人之间的连接几乎没有可用的容量。为了填补这一空白,我们在本文中提出了 CREW,一种灵活高效的跨云 DML 部分还原实现方法。在高层,CREW 采用了新颖的设计,即利用所有活跃的工作者及其内部连接能力来执行相关的通信和计算任务;在底层,CREW 采用了一套算法,以负载均衡的方式在工作者之间分配任务,并处理工作者/连接可能出现的中断和带宽争用问题。详细的性能研究证实,与全对全直接通信方案和原始的部分还原设计相比,CREW 不仅缩短了每个部分还原操作的执行时间,大大优于现有的通信方案,如 PS、Ring、TopoAdopt 和 BLINK,而且还显著加快了大型模型的训练速度,分别达到 $15/times$ 和 $9/times$。
{"title":"Efficient Cross-Cloud Partial Reduce With CREW","authors":"Shouxi Luo;Renyi Wang;Ke Li;Huanlai Xing","doi":"10.1109/TPDS.2024.3460185","DOIUrl":"10.1109/TPDS.2024.3460185","url":null,"abstract":"By allowing \u0000<inline-formula><tex-math>$p$</tex-math></inline-formula>\u0000 out of \u0000<inline-formula><tex-math>$n$</tex-math></inline-formula>\u0000 workers to conduct \u0000<i>all reduce</i>\u0000 operations among them for a round of synchronization, \u0000<i>partial reduce</i>\u0000, a promising partially-asynchronous variant of \u0000<i>all reduce</i>\u0000, has shown its power in alleviating the impacts of stragglers for iterative distributed machine learning (DML). Current \u0000<i>partial reduce</i>\u0000 solutions are mainly designed for intra-cluster DML, in which workers are networked with high-bandwidth LAN links. Yet no prior work has looked into the problem of how to achieve efficient \u0000<i>partial reduce</i>\u0000 for cross-cloud DML, where inter-worker connections are with scarcely-available capacities. To fill the gap, in this paper, we propose \u0000<small>CREW</small>\u0000, a flexible and efficient implementation of \u0000<i>partial reduce</i>\u0000 for cross-cloud DML. At the high level, \u0000<small>CREW</small>\u0000 is built upon the novel design of employing all active workers along with their internal connection capacities to execute the involved communication and computation tasks; and at the low level, \u0000<small>CREW</small>\u0000 employs a suite of algorithms to distribute the tasks among workers in a load-balanced way, and deal with possible outages of workers/connections, and bandwidth contention. Detailed performance studies confirm that, \u0000<small>CREW</small>\u0000 not only shortens the execution of each \u0000<i>partial reduce</i>\u0000 operation, outperforming existing communication schemes such as PS, Ring, \u0000<small>TopoAdopt</small>\u0000, and BLINK greatly, but also significantly accelerates the training of large models, up to \u0000<inline-formula><tex-math>$15times$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$9times$</tex-math></inline-formula>\u0000, respectively, when compared with the all-to-all direct communication scheme and \u0000<i>original partial reduce</i>\u0000 design.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2224-2238"},"PeriodicalIF":5.6,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142264435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Evaluation Framework for Dynamic Thermal Management Strategies in 3D MultiProcessor System-on-Chip Co-Design 三维多处理器片上系统协同设计中动态热管理策略的评估框架
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-12 DOI: 10.1109/TPDS.2024.3459414
Darong Huang;Luis Costero;David Atienza
Dynamic thermal management (DTM) has been widely adopted to improve the energy efficiency, reliability, and performance of modern Multi-Processor SoCs (MPSoCs). However, the evolving industry trends and heterogeneous architecture designs have introduced significant challenges in state-of-the-art DTM methods. Specifically, the emergence of heterogeneous design has led to increased localized and non-uniform hotspots, necessitating accurate and responsive DTM strategies. Additionally, the increased number of cores to be managed requires the DTM to optimize and coordinate the whole system. However, existing methodologies fail in both precise thermal modeling in localized hotspots and fast architecture simulation. To tackle these existing challenges, we first introduce the latest version of 3D-ICE 3.1, with a novel non-uniform thermal modeling technique to support customized discretization levels of thermal grids. 3D-ICE 3.1 improves the accuracy of thermal analysis and reduces simulation overhead. Then, in conjunction with an efficient and fast offline application profiling strategy utilizing the architecture simulator gem5-X, we propose a novel DTM evaluation framework. This framework enables us to explore novel DTM methods to optimize the energy efficiency, reliability, and performance of contemporary 3D MPSoCs. The experimental results demonstrate that 3D-ICE 3.1 achieves high accuracy, with only 0.3K mean temperature error. Subsequently, we evaluate various DTM methods and propose a Multi-Agent Reinforcement Learning (MARL) control to address the demanding thermal challenges of 3D MPSoCs. Our experimental results show that the proposed DTM method based on MARL can reduce power consumption by 13% while maintaining a similar performance level to the comparison methods.
动态热管理(DTM)已被广泛采用,以提高现代多处理器系统级芯片(MPSoC)的能效、可靠性和性能。然而,不断发展的行业趋势和异构架构设计给最先进的 DTM 方法带来了重大挑战。具体来说,异构设计的出现导致局部和非均匀热点的增加,这就要求采用准确、灵敏的 DTM 策略。此外,需要管理的内核数量增加,要求 DTM 对整个系统进行优化和协调。然而,现有方法在局部热点的精确热建模和快速架构仿真方面都存在不足。为了应对这些现有挑战,我们首先介绍了最新版本的 3D-ICE 3.1,该版本采用了新颖的非均匀热建模技术,支持自定义热网格离散等级。3D-ICE 3.1 提高了热分析的精度,降低了仿真开销。然后,结合利用架构模拟器 gem5-X 的高效、快速离线应用剖析策略,我们提出了一个新颖的 DTM 评估框架。该框架使我们能够探索新型 DTM 方法,以优化当代 3D MPSoC 的能效、可靠性和性能。实验结果表明,3D-ICE 3.1 实现了高精度,平均温度误差仅为 0.3K。随后,我们对各种 DTM 方法进行了评估,并提出了多代理强化学习 (MARL) 控制方法,以应对 3D MPSoC 在散热方面的严峻挑战。实验结果表明,基于 MARL 提出的 DTM 方法可以降低 13% 的功耗,同时保持与比较方法类似的性能水平。
{"title":"An Evaluation Framework for Dynamic Thermal Management Strategies in 3D MultiProcessor System-on-Chip Co-Design","authors":"Darong Huang;Luis Costero;David Atienza","doi":"10.1109/TPDS.2024.3459414","DOIUrl":"10.1109/TPDS.2024.3459414","url":null,"abstract":"Dynamic thermal management (DTM) has been widely adopted to improve the energy efficiency, reliability, and performance of modern Multi-Processor SoCs (MPSoCs). However, the evolving industry trends and heterogeneous architecture designs have introduced significant challenges in state-of-the-art DTM methods. Specifically, the emergence of heterogeneous design has led to increased localized and non-uniform hotspots, necessitating accurate and responsive DTM strategies. Additionally, the increased number of cores to be managed requires the DTM to optimize and coordinate the whole system. However, existing methodologies fail in both precise thermal modeling in localized hotspots and fast architecture simulation. To tackle these existing challenges, we first introduce the latest version of 3D-ICE 3.1, with a novel non-uniform thermal modeling technique to support customized discretization levels of thermal grids. 3D-ICE 3.1 improves the accuracy of thermal analysis and reduces simulation overhead. Then, in conjunction with an efficient and fast offline application profiling strategy utilizing the architecture simulator gem5-X, we propose a novel DTM evaluation framework. This framework enables us to explore novel DTM methods to optimize the energy efficiency, reliability, and performance of contemporary 3D MPSoCs. The experimental results demonstrate that 3D-ICE 3.1 achieves high accuracy, with only 0.3K mean temperature error. Subsequently, we evaluate various DTM methods and propose a Multi-Agent Reinforcement Learning (MARL) control to address the demanding thermal challenges of 3D MPSoCs. Our experimental results show that the proposed DTM method based on MARL can reduce power consumption by 13% while maintaining a similar performance level to the comparison methods.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2161-2176"},"PeriodicalIF":5.6,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DeepCAT+: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data Frameworks DeepCAT+:用于大数据框架的低成本、可转移的在线配置自动调整方法
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-12 DOI: 10.1109/TPDS.2024.3459889
Hui Dou;Yilun Wang;Yiwen Zhang;Pengfei Chen;Zibin Zheng
Big data frameworks usually provide a large number of performance-related parameters. Online auto-tuning these parameters based on deep reinforcement learning (DRL) to achieve a better performance has shown their advantages over search-based and machine learning-based approaches. Unfortunately, the time cost during the online tuning phase of conventional DRL-based methods is still heavy, especially for Big Data applications. Therefore, in this paper, we propose DeepCAT$^+$, a low-cost and transferrable deep reinforcement learning-based approach to achieve online configuration auto-tuning for Big Data frameworks. To reduce the total online tuning cost and increase the adaptability: 1) DeepCAT$^+$ utilizes the TD3 algorithm instead of DDPG to alleviate value overestimation; 2) DeepCAT$^+$ modifies the conventional experience replay to fully utilize the rare but valuable transitions via a novel reward-driven prioritized experience replay mechanism; 3) DeepCAT$^+$ designs a Twin-Q Optimizer to estimate the execution time of each action without the costly configuration evaluation and optimize the sub-optimal ones to achieve a low-cost exploration-exploitation tradeoff; 4) Furthermore, DeepCAT$^+$ also implements an Online Continual Learner module based on Progressive Neural Networks to transfer knowledge from historical tuning experiences. Experimental results based on a lab Spark cluster with HiBench benchmark applications show that DeepCAT$^+$ is able to speed up the best execution time by a factor of 1.49×, 1.63× and 1.65× on average respectively over the baselines, while consuming up to 50.08%, 53.39% and 70.79% less total tuning time. In addition, DeepCAT$^+$ also has a strong adaptability to the time-varying environment of Big Data frameworks.
大数据框架通常会提供大量与性能相关的参数。与基于搜索和机器学习的方法相比,基于深度强化学习(DRL)的在线自动调整这些参数以获得更好的性能已显示出其优势。遗憾的是,基于 DRL 的传统方法在在线调整阶段的时间成本仍然很高,尤其是在大数据应用中。因此,我们在本文中提出了 DeepCAT$^+$,一种低成本、可移植的基于深度强化学习的方法,用于实现大数据框架的在线配置自动调整。为了降低总的在线调优成本并提高适应性,我们提出了以下几种方法:1)DeepCAT$^+$ 利用 TD3 算法而不是 DDPG 来减轻价值高估;2)DeepCAT$^+$ 通过一种新颖的奖励驱动优先级经验重放机制,修改了传统的经验重放,以充分利用稀有但有价值的转换;3)DeepCAT$^+$ 设计了双 Q 优化器(Twin-Q Optimizer),在不进行高成本配置评估的情况下估算每个动作的执行时间,并优化次优动作,以实现低成本的探索-开发权衡;4)此外,DeepCAT$^+$ 还实现了基于渐进式神经网络的在线持续学习模块,从历史调整经验中转移知识。基于实验室 Spark 集群和 HiBench 基准应用的实验结果表明,DeepCAT$^+$ 能够将最佳执行时间分别比基准平均加快 1.49 倍、1.63 倍和 1.65 倍,同时将总调整时间分别减少 50.08%、53.39% 和 70.79%。此外,DeepCAT$^+$ 还能很好地适应大数据框架的时变环境。
{"title":"DeepCAT+: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data Frameworks","authors":"Hui Dou;Yilun Wang;Yiwen Zhang;Pengfei Chen;Zibin Zheng","doi":"10.1109/TPDS.2024.3459889","DOIUrl":"10.1109/TPDS.2024.3459889","url":null,"abstract":"Big data frameworks usually provide a large number of performance-related parameters. Online auto-tuning these parameters based on deep reinforcement learning (DRL) to achieve a better performance has shown their advantages over search-based and machine learning-based approaches. Unfortunately, the time cost during the online tuning phase of conventional DRL-based methods is still heavy, especially for Big Data applications. Therefore, in this paper, we propose DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000, a low-cost and transferrable deep reinforcement learning-based approach to achieve online configuration auto-tuning for Big Data frameworks. To reduce the total online tuning cost and increase the adaptability: 1) DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 utilizes the TD3 algorithm instead of DDPG to alleviate value overestimation; 2) DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 modifies the conventional experience replay to fully utilize the rare but valuable transitions via a novel reward-driven prioritized experience replay mechanism; 3) DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 designs a Twin-Q Optimizer to estimate the execution time of each action without the costly configuration evaluation and optimize the sub-optimal ones to achieve a low-cost exploration-exploitation tradeoff; 4) Furthermore, DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 also implements an Online Continual Learner module based on Progressive Neural Networks to transfer knowledge from historical tuning experiences. Experimental results based on a lab Spark cluster with HiBench benchmark applications show that DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 is able to speed up the best execution time by a factor of 1.49×, 1.63× and 1.65× on average respectively over the baselines, while consuming up to 50.08%, 53.39% and 70.79% less total tuning time. In addition, DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 also has a strong adaptability to the time-varying environment of Big Data frameworks.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2114-2131"},"PeriodicalIF":5.6,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gamora: Learning-Based Buffer-Aware Preloading for Adaptive Short Video Streaming 卡魔拉基于学习的缓冲区感知预加载,实现自适应短视频流
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-09 DOI: 10.1109/TPDS.2024.3456567
Biao Hou;Song Yang;Fan Li;Liehuang Zhu;Lei Jiao;Xu Chen;Xiaoming Fu
Nowadays, the emerging short video streaming applications have gained substantial attention. With the rapidly burgeoning demand for short video streaming services, maximizing their Quality of Experience (QoE) is an onerous challenge. Current video preloading algorithms cannot determine video preloading sequence decisions appropriately due to the impact of users’ swipes and bandwidth fluctuations. As a result, it is still ambiguous how to improve the overall QoE while mitigating bandwidth wastage to optimize short video streaming services. In this article, we devise Gamora, a buffer-aware short video streaming system to provide a high QoE of users. In Gamora, we first propose an unordered preloading algorithm that utilizes a Deep Reinforcement Learning (DRL) algorithm to make video preloading decisions. Then, we further devise an Asymmetric Imitation Learning (AIL) algorithm to guide the DRL-based preloading algorithm, which enables the agent to learn from expert demonstrations for fast convergence. Finally, we implement our proposed short video streaming system prototype and evaluate the performance of Gamora on various real-world network datasets. Our results demonstrate that Gamora significantly achieves QoE improvement by 28.7%–51.4% compared to state-of-the-art algorithms, while mitigating bandwidth wastage by 40.7%–83.2% without sacrificing video quality.
如今,新兴的短视频流应用已受到广泛关注。随着短视频流服务需求的快速增长,如何最大限度地提高其体验质量(QoE)成为一项艰巨的挑战。由于用户刷卡和带宽波动的影响,当前的视频预加载算法无法正确确定视频预加载顺序决策。因此,如何在改善整体 QoE 的同时减少带宽浪费以优化短视频流媒体服务仍是一个模糊的问题。在本文中,我们设计了一种缓冲感知短视频流系统 Gamora,为用户提供高 QoE。在 Gamora 中,我们首先提出了一种无序预加载算法,利用深度强化学习(DRL)算法做出视频预加载决策。然后,我们进一步设计了一种非对称模仿学习(AIL)算法来指导基于 DRL 的预加载算法,使代理能够从专家示范中学习,从而快速收敛。最后,我们实现了所提出的短视频流系统原型,并在各种实际网络数据集上评估了 Gamora 的性能。结果表明,与最先进的算法相比,Gamora 的 QoE 显著提高了 28.7%-51.4%,同时在不牺牲视频质量的情况下减少了 40.7%-83.2% 的带宽浪费。
{"title":"Gamora: Learning-Based Buffer-Aware Preloading for Adaptive Short Video Streaming","authors":"Biao Hou;Song Yang;Fan Li;Liehuang Zhu;Lei Jiao;Xu Chen;Xiaoming Fu","doi":"10.1109/TPDS.2024.3456567","DOIUrl":"10.1109/TPDS.2024.3456567","url":null,"abstract":"Nowadays, the emerging short video streaming applications have gained substantial attention. With the rapidly burgeoning demand for short video streaming services, maximizing their Quality of Experience (QoE) is an onerous challenge. Current video preloading algorithms cannot determine video preloading sequence decisions appropriately due to the impact of users’ swipes and bandwidth fluctuations. As a result, it is still ambiguous how to improve the overall QoE while mitigating bandwidth wastage to optimize short video streaming services. In this article, we devise Gamora, a buffer-aware short video streaming system to provide a high QoE of users. In Gamora, we first propose an unordered preloading algorithm that utilizes a Deep Reinforcement Learning (DRL) algorithm to make video preloading decisions. Then, we further devise an Asymmetric Imitation Learning (AIL) algorithm to guide the DRL-based preloading algorithm, which enables the agent to learn from expert demonstrations for fast convergence. Finally, we implement our proposed short video streaming system prototype and evaluate the performance of Gamora on various real-world network datasets. Our results demonstrate that Gamora significantly achieves QoE improvement by 28.7%–51.4% compared to state-of-the-art algorithms, while mitigating bandwidth wastage by 40.7%–83.2% without sacrificing video quality.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2132-2146"},"PeriodicalIF":5.6,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Trusted Model Aggregation With Zero-Knowledge Proofs in Federated Learning 联盟学习中的零知识证明可信模型聚合
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-09-06 DOI: 10.1109/TPDS.2024.3455762
Renwen Ma;Kai Hwang;Mo Li;Yiming Miao
This paper proposes a new global model aggregation method based on using zero-knowledge federated learning (ZKFL). The purpose is to secure horizontal or P2P federated machine learning systems with shorter aggregation times, higher model accuracy, and lower system costs. We use a model parameter-sharing Chord overlay network among all client hosts. The overlay guarantees a trusted sharing of zero-knowledge proofs for aggregation integrity, even under malicious Byzantine attacks. We tested over popular datasets, Fashion-MNIST and CIFAR10, to prove the new system protection concept. Our benchmark experiments validate the claimed advantages of the ZKFL scheme in all objective functions. Our aggregation method can be applied to secure both rank-based and similarity-based aggregation schemes. For a large system with over 200 clients, our system takes only 3 seconds to yield high-precision global machine models under the ALIE attacks with the Fashion-MNIST dataset. We have achieved up to 85% model accuracy, compared to only 3%$sim$45% accuracy observed with federated schemes without protection. Moreover, our method demands a low memory overhead for handling zero-knowledge proofs as the system scales greatly to a larger number of client nodes.
本文基于零知识联合学习(ZKFL)提出了一种新的全局模型聚合方法。其目的是以更短的聚合时间、更高的模型准确性和更低的系统成本确保水平或 P2P 联合机器学习系统的安全。我们在所有客户主机之间使用模型参数共享的 Chord 重叠网络。即使在受到恶意拜占庭攻击的情况下,叠加网络也能保证可信的零知识证明共享,从而保证聚合的完整性。我们在流行数据集 Fashion-MNIST 和 CIFAR10 上进行了测试,以证明新的系统保护概念。我们的基准实验验证了 ZKFL 方案在所有目标函数中宣称的优势。我们的聚合方法既可用于保护基于等级的聚合方案,也可用于保护基于相似性的聚合方案。对于一个拥有 200 多个客户端的大型系统,我们的系统只需 3 秒钟就能利用 Fashion-MNIST 数据集生成 ALIE 攻击下的高精度全局机器模型。我们的模型准确率高达 85%,而无保护的联合方案准确率仅为 3%。此外,我们的方法只需较低的内存开销来处理零知识证明,因为系统可以极大地扩展到更多的客户端节点。
{"title":"Trusted Model Aggregation With Zero-Knowledge Proofs in Federated Learning","authors":"Renwen Ma;Kai Hwang;Mo Li;Yiming Miao","doi":"10.1109/TPDS.2024.3455762","DOIUrl":"10.1109/TPDS.2024.3455762","url":null,"abstract":"This paper proposes a new global model aggregation method based on using zero-knowledge federated learning (ZKFL). The purpose is to secure horizontal or P2P federated machine learning systems with shorter aggregation times, higher model accuracy, and lower system costs. We use a model parameter-sharing Chord overlay network among all client hosts. The overlay guarantees a trusted sharing of zero-knowledge proofs for aggregation integrity, even under malicious Byzantine attacks. We tested over popular datasets, Fashion-MNIST and CIFAR10, to prove the new system protection concept. Our benchmark experiments validate the claimed advantages of the ZKFL scheme in all objective functions. Our aggregation method can be applied to secure both rank-based and similarity-based aggregation schemes. For a large system with over 200 clients, our system takes only 3 seconds to yield high-precision global machine models under the ALIE attacks with the Fashion-MNIST dataset. We have achieved up to 85% model accuracy, compared to only 3%\u0000<inline-formula><tex-math>$sim$</tex-math></inline-formula>\u000045% accuracy observed with federated schemes without protection. Moreover, our method demands a low memory overhead for handling zero-knowledge proofs as the system scales greatly to a larger number of client nodes.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2284-2296"},"PeriodicalIF":5.6,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Parallel and Distributed Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1