IEEE Transactions on Parallel and Distributed Systems最新文献_第5页

Distributed Task Processing Platform for Infrastructure-Less IoT Networks: A Multi-Dimensional Optimization Approach 无基础设施物联网网络的分布式任务处理平台：多维优化方法

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-27 DOI: 10.1109/TPDS.2024.3469545

Qiushi Zheng;Jiong Jin;Zhishu Shen;Libing Wu;Iftekhar Ahmad;Yong Xiang

With the rapid development of artificial intelligence (AI) and the Internet of Things (IoT), intelligent information services have showcased unprecedented capabilities in acquiring and analysing information. The conventional task processing platforms rely on centralised Cloud processing, which encounters challenges in infrastructure-less environments with unstable or disrupted electrical grids and cellular networks. These challenges hinder the deployment of intelligent information services in such environments. To address these challenges, we propose a distributed task processing platform (

${DTPP}$

) designed to provide satisfactory performance for executing computationally intensive applications in infrastructure-less environments. This platform leverages numerous distributed homogeneous nodes to process the arriving task locally or collaboratively. Based on this platform, a distributed task allocation algorithm is developed to achieve high task processing performance with limited energy and bandwidth resources. To validate our approach,

${DTPP}$

has been tested in an experimental environment utilising real-world experimental data to simulate IoT network services in infrastructure-less environments. Extensive experiments demonstrate that our proposed solution surpasses comparative algorithms in key performance metrics, including task processing ratio, task processing accuracy, algorithm processing time, and energy consumption.

随着人工智能（AI）和物联网（IoT）的快速发展，智能信息服务在获取和分析信息方面展现出前所未有的能力。传统的任务处理平台依赖于集中式云处理，这在电网和蜂窝网络不稳定或中断的无基础设施环境中遇到了挑战。这些挑战阻碍了智能信息服务在此类环境中的部署。为应对这些挑战，我们提出了分布式任务处理平台（{DTPP}$），旨在为在无基础设施环境中执行计算密集型应用提供令人满意的性能。该平台利用众多分布式同构节点，对到达的任务进行本地或协作处理。基于该平台，我们开发了一种分布式任务分配算法，以便在能源和带宽资源有限的情况下实现较高的任务处理性能。为了验证我们的方法，我们在实验环境中测试了 ${DTPP}$，利用真实世界的实验数据来模拟无基础设施环境中的物联网网络服务。广泛的实验证明，我们提出的解决方案在关键性能指标（包括任务处理率、任务处理准确性、算法处理时间和能耗）上超越了同类算法。

{"title":"Distributed Task Processing Platform for Infrastructure-Less IoT Networks: A Multi-Dimensional Optimization Approach","authors":"Qiushi Zheng;Jiong Jin;Zhishu Shen;Libing Wu;Iftekhar Ahmad;Yong Xiang","doi":"10.1109/TPDS.2024.3469545","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3469545","url":null,"abstract":"With the rapid development of artificial intelligence (AI) and the Internet of Things (IoT), intelligent information services have showcased unprecedented capabilities in acquiring and analysing information. The conventional task processing platforms rely on centralised Cloud processing, which encounters challenges in infrastructure-less environments with unstable or disrupted electrical grids and cellular networks. These challenges hinder the deployment of intelligent information services in such environments. To address these challenges, we propose a distributed task processing platform (\u0000<inline-formula><tex-math>${DTPP}$</tex-math></inline-formula>\u0000) designed to provide satisfactory performance for executing computationally intensive applications in infrastructure-less environments. This platform leverages numerous distributed homogeneous nodes to process the arriving task locally or collaboratively. Based on this platform, a distributed task allocation algorithm is developed to achieve high task processing performance with limited energy and bandwidth resources. To validate our approach, \u0000<inline-formula><tex-math>${DTPP}$</tex-math></inline-formula>\u0000 has been tested in an experimental environment utilising real-world experimental data to simulate IoT network services in infrastructure-less environments. Extensive experiments demonstrate that our proposed solution surpasses comparative algorithms in key performance metrics, including task processing ratio, task processing accuracy, algorithm processing time, and energy consumption.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2392-2404"},"PeriodicalIF":5.6,"publicationDate":"2024-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GeoDeploy: Geo-Distributed Application Deployment Using Benchmarking GeoDeploy：利用基准测试进行地理分布式应用部署

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-27 DOI: 10.1109/TPDS.2024.3470532

Devki Nandan Jha;Yinhao Li;Zhenyu Wen;Graham Morgan;Prem Prakash Jayaraman;Maciej Koutny;Omer F. Rana;Rajiv Ranjan

Geo-distributed web-applications (GWA) can be deployed across multiple geographically separated datacenters to reduce the latency of access for users. Finding a suitable deployment for a GWA is challenging due to the requirement to consider a number of different parameters, such as host configurations across a federated infrastructure. The ability to evaluate multiple deployment configurations enables an efficient outcome to be determined, balancing resource usage while satisfying user requirements. We propose GeoDeploy, a framework designed for finding a deployment solution for GWA. We evaluate GeoDeploy using both a formal algorithmic model and a practical cloud-based deployment. We also compare our approach with other existing techniques.

地理分布式网络应用程序（GWA）可以部署在多个地理位置分离的数据中心，以减少用户访问的延迟。由于需要考虑许多不同的参数，例如联合基础设施中的主机配置，因此为 GWA 找到合适的部署方式具有挑战性。评估多种部署配置的能力可以确定有效的结果，在满足用户需求的同时平衡资源使用。我们提出的 GeoDeploy 是一个旨在为 GWA 寻找部署解决方案的框架。我们使用正式算法模型和基于云的实际部署对 GeoDeploy 进行了评估。我们还将我们的方法与其他现有技术进行了比较。

引用次数: 0

Efficient Distributed Edge Computing for Dependent Delay-Sensitive Tasks in Multi-Operator Multi-Access Networks 在多运营商多接入网络中针对依赖性延迟敏感任务的高效分布式边缘计算

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-26 DOI: 10.1109/TPDS.2024.3468892

Alia Asheralieva;Dusit Niyato;Xuetao Wei

We study the problem of distributed computing in the multi-operator multi-access edge computing (MEC) network for dependent tasks. Every task comprises several sub-tasks which are executed based on logical precedence modelled as a directed acyclic graph. In the graph, each vertex is a sub-task, each edge – precedence constraint, such that a sub-task can only be started after all its preceding sub-tasks are completed. Tasks are executed by MEC servers with the assistance of nearby edge devices, so that the MEC network can be viewed as a distributed “primary-secondary node” system where each MEC server acts as a primary node (PN) deciding on sub-tasks assigned to its secondary nodes (SNs), i.e., nearby edge devices. The PN's decision problem is complex, as its SNs can be associated with other neighboring PNs. In this case, the available processing resources of SNs depend on the sub-task assignment decisions of all neighboring PNs. Since PNs are controlled by different operators, they do not coordinate their decisions, and each PN is uncertain about the sub-task assignments of its neighbors (and, thus, the available resources of its SNs). To address this problem, we propose a novel framework based on a graphical Bayesian game, where PNs play under uncertainty about their neighbors’ decisions. We prove that the game has a perfect Bayesian equilibrium (PBE) yielding unique optimal values, and formulate new Bayesian reinforcement learning and Bayesian deep reinforcement learning algorithms enabling each PN to reach the PBE autonomously (without communicating with other PNs).

我们研究了多操作员多访问边缘计算（MEC）网络中依赖任务的分布式计算问题。每个任务都由若干个子任务组成，这些子任务根据逻辑优先级执行，被模拟为有向无环图。在该图中，每个顶点都是一个子任务，每条边都是优先级约束，只有在前面所有子任务都完成后，才能启动一个子任务。任务由 MEC 服务器在附近边缘设备的协助下执行，因此 MEC 网络可视为一个分布式 "主-次节点 "系统，其中每个 MEC 服务器作为主节点 (PN)，决定分配给其次节点 (SN)（即附近的边缘设备）的子任务。PN 的决策问题很复杂，因为其 SN 可能与其他相邻的 PN 相关联。在这种情况下，SN 的可用处理资源取决于所有相邻 PN 的子任务分配决策。由于 PN 由不同的操作员控制，它们不会协调其决策，因此每个 PN 都不确定其邻居的子任务分配（因此也不确定其 SN 的可用资源）。为了解决这个问题，我们提出了一个基于图形贝叶斯博弈的新框架，其中 PN 在不确定其邻居决策的情况下进行博弈。我们证明该博弈有一个完美贝叶斯均衡（PBE），它能产生唯一的最优值，并提出了新的贝叶斯强化学习和贝叶斯深度强化学习算法，使每个 PN 都能自主达到 PBE（无需与其他 PN 通信）。

{"title":"Efficient Distributed Edge Computing for Dependent Delay-Sensitive Tasks in Multi-Operator Multi-Access Networks","authors":"Alia Asheralieva;Dusit Niyato;Xuetao Wei","doi":"10.1109/TPDS.2024.3468892","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3468892","url":null,"abstract":"We study the problem of distributed computing in the \u0000<italic>multi-operator multi-access edge computing\u0000 (MEC) network for \u0000<italic>dependent tasks\u0000. Every task comprises several \u0000<italic>sub-tasks\u0000 which are executed based on logical precedence modelled as a \u0000<italic>directed acyclic graph\u0000. In the graph, each vertex is a sub-task, each edge – precedence constraint, such that a sub-task can only be started after all its preceding sub-tasks are completed. Tasks are executed by MEC servers with the assistance of nearby edge devices, so that the MEC network can be viewed as a \u0000<italic>distributed\u0000 “\u0000<italic>primary-secondary node\u0000” system where each MEC server acts as a \u0000<italic>primary node\u0000 (PN) deciding on sub-tasks assigned to its \u0000<italic>secondary nodes\u0000 (SNs), i.e., nearby edge devices. The PN's decision problem is complex, as its SNs can be associated with other \u0000<italic>neighboring\u0000 PNs. In this case, the available processing resources of SNs depend on the sub-task assignment decisions of all neighboring PNs. Since PNs are controlled by different operators, they do not coordinate their decisions, and each PN is uncertain about the sub-task assignments of its neighbors (and, thus, the available resources of its SNs). To address this problem, we propose a novel framework based on a \u0000<italic>graphical Bayesian game\u0000, where PNs play under uncertainty about their neighbors’ decisions. We prove that the game has a \u0000<italic>perfect Bayesian equilibrium\u0000 (PBE) yielding \u0000<italic>unique optimal values\u0000, and formulate new \u0000<italic>Bayesian reinforcement learning\u0000 and \u0000<italic>Bayesian deep reinforcement learning\u0000 algorithms enabling each PN to reach the PBE autonomously (without communicating with other PNs).","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2559-2577"},"PeriodicalIF":5.6,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Schedule Construction for Distributed Execution of Large DNN Models 高效构建大型 DNN 模型的分布式执行时间表

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-24 DOI: 10.1109/TPDS.2024.3466913

Zhiqi Lin;Youshan Miao;Guanbin Xu;Cheng Li;Olli Saarikivi;Saeed Maleki;Fan Yang

Increasingly complex and diverse deep neural network (DNN) models necessitate distributing the execution across multiple devices for training and inference tasks, and also require carefully planned schedules for performance. However, existing practices often rely on predefined schedules that may not fully exploit the benefits of emerging diverse model-aware operator placement strategies. Handcrafting high-efficiency schedules can be challenging due to the large and varying schedule space. This paper presents Tessel, an automated system that searches for efficient schedules for distributed DNN training and inference for diverse operator placement strategies. To reduce search costs, Tessel leverages the insight that the most efficient schedules often exhibit repetitive pattern (repetend) across different data inputs. This leads to a two-phase approach: repetend construction and schedule completion. By exploring schedules for various operator placement strategies, Tessel significantly improves both training and inference performance. Experiments with representative DNN models demonstrate that Tessel achieves up to 5.5× training performance speedup and up to 38% inference latency reduction.

由于深度神经网络（DNN）模型日益复杂多样，因此有必要在多个设备上执行训练和推理任务，同时还需要精心规划的性能时间表。然而，现有的做法往往依赖于预定义的时间表，这些时间表可能无法充分利用新兴的多样化模型感知运算符放置策略的优势。由于计划空间巨大且变化多端，手工制作高效率的计划具有挑战性。本文介绍的 Tessel 是一种自动系统，可为分布式 DNN 训练和推理搜索高效时间表，并采用多种算子放置策略。为了降低搜索成本，Tessel 利用了最高效的计划通常在不同的数据输入中表现出重复模式（repetend）这一洞察力。这导致了一种两阶段的方法：repetend 构建和计划完成。通过探索各种运算器放置策略的时间表，Tessel 显著提高了训练和推理性能。对具有代表性的 DNN 模型进行的实验表明，Tessel 的训练速度提高了 5.5 倍，推理延迟降低了 38%。

{"title":"Efficient Schedule Construction for Distributed Execution of Large DNN Models","authors":"Zhiqi Lin;Youshan Miao;Guanbin Xu;Cheng Li;Olli Saarikivi;Saeed Maleki;Fan Yang","doi":"10.1109/TPDS.2024.3466913","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3466913","url":null,"abstract":"Increasingly complex and diverse deep neural network (DNN) models necessitate distributing the execution across multiple devices for training and inference tasks, and also require carefully planned schedules for performance. However, existing practices often rely on predefined schedules that may not fully exploit the benefits of emerging diverse model-aware operator placement strategies. Handcrafting high-efficiency schedules can be challenging due to the large and varying schedule space. This paper presents Tessel, an automated system that searches for efficient schedules for distributed DNN training and inference for diverse operator placement strategies. To reduce search costs, Tessel leverages the insight that the most efficient schedules often exhibit repetitive pattern (\u0000<italic>repetend\u0000) across different data inputs. This leads to a two-phase approach: repetend construction and schedule completion. By exploring schedules for various operator placement strategies, Tessel significantly improves both training and inference performance. Experiments with representative DNN models demonstrate that Tessel achieves up to 5.5× training performance speedup and up to 38% inference latency reduction.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2375-2391"},"PeriodicalIF":5.6,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Two-Timescale Joint Optimization of Task Scheduling and Resource Scaling in Multi-Data Center System Based on Multi-Agent Deep Reinforcement Learning 基于多代理深度强化学习的多数据中心系统中任务调度和资源规模的双时标联合优化

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-24 DOI: 10.1109/TPDS.2024.3467212

Shuangwu Chen;Jiangming Li;Qifeng Yuan;Huasen He;Sen Li;Jian Yang

As a new computing paradigm, multi-data center computing enables service providers to deploy their applications close to the users. However, due to the spatio-temporal changes in workloads, it is challenging to coordinate multiple distributed data centers to provide high-quality services while reducing service operation costs. To address this challenge, this article studies the joint optimization problem of task scheduling and resource scaling in multi-data center systems. Since the task scheduling and the resource scaling are usually performed in different timescales, we decompose the joint optimization problem into two sub-problems and propose a two-timescale optimization framework. The short-timescale task scheduling can promptly relieve the bursty arrivals of computing tasks, and the long-timescale resource scaling can adapt well to the long-term changes in workloads. To address the distributed optimization problem, we propose a two-timescale multi-agent deep reinforcement learning algorithm. In order to characterize the graph-structured states of connected data centers, we develop a directed graph convolutional network based global state representation model. The evaluation indicates that the proposed algorithm is able to reduce both the task makespan and the task timeout while maintaining a reasonable cost.

作为一种新的计算模式，多数据中心计算使服务提供商能够在用户附近部署应用。然而，由于工作负载的时空变化，如何协调多个分布式数据中心，在提供高质量服务的同时降低服务运营成本是一个挑战。为了应对这一挑战，本文研究了多数据中心系统中任务调度和资源扩展的联合优化问题。由于任务调度和资源扩展通常在不同的时间尺度上进行，我们将联合优化问题分解为两个子问题，并提出了一个双时间尺度优化框架。短时标任务调度可以及时缓解计算任务的突发性到达，而长时标资源扩展可以很好地适应工作负载的长期变化。为了解决分布式优化问题，我们提出了一种双时标多代理深度强化学习算法。为了描述互联数据中心的图结构状态，我们开发了基于有向图卷积网络的全局状态表示模型。评估结果表明，所提出的算法能够在保持合理成本的同时，减少任务持续时间和任务超时。

{"title":"Two-Timescale Joint Optimization of Task Scheduling and Resource Scaling in Multi-Data Center System Based on Multi-Agent Deep Reinforcement Learning","authors":"Shuangwu Chen;Jiangming Li;Qifeng Yuan;Huasen He;Sen Li;Jian Yang","doi":"10.1109/TPDS.2024.3467212","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3467212","url":null,"abstract":"As a new computing paradigm, multi-data center computing enables service providers to deploy their applications close to the users. However, due to the spatio-temporal changes in workloads, it is challenging to coordinate multiple distributed data centers to provide high-quality services while reducing service operation costs. To address this challenge, this article studies the joint optimization problem of task scheduling and resource scaling in multi-data center systems. Since the task scheduling and the resource scaling are usually performed in different timescales, we decompose the joint optimization problem into two sub-problems and propose a two-timescale optimization framework. The short-timescale task scheduling can promptly relieve the bursty arrivals of computing tasks, and the long-timescale resource scaling can adapt well to the long-term changes in workloads. To address the distributed optimization problem, we propose a two-timescale multi-agent deep reinforcement learning algorithm. In order to characterize the graph-structured states of connected data centers, we develop a directed graph convolutional network based global state representation model. The evaluation indicates that the proposed algorithm is able to reduce both the task makespan and the task timeout while maintaining a reasonable cost.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2331-2346"},"PeriodicalIF":5.6,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VisionAGILE: A Versatile Domain-Specific Accelerator for Computer Vision Tasks VisionAGILE：用于计算机视觉任务的多功能特定领域加速器

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-24 DOI: 10.1109/TPDS.2024.3466891

Bingyi Zhang;Rajgopal Kannan;Carl Busart;Viktor K. Prasanna

The emergence of diverse machine learning (ML) models has led to groundbreaking revolutions in computer vision (CV). These ML models include convolutional neural networks (CNNs), graph neural networks (GNNs), and vision transformers (ViTs). However, existing hardware accelerators designed for CV lack the versatility to support various ML models, potentially limiting their applicability to real-world scenarios. To address this limitation, we introduce VisionAGILE, a domain-specific accelerator designed to be versatile and capable of accommodating a range of ML models, including CNNs, GNNs, and ViTs. VisionAGILE comprises a compiler, a runtime system, and a hardware accelerator. For the hardware accelerator, we develop a novel unified architecture with a flexible data path and memory organization to support the computation primitives in various ML models. Regarding the compiler design, we develop a unified compilation workflow that maps various ML models to the proposed hardware accelerator. The runtime system executes dynamic sparsity exploitation to reduce inference latency and dynamic task scheduling for workload balance. The compiler, the runtime system, and the hardware accelerator work synergistically to support a variety of ML models in CV, enabling low-latency inference. We deploy the hardware accelerator on a state-of-the-art data center FPGA (Xilinx Alveo U250). We evaluate VisionAGILE on diverse ML models for CV, including CNNs, GNNs, hybrid models (comprising both CNN and GNN), and ViTs. The experimental results indicate that, compared with state-of-the-art CPU (GPU) implementations, VisionAGILE achieves a speedup of

$81.7times$

(

$4.8times$

) in terms of latency. Evaluated on standalone CNNs, GNNs, and ViTs, VisionAGILE demonstrates comparable or higher performance with state-of-the-art CNN accelerators, GNN accelerators, and ViT accelerators, respectively.

各种机器学习（ML）模型的出现为计算机视觉（CV）领域带来了突破性的变革。这些 ML 模型包括卷积神经网络 (CNN)、图神经网络 (GNN) 和视觉转换器 (ViT)。然而，为 CV 设计的现有硬件加速器缺乏支持各种 ML 模型的多功能性，这可能会限制它们在现实世界场景中的适用性。为了解决这一局限性，我们推出了 VisionAGILE，它是一种针对特定领域设计的加速器，具有多功能性，能够支持一系列 ML 模型，包括 CNN、GNN 和 ViT。VisionAGILE 由编译器、运行系统和硬件加速器组成。在硬件加速器方面，我们开发了一种新颖的统一架构，具有灵活的数据路径和内存组织，可支持各种 ML 模型中的计算基元。在编译器设计方面，我们开发了一个统一的编译工作流程，可将各种 ML 模型映射到拟议的硬件加速器。运行时系统执行动态稀疏性利用以减少推理延迟，并执行动态任务调度以平衡工作量。编译器、运行时系统和硬件加速器协同工作，支持 CV 中的各种 ML 模型，从而实现低延迟推理。我们在最先进的数据中心 FPGA（赛灵思 Alveo U250）上部署了硬件加速器。我们评估了 VisionAGILE 在 CV 中的各种 ML 模型，包括 CNN、GNN、混合模型（包括 CNN 和 GNN）和 ViT。实验结果表明，与最先进的 CPU（GPU）实现相比，VisionAGILE 在延迟方面提高了 81.7 美元（4.8 美元）。在独立的 CNN、GNN 和 ViT 上进行评估后，VisionAGILE 的性能分别与最先进的 CNN 加速器、GNN 加速器和 ViT 加速器相当或更高。

{"title":"VisionAGILE: A Versatile Domain-Specific Accelerator for Computer Vision Tasks","authors":"Bingyi Zhang;Rajgopal Kannan;Carl Busart;Viktor K. Prasanna","doi":"10.1109/TPDS.2024.3466891","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3466891","url":null,"abstract":"The emergence of diverse machine learning (ML) models has led to groundbreaking revolutions in computer vision (CV). These ML models include convolutional neural networks (CNNs), graph neural networks (GNNs), and vision transformers (ViTs). However, existing hardware accelerators designed for CV lack the versatility to support various ML models, potentially limiting their applicability to real-world scenarios. To address this limitation, we introduce VisionAGILE, a domain-specific accelerator designed to be versatile and capable of accommodating a range of ML models, including CNNs, GNNs, and ViTs. VisionAGILE comprises a compiler, a runtime system, and a hardware accelerator. For the hardware accelerator, we develop a novel unified architecture with a flexible data path and memory organization to support the computation primitives in various ML models. Regarding the compiler design, we develop a unified compilation workflow that maps various ML models to the proposed hardware accelerator. The runtime system executes dynamic sparsity exploitation to reduce inference latency and dynamic task scheduling for workload balance. The compiler, the runtime system, and the hardware accelerator work synergistically to support a variety of ML models in CV, enabling low-latency inference. We deploy the hardware accelerator on a state-of-the-art data center FPGA (Xilinx Alveo U250). We evaluate VisionAGILE on diverse ML models for CV, including CNNs, GNNs, hybrid models (comprising both CNN and GNN), and ViTs. The experimental results indicate that, compared with state-of-the-art CPU (GPU) implementations, VisionAGILE achieves a speedup of \u0000<inline-formula><tex-math>$81.7times$</tex-math></inline-formula>\u0000 (\u0000<inline-formula><tex-math>$4.8times$</tex-math></inline-formula>\u0000) in terms of latency. Evaluated on standalone CNNs, GNNs, and ViTs, VisionAGILE demonstrates comparable or higher performance with state-of-the-art CNN accelerators, GNN accelerators, and ViT accelerators, respectively.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2405-2422"},"PeriodicalIF":5.6,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142447219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity Alignment 利用整体稀疏性对齐在移动设备上高效推断剪枝 CNN 模型

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-17 DOI: 10.1109/TPDS.2024.3462092

Yuyang Jin;Runxin Zhong;Saiqin Long;Jidong Zhai

Many artificial intelligence applications based on convolutional neural networks are directly deployed on mobile devices to avoid network unavailability and user privacy leakage. However, the significant increase in model parameter volumes makes it difficult to achieve high-performance convolutional neural network inference on these mobile devices with limited computing power. Weight pruning is one of the main approaches to compress models by reducing model parameters and computational operations, which also introduces irregular sparsity of neural networks, leading to inefficient computation and memory access during inference. This work proposes an end-to-end framework, namely MCPruner, for efficient inference of pruned convolutional neural networks on mobile devices by aligning the sparse patterns with hardware execution features in computation, memory access, and parallelism. It first co-designs pruning methods and code generation optimizations for the alignment of non-zero weight count and vector width, to improve computational efficiency while ensuring accuracy. During the code generation, it applies a sparse pattern-aware format to reduce inefficient memory accesses. Besides, convolution computations are reordered for alignment, and then mapped to parallel threads on accelerated units to achieve high parallelism. Experimental results using several commonly used models and datasets on the ARM-based Hikey970 demonstrate that our work outperforms state-of-the-art methods in inference efficiency, with no accuracy degradation.

许多基于卷积神经网络的人工智能应用都直接部署在移动设备上，以避免网络不可用和用户隐私泄露。然而，由于模型参数量大幅增加，很难在计算能力有限的移动设备上实现高性能的卷积神经网络推理。权重剪枝是通过减少模型参数和计算操作来压缩模型的主要方法之一，但同时也会引入神经网络的不规则稀疏性，导致推理过程中的计算和内存访问效率低下。本研究提出了一种端到端框架，即 MCPruner，通过将稀疏模式与计算、内存访问和并行性方面的硬件执行特性相匹配，在移动设备上高效推断剪枝卷积神经网络。它首先共同设计了剪枝方法和代码生成优化方法，用于调整非零权重计数和向量宽度，以提高计算效率，同时确保准确性。在代码生成过程中，它采用了稀疏模式感知格式，以减少低效的内存访问。此外，卷积计算会重新排序以进行对齐，然后映射到加速单元上的并行线程，以实现高并行性。在基于 ARM 的 Hikey970 上使用几种常用模型和数据集的实验结果表明，我们的工作在推理效率方面优于最先进的方法，而且准确性没有降低。

{"title":"Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity Alignment","authors":"Yuyang Jin;Runxin Zhong;Saiqin Long;Jidong Zhai","doi":"10.1109/TPDS.2024.3462092","DOIUrl":"10.1109/TPDS.2024.3462092","url":null,"abstract":"Many artificial intelligence applications based on convolutional neural networks are directly deployed on mobile devices to avoid network unavailability and user privacy leakage. However, the significant increase in model parameter volumes makes it difficult to achieve high-performance convolutional neural network inference on these mobile devices with limited computing power. Weight pruning is one of the main approaches to compress models by reducing model parameters and computational operations, which also introduces irregular sparsity of neural networks, leading to inefficient computation and memory access during inference. This work proposes an end-to-end framework, namely MCPruner, for efficient inference of pruned convolutional neural networks on mobile devices by aligning the sparse patterns with hardware execution features in computation, memory access, and parallelism. It first co-designs pruning methods and code generation optimizations for the alignment of non-zero weight count and vector width, to improve computational efficiency while ensuring accuracy. During the code generation, it applies a sparse pattern-aware format to reduce inefficient memory accesses. Besides, convolution computations are reordered for alignment, and then mapped to parallel threads on accelerated units to achieve high parallelism. Experimental results using several commonly used models and datasets on the ARM-based Hikey970 demonstrate that our work outperforms state-of-the-art methods in inference efficiency, with no accuracy degradation.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2208-2223"},"PeriodicalIF":5.6,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Freyr $^+$+: Harvesting Idle Resources in Serverless Computing via Deep Reinforcement Learning Freyr$^+$：通过深度强化学习挖掘无服务器计算中的闲置资源

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-17 DOI: 10.1109/TPDS.2024.3462294

Hanfei Yu;Hao Wang;Jian Li;Xu Yuan;Seung-Jong Park

Serverless computing has revolutionized online service development and deployment with ease-to-use operations, auto-scaling, fine-grained resource allocation, and pay-as-you-go pricing. However, a gap remains in configuring serverless functions—the actual resource consumption may vary due to function types, dependencies, and input data sizes, thus mismatching the static resource configuration by users. Dynamic resource consumption against static configuration may lead to either poor function execution performance or low utilization. This paper proposes Freyr

$^+$

, a novel resource manager (RM) that dynamically harvests idle resources from over-provisioned functions to accelerate under-provisioned functions for serverless platforms. Freyr

$^+$

monitors each function's resource utilization in real-time and detects the mismatches between user configuration and actual resource consumption. We design deep reinforcement learning (DRL) algorithms with attention-enhanced embedding, incremental learning, and safeguard mechanism for Freyr

$^+$

to harvest idle resources safely and accelerate functions efficiently. We have implemented and deployed a Freyr

$^+$

prototype in a 13-node Apache OpenWhisk cluster using AWS EC2. Freyr

$^+$

is evaluated on both large-scale simulation and real-world testbed. Experimental results show that Freyr

$^+$

harvests 38% of function invocations’ idle resources and accelerates 39% of invocations using harvested resources. Freyr

$^+$

reduces the 99th-percentile function response latency by 26% compared to the baseline RMs.

无服务器计算凭借易于使用的操作、自动缩放、细粒度资源分配和即用即付的定价方式，彻底改变了在线服务的开发和部署。然而，在配置无服务器功能方面仍存在差距--实际资源消耗可能会因功能类型、依赖关系和输入数据大小而变化，从而与用户的静态资源配置不匹配。与静态配置不匹配的动态资源消耗可能导致函数执行性能低下或利用率低。本文提出了一种新型资源管理器（RM）--Freyr$^+$，它可以动态地从超额配置的函数中获取闲置资源，从而加速无服务器平台中配置不足的函数。Freyr$^+$ 实时监控每个函数的资源利用率，并检测用户配置与实际资源消耗之间的不匹配。我们为Freyr$^+$设计了具有注意力增强嵌入、增量学习和保障机制的深度强化学习（DRL）算法，以安全地获取闲置资源并高效地加速函数。我们在使用 AWS EC2 的 13 节点 Apache OpenWhisk 集群中实施并部署了 Freyr$^+$ 原型。Freyr$^+$ 在大规模模拟和真实世界测试平台上进行了评估。实验结果表明，Freyr$^+$ 可收集 38% 的函数调用闲置资源，并利用收集的资源加速 39% 的调用。与基准 RM 相比，Freyr$^+$ 将第 99 百分位函数响应延迟降低了 26%。

{"title":"Freyr $^+$+: Harvesting Idle Resources in Serverless Computing via Deep Reinforcement Learning","authors":"Hanfei Yu;Hao Wang;Jian Li;Xu Yuan;Seung-Jong Park","doi":"10.1109/TPDS.2024.3462294","DOIUrl":"10.1109/TPDS.2024.3462294","url":null,"abstract":"Serverless computing has revolutionized online service development and deployment with ease-to-use operations, auto-scaling, fine-grained resource allocation, and pay-as-you-go pricing. However, a gap remains in configuring serverless functions—the actual resource consumption may vary due to function types, dependencies, and input data sizes, thus mismatching the static resource configuration by users. Dynamic resource consumption against static configuration may lead to either poor function execution performance or low utilization. This paper proposes \u0000Freyr\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000, a novel resource manager (RM) that dynamically harvests idle resources from over-provisioned functions to accelerate under-provisioned functions for serverless platforms. \u0000Freyr\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 monitors each function's resource utilization in real-time and detects the mismatches between user configuration and actual resource consumption. We design deep reinforcement learning (DRL) algorithms with attention-enhanced embedding, incremental learning, and safeguard mechanism for \u0000Freyr\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 to harvest idle resources safely and accelerate functions efficiently. We have implemented and deployed a \u0000Freyr\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 prototype in a 13-node Apache OpenWhisk cluster using AWS EC2. \u0000Freyr\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 is evaluated on both large-scale simulation and real-world testbed. Experimental results show that \u0000Freyr\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 harvests 38% of function invocations’ idle resources and accelerates 39% of invocations using harvested resources. \u0000Freyr\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 reduces the 99th-percentile function response latency by 26% compared to the baseline RMs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2254-2269"},"PeriodicalIF":5.6,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142264434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Cross-Cloud Partial Reduce With CREW 利用 CREW 实现高效的跨云部分还原

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-13 DOI: 10.1109/TPDS.2024.3460185

Shouxi Luo;Renyi Wang;Ke Li;Huanlai Xing

By allowing

$p$

out of

$n$

workers to conduct all reduce operations among them for a round of synchronization, partial reduce, a promising partially-asynchronous variant of all reduce, has shown its power in alleviating the impacts of stragglers for iterative distributed machine learning (DML). Current partial reduce solutions are mainly designed for intra-cluster DML, in which workers are networked with high-bandwidth LAN links. Yet no prior work has looked into the problem of how to achieve efficient partial reduce for cross-cloud DML, where inter-worker connections are with scarcely-available capacities. To fill the gap, in this paper, we propose CREW, a flexible and efficient implementation of partial reduce for cross-cloud DML. At the high level, CREW is built upon the novel design of employing all active workers along with their internal connection capacities to execute the involved communication and computation tasks; and at the low level, CREW employs a suite of algorithms to distribute the tasks among workers in a load-balanced way, and deal with possible outages of workers/connections, and bandwidth contention. Detailed performance studies confirm that, CREW not only shortens the execution of each partial reduce operation, outperforming existing communication schemes such as PS, Ring, TopoAdopt, and BLINK greatly, but also significantly accelerates the training of large models, up to

$15times$

and

$9times$

, respectively, when compared with the all-to-all direct communication scheme and original partial reduce design.

部分还原是全部还原的一种很有前途的部分异步变体，它允许 n 个工作者中的 $p$ 在一轮同步中在他们之间进行所有还原操作，在减轻迭代式分布式机器学习（DML）中滞后者的影响方面显示出了它的威力。目前的部分还原解决方案主要是针对集群内的 DML 而设计的，在集群内，工作人员通过高带宽局域网链接联网。然而，对于如何在跨云 DML 中实现高效的部分还原这一问题，此前还没有任何研究成果，因为在跨云 DML 中，工人之间的连接几乎没有可用的容量。为了填补这一空白，我们在本文中提出了 CREW，一种灵活高效的跨云 DML 部分还原实现方法。在高层，CREW 采用了新颖的设计，即利用所有活跃的工作者及其内部连接能力来执行相关的通信和计算任务；在底层，CREW 采用了一套算法，以负载均衡的方式在工作者之间分配任务，并处理工作者/连接可能出现的中断和带宽争用问题。详细的性能研究证实，与全对全直接通信方案和原始的部分还原设计相比，CREW 不仅缩短了每个部分还原操作的执行时间，大大优于现有的通信方案，如 PS、Ring、TopoAdopt 和 BLINK，而且还显著加快了大型模型的训练速度，分别达到 $15/times$ 和 $9/times$。

{"title":"Efficient Cross-Cloud Partial Reduce With CREW","authors":"Shouxi Luo;Renyi Wang;Ke Li;Huanlai Xing","doi":"10.1109/TPDS.2024.3460185","DOIUrl":"10.1109/TPDS.2024.3460185","url":null,"abstract":"By allowing \u0000<inline-formula><tex-math>$p$</tex-math></inline-formula>\u0000 out of \u0000<inline-formula><tex-math>$n$</tex-math></inline-formula>\u0000 workers to conduct \u0000all reduce\u0000 operations among them for a round of synchronization, \u0000partial reduce\u0000, a promising partially-asynchronous variant of \u0000all reduce\u0000, has shown its power in alleviating the impacts of stragglers for iterative distributed machine learning (DML). Current \u0000partial reduce\u0000 solutions are mainly designed for intra-cluster DML, in which workers are networked with high-bandwidth LAN links. Yet no prior work has looked into the problem of how to achieve efficient \u0000partial reduce\u0000 for cross-cloud DML, where inter-worker connections are with scarcely-available capacities. To fill the gap, in this paper, we propose \u0000CREW\u0000, a flexible and efficient implementation of \u0000partial reduce\u0000 for cross-cloud DML. At the high level, \u0000CREW\u0000 is built upon the novel design of employing all active workers along with their internal connection capacities to execute the involved communication and computation tasks; and at the low level, \u0000CREW\u0000 employs a suite of algorithms to distribute the tasks among workers in a load-balanced way, and deal with possible outages of workers/connections, and bandwidth contention. Detailed performance studies confirm that, \u0000CREW\u0000 not only shortens the execution of each \u0000partial reduce\u0000 operation, outperforming existing communication schemes such as PS, Ring, \u0000TopoAdopt\u0000, and BLINK greatly, but also significantly accelerates the training of large models, up to \u0000<inline-formula><tex-math>$15times$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$9times$</tex-math></inline-formula>\u0000, respectively, when compared with the all-to-all direct communication scheme and \u0000original partial reduce\u0000 design.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2224-2238"},"PeriodicalIF":5.6,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142264435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Evaluation Framework for Dynamic Thermal Management Strategies in 3D MultiProcessor System-on-Chip Co-Design 三维多处理器片上系统协同设计中动态热管理策略的评估框架

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-09-12 DOI: 10.1109/TPDS.2024.3459414

Darong Huang;Luis Costero;David Atienza

Dynamic thermal management (DTM) has been widely adopted to improve the energy efficiency, reliability, and performance of modern Multi-Processor SoCs (MPSoCs). However, the evolving industry trends and heterogeneous architecture designs have introduced significant challenges in state-of-the-art DTM methods. Specifically, the emergence of heterogeneous design has led to increased localized and non-uniform hotspots, necessitating accurate and responsive DTM strategies. Additionally, the increased number of cores to be managed requires the DTM to optimize and coordinate the whole system. However, existing methodologies fail in both precise thermal modeling in localized hotspots and fast architecture simulation. To tackle these existing challenges, we first introduce the latest version of 3D-ICE 3.1, with a novel non-uniform thermal modeling technique to support customized discretization levels of thermal grids. 3D-ICE 3.1 improves the accuracy of thermal analysis and reduces simulation overhead. Then, in conjunction with an efficient and fast offline application profiling strategy utilizing the architecture simulator gem5-X, we propose a novel DTM evaluation framework. This framework enables us to explore novel DTM methods to optimize the energy efficiency, reliability, and performance of contemporary 3D MPSoCs. The experimental results demonstrate that 3D-ICE 3.1 achieves high accuracy, with only 0.3K mean temperature error. Subsequently, we evaluate various DTM methods and propose a Multi-Agent Reinforcement Learning (MARL) control to address the demanding thermal challenges of 3D MPSoCs. Our experimental results show that the proposed DTM method based on MARL can reduce power consumption by 13% while maintaining a similar performance level to the comparison methods.

动态热管理（DTM）已被广泛采用，以提高现代多处理器系统级芯片（MPSoC）的能效、可靠性和性能。然而，不断发展的行业趋势和异构架构设计给最先进的 DTM 方法带来了重大挑战。具体来说，异构设计的出现导致局部和非均匀热点的增加，这就要求采用准确、灵敏的 DTM 策略。此外，需要管理的内核数量增加，要求 DTM 对整个系统进行优化和协调。然而，现有方法在局部热点的精确热建模和快速架构仿真方面都存在不足。为了应对这些现有挑战，我们首先介绍了最新版本的 3D-ICE 3.1，该版本采用了新颖的非均匀热建模技术，支持自定义热网格离散等级。3D-ICE 3.1 提高了热分析的精度，降低了仿真开销。然后，结合利用架构模拟器 gem5-X 的高效、快速离线应用剖析策略，我们提出了一个新颖的 DTM 评估框架。该框架使我们能够探索新型 DTM 方法，以优化当代 3D MPSoC 的能效、可靠性和性能。实验结果表明，3D-ICE 3.1 实现了高精度，平均温度误差仅为 0.3K。随后，我们对各种 DTM 方法进行了评估，并提出了多代理强化学习 (MARL) 控制方法，以应对 3D MPSoC 在散热方面的严峻挑战。实验结果表明，基于 MARL 提出的 DTM 方法可以降低 13% 的功耗，同时保持与比较方法类似的性能水平。

{"title":"An Evaluation Framework for Dynamic Thermal Management Strategies in 3D MultiProcessor System-on-Chip Co-Design","authors":"Darong Huang;Luis Costero;David Atienza","doi":"10.1109/TPDS.2024.3459414","DOIUrl":"10.1109/TPDS.2024.3459414","url":null,"abstract":"Dynamic thermal management (DTM) has been widely adopted to improve the energy efficiency, reliability, and performance of modern Multi-Processor SoCs (MPSoCs). However, the evolving industry trends and heterogeneous architecture designs have introduced significant challenges in state-of-the-art DTM methods. Specifically, the emergence of heterogeneous design has led to increased localized and non-uniform hotspots, necessitating accurate and responsive DTM strategies. Additionally, the increased number of cores to be managed requires the DTM to optimize and coordinate the whole system. However, existing methodologies fail in both precise thermal modeling in localized hotspots and fast architecture simulation. To tackle these existing challenges, we first introduce the latest version of 3D-ICE 3.1, with a novel non-uniform thermal modeling technique to support customized discretization levels of thermal grids. 3D-ICE 3.1 improves the accuracy of thermal analysis and reduces simulation overhead. Then, in conjunction with an efficient and fast offline application profiling strategy utilizing the architecture simulator gem5-X, we propose a novel DTM evaluation framework. This framework enables us to explore novel DTM methods to optimize the energy efficiency, reliability, and performance of contemporary 3D MPSoCs. The experimental results demonstrate that 3D-ICE 3.1 achieves high accuracy, with only 0.3K mean temperature error. Subsequently, we evaluate various DTM methods and propose a Multi-Agent Reinforcement Learning (MARL) control to address the demanding thermal challenges of 3D MPSoCs. Our experimental results show that the proposed DTM method based on MARL can reduce power consumption by 13% while maintaining a similar performance level to the comparison methods.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2161-2176"},"PeriodicalIF":5.6,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0