首页 > 最新文献

IEEE Transactions on Computers最新文献

英文 中文
DIVIDE: Efficient RowHammer Defense via In-DRAM Cache-Based Hot Data Isolation DIVIDE:通过基于dram缓存的热数据隔离的高效RowHammer防御
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-09-05 DOI: 10.1109/TC.2025.3603729
Haitao Du;Yuxuan Yang;Song Chen;Yi Kang
RowHammer poses a serious reliability challenge to modern DRAM systems. As technology scales down, DRAM resistance to RowHammer has decreased by $30times$ over the past decade, causing an increasing number of benign applications to suffer from this issue. However, existing defense mechanisms have three limitations: 1) they rely on inefficient mitigation techniques, such as time-consuming victim row refresh; 2) they do not reduce the number of effective RowHammer attacks, leading to frequent mitigations; and 3) they fail to recognize that frequently accessed data is not only a root cause of RowHammer but also presents an opportunity for performance optimization. In this paper, we observe that frequently accessed hot data plays a distinct role in security and efficiency: it can induce RowHammer by interfering with adjacent cold data, while also being performance-critical due to its frequent accesses. To this end, we propose Data Isolation via In-DRAM Cache (DIVIDE), a novel defense mechanism that leverages in-DRAM cache to isolate and exploit hot data. DIVIDE offers three key benefits: 1) It reduces the number of effective RowHammer attacks, as hot data in the cache cannot interfere with each other. 2) It provides a simple yet effective mitigation measure by isolating hot data from cold data. 3) It caches frequently accessed hot data, improving average access latency. DIVIDE employs a two-level protection structure: the first level mitigates RowHammer in cache arrays with high efficiency, while the second level addresses the remaining threats in normal arrays to ensure complete protection. Owing to the high in-DRAM cache hit rate, DIVIDE efficiently mitigates RowHammer while preserving both the performance and energy efficiency of the in-DRAM cache. At a RowHammer threshold of 128, DIVIDE with probabilistic mitigation achieves an average performance improvement of 19.6% and energy savings of 20.4% over DDR4 DRAM for four-core workloads. Compared to an unprotected in-DRAM cache DRAM, DIVIDE incurs only a 2.1% performance overhead while requiring just a modest 1KB per-channel CAM in the memory controller, with no modification to the DRAM chip.
RowHammer对现代DRAM系统的可靠性提出了严峻的挑战。随着技术规模的缩小,DRAM对RowHammer的抵抗力在过去十年中下降了30倍,导致越来越多的良性应用受到这个问题的困扰。然而,现有的防御机制有三个局限性:1)它们依赖于低效的缓解技术,例如耗时的受害者行刷新;2)它们不会减少有效的RowHammer攻击的数量,导致频繁的缓解;3)他们没有认识到频繁访问的数据不仅是RowHammer的根本原因,而且还提供了性能优化的机会。在本文中,我们观察到频繁访问的热数据在安全性和效率方面发挥着独特的作用:它可以通过干扰相邻的冷数据而诱发RowHammer,同时由于其频繁访问而对性能至关重要。为此,我们提出了通过dram内缓存进行数据隔离(DIVIDE),这是一种利用dram内缓存隔离和利用热数据的新型防御机制。DIVIDE提供了三个关键的好处:1)它减少了有效的RowHammer攻击的数量,因为缓存中的热数据不会相互干扰。2)通过隔离热数据和冷数据,它提供了一种简单而有效的缓解措施。3)缓存访问频繁的热数据,提高平均访问时延。DIVIDE采用两级保护结构,第一级保护高效缓解缓存阵列中的RowHammer,第二级保护普通阵列中的剩余威胁,确保完全保护。由于高内存缓存命中率,DIVIDE有效地减轻了RowHammer,同时保留了内存缓存的性能和能源效率。在RowHammer阈值为128时,与四核工作负载的DDR4 DRAM相比,带有概率缓解的DIVIDE实现了19.6%的平均性能提升和20.4%的能源节约。与不受保护的内置DRAM缓存DRAM相比,DIVIDE只会导致2.1%的性能开销,同时在内存控制器中只需要每个通道1KB的CAM,而不需要修改DRAM芯片。
{"title":"DIVIDE: Efficient RowHammer Defense via In-DRAM Cache-Based Hot Data Isolation","authors":"Haitao Du;Yuxuan Yang;Song Chen;Yi Kang","doi":"10.1109/TC.2025.3603729","DOIUrl":"https://doi.org/10.1109/TC.2025.3603729","url":null,"abstract":"RowHammer poses a serious reliability challenge to modern DRAM systems. As technology scales down, DRAM resistance to RowHammer has decreased by <inline-formula><tex-math>$30times$</tex-math></inline-formula> over the past decade, causing an increasing number of benign applications to suffer from this issue. However, existing defense mechanisms have three limitations: 1) they rely on inefficient mitigation techniques, such as time-consuming victim row refresh; 2) they do not reduce the number of effective RowHammer attacks, leading to frequent mitigations; and 3) they fail to recognize that frequently accessed data is not only a root cause of RowHammer but also presents an opportunity for performance optimization. In this paper, we observe that frequently accessed hot data plays a distinct role in security and efficiency: it can induce RowHammer by interfering with adjacent cold data, while also being performance-critical due to its frequent accesses. To this end, we propose Data Isolation via In-DRAM Cache (DIVIDE), a novel defense mechanism that leverages in-DRAM cache to isolate and exploit hot data. DIVIDE offers three key benefits: 1) It reduces the number of effective RowHammer attacks, as hot data in the cache cannot interfere with each other. 2) It provides a simple yet effective mitigation measure by isolating hot data from cold data. 3) It caches frequently accessed hot data, improving average access latency. DIVIDE employs a two-level protection structure: the first level mitigates RowHammer in cache arrays with high efficiency, while the second level addresses the remaining threats in normal arrays to ensure complete protection. Owing to the high in-DRAM cache hit rate, DIVIDE efficiently mitigates RowHammer while preserving both the performance and energy efficiency of the in-DRAM cache. At a RowHammer threshold of 128, DIVIDE with probabilistic mitigation achieves an average performance improvement of 19.6% and energy savings of 20.4% over DDR4 DRAM for four-core workloads. Compared to an unprotected in-DRAM cache DRAM, DIVIDE incurs only a 2.1% performance overhead while requiring just a modest 1KB per-channel CAM in the memory controller, with no modification to the DRAM chip.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"3980-3994"},"PeriodicalIF":3.8,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
OOLU: An Operation-Based Optimized Sparse LU Decomposition Accelerator for Circuit Simulation 面向电路仿真的基于运算的优化稀疏LU分解加速器
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-09-04 DOI: 10.1109/TC.2025.3605751
Ke Hu;Fan Yang
As scientific and engineering challenges grow in complexity and scale, the demand for effective solutions for sparse matrix computations becomes increasingly critical. LU decomposition, known for its ability to reduce computational load and enhance numerical stability, serves as a promising approach. This study focuses on accelerating sparse LU decomposition for circuit simulations, addressing the prolonged simulation times caused by large circuit matrices. We present a novel Operation-based Optimized LU (OOLU) decomposition architecture that significantly improves circuit analysis efficiency. OOLU employs a VLIW-like processing element array and incorporates a scheduler that decomposes computations into a fine-grained operational task flow graph, maximizing inter-operation parallelism. Specialized scheduling and data mapping strategies are applied to align with the adaptable pipelined framework and the characteristics of circuit matrices. The OOLU architecture is prototyped on an FPGA and validated through extensive tests on the University of Florida sparse matrix collection, benchmarked against multiple platforms. The accelerator achieves speedups ranging from $3.48times$ to $32.25times$ (average $12.51times$) over the KLU software package. It also delivers average speedups of $2.64times$ over a prior FPGA accelerator and $25.18times$ and $32.27times$ over the GPU accelerators STRUMPACK and SFLU, respectively, highlighting the substantial efficiency gains our approach delivers.
随着科学和工程挑战的复杂性和规模的增长,对稀疏矩阵计算的有效解决方案的需求变得越来越迫切。LU分解以其减少计算负荷和提高数值稳定性的能力而闻名,是一种很有前途的方法。本研究的重点是加速电路仿真中的稀疏LU分解,以解决大型电路矩阵导致的仿真时间延长的问题。我们提出了一种新的基于运算的优化电路分解架构,显著提高了电路分析效率。OOLU采用类似vliw的处理元素数组,并集成了一个调度器,该调度器将计算分解为细粒度的操作任务流图,从而最大限度地提高了操作间的并行性。采用了专门的调度和数据映射策略,以适应流水线框架和电路矩阵的特点。OOLU架构在FPGA上进行了原型设计,并在佛罗里达大学的稀疏矩阵集合上进行了广泛的测试,并针对多个平台进行了基准测试。与KLU软件包相比,该加速器的加速范围从3.48美元到32.25美元(平均为12.51美元)。与之前的FPGA加速器相比,它的平均速度提高了2.64倍,与GPU加速器STRUMPACK和SFLU相比,它的平均速度分别提高了25.18倍和32.27倍,凸显了我们的方法所带来的巨大效率提升。
{"title":"OOLU: An Operation-Based Optimized Sparse LU Decomposition Accelerator for Circuit Simulation","authors":"Ke Hu;Fan Yang","doi":"10.1109/TC.2025.3605751","DOIUrl":"https://doi.org/10.1109/TC.2025.3605751","url":null,"abstract":"As scientific and engineering challenges grow in complexity and scale, the demand for effective solutions for sparse matrix computations becomes increasingly critical. LU decomposition, known for its ability to reduce computational load and enhance numerical stability, serves as a promising approach. This study focuses on accelerating sparse LU decomposition for circuit simulations, addressing the prolonged simulation times caused by large circuit matrices. We present a novel Operation-based Optimized LU (OOLU) decomposition architecture that significantly improves circuit analysis efficiency. OOLU employs a VLIW-like processing element array and incorporates a scheduler that decomposes computations into a fine-grained operational task flow graph, maximizing inter-operation parallelism. Specialized scheduling and data mapping strategies are applied to align with the adaptable pipelined framework and the characteristics of circuit matrices. The OOLU architecture is prototyped on an FPGA and validated through extensive tests on the University of Florida sparse matrix collection, benchmarked against multiple platforms. The accelerator achieves speedups ranging from <inline-formula><tex-math>$3.48times$</tex-math></inline-formula> to <inline-formula><tex-math>$32.25times$</tex-math></inline-formula> (average <inline-formula><tex-math>$12.51times$</tex-math></inline-formula>) over the KLU software package. It also delivers average speedups of <inline-formula><tex-math>$2.64times$</tex-math></inline-formula> over a prior FPGA accelerator and <inline-formula><tex-math>$25.18times$</tex-math></inline-formula> and <inline-formula><tex-math>$32.27times$</tex-math></inline-formula> over the GPU accelerators STRUMPACK and SFLU, respectively, highlighting the substantial efficiency gains our approach delivers.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"4065-4079"},"PeriodicalIF":3.8,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spatial-Temporal Embodied Carbon Models With Dual Carbon Attribution for Embodied Carbon Accounting of Computer Systems 基于双重碳归因的时空隐含碳模型:计算机系统隐含碳核算
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-09-04 DOI: 10.1109/TC.2025.3605743
Xiaoyang Zhang;Yijie Yang;Dan Wang
Embodied carbon is the carbon emissions in the manufacturing process of products, which dominates the overall carbon footprint in many industries. Existing studies derive the embodied carbon through life cycle analysis (LCA) reports. Current LCA reports only provide the carbon emission of a product class, e.g. 28nm CPU, whereas a product instance can be made in various regions and time periods. Carbon emissions depend on the electricity generation process, which has spatial-temporal dynamics. Therefore, the embodied carbon of a product instance can differ from its product class. Additionally, different carbon attribution methods (e.g., location-based and market-based) can affect the carbon emissions of electricity, thus further affecting the embodied carbon of products. In this paper, we present new Spatial-Temporal Embodied Carbon (STEC) accounting models with dual attribution methods. We observe significant differences between STEC and current models, e.g., for 7nm CPU the difference is 13.69%. We further examine the impact of STEC models on existing embodied carbon accounting schemes on computer applications, such as Large Language Model (LLM) training and LLM inference. We observe that using STEC results in much greater differences in the embodied carbon of certain applications as compared to others (e.g., 32.26% vs. 6.35%).
隐含碳是指产品制造过程中的碳排放,在许多行业中占总碳足迹的主导地位。现有研究通过生命周期分析(LCA)报告得出隐含碳。目前的LCA报告只提供一个产品类别的碳排放量,例如28nm CPU,而一个产品实例可以在不同的地区和时间段制造。碳排放依赖于发电过程,且具有时空动态。因此,产品实例的隐含碳可以与其产品类不同。此外,不同的碳归因方法(如基于位置的和基于市场的)会影响电力的碳排放,从而进一步影响产品的隐含碳。本文提出了基于双重归因方法的时空隐含碳(STEC)核算新模型。我们观察到STEC与当前型号之间存在显著差异,例如,对于7nm CPU,差异为13.69%。我们进一步研究了STEC模型对计算机应用中现有隐含碳核算方案的影响,如大型语言模型(LLM)训练和LLM推理。我们观察到,与其他应用相比,使用产志毒素大肠杆菌导致某些应用的隐含碳差异更大(例如,32.26%对6.35%)。
{"title":"Spatial-Temporal Embodied Carbon Models With Dual Carbon Attribution for Embodied Carbon Accounting of Computer Systems","authors":"Xiaoyang Zhang;Yijie Yang;Dan Wang","doi":"10.1109/TC.2025.3605743","DOIUrl":"https://doi.org/10.1109/TC.2025.3605743","url":null,"abstract":"<italic>Embodied carbon</i> is the carbon emissions in the manufacturing process of products, which dominates the overall carbon footprint in many industries. Existing studies derive the embodied carbon through life cycle analysis (LCA) reports. Current LCA reports only provide the carbon emission of a <italic>product class</i>, e.g. 28nm CPU, whereas a <italic>product instance</i> can be made in various regions and time periods. Carbon emissions depend on the electricity generation process, which has spatial-temporal dynamics. Therefore, the embodied carbon of a product instance can differ from its product class. Additionally, different carbon attribution methods (e.g., location-based and market-based) can affect the carbon emissions of electricity, thus further affecting the embodied carbon of products. In this paper, we present new Spatial-Temporal Embodied Carbon (STEC) accounting models with dual attribution methods. We observe significant differences between STEC and current models, e.g., for 7nm CPU the difference is 13.69%. We further examine the impact of STEC models on existing embodied carbon accounting schemes on computer applications, such as Large Language Model (LLM) training and LLM inference. We observe that using STEC results in much greater differences in the embodied carbon of certain applications as compared to others (e.g., 32.26% vs. 6.35%).","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"4037-4049"},"PeriodicalIF":3.8,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
JCSRC: Joint Client Selection and Resource Configuration for Energy-Efficient Multi-Task Federated Learning 节能多任务联邦学习的联合客户端选择和资源配置
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-09-04 DOI: 10.1109/TC.2025.3605765
Junpeng Ke;Junlong Zhou;Dan Meng;Yue Zeng;Yizhou Shi;Xiangmou Qu;Song Guo
Federated learning (FL) enables privacy-preserving distributed machine learning by training models on edge client devices using their local data without revealing their raw data. In edge environments, various applications require different neural network models, making it crucial to perform joint training of multiple models on edge devices, known as multi-task FL. While existing multi-task FL approaches enhance resource utilization on edge devices through adaptive resource configuration or client selection, optimizing either of these aspects alone may lead to suboptimality. Therefore, in this paper, we explore a joint client selection and resource configuration method called JCSRC for multi-task FL, aiming to maximize energy efficiency in environments with limited computation and communication resources and heterogeneous client devices. Firstly, we formalize this problem as a mixed-integer nonlinear programming problem considering all these characteristics and prove its NP-hardness. To address this problem, we first design a multi-agent reinforcement learning (MARL)-based client selection method that selects appropriate clients for each task to train their models. The MARL method makes client selection decisions based on the clients’ data quality, energy efficiency, communication, and computation capacity to ensure fast convergence and energy efficiency. Then, we design a particle swarm optimization (PSO)-based resource configuration scheme that configures appropriate computation and bandwidth resources for each task on each client. The PSO scheme makes resource configuration decisions based on theoretically derived optimal CPU frequency and bandwidth to achieve high energy efficiency. Finally, we carry out extensive simulations and testbed-based experiments to validate our proposed JCSRC. The results demonstrate that, in comparison to state-of-the-art solutions, JCSRC can save energy consumption by up to 59% to achieve the target accuracy.
联邦学习(FL)通过在边缘客户端设备上使用其本地数据训练模型,而不泄露其原始数据,从而实现了保护隐私的分布式机器学习。在边缘环境中,各种应用需要不同的神经网络模型,因此在边缘设备上执行多个模型的联合训练至关重要,称为多任务FL。虽然现有的多任务FL方法通过自适应资源配置或客户端选择来提高边缘设备上的资源利用率,但单独优化这些方面中的任何一个都可能导致次优性。因此,在本文中,我们探索了一种称为JCSRC的多任务FL联合客户端选择和资源配置方法,旨在在计算和通信资源有限、客户端设备异构的环境中实现能源效率最大化。首先,我们将该问题形式化为一个考虑所有这些特征的混合整数非线性规划问题,并证明了其np -硬度。为了解决这个问题,我们首先设计了一个基于多智能体强化学习(MARL)的客户端选择方法,为每个任务选择合适的客户端来训练他们的模型。MARL方法根据客户端的数据质量、能效、通信和计算能力进行客户端选择决策,保证了快速收敛和能效。然后,设计了一种基于粒子群优化(PSO)的资源配置方案,为每个客户端上的每个任务配置适当的计算和带宽资源。该方案基于理论推导出的最优CPU频率和带宽进行资源配置决策,以达到较高的能效。最后,我们进行了大量的模拟和基于测试平台的实验来验证我们提出的JCSRC。结果表明,与最先进的解决方案相比,JCSRC可以节省高达59%的能耗,以达到目标精度。
{"title":"JCSRC: Joint Client Selection and Resource Configuration for Energy-Efficient Multi-Task Federated Learning","authors":"Junpeng Ke;Junlong Zhou;Dan Meng;Yue Zeng;Yizhou Shi;Xiangmou Qu;Song Guo","doi":"10.1109/TC.2025.3605765","DOIUrl":"https://doi.org/10.1109/TC.2025.3605765","url":null,"abstract":"Federated learning (FL) enables privacy-preserving distributed machine learning by training models on edge client devices using their local data without revealing their raw data. In edge environments, various applications require different neural network models, making it crucial to perform joint training of multiple models on edge devices, known as multi-task FL. While existing multi-task FL approaches enhance resource utilization on edge devices through adaptive resource configuration or client selection, optimizing either of these aspects alone may lead to suboptimality. Therefore, in this paper, we explore a joint client selection and resource configuration method called JCSRC for multi-task FL, aiming to maximize energy efficiency in environments with limited computation and communication resources and heterogeneous client devices. Firstly, we formalize this problem as a mixed-integer nonlinear programming problem considering all these characteristics and prove its NP-hardness. To address this problem, we first design a multi-agent reinforcement learning (MARL)-based client selection method that selects appropriate clients for each task to train their models. The MARL method makes client selection decisions based on the clients’ data quality, energy efficiency, communication, and computation capacity to ensure fast convergence and energy efficiency. Then, we design a particle swarm optimization (PSO)-based resource configuration scheme that configures appropriate computation and bandwidth resources for each task on each client. The PSO scheme makes resource configuration decisions based on theoretically derived optimal CPU frequency and bandwidth to achieve high energy efficiency. Finally, we carry out extensive simulations and testbed-based experiments to validate our proposed JCSRC. The results demonstrate that, in comparison to state-of-the-art solutions, JCSRC can save energy consumption by up to 59% to achieve the target accuracy.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"4094-4108"},"PeriodicalIF":3.8,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing Multi-DNN Parallel Inference Performance in MEC Networks: A Resource-Aware and Dynamic DNN Deployment Scheme MEC网络中优化多DNN并行推理性能:一种资源感知和动态DNN部署方案
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-09-03 DOI: 10.1109/TC.2025.3605749
Tong Zheng;Yuanguo Bi;Guangjie Han;Xingwei Wang;Yuheng Liu;Yufei Liu;Xiangyi Chen
The advent of Multi-access Edge Computing (MEC) has empowered Internet of Things (IoT) devices and edge servers to deploy sophisticated Deep Neural Network (DNN) applications, enabling real-time inference. Many concurrent inference requests and intricate DNN models demand efficient multi-DNN inference in MEC networks. However, the resource-limited IoT device/edge server and expanding model size force models to be dynamically deployed, resulting in significant undesired energy consumption. In addition, parallel multi-DNN inference on the same device complicates the inference process due to the resource competition among models, increasing the inference latency. In this paper, we propose a Resource-aware and Dynamic DNN Deployment (R3D) scheme with the collaboration of end-edge-cloud. To mitigate resource competition and waste during multi-DNN parallel inference, we develop a Resource Adaptive Management (RAM) algorithm based on the Roofline model, which dynamically allocates resources by accounting for the impact of device-specific performance bottlenecks on inference latency. Additionally, we design a Deep Reinforcement Learning (DRL)-based online optimization algorithm that dynamically adjusts DNN deployment strategies to achieve fast and energy-efficient inference across heterogeneous devices. Experiment results demonstrate that R3D is applicable in MEC environments and performs well in terms of inference latency, resource utilization, and energy consumption.
多访问边缘计算(MEC)的出现使物联网(IoT)设备和边缘服务器能够部署复杂的深度神经网络(DNN)应用程序,从而实现实时推理。在MEC网络中,许多并发推理请求和复杂的深度神经网络模型需要高效的多深度神经网络推理。然而,资源有限的物联网设备/边缘服务器和不断扩大的模型尺寸迫使模型进行动态部署,导致大量不必要的能源消耗。此外,同一设备上的并行多dnn推理由于模型之间的资源竞争而使推理过程复杂化,增加了推理延迟。本文提出了一种基于端缘云的资源感知和动态DNN部署(R3D)方案。为了减轻多dnn并行推理过程中的资源竞争和浪费,我们开发了一种基于rooline模型的资源自适应管理(RAM)算法,该算法通过考虑特定设备性能瓶颈对推理延迟的影响来动态分配资源。此外,我们设计了一种基于深度强化学习(DRL)的在线优化算法,该算法动态调整DNN部署策略,以实现跨异构设备的快速节能推理。实验结果表明,R3D算法适用于MEC环境,在推理延迟、资源利用率和能耗方面表现良好。
{"title":"Optimizing Multi-DNN Parallel Inference Performance in MEC Networks: A Resource-Aware and Dynamic DNN Deployment Scheme","authors":"Tong Zheng;Yuanguo Bi;Guangjie Han;Xingwei Wang;Yuheng Liu;Yufei Liu;Xiangyi Chen","doi":"10.1109/TC.2025.3605749","DOIUrl":"https://doi.org/10.1109/TC.2025.3605749","url":null,"abstract":"The advent of Multi-access Edge Computing (MEC) has empowered Internet of Things (IoT) devices and edge servers to deploy sophisticated Deep Neural Network (DNN) applications, enabling real-time inference. Many concurrent inference requests and intricate DNN models demand efficient multi-DNN inference in MEC networks. However, the resource-limited IoT device/edge server and expanding model size force models to be dynamically deployed, resulting in significant undesired energy consumption. In addition, parallel multi-DNN inference on the same device complicates the inference process due to the resource competition among models, increasing the inference latency. In this paper, we propose a Resource-aware and Dynamic DNN Deployment (R3D) scheme with the collaboration of end-edge-cloud. To mitigate resource competition and waste during multi-DNN parallel inference, we develop a Resource Adaptive Management (RAM) algorithm based on the Roofline model, which dynamically allocates resources by accounting for the impact of device-specific performance bottlenecks on inference latency. Additionally, we design a Deep Reinforcement Learning (DRL)-based online optimization algorithm that dynamically adjusts DNN deployment strategies to achieve fast and energy-efficient inference across heterogeneous devices. Experiment results demonstrate that R3D is applicable in MEC environments and performs well in terms of inference latency, resource utilization, and energy consumption.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3938-3952"},"PeriodicalIF":3.8,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Competition-Style Sorting Networks (CSN): A Framework for Hardware-Based Sorting Operations 竞争式排序网络(CSN):基于硬件的排序操作框架
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-09-03 DOI: 10.1109/TC.2025.3605766
Abbas A. Fairouz;Jassim M. Aljuraidan;Ameer Mohammed
Sorting operations are considered to be a significant part of any computer system and are widely used in many applications. In applications where sorting has to be efficiently accomplished (i.e., in $O(1)$ time) on small-sized entries, hardware accelerators, such as ASICs, FPGAs, or GPUs, are used to speed up the sorting operations. In the literature, the bitonic sort algorithm (or variants thereof) is still considered to be the most commonly used approach in many hardware sort implementations for decades. However, the time complexity of the bitonic sort is $O((log(n))^{2})$ for sorting $n$ elements, which does not satisfy the constant-time constraint we demand for our setting. In this paper, we propose competition-style sorting networks (CSNs), a framework for designing hardware-based competition-style class of sorting networks that captures all forms of two-stage sorting networks where the first stage (competition) consists of pairwise comparisons and the second stage (evaluation) ranks the entries and sorts them. To illustrate the utility of this framework, we develop and test one instance of this design, called the Competition Sort Algorithm (CSA), which has a time complexity of $O(1)$, and specifically, one clock cycle. We implemented and tested CSA on both an Intel Cyclone V FPGA and the NVIDIA Quadro T1000 GPU then measured its gain, which combines the trade-offs between the relative speedup and the relative area increase, against the bitonic sort. Our results show that the CSA achieves a significant gain of up to $11.01times$ on the FPGA and a relative speedup of up to $3.32times$ on the GPU. We also compare the area and latency of CSA with the bitonic sort algorithm on the FPGA.
排序操作被认为是任何计算机系统的重要组成部分,并广泛应用于许多应用中。在必须有效地完成排序的应用程序中(即,在$O(1)$时间内),在小尺寸的条目上,硬件加速器,如asic, fpga或gpu,被用来加速排序操作。在文献中,双元排序算法(或其变体)仍然被认为是几十年来许多硬件排序实现中最常用的方法。然而,对于排序$n$个元素,双元排序的时间复杂度为$O((log(n))^{2})$,这并不满足我们对设置的常数时间约束要求。在本文中,我们提出了竞争式排序网络(csn),这是一个设计基于硬件的竞争式排序网络的框架,它捕获了所有形式的两阶段排序网络,其中第一阶段(竞争)由两两比较组成,第二阶段(评估)对条目进行排序并对它们进行排序。为了说明该框架的实用性,我们开发并测试了该设计的一个实例,称为竞争排序算法(CSA),其时间复杂度为$O(1)$,特别是一个时钟周期。我们在Intel Cyclone V FPGA和NVIDIA Quadro T1000 GPU上实现和测试了CSA,然后测量了它的增益,它结合了相对加速和相对面积增加之间的权衡,而不是bitonic类型。我们的结果表明,CSA在FPGA上实现了高达11.01times$的显著增益,在GPU上实现了高达3.32times$的相对加速。在FPGA上比较了CSA算法与双次排序算法的面积和延迟。
{"title":"Competition-Style Sorting Networks (CSN): A Framework for Hardware-Based Sorting Operations","authors":"Abbas A. Fairouz;Jassim M. Aljuraidan;Ameer Mohammed","doi":"10.1109/TC.2025.3605766","DOIUrl":"https://doi.org/10.1109/TC.2025.3605766","url":null,"abstract":"Sorting operations are considered to be a significant part of any computer system and are widely used in many applications. In applications where sorting has to be efficiently accomplished (i.e., in <inline-formula><tex-math>$O(1)$</tex-math></inline-formula> time) on small-sized entries, hardware accelerators, such as ASICs, FPGAs, or GPUs, are used to speed up the sorting operations. In the literature, the bitonic sort algorithm (or variants thereof) is still considered to be the most commonly used approach in many hardware sort implementations for decades. However, the time complexity of the bitonic sort is <inline-formula><tex-math>$O((log(n))^{2})$</tex-math></inline-formula> for sorting <inline-formula><tex-math>$n$</tex-math></inline-formula> elements, which does not satisfy the constant-time constraint we demand for our setting. In this paper, we propose <i>competition-style sorting networks</i> (CSNs), a framework for designing hardware-based competition-style class of sorting networks that captures all forms of two-stage sorting networks where the first stage (competition) consists of pairwise comparisons and the second stage (evaluation) ranks the entries and sorts them. To illustrate the utility of this framework, we develop and test one instance of this design, called the Competition Sort Algorithm (CSA), which has a time complexity of <inline-formula><tex-math>$O(1)$</tex-math></inline-formula>, and specifically, one clock cycle. We implemented and tested CSA on both an Intel Cyclone V FPGA and the NVIDIA Quadro T1000 GPU then measured its <i>gain</i>, which combines the trade-offs between the relative speedup and the relative area increase, against the bitonic sort. Our results show that the CSA achieves a significant gain of up to <inline-formula><tex-math>$11.01times$</tex-math></inline-formula> on the FPGA and a relative speedup of up to <inline-formula><tex-math>$3.32times$</tex-math></inline-formula> on the GPU. We also compare the area and latency of CSA with the bitonic sort algorithm on the FPGA.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"4109-4122"},"PeriodicalIF":3.8,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
StageWise: Accelerating Persistent Key-Value Stores by Thread Model Redesigning StageWise:通过重新设计线程模型来加速持久键值存储
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-09-03 DOI: 10.1109/TC.2025.3605763
Zeqi Li;Youmin Chen;Qing Wang;Youyou Lu;Jiwu Shu
With the emergence of fast NVMe SSDs, key-value stores are becoming more CPU-efficient in order to reap their bandwidth. However, current CPU-optimized key-value stores adopt suboptimal intra- and inter-thread models, hence incurring memory-level stalling and load imbalance that hinder cores from realizing their full potential. We present StageWise, an CPU-efficient key-value store on fast NVMe SSDs with high throughput. To achieve this, we introduce a new thread model for StageWise to process KV requests. Specifically, StageWise converts the processing of each KV request into multiple asynchronous stages, and thus enables pipelining across all stages. StageWise further introduces a client-driven share-index architecture to ease inter-thread load imbalance and maximize the pipelining opportunity. Guided by Little’s Law, StageWise improves concurrency, and therefore efficiently uses CPU to reach higher throughput. Extensive experimental results show that StageWise outperforms CPU-optimized key-value stores (e.g., KVell) by up to 3.5${boldsymbol{times}}$ with write-intensive workloads, and storage-optimized ones (e.g., RocksDB) by over an order of magnitude. StageWise also shows higher read performance and excellent scalability under various workloads.
随着快速NVMe ssd的出现,键值存储变得更加高效,以便获得带宽。然而,当前cpu优化的键值存储采用次优的线程内和线程间模型,因此会导致内存级的停滞和负载不平衡,从而阻碍内核充分发挥其潜力。我们提出StageWise,一个cpu高效的键值存储在快速NVMe ssd具有高吞吐量。为了实现这一点,我们为StageWise引入了一个新的线程模型来处理KV请求。具体来说,StageWise将每个KV请求的处理转换为多个异步阶段,从而实现了所有阶段的流水线操作。StageWise进一步引入了客户端驱动的共享索引架构,以缓解线程间负载不平衡,并最大限度地提高流水线的机会。在利特尔定律的指导下,StageWise提高了并发性,从而有效地利用CPU达到更高的吞吐量。大量的实验结果表明,在写密集型工作负载下,StageWise比cpu优化的键值存储(例如,KVell)的性能高出3.5${boldsymbol{times}}$,比存储优化的工作负载(例如,RocksDB)的性能高出一个数量级。StageWise在各种工作负载下也显示出更高的读取性能和出色的可扩展性。
{"title":"StageWise: Accelerating Persistent Key-Value Stores by Thread Model Redesigning","authors":"Zeqi Li;Youmin Chen;Qing Wang;Youyou Lu;Jiwu Shu","doi":"10.1109/TC.2025.3605763","DOIUrl":"https://doi.org/10.1109/TC.2025.3605763","url":null,"abstract":"With the emergence of fast NVMe SSDs, key-value stores are becoming more CPU-efficient in order to reap their bandwidth. However, current CPU-optimized key-value stores adopt suboptimal intra- and inter-thread models, hence incurring memory-level stalling and load imbalance that hinder cores from realizing their full potential. We present <small>StageWise</small>, an CPU-efficient key-value store on fast NVMe SSDs with high throughput. To achieve this, we introduce a new thread model for <small>StageWise</small> to process KV requests. Specifically, <small>StageWise</small> converts the processing of each KV request into multiple asynchronous stages, and thus enables pipelining across all stages. <small>StageWise</small> further introduces a client-driven share-index architecture to ease inter-thread load imbalance and maximize the pipelining opportunity. Guided by Little’s Law, <small>StageWise</small> improves concurrency, and therefore efficiently uses CPU to reach higher throughput. Extensive experimental results show that <small>StageWise</small> outperforms CPU-optimized key-value stores (e.g., KVell) by up to 3.5<inline-formula><tex-math>${boldsymbol{times}}$</tex-math></inline-formula> with write-intensive workloads, and storage-optimized ones (e.g., RocksDB) by over an order of magnitude. <small>StageWise</small> also shows higher read performance and excellent scalability under various workloads.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"4080-4093"},"PeriodicalIF":3.8,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CoFormer: Collaborating With Heterogeneous Edge Devices for Scalable Transformer Inference 共变换器:与异构边缘器件协作用于可扩展变压器推理
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-09-02 DOI: 10.1109/TC.2025.3604473
Guanyu Xu;Zhiwei Hao;Li Shen;Yong Luo;Fuhui Sun;Xiaoyan Wang;Han Hu;Yonggang Wen
The impressive performance of transformer models has sparked the deployment of intelligent applications on resource-constrained edge devices. However, ensuring high-quality service for real-time edge systems is a significant challenge due to the considerable computational demands and resource requirements of these models. Existing strategies typically either offload transformer computations to other devices or directly deploy compressed models on individual edge devices. These strategies, however, result in either considerable communication overhead or suboptimal trade-offs between accuracy and efficiency. To tackle these challenges, we propose a collaborative inference system for general transformer models, termed CoFormer. The central idea behind CoFormer is to exploit the divisibility and integrability of transformer. An off-the-shelf large transformer can be decomposed into multiple smaller models for distributed inference, and their intermediate results are aggregated to generate the final output. We formulate an optimization problem to minimize both inference latency and accuracy degradation under heterogeneous hardware constraints. DeBo algorithm is proposed to first solve the optimization problem to derive the decomposition policy, and then progressively calibrate decomposed models to restore performance. We demonstrate the capability to support a wide range of transformer models on heterogeneous edge devices, achieving up to 3.1$times$ inference speedup with large transformer models. Notably, CoFormer enables the efficient inference of GPT2-XL with 1.6 billion parameters on edge devices, reducing memory requirements by 76.3%. CoFormer can also reduce energy consumption by approximately 40% while maintaining satisfactory inference performance.
变压器模型令人印象深刻的性能引发了在资源受限的边缘设备上部署智能应用程序。然而,由于这些模型的大量计算需求和资源需求,确保实时边缘系统的高质量服务是一个重大挑战。现有的策略通常要么将变压器计算卸载到其他设备上,要么直接在单个边缘设备上部署压缩模型。然而,这些策略要么导致相当大的通信开销,要么导致准确性和效率之间的次优权衡。为了应对这些挑战,我们提出了一个通用变压器模型的协作推理系统,称为CoFormer。CoFormer的核心思想是利用变压器的可整除性和可积性。一个现成的大型变压器可以被分解成多个较小的模型进行分布式推理,它们的中间结果被聚合以生成最终输出。我们提出了一个优化问题,以最小化异构硬件约束下的推理延迟和精度下降。提出了DeBo算法,首先解决优化问题,导出分解策略,然后逐步校准分解模型以恢复性能。我们展示了在异构边缘设备上支持各种变压器模型的能力,在大型变压器模型上实现了高达3.1$times$的推理加速。值得注意的是,CoFormer在边缘设备上实现了具有16亿个参数的GPT2-XL的高效推理,将内存需求降低了76.3%。CoFormer还可以在保持令人满意的推理性能的同时降低约40%的能耗。
{"title":"CoFormer: Collaborating With Heterogeneous Edge Devices for Scalable Transformer Inference","authors":"Guanyu Xu;Zhiwei Hao;Li Shen;Yong Luo;Fuhui Sun;Xiaoyan Wang;Han Hu;Yonggang Wen","doi":"10.1109/TC.2025.3604473","DOIUrl":"https://doi.org/10.1109/TC.2025.3604473","url":null,"abstract":"The impressive performance of transformer models has sparked the deployment of intelligent applications on resource-constrained edge devices. However, ensuring high-quality service for real-time edge systems is a significant challenge due to the considerable computational demands and resource requirements of these models. Existing strategies typically either offload transformer computations to other devices or directly deploy compressed models on individual edge devices. These strategies, however, result in either considerable communication overhead or suboptimal trade-offs between accuracy and efficiency. To tackle these challenges, we propose a collaborative inference system for general transformer models, termed <b>CoFormer</b>. The central idea behind CoFormer is to exploit the divisibility and integrability of transformer. An off-the-shelf large transformer can be decomposed into multiple smaller models for distributed inference, and their intermediate results are aggregated to generate the final output. We formulate an optimization problem to minimize both inference latency and accuracy degradation under heterogeneous hardware constraints. DeBo algorithm is proposed to first solve the optimization problem to derive the decomposition policy, and then progressively calibrate decomposed models to restore performance. We demonstrate the capability to support a wide range of transformer models on heterogeneous edge devices, achieving up to 3.1<inline-formula><tex-math>$times$</tex-math></inline-formula> inference speedup with large transformer models. Notably, CoFormer enables the efficient inference of GPT2-XL with 1.6 billion parameters on edge devices, reducing memory requirements by 76.3%. CoFormer can also reduce energy consumption by approximately 40% while maintaining satisfactory inference performance.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"4010-4024"},"PeriodicalIF":3.8,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Reputation-Based Energy-Efficient Transaction Propagation Mechanism for Blockchain-Enabled Multi-Access Edge Computing 基于信誉的区块链多访问边缘计算节能交易传播机制
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-09-02 DOI: 10.1109/TC.2025.3604480
Xijia Lu;Qiang He;Xingwei Wang;Jaime Lloret;Peichen Li;Ying Qian;Min Huang
Blockchain strengthens reliable collaboration among entities through its transparency, immutability, and traceability, leading to its integration into Multi-access Edge Computing (MEC) and promoting the development of a trusted JointCloud. However, existing transaction propagation mechanisms require MEC devices to consume significant computing resources for complex transaction verification, increasing their vulnerability to malicious attacks. Adversaries can exploit this by flooding the blockchain network with spam transactions, aiming to deplete device energy and disrupt system performance. To cope with these issues, this paper proposes a reputation-based energy-efficient transaction propagation mechanism that alleviates spam transaction attacks while reducing computing resources and energy consumption. Firstly, we design a subjective logic-based reputation scheme that assesses node trust by integrating local and recommended opinions and incorporates opinion acceptance to counteract false evidence. Then, we optimize the transaction verification method by adjusting transaction discard and verification probabilities based on the proposed reputation scheme to curb the propagation of spam transactions and reduce verification consumption. Finally, we enhance the transaction transmission strategy by prioritizing nodes with higher reputations, enhancing both resilience to spam transactions and transmission reliability. A series of simulations demonstrates the effectiveness of the proposed mechanism.
区块链通过其透明度、不变性和可追溯性加强实体之间的可靠协作,从而集成到多访问边缘计算(MEC)中,并促进可信联合云的发展。然而,现有的事务传播机制要求MEC设备消耗大量的计算资源进行复杂的事务验证,增加了其遭受恶意攻击的脆弱性。攻击者可以通过向区块链网络发送大量垃圾邮件交易来利用这一点,目的是耗尽设备能量并破坏系统性能。针对这些问题,本文提出了一种基于信誉的高效交易传播机制,在减少垃圾交易攻击的同时减少了计算资源和能源消耗。首先,我们设计了一个基于主观逻辑的信誉方案,该方案通过整合本地意见和推荐意见来评估节点信任,并结合意见接受度来抵消虚假证据。然后,我们在提出的信誉方案的基础上,通过调整交易丢弃概率和验证概率来优化交易验证方法,以抑制垃圾交易的传播,降低验证消耗。最后,我们通过优先考虑声誉较高的节点来增强交易传输策略,增强了对垃圾交易的弹性和传输可靠性。一系列的仿真验证了该机制的有效性。
{"title":"A Reputation-Based Energy-Efficient Transaction Propagation Mechanism for Blockchain-Enabled Multi-Access Edge Computing","authors":"Xijia Lu;Qiang He;Xingwei Wang;Jaime Lloret;Peichen Li;Ying Qian;Min Huang","doi":"10.1109/TC.2025.3604480","DOIUrl":"https://doi.org/10.1109/TC.2025.3604480","url":null,"abstract":"Blockchain strengthens reliable collaboration among entities through its transparency, immutability, and traceability, leading to its integration into Multi-access Edge Computing (MEC) and promoting the development of a trusted JointCloud. However, existing transaction propagation mechanisms require MEC devices to consume significant computing resources for complex transaction verification, increasing their vulnerability to malicious attacks. Adversaries can exploit this by flooding the blockchain network with spam transactions, aiming to deplete device energy and disrupt system performance. To cope with these issues, this paper proposes a reputation-based energy-efficient transaction propagation mechanism that alleviates spam transaction attacks while reducing computing resources and energy consumption. Firstly, we design a subjective logic-based reputation scheme that assesses node trust by integrating local and recommended opinions and incorporates opinion acceptance to counteract false evidence. Then, we optimize the transaction verification method by adjusting transaction discard and verification probabilities based on the proposed reputation scheme to curb the propagation of spam transactions and reduce verification consumption. Finally, we enhance the transaction transmission strategy by prioritizing nodes with higher reputations, enhancing both resilience to spam transactions and transmission reliability. A series of simulations demonstrates the effectiveness of the proposed mechanism.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3897-3910"},"PeriodicalIF":3.8,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reliability-Aware Optimization of Task Offloading for UAV-Assisted Edge Computing 无人机辅助边缘计算任务卸载的可靠性感知优化
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-09-02 DOI: 10.1109/TC.2025.3604463
Hao Hao;Changqiao Xu;Wei Zhang;Xingyan Chen;Shujie Yang;Gabriel-Miro Muntean
Uncrewed aerial vehicles (UAV) are widely used for edge computing in poor infrastructure scenarios due to their deployment flexibility and mobility. In UAV-assisted edge computing systems, multiple UAVs can cooperate with the cloud to provide superior computing capability for diverse innovative services. However, many service-related computational tasks may fail due to the unreliability of UAVs and wireless transmission channels. Diverse solutions were proposed, but most of them employ time-driven strategies which introduce unwanted decision waiting delays. To address this problem, this paper focuses on a task-driven reliability-aware cooperative offloading problem in UAV-assisted edge-enhanced networks. The issue is formulated as an optimization problem which jointly optimizes UAV trajectories, offloading decisions, and transmission power, aiming to maximize the long-term average task success rate. Considering the discrete-continuous hybrid action space of the problem, a dependence-aware latent-space representation algorithm is proposed to represent discrete-continuous hybrid actions. Furthermore, we design a novel deep reinforcement learning scheme by combining the representation algorithm and a twin delayed deep deterministic policy gradient algorithm. We compared our proposed algorithm with four alternative solutions via simulations and a realistic Kubernetes testbed-based setup. The test results show how our scheme outperforms the other methods, ensuring significant improvements in terms of task success rate.
无人机(UAV)由于其部署灵活性和移动性,被广泛用于基础设施差的场景下的边缘计算。在无人机辅助的边缘计算系统中,多架无人机可以与云合作,为各种创新服务提供卓越的计算能力。然而,由于无人机和无线传输信道的不可靠性,许多与服务相关的计算任务可能会失败。提出了多种解决方案,但大多数都采用了时间驱动的策略,这引入了不必要的决策等待延迟。为了解决这一问题,本文重点研究了无人机辅助边缘增强网络中任务驱动的可靠性感知协同卸载问题。该问题被表述为一个以长期平均任务成功率最大化为目标,对无人机轨迹、卸载决策和发射功率进行联合优化的优化问题。考虑到问题的离散-连续混合动作空间,提出了一种依赖感知的潜在空间表示算法来表示离散-连续混合动作。此外,我们设计了一种新的深度强化学习方案,该方案将表示算法与双延迟深度确定性策略梯度算法相结合。我们通过模拟和基于Kubernetes测试平台的现实设置,将我们提出的算法与四种替代解决方案进行了比较。测试结果表明,我们的方案优于其他方法,确保了任务成功率的显著提高。
{"title":"Reliability-Aware Optimization of Task Offloading for UAV-Assisted Edge Computing","authors":"Hao Hao;Changqiao Xu;Wei Zhang;Xingyan Chen;Shujie Yang;Gabriel-Miro Muntean","doi":"10.1109/TC.2025.3604463","DOIUrl":"https://doi.org/10.1109/TC.2025.3604463","url":null,"abstract":"Uncrewed aerial vehicles (UAV) are widely used for edge computing in poor infrastructure scenarios due to their deployment flexibility and mobility. In UAV-assisted edge computing systems, multiple UAVs can cooperate with the cloud to provide superior computing capability for diverse innovative services. However, many service-related computational tasks may fail due to the unreliability of UAVs and wireless transmission channels. Diverse solutions were proposed, but most of them employ time-driven strategies which introduce unwanted decision waiting delays. To address this problem, this paper focuses on a task-driven reliability-aware cooperative offloading problem in UAV-assisted edge-enhanced networks. The issue is formulated as an optimization problem which jointly optimizes UAV trajectories, offloading decisions, and transmission power, aiming to maximize the long-term average task success rate. Considering the discrete-continuous hybrid action space of the problem, a dependence-aware latent-space representation algorithm is proposed to represent discrete-continuous hybrid actions. Furthermore, we design a novel deep reinforcement learning scheme by combining the representation algorithm and a twin delayed deep deterministic policy gradient algorithm. We compared our proposed algorithm with four alternative solutions via simulations and a realistic Kubernetes testbed-based setup. The test results show how our scheme outperforms the other methods, ensuring significant improvements in terms of task success rate.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3832-3844"},"PeriodicalIF":3.8,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11146794","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Computers
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1