RowHammer poses a serious reliability challenge to modern DRAM systems. As technology scales down, DRAM resistance to RowHammer has decreased by $30times$ over the past decade, causing an increasing number of benign applications to suffer from this issue. However, existing defense mechanisms have three limitations: 1) they rely on inefficient mitigation techniques, such as time-consuming victim row refresh; 2) they do not reduce the number of effective RowHammer attacks, leading to frequent mitigations; and 3) they fail to recognize that frequently accessed data is not only a root cause of RowHammer but also presents an opportunity for performance optimization. In this paper, we observe that frequently accessed hot data plays a distinct role in security and efficiency: it can induce RowHammer by interfering with adjacent cold data, while also being performance-critical due to its frequent accesses. To this end, we propose Data Isolation via In-DRAM Cache (DIVIDE), a novel defense mechanism that leverages in-DRAM cache to isolate and exploit hot data. DIVIDE offers three key benefits: 1) It reduces the number of effective RowHammer attacks, as hot data in the cache cannot interfere with each other. 2) It provides a simple yet effective mitigation measure by isolating hot data from cold data. 3) It caches frequently accessed hot data, improving average access latency. DIVIDE employs a two-level protection structure: the first level mitigates RowHammer in cache arrays with high efficiency, while the second level addresses the remaining threats in normal arrays to ensure complete protection. Owing to the high in-DRAM cache hit rate, DIVIDE efficiently mitigates RowHammer while preserving both the performance and energy efficiency of the in-DRAM cache. At a RowHammer threshold of 128, DIVIDE with probabilistic mitigation achieves an average performance improvement of 19.6% and energy savings of 20.4% over DDR4 DRAM for four-core workloads. Compared to an unprotected in-DRAM cache DRAM, DIVIDE incurs only a 2.1% performance overhead while requiring just a modest 1KB per-channel CAM in the memory controller, with no modification to the DRAM chip.
{"title":"DIVIDE: Efficient RowHammer Defense via In-DRAM Cache-Based Hot Data Isolation","authors":"Haitao Du;Yuxuan Yang;Song Chen;Yi Kang","doi":"10.1109/TC.2025.3603729","DOIUrl":"https://doi.org/10.1109/TC.2025.3603729","url":null,"abstract":"RowHammer poses a serious reliability challenge to modern DRAM systems. As technology scales down, DRAM resistance to RowHammer has decreased by <inline-formula><tex-math>$30times$</tex-math></inline-formula> over the past decade, causing an increasing number of benign applications to suffer from this issue. However, existing defense mechanisms have three limitations: 1) they rely on inefficient mitigation techniques, such as time-consuming victim row refresh; 2) they do not reduce the number of effective RowHammer attacks, leading to frequent mitigations; and 3) they fail to recognize that frequently accessed data is not only a root cause of RowHammer but also presents an opportunity for performance optimization. In this paper, we observe that frequently accessed hot data plays a distinct role in security and efficiency: it can induce RowHammer by interfering with adjacent cold data, while also being performance-critical due to its frequent accesses. To this end, we propose Data Isolation via In-DRAM Cache (DIVIDE), a novel defense mechanism that leverages in-DRAM cache to isolate and exploit hot data. DIVIDE offers three key benefits: 1) It reduces the number of effective RowHammer attacks, as hot data in the cache cannot interfere with each other. 2) It provides a simple yet effective mitigation measure by isolating hot data from cold data. 3) It caches frequently accessed hot data, improving average access latency. DIVIDE employs a two-level protection structure: the first level mitigates RowHammer in cache arrays with high efficiency, while the second level addresses the remaining threats in normal arrays to ensure complete protection. Owing to the high in-DRAM cache hit rate, DIVIDE efficiently mitigates RowHammer while preserving both the performance and energy efficiency of the in-DRAM cache. At a RowHammer threshold of 128, DIVIDE with probabilistic mitigation achieves an average performance improvement of 19.6% and energy savings of 20.4% over DDR4 DRAM for four-core workloads. Compared to an unprotected in-DRAM cache DRAM, DIVIDE incurs only a 2.1% performance overhead while requiring just a modest 1KB per-channel CAM in the memory controller, with no modification to the DRAM chip.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"3980-3994"},"PeriodicalIF":3.8,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As scientific and engineering challenges grow in complexity and scale, the demand for effective solutions for sparse matrix computations becomes increasingly critical. LU decomposition, known for its ability to reduce computational load and enhance numerical stability, serves as a promising approach. This study focuses on accelerating sparse LU decomposition for circuit simulations, addressing the prolonged simulation times caused by large circuit matrices. We present a novel Operation-based Optimized LU (OOLU) decomposition architecture that significantly improves circuit analysis efficiency. OOLU employs a VLIW-like processing element array and incorporates a scheduler that decomposes computations into a fine-grained operational task flow graph, maximizing inter-operation parallelism. Specialized scheduling and data mapping strategies are applied to align with the adaptable pipelined framework and the characteristics of circuit matrices. The OOLU architecture is prototyped on an FPGA and validated through extensive tests on the University of Florida sparse matrix collection, benchmarked against multiple platforms. The accelerator achieves speedups ranging from $3.48times$ to $32.25times$ (average $12.51times$) over the KLU software package. It also delivers average speedups of $2.64times$ over a prior FPGA accelerator and $25.18times$ and $32.27times$ over the GPU accelerators STRUMPACK and SFLU, respectively, highlighting the substantial efficiency gains our approach delivers.
{"title":"OOLU: An Operation-Based Optimized Sparse LU Decomposition Accelerator for Circuit Simulation","authors":"Ke Hu;Fan Yang","doi":"10.1109/TC.2025.3605751","DOIUrl":"https://doi.org/10.1109/TC.2025.3605751","url":null,"abstract":"As scientific and engineering challenges grow in complexity and scale, the demand for effective solutions for sparse matrix computations becomes increasingly critical. LU decomposition, known for its ability to reduce computational load and enhance numerical stability, serves as a promising approach. This study focuses on accelerating sparse LU decomposition for circuit simulations, addressing the prolonged simulation times caused by large circuit matrices. We present a novel Operation-based Optimized LU (OOLU) decomposition architecture that significantly improves circuit analysis efficiency. OOLU employs a VLIW-like processing element array and incorporates a scheduler that decomposes computations into a fine-grained operational task flow graph, maximizing inter-operation parallelism. Specialized scheduling and data mapping strategies are applied to align with the adaptable pipelined framework and the characteristics of circuit matrices. The OOLU architecture is prototyped on an FPGA and validated through extensive tests on the University of Florida sparse matrix collection, benchmarked against multiple platforms. The accelerator achieves speedups ranging from <inline-formula><tex-math>$3.48times$</tex-math></inline-formula> to <inline-formula><tex-math>$32.25times$</tex-math></inline-formula> (average <inline-formula><tex-math>$12.51times$</tex-math></inline-formula>) over the KLU software package. It also delivers average speedups of <inline-formula><tex-math>$2.64times$</tex-math></inline-formula> over a prior FPGA accelerator and <inline-formula><tex-math>$25.18times$</tex-math></inline-formula> and <inline-formula><tex-math>$32.27times$</tex-math></inline-formula> over the GPU accelerators STRUMPACK and SFLU, respectively, highlighting the substantial efficiency gains our approach delivers.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"4065-4079"},"PeriodicalIF":3.8,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Embodied carbon is the carbon emissions in the manufacturing process of products, which dominates the overall carbon footprint in many industries. Existing studies derive the embodied carbon through life cycle analysis (LCA) reports. Current LCA reports only provide the carbon emission of a product class, e.g. 28nm CPU, whereas a product instance can be made in various regions and time periods. Carbon emissions depend on the electricity generation process, which has spatial-temporal dynamics. Therefore, the embodied carbon of a product instance can differ from its product class. Additionally, different carbon attribution methods (e.g., location-based and market-based) can affect the carbon emissions of electricity, thus further affecting the embodied carbon of products. In this paper, we present new Spatial-Temporal Embodied Carbon (STEC) accounting models with dual attribution methods. We observe significant differences between STEC and current models, e.g., for 7nm CPU the difference is 13.69%. We further examine the impact of STEC models on existing embodied carbon accounting schemes on computer applications, such as Large Language Model (LLM) training and LLM inference. We observe that using STEC results in much greater differences in the embodied carbon of certain applications as compared to others (e.g., 32.26% vs. 6.35%).
{"title":"Spatial-Temporal Embodied Carbon Models With Dual Carbon Attribution for Embodied Carbon Accounting of Computer Systems","authors":"Xiaoyang Zhang;Yijie Yang;Dan Wang","doi":"10.1109/TC.2025.3605743","DOIUrl":"https://doi.org/10.1109/TC.2025.3605743","url":null,"abstract":"<italic>Embodied carbon</i> is the carbon emissions in the manufacturing process of products, which dominates the overall carbon footprint in many industries. Existing studies derive the embodied carbon through life cycle analysis (LCA) reports. Current LCA reports only provide the carbon emission of a <italic>product class</i>, e.g. 28nm CPU, whereas a <italic>product instance</i> can be made in various regions and time periods. Carbon emissions depend on the electricity generation process, which has spatial-temporal dynamics. Therefore, the embodied carbon of a product instance can differ from its product class. Additionally, different carbon attribution methods (e.g., location-based and market-based) can affect the carbon emissions of electricity, thus further affecting the embodied carbon of products. In this paper, we present new Spatial-Temporal Embodied Carbon (STEC) accounting models with dual attribution methods. We observe significant differences between STEC and current models, e.g., for 7nm CPU the difference is 13.69%. We further examine the impact of STEC models on existing embodied carbon accounting schemes on computer applications, such as Large Language Model (LLM) training and LLM inference. We observe that using STEC results in much greater differences in the embodied carbon of certain applications as compared to others (e.g., 32.26% vs. 6.35%).","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"4037-4049"},"PeriodicalIF":3.8,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federated learning (FL) enables privacy-preserving distributed machine learning by training models on edge client devices using their local data without revealing their raw data. In edge environments, various applications require different neural network models, making it crucial to perform joint training of multiple models on edge devices, known as multi-task FL. While existing multi-task FL approaches enhance resource utilization on edge devices through adaptive resource configuration or client selection, optimizing either of these aspects alone may lead to suboptimality. Therefore, in this paper, we explore a joint client selection and resource configuration method called JCSRC for multi-task FL, aiming to maximize energy efficiency in environments with limited computation and communication resources and heterogeneous client devices. Firstly, we formalize this problem as a mixed-integer nonlinear programming problem considering all these characteristics and prove its NP-hardness. To address this problem, we first design a multi-agent reinforcement learning (MARL)-based client selection method that selects appropriate clients for each task to train their models. The MARL method makes client selection decisions based on the clients’ data quality, energy efficiency, communication, and computation capacity to ensure fast convergence and energy efficiency. Then, we design a particle swarm optimization (PSO)-based resource configuration scheme that configures appropriate computation and bandwidth resources for each task on each client. The PSO scheme makes resource configuration decisions based on theoretically derived optimal CPU frequency and bandwidth to achieve high energy efficiency. Finally, we carry out extensive simulations and testbed-based experiments to validate our proposed JCSRC. The results demonstrate that, in comparison to state-of-the-art solutions, JCSRC can save energy consumption by up to 59% to achieve the target accuracy.
{"title":"JCSRC: Joint Client Selection and Resource Configuration for Energy-Efficient Multi-Task Federated Learning","authors":"Junpeng Ke;Junlong Zhou;Dan Meng;Yue Zeng;Yizhou Shi;Xiangmou Qu;Song Guo","doi":"10.1109/TC.2025.3605765","DOIUrl":"https://doi.org/10.1109/TC.2025.3605765","url":null,"abstract":"Federated learning (FL) enables privacy-preserving distributed machine learning by training models on edge client devices using their local data without revealing their raw data. In edge environments, various applications require different neural network models, making it crucial to perform joint training of multiple models on edge devices, known as multi-task FL. While existing multi-task FL approaches enhance resource utilization on edge devices through adaptive resource configuration or client selection, optimizing either of these aspects alone may lead to suboptimality. Therefore, in this paper, we explore a joint client selection and resource configuration method called JCSRC for multi-task FL, aiming to maximize energy efficiency in environments with limited computation and communication resources and heterogeneous client devices. Firstly, we formalize this problem as a mixed-integer nonlinear programming problem considering all these characteristics and prove its NP-hardness. To address this problem, we first design a multi-agent reinforcement learning (MARL)-based client selection method that selects appropriate clients for each task to train their models. The MARL method makes client selection decisions based on the clients’ data quality, energy efficiency, communication, and computation capacity to ensure fast convergence and energy efficiency. Then, we design a particle swarm optimization (PSO)-based resource configuration scheme that configures appropriate computation and bandwidth resources for each task on each client. The PSO scheme makes resource configuration decisions based on theoretically derived optimal CPU frequency and bandwidth to achieve high energy efficiency. Finally, we carry out extensive simulations and testbed-based experiments to validate our proposed JCSRC. The results demonstrate that, in comparison to state-of-the-art solutions, JCSRC can save energy consumption by up to 59% to achieve the target accuracy.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"4094-4108"},"PeriodicalIF":3.8,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The advent of Multi-access Edge Computing (MEC) has empowered Internet of Things (IoT) devices and edge servers to deploy sophisticated Deep Neural Network (DNN) applications, enabling real-time inference. Many concurrent inference requests and intricate DNN models demand efficient multi-DNN inference in MEC networks. However, the resource-limited IoT device/edge server and expanding model size force models to be dynamically deployed, resulting in significant undesired energy consumption. In addition, parallel multi-DNN inference on the same device complicates the inference process due to the resource competition among models, increasing the inference latency. In this paper, we propose a Resource-aware and Dynamic DNN Deployment (R3D) scheme with the collaboration of end-edge-cloud. To mitigate resource competition and waste during multi-DNN parallel inference, we develop a Resource Adaptive Management (RAM) algorithm based on the Roofline model, which dynamically allocates resources by accounting for the impact of device-specific performance bottlenecks on inference latency. Additionally, we design a Deep Reinforcement Learning (DRL)-based online optimization algorithm that dynamically adjusts DNN deployment strategies to achieve fast and energy-efficient inference across heterogeneous devices. Experiment results demonstrate that R3D is applicable in MEC environments and performs well in terms of inference latency, resource utilization, and energy consumption.
{"title":"Optimizing Multi-DNN Parallel Inference Performance in MEC Networks: A Resource-Aware and Dynamic DNN Deployment Scheme","authors":"Tong Zheng;Yuanguo Bi;Guangjie Han;Xingwei Wang;Yuheng Liu;Yufei Liu;Xiangyi Chen","doi":"10.1109/TC.2025.3605749","DOIUrl":"https://doi.org/10.1109/TC.2025.3605749","url":null,"abstract":"The advent of Multi-access Edge Computing (MEC) has empowered Internet of Things (IoT) devices and edge servers to deploy sophisticated Deep Neural Network (DNN) applications, enabling real-time inference. Many concurrent inference requests and intricate DNN models demand efficient multi-DNN inference in MEC networks. However, the resource-limited IoT device/edge server and expanding model size force models to be dynamically deployed, resulting in significant undesired energy consumption. In addition, parallel multi-DNN inference on the same device complicates the inference process due to the resource competition among models, increasing the inference latency. In this paper, we propose a Resource-aware and Dynamic DNN Deployment (R3D) scheme with the collaboration of end-edge-cloud. To mitigate resource competition and waste during multi-DNN parallel inference, we develop a Resource Adaptive Management (RAM) algorithm based on the Roofline model, which dynamically allocates resources by accounting for the impact of device-specific performance bottlenecks on inference latency. Additionally, we design a Deep Reinforcement Learning (DRL)-based online optimization algorithm that dynamically adjusts DNN deployment strategies to achieve fast and energy-efficient inference across heterogeneous devices. Experiment results demonstrate that R3D is applicable in MEC environments and performs well in terms of inference latency, resource utilization, and energy consumption.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3938-3952"},"PeriodicalIF":3.8,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abbas A. Fairouz;Jassim M. Aljuraidan;Ameer Mohammed
Sorting operations are considered to be a significant part of any computer system and are widely used in many applications. In applications where sorting has to be efficiently accomplished (i.e., in $O(1)$ time) on small-sized entries, hardware accelerators, such as ASICs, FPGAs, or GPUs, are used to speed up the sorting operations. In the literature, the bitonic sort algorithm (or variants thereof) is still considered to be the most commonly used approach in many hardware sort implementations for decades. However, the time complexity of the bitonic sort is $O((log(n))^{2})$ for sorting $n$ elements, which does not satisfy the constant-time constraint we demand for our setting. In this paper, we propose competition-style sorting networks (CSNs), a framework for designing hardware-based competition-style class of sorting networks that captures all forms of two-stage sorting networks where the first stage (competition) consists of pairwise comparisons and the second stage (evaluation) ranks the entries and sorts them. To illustrate the utility of this framework, we develop and test one instance of this design, called the Competition Sort Algorithm (CSA), which has a time complexity of $O(1)$, and specifically, one clock cycle. We implemented and tested CSA on both an Intel Cyclone V FPGA and the NVIDIA Quadro T1000 GPU then measured its gain, which combines the trade-offs between the relative speedup and the relative area increase, against the bitonic sort. Our results show that the CSA achieves a significant gain of up to $11.01times$ on the FPGA and a relative speedup of up to $3.32times$ on the GPU. We also compare the area and latency of CSA with the bitonic sort algorithm on the FPGA.
排序操作被认为是任何计算机系统的重要组成部分,并广泛应用于许多应用中。在必须有效地完成排序的应用程序中(即,在$O(1)$时间内),在小尺寸的条目上,硬件加速器,如asic, fpga或gpu,被用来加速排序操作。在文献中,双元排序算法(或其变体)仍然被认为是几十年来许多硬件排序实现中最常用的方法。然而,对于排序$n$个元素,双元排序的时间复杂度为$O((log(n))^{2})$,这并不满足我们对设置的常数时间约束要求。在本文中,我们提出了竞争式排序网络(csn),这是一个设计基于硬件的竞争式排序网络的框架,它捕获了所有形式的两阶段排序网络,其中第一阶段(竞争)由两两比较组成,第二阶段(评估)对条目进行排序并对它们进行排序。为了说明该框架的实用性,我们开发并测试了该设计的一个实例,称为竞争排序算法(CSA),其时间复杂度为$O(1)$,特别是一个时钟周期。我们在Intel Cyclone V FPGA和NVIDIA Quadro T1000 GPU上实现和测试了CSA,然后测量了它的增益,它结合了相对加速和相对面积增加之间的权衡,而不是bitonic类型。我们的结果表明,CSA在FPGA上实现了高达11.01times$的显著增益,在GPU上实现了高达3.32times$的相对加速。在FPGA上比较了CSA算法与双次排序算法的面积和延迟。
{"title":"Competition-Style Sorting Networks (CSN): A Framework for Hardware-Based Sorting Operations","authors":"Abbas A. Fairouz;Jassim M. Aljuraidan;Ameer Mohammed","doi":"10.1109/TC.2025.3605766","DOIUrl":"https://doi.org/10.1109/TC.2025.3605766","url":null,"abstract":"Sorting operations are considered to be a significant part of any computer system and are widely used in many applications. In applications where sorting has to be efficiently accomplished (i.e., in <inline-formula><tex-math>$O(1)$</tex-math></inline-formula> time) on small-sized entries, hardware accelerators, such as ASICs, FPGAs, or GPUs, are used to speed up the sorting operations. In the literature, the bitonic sort algorithm (or variants thereof) is still considered to be the most commonly used approach in many hardware sort implementations for decades. However, the time complexity of the bitonic sort is <inline-formula><tex-math>$O((log(n))^{2})$</tex-math></inline-formula> for sorting <inline-formula><tex-math>$n$</tex-math></inline-formula> elements, which does not satisfy the constant-time constraint we demand for our setting. In this paper, we propose <i>competition-style sorting networks</i> (CSNs), a framework for designing hardware-based competition-style class of sorting networks that captures all forms of two-stage sorting networks where the first stage (competition) consists of pairwise comparisons and the second stage (evaluation) ranks the entries and sorts them. To illustrate the utility of this framework, we develop and test one instance of this design, called the Competition Sort Algorithm (CSA), which has a time complexity of <inline-formula><tex-math>$O(1)$</tex-math></inline-formula>, and specifically, one clock cycle. We implemented and tested CSA on both an Intel Cyclone V FPGA and the NVIDIA Quadro T1000 GPU then measured its <i>gain</i>, which combines the trade-offs between the relative speedup and the relative area increase, against the bitonic sort. Our results show that the CSA achieves a significant gain of up to <inline-formula><tex-math>$11.01times$</tex-math></inline-formula> on the FPGA and a relative speedup of up to <inline-formula><tex-math>$3.32times$</tex-math></inline-formula> on the GPU. We also compare the area and latency of CSA with the bitonic sort algorithm on the FPGA.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"4109-4122"},"PeriodicalIF":3.8,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the emergence of fast NVMe SSDs, key-value stores are becoming more CPU-efficient in order to reap their bandwidth. However, current CPU-optimized key-value stores adopt suboptimal intra- and inter-thread models, hence incurring memory-level stalling and load imbalance that hinder cores from realizing their full potential. We present StageWise, an CPU-efficient key-value store on fast NVMe SSDs with high throughput. To achieve this, we introduce a new thread model for StageWise to process KV requests. Specifically, StageWise converts the processing of each KV request into multiple asynchronous stages, and thus enables pipelining across all stages. StageWise further introduces a client-driven share-index architecture to ease inter-thread load imbalance and maximize the pipelining opportunity. Guided by Little’s Law, StageWise improves concurrency, and therefore efficiently uses CPU to reach higher throughput. Extensive experimental results show that StageWise outperforms CPU-optimized key-value stores (e.g., KVell) by up to 3.5${boldsymbol{times}}$ with write-intensive workloads, and storage-optimized ones (e.g., RocksDB) by over an order of magnitude. StageWise also shows higher read performance and excellent scalability under various workloads.
{"title":"StageWise: Accelerating Persistent Key-Value Stores by Thread Model Redesigning","authors":"Zeqi Li;Youmin Chen;Qing Wang;Youyou Lu;Jiwu Shu","doi":"10.1109/TC.2025.3605763","DOIUrl":"https://doi.org/10.1109/TC.2025.3605763","url":null,"abstract":"With the emergence of fast NVMe SSDs, key-value stores are becoming more CPU-efficient in order to reap their bandwidth. However, current CPU-optimized key-value stores adopt suboptimal intra- and inter-thread models, hence incurring memory-level stalling and load imbalance that hinder cores from realizing their full potential. We present <small>StageWise</small>, an CPU-efficient key-value store on fast NVMe SSDs with high throughput. To achieve this, we introduce a new thread model for <small>StageWise</small> to process KV requests. Specifically, <small>StageWise</small> converts the processing of each KV request into multiple asynchronous stages, and thus enables pipelining across all stages. <small>StageWise</small> further introduces a client-driven share-index architecture to ease inter-thread load imbalance and maximize the pipelining opportunity. Guided by Little’s Law, <small>StageWise</small> improves concurrency, and therefore efficiently uses CPU to reach higher throughput. Extensive experimental results show that <small>StageWise</small> outperforms CPU-optimized key-value stores (e.g., KVell) by up to 3.5<inline-formula><tex-math>${boldsymbol{times}}$</tex-math></inline-formula> with write-intensive workloads, and storage-optimized ones (e.g., RocksDB) by over an order of magnitude. <small>StageWise</small> also shows higher read performance and excellent scalability under various workloads.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"4080-4093"},"PeriodicalIF":3.8,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The impressive performance of transformer models has sparked the deployment of intelligent applications on resource-constrained edge devices. However, ensuring high-quality service for real-time edge systems is a significant challenge due to the considerable computational demands and resource requirements of these models. Existing strategies typically either offload transformer computations to other devices or directly deploy compressed models on individual edge devices. These strategies, however, result in either considerable communication overhead or suboptimal trade-offs between accuracy and efficiency. To tackle these challenges, we propose a collaborative inference system for general transformer models, termed CoFormer. The central idea behind CoFormer is to exploit the divisibility and integrability of transformer. An off-the-shelf large transformer can be decomposed into multiple smaller models for distributed inference, and their intermediate results are aggregated to generate the final output. We formulate an optimization problem to minimize both inference latency and accuracy degradation under heterogeneous hardware constraints. DeBo algorithm is proposed to first solve the optimization problem to derive the decomposition policy, and then progressively calibrate decomposed models to restore performance. We demonstrate the capability to support a wide range of transformer models on heterogeneous edge devices, achieving up to 3.1$times$ inference speedup with large transformer models. Notably, CoFormer enables the efficient inference of GPT2-XL with 1.6 billion parameters on edge devices, reducing memory requirements by 76.3%. CoFormer can also reduce energy consumption by approximately 40% while maintaining satisfactory inference performance.
{"title":"CoFormer: Collaborating With Heterogeneous Edge Devices for Scalable Transformer Inference","authors":"Guanyu Xu;Zhiwei Hao;Li Shen;Yong Luo;Fuhui Sun;Xiaoyan Wang;Han Hu;Yonggang Wen","doi":"10.1109/TC.2025.3604473","DOIUrl":"https://doi.org/10.1109/TC.2025.3604473","url":null,"abstract":"The impressive performance of transformer models has sparked the deployment of intelligent applications on resource-constrained edge devices. However, ensuring high-quality service for real-time edge systems is a significant challenge due to the considerable computational demands and resource requirements of these models. Existing strategies typically either offload transformer computations to other devices or directly deploy compressed models on individual edge devices. These strategies, however, result in either considerable communication overhead or suboptimal trade-offs between accuracy and efficiency. To tackle these challenges, we propose a collaborative inference system for general transformer models, termed <b>CoFormer</b>. The central idea behind CoFormer is to exploit the divisibility and integrability of transformer. An off-the-shelf large transformer can be decomposed into multiple smaller models for distributed inference, and their intermediate results are aggregated to generate the final output. We formulate an optimization problem to minimize both inference latency and accuracy degradation under heterogeneous hardware constraints. DeBo algorithm is proposed to first solve the optimization problem to derive the decomposition policy, and then progressively calibrate decomposed models to restore performance. We demonstrate the capability to support a wide range of transformer models on heterogeneous edge devices, achieving up to 3.1<inline-formula><tex-math>$times$</tex-math></inline-formula> inference speedup with large transformer models. Notably, CoFormer enables the efficient inference of GPT2-XL with 1.6 billion parameters on edge devices, reducing memory requirements by 76.3%. CoFormer can also reduce energy consumption by approximately 40% while maintaining satisfactory inference performance.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"4010-4024"},"PeriodicalIF":3.8,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Blockchain strengthens reliable collaboration among entities through its transparency, immutability, and traceability, leading to its integration into Multi-access Edge Computing (MEC) and promoting the development of a trusted JointCloud. However, existing transaction propagation mechanisms require MEC devices to consume significant computing resources for complex transaction verification, increasing their vulnerability to malicious attacks. Adversaries can exploit this by flooding the blockchain network with spam transactions, aiming to deplete device energy and disrupt system performance. To cope with these issues, this paper proposes a reputation-based energy-efficient transaction propagation mechanism that alleviates spam transaction attacks while reducing computing resources and energy consumption. Firstly, we design a subjective logic-based reputation scheme that assesses node trust by integrating local and recommended opinions and incorporates opinion acceptance to counteract false evidence. Then, we optimize the transaction verification method by adjusting transaction discard and verification probabilities based on the proposed reputation scheme to curb the propagation of spam transactions and reduce verification consumption. Finally, we enhance the transaction transmission strategy by prioritizing nodes with higher reputations, enhancing both resilience to spam transactions and transmission reliability. A series of simulations demonstrates the effectiveness of the proposed mechanism.
{"title":"A Reputation-Based Energy-Efficient Transaction Propagation Mechanism for Blockchain-Enabled Multi-Access Edge Computing","authors":"Xijia Lu;Qiang He;Xingwei Wang;Jaime Lloret;Peichen Li;Ying Qian;Min Huang","doi":"10.1109/TC.2025.3604480","DOIUrl":"https://doi.org/10.1109/TC.2025.3604480","url":null,"abstract":"Blockchain strengthens reliable collaboration among entities through its transparency, immutability, and traceability, leading to its integration into Multi-access Edge Computing (MEC) and promoting the development of a trusted JointCloud. However, existing transaction propagation mechanisms require MEC devices to consume significant computing resources for complex transaction verification, increasing their vulnerability to malicious attacks. Adversaries can exploit this by flooding the blockchain network with spam transactions, aiming to deplete device energy and disrupt system performance. To cope with these issues, this paper proposes a reputation-based energy-efficient transaction propagation mechanism that alleviates spam transaction attacks while reducing computing resources and energy consumption. Firstly, we design a subjective logic-based reputation scheme that assesses node trust by integrating local and recommended opinions and incorporates opinion acceptance to counteract false evidence. Then, we optimize the transaction verification method by adjusting transaction discard and verification probabilities based on the proposed reputation scheme to curb the propagation of spam transactions and reduce verification consumption. Finally, we enhance the transaction transmission strategy by prioritizing nodes with higher reputations, enhancing both resilience to spam transactions and transmission reliability. A series of simulations demonstrates the effectiveness of the proposed mechanism.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3897-3910"},"PeriodicalIF":3.8,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Uncrewed aerial vehicles (UAV) are widely used for edge computing in poor infrastructure scenarios due to their deployment flexibility and mobility. In UAV-assisted edge computing systems, multiple UAVs can cooperate with the cloud to provide superior computing capability for diverse innovative services. However, many service-related computational tasks may fail due to the unreliability of UAVs and wireless transmission channels. Diverse solutions were proposed, but most of them employ time-driven strategies which introduce unwanted decision waiting delays. To address this problem, this paper focuses on a task-driven reliability-aware cooperative offloading problem in UAV-assisted edge-enhanced networks. The issue is formulated as an optimization problem which jointly optimizes UAV trajectories, offloading decisions, and transmission power, aiming to maximize the long-term average task success rate. Considering the discrete-continuous hybrid action space of the problem, a dependence-aware latent-space representation algorithm is proposed to represent discrete-continuous hybrid actions. Furthermore, we design a novel deep reinforcement learning scheme by combining the representation algorithm and a twin delayed deep deterministic policy gradient algorithm. We compared our proposed algorithm with four alternative solutions via simulations and a realistic Kubernetes testbed-based setup. The test results show how our scheme outperforms the other methods, ensuring significant improvements in terms of task success rate.
{"title":"Reliability-Aware Optimization of Task Offloading for UAV-Assisted Edge Computing","authors":"Hao Hao;Changqiao Xu;Wei Zhang;Xingyan Chen;Shujie Yang;Gabriel-Miro Muntean","doi":"10.1109/TC.2025.3604463","DOIUrl":"https://doi.org/10.1109/TC.2025.3604463","url":null,"abstract":"Uncrewed aerial vehicles (UAV) are widely used for edge computing in poor infrastructure scenarios due to their deployment flexibility and mobility. In UAV-assisted edge computing systems, multiple UAVs can cooperate with the cloud to provide superior computing capability for diverse innovative services. However, many service-related computational tasks may fail due to the unreliability of UAVs and wireless transmission channels. Diverse solutions were proposed, but most of them employ time-driven strategies which introduce unwanted decision waiting delays. To address this problem, this paper focuses on a task-driven reliability-aware cooperative offloading problem in UAV-assisted edge-enhanced networks. The issue is formulated as an optimization problem which jointly optimizes UAV trajectories, offloading decisions, and transmission power, aiming to maximize the long-term average task success rate. Considering the discrete-continuous hybrid action space of the problem, a dependence-aware latent-space representation algorithm is proposed to represent discrete-continuous hybrid actions. Furthermore, we design a novel deep reinforcement learning scheme by combining the representation algorithm and a twin delayed deep deterministic policy gradient algorithm. We compared our proposed algorithm with four alternative solutions via simulations and a realistic Kubernetes testbed-based setup. The test results show how our scheme outperforms the other methods, ensuring significant improvements in terms of task success rate.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 11","pages":"3832-3844"},"PeriodicalIF":3.8,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11146794","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145248090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}