Yuxuan Qin;Chuxiong Lin;Guoming Rao;Ling Yang;Weiguang Sheng;Weifeng He
DRAM latency remains a critical bottleneck in the performance of modern computing systems. However, the latency is excessively conservative due to the timing margins imposed by DRAM vendors to accommodate rare worst-case scenarios, such as weak cells and high temperatures. In this study, we introduce a temperature- and process-variation-aware timing detection and adaptation DRAM (TPDA-DRAM) architecture that dynamically mitigates timing margins at runtime. TPDA-DRAM leverages innovative in-situ cross-coupled detectors to monitor voltage differences between bitline pairs inside DRAM arrays, ensuring precise detection of timing margins. Additionally, the proposed detector inherently accelerates the precharge operation of DRAM, thereby reducing the precharge latency by up to 62.5%. Building upon this architecture, we propose two variation-aware timing adaptation schemes: 1) a process-variation-aware adaptation (PVA) scheme that accelerates access to weak cells, mitigating process-induced timing margins, and 2) a temperature-variation-aware adaptation (TVA) scheme that leverages temperature information and the restoration truncation technique to reduce DRAM latency, mitigating temperature-induced timing margins. Evaluations on an eight-core computing system show that TPDA-DRAM improves average performance by 21.8% and energy efficiency by 18.2%.
{"title":"TPDA-DRAM: A Variation-Aware DRAM Improving System Performance via In-Situ Timing Margin Detection and Adaptive Mitigation","authors":"Yuxuan Qin;Chuxiong Lin;Guoming Rao;Ling Yang;Weiguang Sheng;Weifeng He","doi":"10.1109/TC.2025.3627945","DOIUrl":"https://doi.org/10.1109/TC.2025.3627945","url":null,"abstract":"DRAM latency remains a critical bottleneck in the performance of modern computing systems. However, the latency is excessively conservative due to the timing margins imposed by DRAM vendors to accommodate rare worst-case scenarios, such as weak cells and high temperatures. In this study, we introduce a temperature- and process-variation-aware timing detection and adaptation DRAM (TPDA-DRAM) architecture that dynamically mitigates timing margins at runtime. TPDA-DRAM leverages innovative <italic>in-situ</i> cross-coupled detectors to monitor voltage differences between bitline pairs inside DRAM arrays, ensuring precise detection of timing margins. Additionally, the proposed detector inherently accelerates the precharge operation of DRAM, thereby reducing the precharge latency by up to 62.5%. Building upon this architecture, we propose two variation-aware timing adaptation schemes: 1) a process-variation-aware adaptation (PVA) scheme that accelerates access to weak cells, mitigating process-induced timing margins, and 2) a temperature-variation-aware adaptation (TVA) scheme that leverages temperature information and the restoration truncation technique to reduce DRAM latency, mitigating temperature-induced timing margins. Evaluations on an eight-core computing system show that TPDA-DRAM improves average performance by 21.8% and energy efficiency by 18.2%.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"75 1","pages":"350-364"},"PeriodicalIF":3.8,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145729298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Compute-in-memory (CIM) has emerged as a pivotal direction for accelerating workloads in the field of machine learning, such as Deep Neural Networks (DNNs). However, the effective exploitation of sparsity in CIM systems presents numerous challenges, due to the inherent limitations in their rigid array structures. Designing sparse DNN dataflows and developing efficient mapping strategies also become more complex when accounting for diverse sparsity patterns and the flexibility of a multi-macro CIM structure. Despite these complexities, there is still an absence of a unified systematic view and modeling approach for diverse sparse DNN workloads in CIM systems. In this paper, we propose CIMinus, a framework dedicated to cost modeling for sparse DNN workloads on CIM architectures. It provides an in-depth energy consumption analysis at the level of individual components and an assessment of the overall workload latency. We validate CIMinus against contemporary CIM architectures and demonstrate its applicability in two use-cases. These cases provide valuable insights into both the impact of sparsity patterns and the effectiveness of mapping strategies, bridging the gap between theoretical design and practical implementation.
{"title":"CIMinus: Empowering Sparse DNN Workloads Modeling and Exploration on SRAM-Based CIM Architectures","authors":"Yingjie Qi;Jianlei Yang;Rubing Yang;Cenlin Duan;Xiaolin He;Ziyan He;Weitao Pan;Weisheng Zhao","doi":"10.1109/TC.2025.3628114","DOIUrl":"https://doi.org/10.1109/TC.2025.3628114","url":null,"abstract":"Compute-in-memory (CIM) has emerged as a pivotal direction for accelerating workloads in the field of machine learning, such as Deep Neural Networks (DNNs). However, the effective exploitation of sparsity in CIM systems presents numerous challenges, due to the inherent limitations in their rigid array structures. Designing sparse DNN dataflows and developing efficient mapping strategies also become more complex when accounting for diverse sparsity patterns and the flexibility of a multi-macro CIM structure. Despite these complexities, there is still an absence of a unified systematic view and modeling approach for diverse sparse DNN workloads in CIM systems. In this paper, we propose CIMinus, a framework dedicated to cost modeling for sparse DNN workloads on CIM architectures. It provides an in-depth energy consumption analysis at the level of individual components and an assessment of the overall workload latency. We validate CIMinus against contemporary CIM architectures and demonstrate its applicability in two use-cases. These cases provide valuable insights into both the impact of sparsity patterns and the effectiveness of mapping strategies, bridging the gap between theoretical design and practical implementation.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"75 1","pages":"380-394"},"PeriodicalIF":3.8,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145712545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Virtualization environments (e.g., containers and hypervisors) achieve isolation of multiple runtime entities but result in two mutually isolated guest and host layers. Such cross-layer isolation could cause high latency and low throughput of the system. Previous aware scheduling and double scheduling fail to achieve bidirectional coordination between the guest and host layers. To address this challenge, we develop DCS3, a Dual-layer Co-aware Scheduler that combines stealing balance and synchronized priority. Stealing balancing migrates tasks between virtual CPU (vCPU) queues for load balance based on the workloads of physical CPUs (pCPUs). Synchronized priority dynamically adjusts the thread priorities running on the pCPUs according to the current vCPU workloads. The vCPUs and pCPUs belong to the guest and host layers, respectively. Compared with aware scheduling, double scheduling, and DCS2 (i.e., DCS3 without synchronized priority), DCS3 has the following obvious advantages: 1) Requests Per Second (RPS) increases by up to 52%, 55%, and 2%, respectively; 2) request latency decreases by up to 72%, 71%, and 20%, respectively.
虚拟化环境(例如,容器和管理程序)实现了多个运行时实体的隔离,但导致两个相互隔离的来宾层和主机层。这种跨层隔离可能导致系统的高延迟和低吞吐量。以往的感知调度和双调度无法实现来宾层和主机层之间的双向协调。为了应对这一挑战,我们开发了DCS3,这是一个结合了窃取平衡和同步优先级的双层协同感知调度器。均衡窃取是指根据pcpu的负载情况,在vCPU队列之间迁移任务,实现负载均衡。同步优先级根据当前vCPU的工作负载动态调整pcpu上运行的线程优先级。vcpu和pcpu分别属于guest层和host层。与感知调度、双调度和DCS2(即没有同步优先级的DCS3)相比,DCS3具有以下明显优势:1)每秒请求数(Requests Per Second, RPS)分别提高了52%、55%和2%;2)请求延迟分别减少高达72%、71%和20%。
{"title":"DCS3: A Dual-Layer Co-Aware Scheduler With Stealing Balance and Synchronized Priority in Virtualization Environments","authors":"Chenglai Xiong;Junjie Wen;Guoqi Xie;Zhongjia Wang;Zhenli He;Shaowen Yao;Jianfeng Tan;Tianyu Zhou;Tiwei Bie;Yan Yan;Shoumeng Yan","doi":"10.1109/TC.2025.3628012","DOIUrl":"https://doi.org/10.1109/TC.2025.3628012","url":null,"abstract":"Virtualization environments (e.g., containers and hypervisors) achieve isolation of multiple runtime entities but result in two mutually isolated guest and host layers. Such cross-layer isolation could cause high latency and low throughput of the system. Previous aware scheduling and double scheduling fail to achieve bidirectional coordination between the guest and host layers. To address this challenge, we develop DCS3, a Dual-layer Co-aware Scheduler that combines stealing balance and synchronized priority. Stealing balancing migrates tasks between virtual CPU (vCPU) queues for load balance based on the workloads of physical CPUs (pCPUs). Synchronized priority dynamically adjusts the thread priorities running on the pCPUs according to the current vCPU workloads. The vCPUs and pCPUs belong to the guest and host layers, respectively. Compared with aware scheduling, double scheduling, and DCS2 (i.e., DCS3 without synchronized priority), DCS3 has the following obvious advantages: 1) Requests Per Second (RPS) increases by up to 52%, 55%, and 2%, respectively; 2) request latency decreases by up to 72%, 71%, and 20%, respectively.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"75 1","pages":"365-379"},"PeriodicalIF":3.8,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145712544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuhui Cai;Guowen Gong;Zhirong Shen;Jiahui Yang;Jiwu Shu
Erasure coding has been extensively deployed in today’s commodity HPC systems against unexpected failures. To adapt to the varying access characteristics and reliability demands, storage clusters have to perform redundancy transitioning via tuning the coding parameters, which unfortunately gives rise to substantial transitioning traffic. We present $textsf{ElasticEC}$, a fast and elastic redundancy transitioning approach for erasure-coded clusters. $textsf{ElasticEC}$ first minimizes the transitioning traffic via proposing a relocation-aware stripe reorganization mechanism and a collecting-and-encoding algorithm. It further heuristically balances the transitioning traffic across nodes. We implement $textsf{ElasticEC}$ in Hadoop HDFS and conduct extensive experiments on a real-world cloud storage cluster, showing that $textsf{ElasticEC}$ can reduce 71.1-92.6% of the transitioning traffic and shorten 65.9-90.7% of the transitioning time.
{"title":"ElasticEC: Achieving Fast and Elastic Redundancy Transitioning in Erasure-Coded Clusters","authors":"Yuhui Cai;Guowen Gong;Zhirong Shen;Jiahui Yang;Jiwu Shu","doi":"10.1109/TC.2025.3614839","DOIUrl":"https://doi.org/10.1109/TC.2025.3614839","url":null,"abstract":"Erasure coding has been extensively deployed in today’s commodity HPC systems against unexpected failures. To adapt to the varying access characteristics and reliability demands, storage clusters have to perform redundancy transitioning via tuning the coding parameters, which unfortunately gives rise to substantial transitioning traffic. We present <inline-formula><tex-math>$textsf{ElasticEC}$</tex-math></inline-formula>, a fast and elastic redundancy transitioning approach for erasure-coded clusters. <inline-formula><tex-math>$textsf{ElasticEC}$</tex-math></inline-formula> first minimizes the transitioning traffic via proposing a relocation-aware stripe reorganization mechanism and a collecting-and-encoding algorithm. It further heuristically balances the transitioning traffic across nodes. We implement <inline-formula><tex-math>$textsf{ElasticEC}$</tex-math></inline-formula> in Hadoop HDFS and conduct extensive experiments on a real-world cloud storage cluster, showing that <inline-formula><tex-math>$textsf{ElasticEC}$</tex-math></inline-formula> can reduce 71.1-92.6% of the transitioning traffic and shorten 65.9-90.7% of the transitioning time.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"4168-4181"},"PeriodicalIF":3.8,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph partitioning as a classic NP-complete problem, is the most fundamental procedure that needs to be performed before parallel computations. Partitioners can be divided into vertex- and edge-based approaches. Recently, both approaches are employing a streaming heuristic to find approximate solutions. It is lightweight in space and time complexities, but suffers from suboptimal partitioning quality, especially for directed graphs where the explicit knowledge provided for heuristic is limited. This paper thereby proposes new heuristics for not only vertex-based but also edge-based partitioning. They improve quality by additionally utilizing implicit knowledge, which is embedded in the local streaming view and the global graph view. Memory reduction techniques are presented to extract this knowledge with negligible space costs. That preserves the lightweight advantages of streaming partitioning. Besides, we study parallel acceleration and restreaming, to further boost the partitioning efficiency and quality. Extensive experiments validate that our proposals outperform the state-of-the-art competitors.
{"title":"Lightweight Graph Partitioning Enhanced by Implicit Knowledge","authors":"Zhigang Wang;Gongtai Sun;Ning Wang;Lixin Gao;Chuanfei Xu;Yu Gu;Ge Yu;Zhihong Tian","doi":"10.1109/TC.2025.3612730","DOIUrl":"https://doi.org/10.1109/TC.2025.3612730","url":null,"abstract":"Graph partitioning as a classic NP-complete problem, is the most fundamental procedure that needs to be performed before parallel computations. Partitioners can be divided into vertex- and edge-based approaches. Recently, both approaches are employing a streaming heuristic to find approximate solutions. It is lightweight in space and time complexities, but suffers from suboptimal partitioning quality, especially for directed graphs where the explicit knowledge provided for heuristic is limited. This paper thereby proposes new heuristics for not only vertex-based but also edge-based partitioning. They improve quality by additionally utilizing implicit knowledge, which is embedded in the local streaming view and the global graph view. Memory reduction techniques are presented to extract this knowledge with negligible space costs. That preserves the lightweight advantages of streaming partitioning. Besides, we study parallel acceleration and restreaming, to further boost the partitioning efficiency and quality. Extensive experiments validate that our proposals outperform the state-of-the-art competitors.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"4153-4167"},"PeriodicalIF":3.8,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To achieve seamless portability across the embedded computing continuum, we introduce a new kernel architecture: fluid kernels. Fluid kernels can be thought of as the intersection between embedded unikernels and general purpose monolithic kernels, allowing to seamlessly develop applications both in kernel space and user space in a unified way. This scalable kernel architecture can manage the trade-off between performance, code size, isolation and security. We compare our fluid kernel implementation, Miosix, to Linux and FreeRTOS on the same hardware with standard benchmarks. Compared to Linux, we achieve an average speedup of 3.5${boldsymbol{times}}$ and a maximum of up to 15.4${boldsymbol{times}}$. We also achieve an average code size reduction of 84% and a maximum of up to 90%. By moving application code from user space to kernel space, an additional code size reduction up to 56% and a speedup up to 1.3${boldsymbol{times}}$ can be achieved. Compared to FreeRTOS, the use of Miosix only costs a moderate amount of code size (at most 47KB) for significant advantages in application performance with speedups averaging at 1.5${boldsymbol{times}}$ and up to 5${boldsymbol{times}}$.
{"title":"Fluid Kernels: Seamlessly Conquering the Embedded Computing Continuum","authors":"Federico Terraneo;Daniele Cattaneo","doi":"10.1109/TC.2025.3605745","DOIUrl":"https://doi.org/10.1109/TC.2025.3605745","url":null,"abstract":"To achieve seamless portability across the embedded computing continuum, we introduce a new kernel architecture: fluid kernels. Fluid kernels can be thought of as the intersection between embedded unikernels and general purpose monolithic kernels, allowing to seamlessly develop applications both in kernel space and user space in a unified way. This scalable kernel architecture can manage the trade-off between performance, code size, isolation and security. We compare our fluid kernel implementation, Miosix, to Linux and FreeRTOS on the same hardware with standard benchmarks. Compared to Linux, we achieve an average speedup of 3.5<inline-formula><tex-math>${boldsymbol{times}}$</tex-math></inline-formula> and a maximum of up to 15.4<inline-formula><tex-math>${boldsymbol{times}}$</tex-math></inline-formula>. We also achieve an average code size reduction of 84% and a maximum of up to 90%. By moving application code from user space to kernel space, an additional code size reduction up to 56% and a speedup up to 1.3<inline-formula><tex-math>${boldsymbol{times}}$</tex-math></inline-formula> can be achieved. Compared to FreeRTOS, the use of Miosix only costs a moderate amount of code size (at most 47KB) for significant advantages in application performance with speedups averaging at 1.5<inline-formula><tex-math>${boldsymbol{times}}$</tex-math></inline-formula> and up to 5<inline-formula><tex-math>${boldsymbol{times}}$</tex-math></inline-formula>.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"4050-4064"},"PeriodicalIF":3.8,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11173649","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Asynchronous common subset (ACS) is a powerful paradigm enabling applications such as Byzantine fault-tolerance (BFT) and multi-party computation (MPC). The most efficient ACS framework in the information-theoretic setting is due to Ben-Or, Kelmer, and Rabin (BKR, 1994). The BKR ACS protocol has been both theoretically and practically impactful. BKR ACS has an $O(log n)$ running time (where $n$ is the number of replicas) due to the usage of $n$ parallel asynchronous binary agreement (ABA) instances, impacting both performance and scalability. Indeed, for a network of 16$sim$ 64 replicas, the parallel ABA phase occupies about 95%$sim$ 97% of the total runtime. A long-standing open problem is whether we can build an ACS framework with $O(1)$ time while not increasing the message or communication complexity of the BKR protocol. We resolve the open problem, presenting the first constant-time ACS protocol with $O(n^{3})$ messages in the information-theoretic and signature-free settings. Our key ingredient is the first information-theoretic and constant-time multivalued validated Byzantine agreement (MVBA) protocol. Our results can improve—asymptotically and concretely—various applications using ACS and MVBA. As an example, we implement FIN, a BFT protocol instantiated using our framework. Via a 121-server deployment on Amazon EC2, we show FIN reduces the overhead of the ABA phase to as low as 1.23% of the total runtime.
异步公共子集(ACS)是一种强大的范例,支持拜占庭容错(BFT)和多方计算(MPC)等应用程序。本-奥尔、凯尔默和拉宾(BKR, 1994)提出了信息理论背景下最有效的ACS框架。BKR ACS协议在理论和实践上都具有重要影响。由于使用$n$并行异步二进制协议(ABA)实例,BKR ACS的运行时间为$O(log n)$(其中$n$是副本的数量),这会影响性能和可伸缩性。事实上,对于一个16 $sim$ 64个副本的网络,平行ABA阶段占据了大约95个%$sim$ 97% of the total runtime. A long-standing open problem is whether we can build an ACS framework with $O(1)$ time while not increasing the message or communication complexity of the BKR protocol. We resolve the open problem, presenting the first constant-time ACS protocol with $O(n^{3})$ messages in the information-theoretic and signature-free settings. Our key ingredient is the first information-theoretic and constant-time multivalued validated Byzantine agreement (MVBA) protocol. Our results can improve—asymptotically and concretely—various applications using ACS and MVBA. As an example, we implement FIN, a BFT protocol instantiated using our framework. Via a 121-server deployment on Amazon EC2, we show FIN reduces the overhead of the ABA phase to as low as 1.23% of the total runtime.
{"title":"Practical Signature-Free Multivalued Validated Byzantine Agreement and Asynchronous Common Subset in Constant Time","authors":"Xin Wang;Xiao Sui;Sisi Duan;Haibin Zhang","doi":"10.1109/TC.2025.3607476","DOIUrl":"https://doi.org/10.1109/TC.2025.3607476","url":null,"abstract":"Asynchronous common subset (ACS) is a powerful paradigm enabling applications such as Byzantine fault-tolerance (BFT) and multi-party computation (MPC). The most efficient ACS framework in the information-theoretic setting is due to Ben-Or, Kelmer, and Rabin (BKR, 1994). The BKR ACS protocol has been both theoretically and practically impactful. BKR ACS has an <inline-formula><tex-math>$O(log n)$</tex-math></inline-formula> running time (where <inline-formula><tex-math>$n$</tex-math></inline-formula> is the number of replicas) due to the usage of <inline-formula><tex-math>$n$</tex-math></inline-formula> parallel asynchronous binary agreement (ABA) instances, impacting both performance and scalability. Indeed, for a network of 16<inline-formula><tex-math>$sim$</tex-math></inline-formula> 64 replicas, the parallel ABA phase occupies about 95%<inline-formula><tex-math>$sim$</tex-math></inline-formula> 97% of the total runtime. A long-standing open problem is whether we can build an ACS framework with <inline-formula><tex-math>$O(1)$</tex-math></inline-formula> time while not increasing the message or communication complexity of the BKR protocol. We resolve the open problem, presenting the first constant-time ACS protocol with <inline-formula><tex-math>$O(n^{3})$</tex-math></inline-formula> messages in the information-theoretic and signature-free settings. Our key ingredient is the first information-theoretic and constant-time multivalued validated Byzantine agreement (MVBA) protocol. Our results can improve—asymptotically and concretely—various applications using ACS and MVBA. As an example, we implement FIN, a BFT protocol instantiated using our framework. Via a 121-server deployment on Amazon EC2, we show FIN reduces the overhead of the ABA phase to as low as 1.23% of the total runtime.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"4138-4152"},"PeriodicalIF":3.8,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Silvano Chiaradonna;Felicita Di Giandomenico;Giulio Masetti
To cope with unforeseen attacks to software systems in critical application domains, redundancy-based ITSs schemes are among popular countermeasures to deploy. Designing the adequate ITS for the stated security requirements calls for stochastic analysis supports, able to assess the impact of variety of attack patterns on different ITS configurations. As contribution to this purpose, a stochastic model for ITS is proposed, whose novel aspects are the ability to account for both camouflaging components and for correlation aspects between the security failures affecting the diverse implementations of the software cyber protections adopted in the ITS. Extensive analyses are conducted to show the applicability of the model; the obtained results allow to understand the limits and strengths of selected ITS configurations when subject to attacks occurring in unfavorable conditions for the defender.
{"title":"Stochastic Modeling of Intrusion Tolerant Systems Based on Redundancy and Diversity","authors":"Silvano Chiaradonna;Felicita Di Giandomenico;Giulio Masetti","doi":"10.1109/TC.2025.3606189","DOIUrl":"https://doi.org/10.1109/TC.2025.3606189","url":null,"abstract":"To cope with unforeseen attacks to software systems in critical application domains, redundancy-based ITSs schemes are among popular countermeasures to deploy. Designing the adequate ITS for the stated security requirements calls for stochastic analysis supports, able to assess the impact of variety of attack patterns on different ITS configurations. As contribution to this purpose, a stochastic model for ITS is proposed, whose novel aspects are the ability to account for both camouflaging components and for correlation aspects between the security failures affecting the diverse implementations of the software cyber protections adopted in the ITS. Extensive analyses are conducted to show the applicability of the model; the obtained results allow to understand the limits and strengths of selected ITS configurations when subject to attacks occurring in unfavorable conditions for the defender.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"4123-4137"},"PeriodicalIF":3.8,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
RowHammer poses a serious reliability challenge to modern DRAM systems. As technology scales down, DRAM resistance to RowHammer has decreased by $30times$ over the past decade, causing an increasing number of benign applications to suffer from this issue. However, existing defense mechanisms have three limitations: 1) they rely on inefficient mitigation techniques, such as time-consuming victim row refresh; 2) they do not reduce the number of effective RowHammer attacks, leading to frequent mitigations; and 3) they fail to recognize that frequently accessed data is not only a root cause of RowHammer but also presents an opportunity for performance optimization. In this paper, we observe that frequently accessed hot data plays a distinct role in security and efficiency: it can induce RowHammer by interfering with adjacent cold data, while also being performance-critical due to its frequent accesses. To this end, we propose Data Isolation via In-DRAM Cache (DIVIDE), a novel defense mechanism that leverages in-DRAM cache to isolate and exploit hot data. DIVIDE offers three key benefits: 1) It reduces the number of effective RowHammer attacks, as hot data in the cache cannot interfere with each other. 2) It provides a simple yet effective mitigation measure by isolating hot data from cold data. 3) It caches frequently accessed hot data, improving average access latency. DIVIDE employs a two-level protection structure: the first level mitigates RowHammer in cache arrays with high efficiency, while the second level addresses the remaining threats in normal arrays to ensure complete protection. Owing to the high in-DRAM cache hit rate, DIVIDE efficiently mitigates RowHammer while preserving both the performance and energy efficiency of the in-DRAM cache. At a RowHammer threshold of 128, DIVIDE with probabilistic mitigation achieves an average performance improvement of 19.6% and energy savings of 20.4% over DDR4 DRAM for four-core workloads. Compared to an unprotected in-DRAM cache DRAM, DIVIDE incurs only a 2.1% performance overhead while requiring just a modest 1KB per-channel CAM in the memory controller, with no modification to the DRAM chip.
{"title":"DIVIDE: Efficient RowHammer Defense via In-DRAM Cache-Based Hot Data Isolation","authors":"Haitao Du;Yuxuan Yang;Song Chen;Yi Kang","doi":"10.1109/TC.2025.3603729","DOIUrl":"https://doi.org/10.1109/TC.2025.3603729","url":null,"abstract":"RowHammer poses a serious reliability challenge to modern DRAM systems. As technology scales down, DRAM resistance to RowHammer has decreased by <inline-formula><tex-math>$30times$</tex-math></inline-formula> over the past decade, causing an increasing number of benign applications to suffer from this issue. However, existing defense mechanisms have three limitations: 1) they rely on inefficient mitigation techniques, such as time-consuming victim row refresh; 2) they do not reduce the number of effective RowHammer attacks, leading to frequent mitigations; and 3) they fail to recognize that frequently accessed data is not only a root cause of RowHammer but also presents an opportunity for performance optimization. In this paper, we observe that frequently accessed hot data plays a distinct role in security and efficiency: it can induce RowHammer by interfering with adjacent cold data, while also being performance-critical due to its frequent accesses. To this end, we propose Data Isolation via In-DRAM Cache (DIVIDE), a novel defense mechanism that leverages in-DRAM cache to isolate and exploit hot data. DIVIDE offers three key benefits: 1) It reduces the number of effective RowHammer attacks, as hot data in the cache cannot interfere with each other. 2) It provides a simple yet effective mitigation measure by isolating hot data from cold data. 3) It caches frequently accessed hot data, improving average access latency. DIVIDE employs a two-level protection structure: the first level mitigates RowHammer in cache arrays with high efficiency, while the second level addresses the remaining threats in normal arrays to ensure complete protection. Owing to the high in-DRAM cache hit rate, DIVIDE efficiently mitigates RowHammer while preserving both the performance and energy efficiency of the in-DRAM cache. At a RowHammer threshold of 128, DIVIDE with probabilistic mitigation achieves an average performance improvement of 19.6% and energy savings of 20.4% over DDR4 DRAM for four-core workloads. Compared to an unprotected in-DRAM cache DRAM, DIVIDE incurs only a 2.1% performance overhead while requiring just a modest 1KB per-channel CAM in the memory controller, with no modification to the DRAM chip.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"3980-3994"},"PeriodicalIF":3.8,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As scientific and engineering challenges grow in complexity and scale, the demand for effective solutions for sparse matrix computations becomes increasingly critical. LU decomposition, known for its ability to reduce computational load and enhance numerical stability, serves as a promising approach. This study focuses on accelerating sparse LU decomposition for circuit simulations, addressing the prolonged simulation times caused by large circuit matrices. We present a novel Operation-based Optimized LU (OOLU) decomposition architecture that significantly improves circuit analysis efficiency. OOLU employs a VLIW-like processing element array and incorporates a scheduler that decomposes computations into a fine-grained operational task flow graph, maximizing inter-operation parallelism. Specialized scheduling and data mapping strategies are applied to align with the adaptable pipelined framework and the characteristics of circuit matrices. The OOLU architecture is prototyped on an FPGA and validated through extensive tests on the University of Florida sparse matrix collection, benchmarked against multiple platforms. The accelerator achieves speedups ranging from $3.48times$ to $32.25times$ (average $12.51times$) over the KLU software package. It also delivers average speedups of $2.64times$ over a prior FPGA accelerator and $25.18times$ and $32.27times$ over the GPU accelerators STRUMPACK and SFLU, respectively, highlighting the substantial efficiency gains our approach delivers.
{"title":"OOLU: An Operation-Based Optimized Sparse LU Decomposition Accelerator for Circuit Simulation","authors":"Ke Hu;Fan Yang","doi":"10.1109/TC.2025.3605751","DOIUrl":"https://doi.org/10.1109/TC.2025.3605751","url":null,"abstract":"As scientific and engineering challenges grow in complexity and scale, the demand for effective solutions for sparse matrix computations becomes increasingly critical. LU decomposition, known for its ability to reduce computational load and enhance numerical stability, serves as a promising approach. This study focuses on accelerating sparse LU decomposition for circuit simulations, addressing the prolonged simulation times caused by large circuit matrices. We present a novel Operation-based Optimized LU (OOLU) decomposition architecture that significantly improves circuit analysis efficiency. OOLU employs a VLIW-like processing element array and incorporates a scheduler that decomposes computations into a fine-grained operational task flow graph, maximizing inter-operation parallelism. Specialized scheduling and data mapping strategies are applied to align with the adaptable pipelined framework and the characteristics of circuit matrices. The OOLU architecture is prototyped on an FPGA and validated through extensive tests on the University of Florida sparse matrix collection, benchmarked against multiple platforms. The accelerator achieves speedups ranging from <inline-formula><tex-math>$3.48times$</tex-math></inline-formula> to <inline-formula><tex-math>$32.25times$</tex-math></inline-formula> (average <inline-formula><tex-math>$12.51times$</tex-math></inline-formula>) over the KLU software package. It also delivers average speedups of <inline-formula><tex-math>$2.64times$</tex-math></inline-formula> over a prior FPGA accelerator and <inline-formula><tex-math>$25.18times$</tex-math></inline-formula> and <inline-formula><tex-math>$32.27times$</tex-math></inline-formula> over the GPU accelerators STRUMPACK and SFLU, respectively, highlighting the substantial efficiency gains our approach delivers.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"4065-4079"},"PeriodicalIF":3.8,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}