The practical applications of quantum computing are currently limited by the small number of available qubits. Recent advances in quantum hardware have introduced mid-circuit measurements and resets, enabling the reuse of measured qubits and thus reducing the qubit requirements for executing quantum algorithms. In this work, we present a systematic study of dynamic quantum circuit compilation, a process that transforms static quantum circuits into their dynamic equivalents with fewer qubits through qubit reuse. We establish the first graph-based framework for optimizing qubit-reuse compilation. In particular, we characterize the task of finding the optimal compilation strategy for maximizing qubit reuse using binary integer programming and provide efficient heuristic algorithms for devising general compilation strategies. We conduct a thorough analysis of quantum circuits with practical relevance and offer their optimal qubit-reuse compilation strategies. We also perform a comparative analysis against state-of-the-art approaches, demonstrating the superior performance of our methods in both structured and random quantum circuits. Our framework lays a rigorous foundation for understanding dynamic quantum circuit compilation via qubit reuse, holding significant promise for the practical implementation of large-scale quantum algorithms on quantum computers with limited resources.
{"title":"Dynamic Quantum Circuit Compilation","authors":"Kun Fang;Munan Zhang;Ruqi Shi;Yinan Li","doi":"10.1109/TC.2025.3643826","DOIUrl":"https://doi.org/10.1109/TC.2025.3643826","url":null,"abstract":"The practical applications of quantum computing are currently limited by the small number of available qubits. Recent advances in quantum hardware have introduced mid-circuit measurements and resets, enabling the reuse of measured qubits and thus reducing the qubit requirements for executing quantum algorithms. In this work, we present a systematic study of dynamic quantum circuit compilation, a process that transforms static quantum circuits into their dynamic equivalents with fewer qubits through qubit reuse. We establish the first graph-based framework for optimizing qubit-reuse compilation. In particular, we characterize the task of finding the optimal compilation strategy for maximizing qubit reuse using binary integer programming and provide efficient heuristic algorithms for devising general compilation strategies. We conduct a thorough analysis of quantum circuits with practical relevance and offer their optimal qubit-reuse compilation strategies. We also perform a comparative analysis against state-of-the-art approaches, demonstrating the superior performance of our methods in both structured and random quantum circuits. Our framework lays a rigorous foundation for understanding dynamic quantum circuit compilation via qubit reuse, holding significant promise for the practical implementation of large-scale quantum algorithms on quantum computers with limited resources.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"75 2","pages":"748-759"},"PeriodicalIF":3.8,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145963443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Quanfeng Deng;Jing Wu;Qiangyu Pei;Chuangxun Lin;Chen Yu;Hai Jin
The serverless computing paradigm has emerged as a promising solution to address the resource underutilization and inflexible service scaling in edge environments by decomposing the monolithic application into a serverless workflow. However, existing serverless platforms are primarily designed for cloud data centers, relying on heavyweight isolation mechanisms that are ill-suited for resource-constrained edge computing. These limitations lead to high latency, low deployment density, and restricted parallelism. In this paper, we propose Sonnet, a serverless platform tailored for edge computing, capable of rapidly responding to user requests and supporting efficient and elastic service scaling. Sonnet offers these features by (i) employing lightweight WebAssembly as the execution environment for functions, (ii) leveraging serverless workflow information to optimize function deployment on resource-constrained edge environments, and (iii) providing a function deployment algorithm that achieves dynamic load balancing within the cluster. An extensive evaluation of Sonnet with real-world serverless workflows demonstrates its effectiveness and practical applicability. Compared with SOTA and commonly used edge computing serverless solutions, Sonnet can reduce end-to-end latency by 27% and improve throughput by 2.83$boldsymbol{times}$.
{"title":"Sonnet: A Workflow-Aware Serverless Platform for Time-Sensitive Edge Computing With WebAssembly","authors":"Quanfeng Deng;Jing Wu;Qiangyu Pei;Chuangxun Lin;Chen Yu;Hai Jin","doi":"10.1109/TC.2025.3628246","DOIUrl":"https://doi.org/10.1109/TC.2025.3628246","url":null,"abstract":"The serverless computing paradigm has emerged as a promising solution to address the resource underutilization and inflexible service scaling in edge environments by decomposing the monolithic application into a serverless workflow. However, existing serverless platforms are primarily designed for cloud data centers, relying on heavyweight isolation mechanisms that are ill-suited for resource-constrained edge computing. These limitations lead to high latency, low deployment density, and restricted parallelism. In this paper, we propose <i>Sonnet</i>, a serverless platform tailored for edge computing, capable of rapidly responding to user requests and supporting efficient and elastic service scaling. Sonnet offers these features by (i) employing lightweight WebAssembly as the execution environment for functions, (ii) leveraging serverless workflow information to optimize function deployment on resource-constrained edge environments, and (iii) providing a function deployment algorithm that achieves dynamic load balancing within the cluster. An extensive evaluation of <i>Sonnet</i> with real-world serverless workflows demonstrates its effectiveness and practical applicability. Compared with SOTA and commonly used edge computing serverless solutions, Sonnet can reduce end-to-end latency by 27% and improve throughput by 2.83<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"75 1","pages":"395-408"},"PeriodicalIF":3.8,"publicationDate":"2025-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11230066","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145712559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuxuan Qin;Chuxiong Lin;Guoming Rao;Ling Yang;Weiguang Sheng;Weifeng He
DRAM latency remains a critical bottleneck in the performance of modern computing systems. However, the latency is excessively conservative due to the timing margins imposed by DRAM vendors to accommodate rare worst-case scenarios, such as weak cells and high temperatures. In this study, we introduce a temperature- and process-variation-aware timing detection and adaptation DRAM (TPDA-DRAM) architecture that dynamically mitigates timing margins at runtime. TPDA-DRAM leverages innovative in-situ cross-coupled detectors to monitor voltage differences between bitline pairs inside DRAM arrays, ensuring precise detection of timing margins. Additionally, the proposed detector inherently accelerates the precharge operation of DRAM, thereby reducing the precharge latency by up to 62.5%. Building upon this architecture, we propose two variation-aware timing adaptation schemes: 1) a process-variation-aware adaptation (PVA) scheme that accelerates access to weak cells, mitigating process-induced timing margins, and 2) a temperature-variation-aware adaptation (TVA) scheme that leverages temperature information and the restoration truncation technique to reduce DRAM latency, mitigating temperature-induced timing margins. Evaluations on an eight-core computing system show that TPDA-DRAM improves average performance by 21.8% and energy efficiency by 18.2%.
{"title":"TPDA-DRAM: A Variation-Aware DRAM Improving System Performance via In-Situ Timing Margin Detection and Adaptive Mitigation","authors":"Yuxuan Qin;Chuxiong Lin;Guoming Rao;Ling Yang;Weiguang Sheng;Weifeng He","doi":"10.1109/TC.2025.3627945","DOIUrl":"https://doi.org/10.1109/TC.2025.3627945","url":null,"abstract":"DRAM latency remains a critical bottleneck in the performance of modern computing systems. However, the latency is excessively conservative due to the timing margins imposed by DRAM vendors to accommodate rare worst-case scenarios, such as weak cells and high temperatures. In this study, we introduce a temperature- and process-variation-aware timing detection and adaptation DRAM (TPDA-DRAM) architecture that dynamically mitigates timing margins at runtime. TPDA-DRAM leverages innovative <italic>in-situ</i> cross-coupled detectors to monitor voltage differences between bitline pairs inside DRAM arrays, ensuring precise detection of timing margins. Additionally, the proposed detector inherently accelerates the precharge operation of DRAM, thereby reducing the precharge latency by up to 62.5%. Building upon this architecture, we propose two variation-aware timing adaptation schemes: 1) a process-variation-aware adaptation (PVA) scheme that accelerates access to weak cells, mitigating process-induced timing margins, and 2) a temperature-variation-aware adaptation (TVA) scheme that leverages temperature information and the restoration truncation technique to reduce DRAM latency, mitigating temperature-induced timing margins. Evaluations on an eight-core computing system show that TPDA-DRAM improves average performance by 21.8% and energy efficiency by 18.2%.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"75 1","pages":"350-364"},"PeriodicalIF":3.8,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145729298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Compute-in-memory (CIM) has emerged as a pivotal direction for accelerating workloads in the field of machine learning, such as Deep Neural Networks (DNNs). However, the effective exploitation of sparsity in CIM systems presents numerous challenges, due to the inherent limitations in their rigid array structures. Designing sparse DNN dataflows and developing efficient mapping strategies also become more complex when accounting for diverse sparsity patterns and the flexibility of a multi-macro CIM structure. Despite these complexities, there is still an absence of a unified systematic view and modeling approach for diverse sparse DNN workloads in CIM systems. In this paper, we propose CIMinus, a framework dedicated to cost modeling for sparse DNN workloads on CIM architectures. It provides an in-depth energy consumption analysis at the level of individual components and an assessment of the overall workload latency. We validate CIMinus against contemporary CIM architectures and demonstrate its applicability in two use-cases. These cases provide valuable insights into both the impact of sparsity patterns and the effectiveness of mapping strategies, bridging the gap between theoretical design and practical implementation.
{"title":"CIMinus: Empowering Sparse DNN Workloads Modeling and Exploration on SRAM-Based CIM Architectures","authors":"Yingjie Qi;Jianlei Yang;Rubing Yang;Cenlin Duan;Xiaolin He;Ziyan He;Weitao Pan;Weisheng Zhao","doi":"10.1109/TC.2025.3628114","DOIUrl":"https://doi.org/10.1109/TC.2025.3628114","url":null,"abstract":"Compute-in-memory (CIM) has emerged as a pivotal direction for accelerating workloads in the field of machine learning, such as Deep Neural Networks (DNNs). However, the effective exploitation of sparsity in CIM systems presents numerous challenges, due to the inherent limitations in their rigid array structures. Designing sparse DNN dataflows and developing efficient mapping strategies also become more complex when accounting for diverse sparsity patterns and the flexibility of a multi-macro CIM structure. Despite these complexities, there is still an absence of a unified systematic view and modeling approach for diverse sparse DNN workloads in CIM systems. In this paper, we propose CIMinus, a framework dedicated to cost modeling for sparse DNN workloads on CIM architectures. It provides an in-depth energy consumption analysis at the level of individual components and an assessment of the overall workload latency. We validate CIMinus against contemporary CIM architectures and demonstrate its applicability in two use-cases. These cases provide valuable insights into both the impact of sparsity patterns and the effectiveness of mapping strategies, bridging the gap between theoretical design and practical implementation.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"75 1","pages":"380-394"},"PeriodicalIF":3.8,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145712545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Virtualization environments (e.g., containers and hypervisors) achieve isolation of multiple runtime entities but result in two mutually isolated guest and host layers. Such cross-layer isolation could cause high latency and low throughput of the system. Previous aware scheduling and double scheduling fail to achieve bidirectional coordination between the guest and host layers. To address this challenge, we develop DCS3, a Dual-layer Co-aware Scheduler that combines stealing balance and synchronized priority. Stealing balancing migrates tasks between virtual CPU (vCPU) queues for load balance based on the workloads of physical CPUs (pCPUs). Synchronized priority dynamically adjusts the thread priorities running on the pCPUs according to the current vCPU workloads. The vCPUs and pCPUs belong to the guest and host layers, respectively. Compared with aware scheduling, double scheduling, and DCS2 (i.e., DCS3 without synchronized priority), DCS3 has the following obvious advantages: 1) Requests Per Second (RPS) increases by up to 52%, 55%, and 2%, respectively; 2) request latency decreases by up to 72%, 71%, and 20%, respectively.
虚拟化环境(例如,容器和管理程序)实现了多个运行时实体的隔离,但导致两个相互隔离的来宾层和主机层。这种跨层隔离可能导致系统的高延迟和低吞吐量。以往的感知调度和双调度无法实现来宾层和主机层之间的双向协调。为了应对这一挑战,我们开发了DCS3,这是一个结合了窃取平衡和同步优先级的双层协同感知调度器。均衡窃取是指根据pcpu的负载情况,在vCPU队列之间迁移任务,实现负载均衡。同步优先级根据当前vCPU的工作负载动态调整pcpu上运行的线程优先级。vcpu和pcpu分别属于guest层和host层。与感知调度、双调度和DCS2(即没有同步优先级的DCS3)相比,DCS3具有以下明显优势:1)每秒请求数(Requests Per Second, RPS)分别提高了52%、55%和2%;2)请求延迟分别减少高达72%、71%和20%。
{"title":"DCS3: A Dual-Layer Co-Aware Scheduler With Stealing Balance and Synchronized Priority in Virtualization Environments","authors":"Chenglai Xiong;Junjie Wen;Guoqi Xie;Zhongjia Wang;Zhenli He;Shaowen Yao;Jianfeng Tan;Tianyu Zhou;Tiwei Bie;Yan Yan;Shoumeng Yan","doi":"10.1109/TC.2025.3628012","DOIUrl":"https://doi.org/10.1109/TC.2025.3628012","url":null,"abstract":"Virtualization environments (e.g., containers and hypervisors) achieve isolation of multiple runtime entities but result in two mutually isolated guest and host layers. Such cross-layer isolation could cause high latency and low throughput of the system. Previous aware scheduling and double scheduling fail to achieve bidirectional coordination between the guest and host layers. To address this challenge, we develop DCS3, a Dual-layer Co-aware Scheduler that combines stealing balance and synchronized priority. Stealing balancing migrates tasks between virtual CPU (vCPU) queues for load balance based on the workloads of physical CPUs (pCPUs). Synchronized priority dynamically adjusts the thread priorities running on the pCPUs according to the current vCPU workloads. The vCPUs and pCPUs belong to the guest and host layers, respectively. Compared with aware scheduling, double scheduling, and DCS2 (i.e., DCS3 without synchronized priority), DCS3 has the following obvious advantages: 1) Requests Per Second (RPS) increases by up to 52%, 55%, and 2%, respectively; 2) request latency decreases by up to 72%, 71%, and 20%, respectively.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"75 1","pages":"365-379"},"PeriodicalIF":3.8,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145712544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuhui Cai;Guowen Gong;Zhirong Shen;Jiahui Yang;Jiwu Shu
Erasure coding has been extensively deployed in today’s commodity HPC systems against unexpected failures. To adapt to the varying access characteristics and reliability demands, storage clusters have to perform redundancy transitioning via tuning the coding parameters, which unfortunately gives rise to substantial transitioning traffic. We present $textsf{ElasticEC}$, a fast and elastic redundancy transitioning approach for erasure-coded clusters. $textsf{ElasticEC}$ first minimizes the transitioning traffic via proposing a relocation-aware stripe reorganization mechanism and a collecting-and-encoding algorithm. It further heuristically balances the transitioning traffic across nodes. We implement $textsf{ElasticEC}$ in Hadoop HDFS and conduct extensive experiments on a real-world cloud storage cluster, showing that $textsf{ElasticEC}$ can reduce 71.1-92.6% of the transitioning traffic and shorten 65.9-90.7% of the transitioning time.
{"title":"ElasticEC: Achieving Fast and Elastic Redundancy Transitioning in Erasure-Coded Clusters","authors":"Yuhui Cai;Guowen Gong;Zhirong Shen;Jiahui Yang;Jiwu Shu","doi":"10.1109/TC.2025.3614839","DOIUrl":"https://doi.org/10.1109/TC.2025.3614839","url":null,"abstract":"Erasure coding has been extensively deployed in today’s commodity HPC systems against unexpected failures. To adapt to the varying access characteristics and reliability demands, storage clusters have to perform redundancy transitioning via tuning the coding parameters, which unfortunately gives rise to substantial transitioning traffic. We present <inline-formula><tex-math>$textsf{ElasticEC}$</tex-math></inline-formula>, a fast and elastic redundancy transitioning approach for erasure-coded clusters. <inline-formula><tex-math>$textsf{ElasticEC}$</tex-math></inline-formula> first minimizes the transitioning traffic via proposing a relocation-aware stripe reorganization mechanism and a collecting-and-encoding algorithm. It further heuristically balances the transitioning traffic across nodes. We implement <inline-formula><tex-math>$textsf{ElasticEC}$</tex-math></inline-formula> in Hadoop HDFS and conduct extensive experiments on a real-world cloud storage cluster, showing that <inline-formula><tex-math>$textsf{ElasticEC}$</tex-math></inline-formula> can reduce 71.1-92.6% of the transitioning traffic and shorten 65.9-90.7% of the transitioning time.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"4168-4181"},"PeriodicalIF":3.8,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph partitioning as a classic NP-complete problem, is the most fundamental procedure that needs to be performed before parallel computations. Partitioners can be divided into vertex- and edge-based approaches. Recently, both approaches are employing a streaming heuristic to find approximate solutions. It is lightweight in space and time complexities, but suffers from suboptimal partitioning quality, especially for directed graphs where the explicit knowledge provided for heuristic is limited. This paper thereby proposes new heuristics for not only vertex-based but also edge-based partitioning. They improve quality by additionally utilizing implicit knowledge, which is embedded in the local streaming view and the global graph view. Memory reduction techniques are presented to extract this knowledge with negligible space costs. That preserves the lightweight advantages of streaming partitioning. Besides, we study parallel acceleration and restreaming, to further boost the partitioning efficiency and quality. Extensive experiments validate that our proposals outperform the state-of-the-art competitors.
{"title":"Lightweight Graph Partitioning Enhanced by Implicit Knowledge","authors":"Zhigang Wang;Gongtai Sun;Ning Wang;Lixin Gao;Chuanfei Xu;Yu Gu;Ge Yu;Zhihong Tian","doi":"10.1109/TC.2025.3612730","DOIUrl":"https://doi.org/10.1109/TC.2025.3612730","url":null,"abstract":"Graph partitioning as a classic NP-complete problem, is the most fundamental procedure that needs to be performed before parallel computations. Partitioners can be divided into vertex- and edge-based approaches. Recently, both approaches are employing a streaming heuristic to find approximate solutions. It is lightweight in space and time complexities, but suffers from suboptimal partitioning quality, especially for directed graphs where the explicit knowledge provided for heuristic is limited. This paper thereby proposes new heuristics for not only vertex-based but also edge-based partitioning. They improve quality by additionally utilizing implicit knowledge, which is embedded in the local streaming view and the global graph view. Memory reduction techniques are presented to extract this knowledge with negligible space costs. That preserves the lightweight advantages of streaming partitioning. Besides, we study parallel acceleration and restreaming, to further boost the partitioning efficiency and quality. Extensive experiments validate that our proposals outperform the state-of-the-art competitors.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"4153-4167"},"PeriodicalIF":3.8,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To achieve seamless portability across the embedded computing continuum, we introduce a new kernel architecture: fluid kernels. Fluid kernels can be thought of as the intersection between embedded unikernels and general purpose monolithic kernels, allowing to seamlessly develop applications both in kernel space and user space in a unified way. This scalable kernel architecture can manage the trade-off between performance, code size, isolation and security. We compare our fluid kernel implementation, Miosix, to Linux and FreeRTOS on the same hardware with standard benchmarks. Compared to Linux, we achieve an average speedup of 3.5${boldsymbol{times}}$ and a maximum of up to 15.4${boldsymbol{times}}$. We also achieve an average code size reduction of 84% and a maximum of up to 90%. By moving application code from user space to kernel space, an additional code size reduction up to 56% and a speedup up to 1.3${boldsymbol{times}}$ can be achieved. Compared to FreeRTOS, the use of Miosix only costs a moderate amount of code size (at most 47KB) for significant advantages in application performance with speedups averaging at 1.5${boldsymbol{times}}$ and up to 5${boldsymbol{times}}$.
{"title":"Fluid Kernels: Seamlessly Conquering the Embedded Computing Continuum","authors":"Federico Terraneo;Daniele Cattaneo","doi":"10.1109/TC.2025.3605745","DOIUrl":"https://doi.org/10.1109/TC.2025.3605745","url":null,"abstract":"To achieve seamless portability across the embedded computing continuum, we introduce a new kernel architecture: fluid kernels. Fluid kernels can be thought of as the intersection between embedded unikernels and general purpose monolithic kernels, allowing to seamlessly develop applications both in kernel space and user space in a unified way. This scalable kernel architecture can manage the trade-off between performance, code size, isolation and security. We compare our fluid kernel implementation, Miosix, to Linux and FreeRTOS on the same hardware with standard benchmarks. Compared to Linux, we achieve an average speedup of 3.5<inline-formula><tex-math>${boldsymbol{times}}$</tex-math></inline-formula> and a maximum of up to 15.4<inline-formula><tex-math>${boldsymbol{times}}$</tex-math></inline-formula>. We also achieve an average code size reduction of 84% and a maximum of up to 90%. By moving application code from user space to kernel space, an additional code size reduction up to 56% and a speedup up to 1.3<inline-formula><tex-math>${boldsymbol{times}}$</tex-math></inline-formula> can be achieved. Compared to FreeRTOS, the use of Miosix only costs a moderate amount of code size (at most 47KB) for significant advantages in application performance with speedups averaging at 1.5<inline-formula><tex-math>${boldsymbol{times}}$</tex-math></inline-formula> and up to 5<inline-formula><tex-math>${boldsymbol{times}}$</tex-math></inline-formula>.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"4050-4064"},"PeriodicalIF":3.8,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11173649","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Asynchronous common subset (ACS) is a powerful paradigm enabling applications such as Byzantine fault-tolerance (BFT) and multi-party computation (MPC). The most efficient ACS framework in the information-theoretic setting is due to Ben-Or, Kelmer, and Rabin (BKR, 1994). The BKR ACS protocol has been both theoretically and practically impactful. BKR ACS has an $O(log n)$ running time (where $n$ is the number of replicas) due to the usage of $n$ parallel asynchronous binary agreement (ABA) instances, impacting both performance and scalability. Indeed, for a network of 16$sim$ 64 replicas, the parallel ABA phase occupies about 95%$sim$ 97% of the total runtime. A long-standing open problem is whether we can build an ACS framework with $O(1)$ time while not increasing the message or communication complexity of the BKR protocol. We resolve the open problem, presenting the first constant-time ACS protocol with $O(n^{3})$ messages in the information-theoretic and signature-free settings. Our key ingredient is the first information-theoretic and constant-time multivalued validated Byzantine agreement (MVBA) protocol. Our results can improve—asymptotically and concretely—various applications using ACS and MVBA. As an example, we implement FIN, a BFT protocol instantiated using our framework. Via a 121-server deployment on Amazon EC2, we show FIN reduces the overhead of the ABA phase to as low as 1.23% of the total runtime.
异步公共子集(ACS)是一种强大的范例,支持拜占庭容错(BFT)和多方计算(MPC)等应用程序。本-奥尔、凯尔默和拉宾(BKR, 1994)提出了信息理论背景下最有效的ACS框架。BKR ACS协议在理论和实践上都具有重要影响。由于使用$n$并行异步二进制协议(ABA)实例,BKR ACS的运行时间为$O(log n)$(其中$n$是副本的数量),这会影响性能和可伸缩性。事实上,对于一个16 $sim$ 64个副本的网络,平行ABA阶段占据了大约95个%$sim$ 97% of the total runtime. A long-standing open problem is whether we can build an ACS framework with $O(1)$ time while not increasing the message or communication complexity of the BKR protocol. We resolve the open problem, presenting the first constant-time ACS protocol with $O(n^{3})$ messages in the information-theoretic and signature-free settings. Our key ingredient is the first information-theoretic and constant-time multivalued validated Byzantine agreement (MVBA) protocol. Our results can improve—asymptotically and concretely—various applications using ACS and MVBA. As an example, we implement FIN, a BFT protocol instantiated using our framework. Via a 121-server deployment on Amazon EC2, we show FIN reduces the overhead of the ABA phase to as low as 1.23% of the total runtime.
{"title":"Practical Signature-Free Multivalued Validated Byzantine Agreement and Asynchronous Common Subset in Constant Time","authors":"Xin Wang;Xiao Sui;Sisi Duan;Haibin Zhang","doi":"10.1109/TC.2025.3607476","DOIUrl":"https://doi.org/10.1109/TC.2025.3607476","url":null,"abstract":"Asynchronous common subset (ACS) is a powerful paradigm enabling applications such as Byzantine fault-tolerance (BFT) and multi-party computation (MPC). The most efficient ACS framework in the information-theoretic setting is due to Ben-Or, Kelmer, and Rabin (BKR, 1994). The BKR ACS protocol has been both theoretically and practically impactful. BKR ACS has an <inline-formula><tex-math>$O(log n)$</tex-math></inline-formula> running time (where <inline-formula><tex-math>$n$</tex-math></inline-formula> is the number of replicas) due to the usage of <inline-formula><tex-math>$n$</tex-math></inline-formula> parallel asynchronous binary agreement (ABA) instances, impacting both performance and scalability. Indeed, for a network of 16<inline-formula><tex-math>$sim$</tex-math></inline-formula> 64 replicas, the parallel ABA phase occupies about 95%<inline-formula><tex-math>$sim$</tex-math></inline-formula> 97% of the total runtime. A long-standing open problem is whether we can build an ACS framework with <inline-formula><tex-math>$O(1)$</tex-math></inline-formula> time while not increasing the message or communication complexity of the BKR protocol. We resolve the open problem, presenting the first constant-time ACS protocol with <inline-formula><tex-math>$O(n^{3})$</tex-math></inline-formula> messages in the information-theoretic and signature-free settings. Our key ingredient is the first information-theoretic and constant-time multivalued validated Byzantine agreement (MVBA) protocol. Our results can improve—asymptotically and concretely—various applications using ACS and MVBA. As an example, we implement FIN, a BFT protocol instantiated using our framework. Via a 121-server deployment on Amazon EC2, we show FIN reduces the overhead of the ABA phase to as low as 1.23% of the total runtime.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"4138-4152"},"PeriodicalIF":3.8,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Silvano Chiaradonna;Felicita Di Giandomenico;Giulio Masetti
To cope with unforeseen attacks to software systems in critical application domains, redundancy-based ITSs schemes are among popular countermeasures to deploy. Designing the adequate ITS for the stated security requirements calls for stochastic analysis supports, able to assess the impact of variety of attack patterns on different ITS configurations. As contribution to this purpose, a stochastic model for ITS is proposed, whose novel aspects are the ability to account for both camouflaging components and for correlation aspects between the security failures affecting the diverse implementations of the software cyber protections adopted in the ITS. Extensive analyses are conducted to show the applicability of the model; the obtained results allow to understand the limits and strengths of selected ITS configurations when subject to attacks occurring in unfavorable conditions for the defender.
{"title":"Stochastic Modeling of Intrusion Tolerant Systems Based on Redundancy and Diversity","authors":"Silvano Chiaradonna;Felicita Di Giandomenico;Giulio Masetti","doi":"10.1109/TC.2025.3606189","DOIUrl":"https://doi.org/10.1109/TC.2025.3606189","url":null,"abstract":"To cope with unforeseen attacks to software systems in critical application domains, redundancy-based ITSs schemes are among popular countermeasures to deploy. Designing the adequate ITS for the stated security requirements calls for stochastic analysis supports, able to assess the impact of variety of attack patterns on different ITS configurations. As contribution to this purpose, a stochastic model for ITS is proposed, whose novel aspects are the ability to account for both camouflaging components and for correlation aspects between the security failures affecting the diverse implementations of the software cyber protections adopted in the ITS. Extensive analyses are conducted to show the applicability of the model; the obtained results allow to understand the limits and strengths of selected ITS configurations when subject to attacks occurring in unfavorable conditions for the defender.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 12","pages":"4123-4137"},"PeriodicalIF":3.8,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}