Pub Date : 2025-07-18DOI: 10.1109/TCAD.2025.3584438
{"title":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems publication information","authors":"","doi":"10.1109/TCAD.2025.3584438","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3584438","url":null,"abstract":"","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 8","pages":"C3-C3"},"PeriodicalIF":2.7,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11085019","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144663720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-18DOI: 10.1109/TCAD.2025.3584436
{"title":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems society information","authors":"","doi":"10.1109/TCAD.2025.3584436","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3584436","url":null,"abstract":"","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 8","pages":"C2-C2"},"PeriodicalIF":2.7,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11085014","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144657470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the current NISQ era, the performance of quantum neural network (QNN) models is strictly hindered by the limited qubit number and inevitable noise. A natural idea to improve the robustness of QNN is the implementation of a distributed system. Nevertheless, due to the heterogeneity and instability of quantum chips (e.g., noise, frequent online/offline), training and inference on distributed quantum devices may even destroy the accuracy. In this article, we propose HeteroQNN, a comprehensive QNN framework designed for efficient and high-accuracy distributed training and inference. The main innovation of HeteroQNN is it decouples the QNN circuit into two uniform representations: model vector and behavioral vector. The model vector specifies the gate parameters in the QNN model, while the behavioral vector captures the hardware features when implementing the QNN circuit. To handle the architectural heterogeneity, we introduce personalized QNN models in each quantum processing unit (QPU) and share the gradient among QPUs with homogeneous behavioral vectors. We propose shot-oriented distributed inference, which is much more fine-grained scheduling that can improve accuracy and balance the workload. Finally, by leveraging the hidden homogeneity in the model vector, we present the maintenance for QPU variability. The experiments show that HeteroQNN accelerates the training process by $4.03 times $ with 7.87% loss reduction, compared with the previous distributed QNN framework.
{"title":"HeteroQNN: Enabling Distributed QNN Under Heterogeneous Quantum Devices","authors":"Liqiang Lu;Tianyao Chu;Siwei Tan;Jingwen Leng;Fangxin Liu;Congliang Lang;Yifan Guo;Jianwei Yin","doi":"10.1109/TCAD.2025.3588457","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3588457","url":null,"abstract":"In the current NISQ era, the performance of quantum neural network (QNN) models is strictly hindered by the limited qubit number and inevitable noise. A natural idea to improve the robustness of QNN is the implementation of a distributed system. Nevertheless, due to the heterogeneity and instability of quantum chips (e.g., noise, frequent online/offline), training and inference on distributed quantum devices may even destroy the accuracy. In this article, we propose HeteroQNN, a comprehensive QNN framework designed for efficient and high-accuracy distributed training and inference. The main innovation of HeteroQNN is it decouples the QNN circuit into two uniform representations: model vector and behavioral vector. The model vector specifies the gate parameters in the QNN model, while the behavioral vector captures the hardware features when implementing the QNN circuit. To handle the architectural heterogeneity, we introduce personalized QNN models in each quantum processing unit (QPU) and share the gradient among QPUs with homogeneous behavioral vectors. We propose shot-oriented distributed inference, which is much more fine-grained scheduling that can improve accuracy and balance the workload. Finally, by leveraging the hidden homogeneity in the model vector, we present the maintenance for QPU variability. The experiments show that HeteroQNN accelerates the training process by <inline-formula> <tex-math>$4.03 times $ </tex-math></inline-formula> with 7.87% loss reduction, compared with the previous distributed QNN framework.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"1007-1020"},"PeriodicalIF":2.9,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11078436","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-08DOI: 10.1109/TCAD.2025.3586890
Zehao Chen;Yang Zhang;Ying Zeng;Wenhua Wu;Guojun Han
3-D nand flash memory has attacked widespread attention due to its fast speed, high endurance, and strong reliability. However, its reliability decreases as program and erase time increases. To tackle this problem, current researches mostly employ machine learning models to predict flash memory failure, but there lacks the consideration of using the interlayer difference and page type difference characteristics in flash memory chips to help failure prediction. Based on the internal characteristic of interlayer difference and page type difference, two failure prediction algorithms are proposed in this article, corresponding to the Standard1 and Standard2. For Standard1, an attention focused failure prediction (AFFP) algorithm is proposed. To predict the failure of the entire block, the proposed AFFP algorithm only focuses on the layer which is the most prone to failure and further predicts eight pages of the most likely failure pages within this layer. For Standard2, a low predict-frequency failure prediction (LPFFP) algorithm is proposed, which can reduce the frequency of failure prediction significantly and thus reduce the prediction overhead as much as possible. The experimental results show that, for Standard1, the AFFP algorithm can predict the block of failure accurately, and its data extraction and prediction overheads are reduced by 99.8% compared to the original algorithm, and meanwhile the F1-score exceeds 0.96. For Standard2, the LPFFP algorithm can predict the page of failure within a flash block accurately, and its F1-score exceeds 0.91 with a significant reduction in prediction overhead.
{"title":"Lightweight Failure Prediction Algorithms Based on Internal Characteristics of 3-D nand Flash Memory","authors":"Zehao Chen;Yang Zhang;Ying Zeng;Wenhua Wu;Guojun Han","doi":"10.1109/TCAD.2025.3586890","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3586890","url":null,"abstract":"3-D <sc>nand</small> flash memory has attacked widespread attention due to its fast speed, high endurance, and strong reliability. However, its reliability decreases as program and erase time increases. To tackle this problem, current researches mostly employ machine learning models to predict flash memory failure, but there lacks the consideration of using the interlayer difference and page type difference characteristics in flash memory chips to help failure prediction. Based on the internal characteristic of interlayer difference and page type difference, two failure prediction algorithms are proposed in this article, corresponding to the Standard1 and Standard2. For Standard1, an attention focused failure prediction (AFFP) algorithm is proposed. To predict the failure of the entire block, the proposed AFFP algorithm only focuses on the layer which is the most prone to failure and further predicts eight pages of the most likely failure pages within this layer. For Standard2, a low predict-frequency failure prediction (LPFFP) algorithm is proposed, which can reduce the frequency of failure prediction significantly and thus reduce the prediction overhead as much as possible. The experimental results show that, for Standard1, the AFFP algorithm can predict the block of failure accurately, and its data extraction and prediction overheads are reduced by 99.8% compared to the original algorithm, and meanwhile the F1-score exceeds 0.96. For Standard2, the LPFFP algorithm can predict the page of failure within a flash block accurately, and its F1-score exceeds 0.91 with a significant reduction in prediction overhead.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"832-844"},"PeriodicalIF":2.9,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The log-structured nature of nand flash storage necessitates garbage collection (GC) in solid state drives (SSDs). GC is a major source of runtime write amplification (WA), leading to faster device wear out and interference with host I/Os. The key to mitigating this problem is separating data by lifetime so that data in the same flash block are invalidated within temporal proximity. For higher lifetime prediction accuracy and adaptibility, prior works proposed using machine learning (ML) algorithms for data separation. However, existing learning-based solutions perform data lifetime prediction at the host side, leading to several drawbacks. First, host-side prediction does not have knowledge of the internal data movement inside the SSD during GC, and thus fails to leverage the opportunity to further separate GC writes, resulting in suboptimal WA reduction in the long term. Second, performing prediction at the host significantly prolongs the I/O critical path and consumes host resources that could otherwise be used for serving user applications. We present Shiro, a holistic flash translation layer (FTL) design that performs in-storage data separation for both user writes and GC writes for maximal long-term WA reduction. For user writes, Shiro uses a sequence model to accurately predict data lifetime by learning lifetime distribution from long historical access patterns. For GC writes, Shiro incorporates a reinforcement learning-assisted page migration strategy that takes direct feedback from long-term WA to further improve data separation efficacy. To address the challenges posed by performing fine-grained and real-time ML decisions inside the resource-constrained SSD, we propose a suite of enabling techniques to keep computation and storage overhead low. Extensive evaluation of Shiro on real-world traces shows that Shiro can deliver 29%–68% lower WA compared with conventional FTL and state-of-the-art in-storage data separation schemes. Furthermore, thanks to lower data migration overhead during GC, Shiro achieves significantly higher steady-state I/O performance.
{"title":"Shiro: Efficient and Accurate In-Storage Data Lifetime Separation for nand Flash SSDs","authors":"Penghao Sun;Shengan Zheng;Litong You;Wanru Zhang;Ruoyan Ma;Jie Yang;Feng Zhu;Shu Li;Linpeng Huang","doi":"10.1109/TCAD.2025.3586891","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3586891","url":null,"abstract":"The log-structured nature of <sc>nand</small> flash storage necessitates garbage collection (GC) in solid state drives (SSDs). GC is a major source of runtime write amplification (WA), leading to faster device wear out and interference with host I/Os. The key to mitigating this problem is separating data by lifetime so that data in the same flash block are invalidated within temporal proximity. For higher lifetime prediction accuracy and adaptibility, prior works proposed using machine learning (ML) algorithms for data separation. However, existing learning-based solutions perform data lifetime prediction at the host side, leading to several drawbacks. First, host-side prediction does not have knowledge of the internal data movement inside the SSD during GC, and thus fails to leverage the opportunity to further separate GC writes, resulting in suboptimal WA reduction in the long term. Second, performing prediction at the host significantly prolongs the I/O critical path and consumes host resources that could otherwise be used for serving user applications. We present Shiro, a holistic flash translation layer (FTL) design that performs in-storage data separation for both user writes and GC writes for maximal long-term WA reduction. For user writes, Shiro uses a sequence model to accurately predict data lifetime by learning lifetime distribution from long historical access patterns. For GC writes, Shiro incorporates a reinforcement learning-assisted page migration strategy that takes direct feedback from long-term WA to further improve data separation efficacy. To address the challenges posed by performing fine-grained and real-time ML decisions inside the resource-constrained SSD, we propose a suite of enabling techniques to keep computation and storage overhead low. Extensive evaluation of Shiro on real-world traces shows that Shiro can deliver 29%–68% lower WA compared with conventional FTL and state-of-the-art in-storage data separation schemes. Furthermore, thanks to lower data migration overhead during GC, Shiro achieves significantly higher steady-state I/O performance.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"1028-1041"},"PeriodicalIF":2.9,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-08DOI: 10.1109/TCAD.2025.3586889
Dipesh C. Monga;Gaurav Singh;Omar Numan;Kazybek Adam;Martin Andraud;Kari A. I. Halonen
in-memory computing (IMC) has emerged as one of the most promising architectures to efficiently compute artificial intelligence tasks on hardware, particularly deep neural networks (DNNs). IMC can make use of analog computation principles alongside emerging nonvolatile memories (eNVM) technologies, potentially offering several orders of magnitude increased energy efficiency compared to generic processing units. Yet, the use of analog circuitry, potentially integrated with emerging technologies post-processed on top of silicon wafers, increases the susceptibility of hardware to a large spectrum of variations, for instance manufacturing, noise or temperature sensitivity. Hence, this susceptibility can hamper the large-scale deployment of IMC circuits into the market. To tackle the reliability of analog resistive-based IMC circuits regarding temperature variations, this article presents TRIM, a thermal on-chip auto-compensation method aimed at fully calibrating first-order temperature effects. TRIM is designed to maintain the computational accuracy of IMC cores in DNN applications over a wide temperature range, while being highly scalable and adaptable. In essence, the temperature compensation is realized through a complementary-to-absolute-temperature (CTAT) voltage reference integrated inside a voltage regulator and applied at the zero reference node of a multiplying digital-to-analog converter (MDAC), eliminating the need for external circuits or look-up table. The proposed methodology is demonstrated on a proof-of-concept 65 nm CMOS resistive IMC column. Measurement results showcase that the proof-of-concept auto-compensation system significantly enhances inference and multiply-and-accumulate (MAC) operation accuracy of any first-order resistive crossbar column, achieving inference accuracy recovery of 100% over a temperature range of –20 °C to 60 °C and a 91.3% improvement in MAC operation accuracy, with an area overhead of 2% and power overhead of ¡ 0.02%.
{"title":"TRIM: Thermal Auto-Compensation for Resistive In-Memory Computing","authors":"Dipesh C. Monga;Gaurav Singh;Omar Numan;Kazybek Adam;Martin Andraud;Kari A. I. Halonen","doi":"10.1109/TCAD.2025.3586889","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3586889","url":null,"abstract":"in-memory computing (IMC) has emerged as one of the most promising architectures to efficiently compute artificial intelligence tasks on hardware, particularly deep neural networks (DNNs). IMC can make use of analog computation principles alongside emerging nonvolatile memories (eNVM) technologies, potentially offering several orders of magnitude increased energy efficiency compared to generic processing units. Yet, the use of analog circuitry, potentially integrated with emerging technologies post-processed on top of silicon wafers, increases the susceptibility of hardware to a large spectrum of variations, for instance manufacturing, noise or temperature sensitivity. Hence, this susceptibility can hamper the large-scale deployment of IMC circuits into the market. To tackle the reliability of analog resistive-based IMC circuits regarding temperature variations, this article presents TRIM, a thermal on-chip auto-compensation method aimed at fully calibrating first-order temperature effects. TRIM is designed to maintain the computational accuracy of IMC cores in DNN applications over a wide temperature range, while being highly scalable and adaptable. In essence, the temperature compensation is realized through a complementary-to-absolute-temperature (CTAT) voltage reference integrated inside a voltage regulator and applied at the zero reference node of a multiplying digital-to-analog converter (MDAC), eliminating the need for external circuits or look-up table. The proposed methodology is demonstrated on a proof-of-concept 65 nm CMOS resistive IMC column. Measurement results showcase that the proof-of-concept auto-compensation system significantly enhances inference and multiply-and-accumulate (MAC) operation accuracy of any first-order resistive crossbar column, achieving inference accuracy recovery of 100% over a temperature range of –20 °C to 60 °C and a 91.3% improvement in MAC operation accuracy, with an area overhead of 2% and power overhead of ¡ 0.02%.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"943-954"},"PeriodicalIF":2.9,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11073135","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
CMOS image sensors (CIS) are integral to both human and computer vision tasks, necessitating continuous improvements in key performance metrics, such as latency, power, and noise. Despite experienced designers being able to make informed design decisions, novice designers and system architects face challenges due to the complex and expansive design space of CIS. This article introduces a systematic methodology that elucidates the tradeoffs among CIS performance metrics and enables efficient design space exploration (DSE). Specifically, we propose a first-principle-based CIS modeling method. By exposing low-level circuit parameters, our modeling method explicitly reveals the impacts of design changes on high-level metrics. Based on the modeling method, we propose a DSE process that swiftly evaluates and identifies the optimal CIS design, capable of exploring over $10^{9}$ designs in under a minute without the need for time-consuming SPICE simulations. Our approach is validated through a case study and comparisons with real-world designs, demonstrating its practical utility in guiding early-stage CIS design.
{"title":"Systematic Methodology of Modeling and Design Space Exploration for CMOS Image Sensors","authors":"Tianrui Ma;Zhe Gao;Zhe Chen;Ramakrishna Kakarala;Charles Shan;Weidong Cao;Xuan Zhang","doi":"10.1109/TCAD.2025.3585753","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3585753","url":null,"abstract":"CMOS image sensors (CIS) are integral to both human and computer vision tasks, necessitating continuous improvements in key performance metrics, such as latency, power, and noise. Despite experienced designers being able to make informed design decisions, novice designers and system architects face challenges due to the complex and expansive design space of CIS. This article introduces a systematic methodology that elucidates the tradeoffs among CIS performance metrics and enables efficient design space exploration (DSE). Specifically, we propose a first-principle-based CIS modeling method. By exposing low-level circuit parameters, our modeling method explicitly reveals the impacts of design changes on high-level metrics. Based on the modeling method, we propose a DSE process that swiftly evaluates and identifies the optimal CIS design, capable of exploring over <inline-formula> <tex-math>$10^{9}$ </tex-math></inline-formula> designs in under a minute without the need for time-consuming SPICE simulations. Our approach is validated through a case study and comparisons with real-world designs, demonstrating its practical utility in guiding early-stage CIS design.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"1047-1060"},"PeriodicalIF":2.9,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-02DOI: 10.1109/TCAD.2025.3585078
Chang Liu;Zhouyang Li;Haixia Wang;Pengfei Qiu;Gang Qu;Dongsheng Wang
ARM CPUs are widely used in both embedded systems and personal computers where security considerations are becoming important. Evidently, vulnerabilities on hardware components such as cache and translation look-aside buffer are well-documented. But there are much less studies on other components, especially those in the CPU backend, largely due to the unavailability of their design and implementation details. To address this gap, we present the first in-depth reverse engineering analysis of the memory disambiguation unit (MDU) in the backend of ARM CPUs. Across four microarchitectures from ARM and Apple CPUs, we identify two different MDU designs, switch-based and counter-based. We then analyze the state machine, selection mechanism, and organization of these MDU designs. We further propose new side channels and covert channels, which we call ARMeD channels, that exploit ARM MDU to leak information. We demonstrate with three attacks using ARMeD channels: 1) a cross-process covert channel; 2) website fingerprinting; and 3) a new implementation of the spectre attack. Finally, we present a defense strategy against ARMeD Channels with less than 3% degradation on the MDU’s prediction accuracy.
ARM cpu广泛应用于嵌入式系统和个人计算机中,在这些系统中,安全考虑变得越来越重要。显然,硬件组件上的漏洞(如缓存和转换暂置缓冲区)是有案可查的。但是对于其他组件,特别是CPU后端组件的研究要少得多,这主要是由于它们的设计和实现细节不可用。为了解决这一差距,我们首次对ARM cpu后端的内存消歧单元(MDU)进行了深入的逆向工程分析。在ARM和Apple cpu的四种微架构中,我们确定了两种不同的MDU设计,基于开关和基于计数器。然后分析这些MDU设计的状态机、选择机制和组织。我们进一步提出了新的侧信道和隐蔽信道,我们称之为武装信道,利用ARM MDU泄漏信息。我们演示了使用武装通道的三种攻击:1)跨进程隐蔽通道;2)网站指纹识别;3)幽灵攻击的新实现。最后,我们提出了一种针对武装信道的防御策略,该策略对MDU的预测精度降低小于3%。
{"title":"Exploiting ARMeD Channels By Reverse Engineering ARM Memory Disambiguation Unit","authors":"Chang Liu;Zhouyang Li;Haixia Wang;Pengfei Qiu;Gang Qu;Dongsheng Wang","doi":"10.1109/TCAD.2025.3585078","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3585078","url":null,"abstract":"ARM CPUs are widely used in both embedded systems and personal computers where security considerations are becoming important. Evidently, vulnerabilities on hardware components such as cache and translation look-aside buffer are well-documented. But there are much less studies on other components, especially those in the CPU backend, largely due to the unavailability of their design and implementation details. To address this gap, we present the first in-depth reverse engineering analysis of the memory disambiguation unit (MDU) in the backend of ARM CPUs. Across four microarchitectures from ARM and Apple CPUs, we identify two different MDU designs, switch-based and counter-based. We then analyze the state machine, selection mechanism, and organization of these MDU designs. We further propose new side channels and covert channels, which we call ARMeD channels, that exploit ARM MDU to leak information. We demonstrate with three attacks using ARMeD channels: 1) a cross-process covert channel; 2) website fingerprinting; and 3) a new implementation of the spectre attack. Finally, we present a defense strategy against ARMeD Channels with less than 3% degradation on the MDU’s prediction accuracy.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"1075-1088"},"PeriodicalIF":2.9,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advancements in machine learning offer the potential for finding faster and robust optimization approaches for analog circuit design automation. However, fully automated yet fast and process, voltage, and temperature (PVT)-robust sizing algorithms are still lacking as even the most recent methods continue to require extensive simulations or domain-specific circuit expertise. In this article, we present a PVT-robust analog circuit sizing method, called AnaCraft, that is the first to introduce an adversarial training scheme of multiagent reinforcement learning (RL) for robust circuit design automation. We adopt the soft actor–critic (SAC) agent for circuit sizing, which outperforms other actor–critic agents in stability and robustness. Then, we introduce a duel-play scheme to address PVT-robustness, where sizing agents cooperate to find optimal circuit parameters while competing with an adversarial PVT agent. We combine this approach with the model-based policy optimization method: an ensemble of probabilistic models is trained and used to extract many short rollouts of generated data for updating the sizing agents. We test our algorithm on the sizing of operational amplifiers in a 45-nm CMOS technology, as well as on a complex data receiver circuit in a predictive 7-nm FinFET technology. This demonstrates our approach’s ability to find PVT-robust power-area-optimal sizes for advanced technologies and circuits. Our proposed method achieves a higher figure of merit with up to $sim 3times $ fewer circuit simulations and $sim 2times $ less runtime compared to existing state-of-the-art methods.
{"title":"AnaCraft: Duel-Play Probabilistic-Model-Based Reinforcement Learning for Sample-Efficient PVT-Robust Analog Circuit Sizing Optimization","authors":"Mohsen Ahmadzadeh;Jan Lappas;Norbert Wehn;Georges Gielen","doi":"10.1109/TCAD.2025.3582175","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3582175","url":null,"abstract":"Recent advancements in machine learning offer the potential for finding faster and robust optimization approaches for analog circuit design automation. However, fully automated yet fast and process, voltage, and temperature (PVT)-robust sizing algorithms are still lacking as even the most recent methods continue to require extensive simulations or domain-specific circuit expertise. In this article, we present a PVT-robust analog circuit sizing method, called AnaCraft, that is the first to introduce an adversarial training scheme of multiagent reinforcement learning (RL) for robust circuit design automation. We adopt the soft actor–critic (SAC) agent for circuit sizing, which outperforms other actor–critic agents in stability and robustness. Then, we introduce a duel-play scheme to address PVT-robustness, where sizing agents cooperate to find optimal circuit parameters while competing with an adversarial PVT agent. We combine this approach with the model-based policy optimization method: an ensemble of probabilistic models is trained and used to extract many short rollouts of generated data for updating the sizing agents. We test our algorithm on the sizing of operational amplifiers in a 45-nm CMOS technology, as well as on a complex data receiver circuit in a predictive 7-nm FinFET technology. This demonstrates our approach’s ability to find PVT-robust power-area-optimal sizes for advanced technologies and circuits. Our proposed method achieves a higher figure of merit with up to <inline-formula> <tex-math>$sim 3times $ </tex-math></inline-formula> fewer circuit simulations and <inline-formula> <tex-math>$sim 2times $ </tex-math></inline-formula> less runtime compared to existing state-of-the-art methods.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"901-914"},"PeriodicalIF":2.9,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The scale of large language models (LLMs) has steadily increased over time, leading to enhanced performance in multimodal understanding and complex reasoning, but with significant execution overhead on hardware. Quantization is a promising approach to reduce computation and memory overhead for LLM deployment. However, maintaining accuracy and efficiency simultaneously is challenging due to the presence of outliers. Moreover, low-bit quantization tends to deteriorate accuracy due to its limited precision. Existing outlier-aware quantization/hardware co-design methods split the sparse outliers from the normal values with dedicated encoding schemes. However, such separation produces a nonuniform data format for normal values and outliers, leading to additional hardware design and inefficient memory access. This article presents an outlier-isolated data format (Oiso) for low-bit LLM quantization called Oiso. Oiso is a unified representation for both outliers and normal values. It isolates the normal values from the outliers, which can reduce the impact of outliers on the normal values during the quantization process. Taking advantage of the uniform format, Oiso arithmetic can be performed using a homogeneous computational unit, and Oiso values can be stored in a standardized format. Hierarchical block encoding with a subblock alignment scheme is introduced to reduce the encoding cost and the hardware overhead. We introduce the Oiso architecture, equipped with Oiso processing elements and encoders tailored for Oiso arithmetic, realizing efficient low-bit LLM inference. Oiso quantization can push the limits of low-bit LLM quantization, and the Oiso accelerator outperforms the state-of-the-art outlieraware accelerator design with $1.26times $ performance improvement and 25% energy reduction.
{"title":"Oiso: Outlier-Isolated Data Format for Low-Bit Large Language Model Quantization","authors":"Lancheng Zou;Shuo Yin;Mingjun Li;Mingzi Wang;Chen Bai;Wenqian Zhao;Bei Yu","doi":"10.1109/TCAD.2025.3585023","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3585023","url":null,"abstract":"The scale of large language models (LLMs) has steadily increased over time, leading to enhanced performance in multimodal understanding and complex reasoning, but with significant execution overhead on hardware. Quantization is a promising approach to reduce computation and memory overhead for LLM deployment. However, maintaining accuracy and efficiency simultaneously is challenging due to the presence of outliers. Moreover, low-bit quantization tends to deteriorate accuracy due to its limited precision. Existing outlier-aware quantization/hardware co-design methods split the sparse outliers from the normal values with dedicated encoding schemes. However, such separation produces a nonuniform data format for normal values and outliers, leading to additional hardware design and inefficient memory access. This article presents an outlier-isolated data format (Oiso) for low-bit LLM quantization called Oiso. Oiso is a unified representation for both outliers and normal values. It isolates the normal values from the outliers, which can reduce the impact of outliers on the normal values during the quantization process. Taking advantage of the uniform format, Oiso arithmetic can be performed using a homogeneous computational unit, and Oiso values can be stored in a standardized format. Hierarchical block encoding with a subblock alignment scheme is introduced to reduce the encoding cost and the hardware overhead. We introduce the Oiso architecture, equipped with Oiso processing elements and encoders tailored for Oiso arithmetic, realizing efficient low-bit LLM inference. Oiso quantization can push the limits of low-bit LLM quantization, and the Oiso accelerator outperforms the state-of-the-art outlieraware accelerator design with <inline-formula> <tex-math>$1.26times $ </tex-math></inline-formula> performance improvement and 25% energy reduction.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 2","pages":"929-942"},"PeriodicalIF":2.9,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}