Pub Date : 2025-06-12DOI: 10.1109/TCAD.2025.3579326
Sriparna Mandal;Surajeet Ghosh
Next-generation sequencing deals with exponential growth in sequence databases; the primary challenge is aligning short-read sequences in a time-efficient manner. Despite numerous efforts in contemporary research, they unfortunately face tradeoff issues related to time, power consumption, and resource constraints. A hardware accelerator is presented utilizing the Burrows-Wheeler Transformation without involving any sequence terminator to perform short-read sequencing at hardware speed, which eases additional storage, operations, and power consumption. Further, a hardware-based binary search scheme is introduced to reduce power consumption of the accelerator. As an alternative, a parallel searching mechanism is introduced to accomplish the searching operation in a single clock-cycle. The accelerator is evaluated for 64-to-256 nucleotide reference sequences and 32-to-56 nucleotide query sequences. The parallel search scheme consumes ≈11% less time than the binary search-based scheme, consuming ≈1.6–3.7% and ≈4.5%–23% more resources and power. While comparing the accelerator with the with-terminator method, it achieves ≈31.01%–33.13% gain in processing time, ≈31.28%–34.47% saving in hardware resource, ≈33.08%–33.29% saving in storage, and ≈14.03%–50.79% gain in power consumption. Finally, this accelerator exhibits a gain of $approx 52times $ in throughput without involving any terminator and external memory compared to state-of-the-art architectures.
{"title":"Hardware Accelerator for Short-Read DNA Sequence Alignment Using Burrows-Wheeler Transformation","authors":"Sriparna Mandal;Surajeet Ghosh","doi":"10.1109/TCAD.2025.3579326","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3579326","url":null,"abstract":"Next-generation sequencing deals with exponential growth in sequence databases; the primary challenge is aligning short-read sequences in a time-efficient manner. Despite numerous efforts in contemporary research, they unfortunately face tradeoff issues related to time, power consumption, and resource constraints. A hardware accelerator is presented utilizing the Burrows-Wheeler Transformation without involving any sequence terminator to perform short-read sequencing at hardware speed, which eases additional storage, operations, and power consumption. Further, a hardware-based binary search scheme is introduced to reduce power consumption of the accelerator. As an alternative, a parallel searching mechanism is introduced to accomplish the searching operation in a single clock-cycle. The accelerator is evaluated for 64-to-256 nucleotide reference sequences and 32-to-56 nucleotide query sequences. The parallel search scheme consumes ≈11% less time than the binary search-based scheme, consuming ≈1.6–3.7% and ≈4.5%–23% more resources and power. While comparing the accelerator with the with-terminator method, it achieves ≈31.01%–33.13% gain in processing time, ≈31.28%–34.47% saving in hardware resource, ≈33.08%–33.29% saving in storage, and ≈14.03%–50.79% gain in power consumption. Finally, this accelerator exhibits a gain of <inline-formula> <tex-math>$approx 52times $ </tex-math></inline-formula> in throughput without involving any terminator and external memory compared to state-of-the-art architectures.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 1","pages":"547-551"},"PeriodicalIF":2.9,"publicationDate":"2025-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145904270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-10DOI: 10.1109/TCAD.2025.3578328
Chen-Yu Hsieh;Yu-En Lin;Yi-Yu Liu
With the increasing number of I/O pins in highly integrated semiconductor products, semiconductor packaging has become an essential yet complex part of integrated circuit (IC) design. The substrate plays an important role in advanced semiconductor packaging and provides the chip with electrical connections and heat dissipation. While numerous studies have addressed the substrate routing problem, only one state-of-the-art work provides a customized routing flow specifically designed for packages with wire-bonding style and fine-pitch ball grid arrays (FBGA), which are more widely used than advanced packaging due to their maturity and lower cost. However, the existing router suffers from unsatisfactory routability due to its simplistic implementation and lack of necessary consideration for finger connections. Therefore, this article proposes several optimization heuristics, such as finger accessibility enhancement, progressive rerouting, and half-grid rerouting techniques, to further improve the overall routing completion rate. Experimental results show that the proposed heuristics are capable of avoiding routing resource wastage, achieving better routing quality, and eliminating design-rule violations.
{"title":"Optimization Heuristics for Grid-Based Integer Linear Programming Package Substrate Router","authors":"Chen-Yu Hsieh;Yu-En Lin;Yi-Yu Liu","doi":"10.1109/TCAD.2025.3578328","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3578328","url":null,"abstract":"With the increasing number of I/O pins in highly integrated semiconductor products, semiconductor packaging has become an essential yet complex part of integrated circuit (IC) design. The substrate plays an important role in advanced semiconductor packaging and provides the chip with electrical connections and heat dissipation. While numerous studies have addressed the substrate routing problem, only one state-of-the-art work provides a customized routing flow specifically designed for packages with wire-bonding style and fine-pitch ball grid arrays (FBGA), which are more widely used than advanced packaging due to their maturity and lower cost. However, the existing router suffers from unsatisfactory routability due to its simplistic implementation and lack of necessary consideration for finger connections. Therefore, this article proposes several optimization heuristics, such as finger accessibility enhancement, progressive rerouting, and half-grid rerouting techniques, to further improve the overall routing completion rate. Experimental results show that the proposed heuristics are capable of avoiding routing resource wastage, achieving better routing quality, and eliminating design-rule violations.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 1","pages":"552-556"},"PeriodicalIF":2.9,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145904330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-09DOI: 10.1109/TCAD.2025.3577971
Siyuan Liang;Zhen Zhuang;Kai-Yuan Chao;Bei Yu;Tsung-Yi Ho
Recently, the challenge of integrating an increasing number of transistors on a single die to adhere to Moore’s Law has spurred the need for innovative packaging solutions. Power/ground planes are integral to packages, and designers typically strive to maximize their size. This provides shielding and maintains constant impedance for adjacent high-speed signal wires, benefiting signal integrity. Additionally, large power/ground planes help reduce DC IR drops, enhancing power integrity. However, the necessity for multiple power/ground nets, each requiring independent power/ground planes within a package, makes the optimal allocation of limited free space a complex task. This article introduces a game-theoretic optimization method aimed at evenly mitigating DC IR drops across the multilayer package power/ground planes. In the formulated game of achieving the ideal power/ground plane design, we can enhance the use of package space and realize a design with evenly distributed DC IR drops across all power/ground planes. This is accomplished by adjusting strategies and reaching a state of Nash equilibrium in the allocation of free space. Additionally, we propose a rapid multilayer power/ground plane DC IR drop evaluation and a power/ground plane legalization method to bolster our optimization method.
{"title":"Multilayer Package Power/Ground Planes Synthesis With Balanced DC IR Drops: A Game-Theoretic Optimization Approach","authors":"Siyuan Liang;Zhen Zhuang;Kai-Yuan Chao;Bei Yu;Tsung-Yi Ho","doi":"10.1109/TCAD.2025.3577971","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3577971","url":null,"abstract":"Recently, the challenge of integrating an increasing number of transistors on a single die to adhere to Moore’s Law has spurred the need for innovative packaging solutions. Power/ground planes are integral to packages, and designers typically strive to maximize their size. This provides shielding and maintains constant impedance for adjacent high-speed signal wires, benefiting signal integrity. Additionally, large power/ground planes help reduce DC IR drops, enhancing power integrity. However, the necessity for multiple power/ground nets, each requiring independent power/ground planes within a package, makes the optimal allocation of limited free space a complex task. This article introduces a game-theoretic optimization method aimed at evenly mitigating DC IR drops across the multilayer package power/ground planes. In the formulated game of achieving the ideal power/ground plane design, we can enhance the use of package space and realize a design with evenly distributed DC IR drops across all power/ground planes. This is accomplished by adjusting strategies and reaching a state of Nash equilibrium in the allocation of free space. Additionally, we propose a rapid multilayer power/ground plane DC IR drop evaluation and a power/ground plane legalization method to bolster our optimization method.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 1","pages":"453-465"},"PeriodicalIF":2.9,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11028916","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145904275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The reliability assessment of systems powered by artificial intelligence (AI) is becoming a crucial step prior to their deployment in safety and mission-critical systems. Recently, many efforts have been made to develop sophisticated techniques to evaluate and improve the resilience of AI models against the occurrence of random hardware faults. However, due to the intrinsic nature of such models, the comparison of the results obtained in state-of-the-art works is crucial, as reference models are missing. Moreover, their resilience is strongly influenced by the training process, the adopted framework and data representation, and so on. To enable a common ground for future research targeting convolutional neural networks (CNNs) resilience analysis/hardening, this work proposes a first benchmark suite of deep learning (DL) models commonly adopted in this context, providing the models, the training/test data, and the resilience-related information (fault list, coverage, etc.) that can be used as a baseline for fair comparison. To this end, this research identifies a set of axes that have an impact on the resilience and classifies some popular CNN models, in both PyTorch and TensorFlow. Some final considerations are drawn, showing the relevance of a benchmark suite tailored for the resilience context.
{"title":"Benchmark Suite for Resilience Assessment of Deep Learning Models","authors":"Cristiana Bolchini;Alberto Bosio;Luca Cassano;Antonio Miele;Salvatore Pappalardo;Dario Passarello;Annachiara Ruospo;Ernesto Sanchez;Matteo Sonza Reorda;Vittorio Turco","doi":"10.1109/TCAD.2025.3578297","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3578297","url":null,"abstract":"The reliability assessment of systems powered by artificial intelligence (AI) is becoming a crucial step prior to their deployment in safety and mission-critical systems. Recently, many efforts have been made to develop sophisticated techniques to evaluate and improve the resilience of AI models against the occurrence of random hardware faults. However, due to the intrinsic nature of such models, the comparison of the results obtained in state-of-the-art works is crucial, as reference models are missing. Moreover, their resilience is strongly influenced by the training process, the adopted framework and data representation, and so on. To enable a common ground for future research targeting convolutional neural networks (CNNs) resilience analysis/hardening, this work proposes a first benchmark suite of deep learning (DL) models commonly adopted in this context, providing the models, the training/test data, and the resilience-related information (fault list, coverage, etc.) that can be used as a baseline for fair comparison. To this end, this research identifies a set of axes that have an impact on the resilience and classifies some popular CNN models, in both PyTorch and TensorFlow. Some final considerations are drawn, showing the relevance of a benchmark suite tailored for the resilience context.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 1","pages":"418-427"},"PeriodicalIF":2.9,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11029030","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145904288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-06DOI: 10.1109/TCAD.2025.3577539
Lian Yao;Jigang Wu;Peng Liu;Siew-Kei Lam
This article presents a comprehensive synthesis framework, named DAGSIS, for memristor-aided logic (MAGIC)-based in-memory computing system. DAGSIS addresses the limitations of prior works, such as overlooking the benefits of MAGIC’s high fan-in capability and the impact of global properties of netlists on the scheduling of computation sequence (CS). DAGSIS achieves the optimization in two synthesis stages. In the technology-independent optimization stage, DAGSIS encourages the merging of nodes in the network to reduce circuit size, by utilizing equivalent transformation of multiplexer (MUX). In the CS scheduling stage, DAGSIS introduces two schemes for optimizing area overhead and latency, respectively. For area optimization, DAGSIS maximizes the utilization of memristive cells by erasing the expired data as early as possible. For latency optimization, DAGSIS aims to minimize erasing operations, by maximizing the number of erased cells in each epoch of filling the memory. To achieve better CS scheduling, DAGSIS introduces two design rules to guide CS scheduling, which fully considers the global attributes of circuit design, such as critical path and high fan-out nodes. Experiment results show that DAGSIS reduces the circuit size by 6.69% on ISCAS’85 benchmarks compared to ABC tool, an open-source logic synthesis framework. Compared to the state-of-the-art works, DAGSIS achieves a reduction of 40.68% and 12.67% in area overhead and erasing operations, respectively, on ISCAS’85 and EPFL benchmarks. The improvements are further translated into the reduction in energy consumption by up to 13.7%.
{"title":"DAGSIS: A DAG-Aware MAGIC-Based Synthesis Framework for In-Memory Computing","authors":"Lian Yao;Jigang Wu;Peng Liu;Siew-Kei Lam","doi":"10.1109/TCAD.2025.3577539","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3577539","url":null,"abstract":"This article presents a comprehensive synthesis framework, named DAGSIS, for memristor-aided logic (MAGIC)-based in-memory computing system. DAGSIS addresses the limitations of prior works, such as overlooking the benefits of MAGIC’s high fan-in capability and the impact of global properties of netlists on the scheduling of computation sequence (CS). DAGSIS achieves the optimization in two synthesis stages. In the technology-independent optimization stage, DAGSIS encourages the merging of nodes in the network to reduce circuit size, by utilizing equivalent transformation of multiplexer (MUX). In the CS scheduling stage, DAGSIS introduces two schemes for optimizing area overhead and latency, respectively. For area optimization, DAGSIS maximizes the utilization of memristive cells by erasing the expired data as early as possible. For latency optimization, DAGSIS aims to minimize erasing operations, by maximizing the number of erased cells in each epoch of filling the memory. To achieve better CS scheduling, DAGSIS introduces two design rules to guide CS scheduling, which fully considers the global attributes of circuit design, such as critical path and high fan-out nodes. Experiment results show that DAGSIS reduces the circuit size by 6.69% on ISCAS’85 benchmarks compared to ABC tool, an open-source logic synthesis framework. Compared to the state-of-the-art works, DAGSIS achieves a reduction of 40.68% and 12.67% in area overhead and erasing operations, respectively, on ISCAS’85 and EPFL benchmarks. The improvements are further translated into the reduction in energy consumption by up to 13.7%.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 1","pages":"373-386"},"PeriodicalIF":2.9,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145904341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-05DOI: 10.1109/TCAD.2025.3577018
Alex Goulet;Roni Khazaka
A parallel non-Monte Carlo transient noise analysis method for efficient general nonlinear analysis is presented. The proposed method extends a previous method to include flicker noise. The implementation of the proposed method in a SPICE-like circuit simulator is described. Additional practical considerations are discussed. Higher parallel efficiency is achieved by balancing the parallel loads. The optimal number of processors is automatically selected as part of load balancing. A new time domain flicker noise circuit representation that increases the computational efficiency of the proposed method and the underlying serial method is presented. Three examples of transient noise analysis are provided: a low-noise amplifier circuit, a mixer circuit, and a distributed amplifier circuit.
{"title":"Parallel Non-Monte Carlo Transient Noise Simulation With Flicker Noise","authors":"Alex Goulet;Roni Khazaka","doi":"10.1109/TCAD.2025.3577018","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3577018","url":null,"abstract":"A parallel non-Monte Carlo transient noise analysis method for efficient general nonlinear analysis is presented. The proposed method extends a previous method to include flicker noise. The implementation of the proposed method in a SPICE-like circuit simulator is described. Additional practical considerations are discussed. Higher parallel efficiency is achieved by balancing the parallel loads. The optimal number of processors is automatically selected as part of load balancing. A new time domain flicker noise circuit representation that increases the computational efficiency of the proposed method and the underlying serial method is presented. Three examples of transient noise analysis are provided: a low-noise amplifier circuit, a mixer circuit, and a distributed amplifier circuit.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 1","pages":"323-334"},"PeriodicalIF":2.9,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145904314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-04DOI: 10.1109/TCAD.2025.3576314
Jaeyoung Joung;Sangjun Lee;Jongho Park;Jaehyun Kim;Laesang Jung;Sungho Kang
Scan is one of the representative design for testability (DFT) techniques designed to test sequential circuits. However, the additional hardware overhead and performance degradation caused by scan insertion can be unacceptable in specific designs. Partial scan has been applied as an alternative to the scan to balance these issues. However, previous cell selection algorithms accompany high computational complexity depending on the number of circuit components, including flip-flops, and do not sufficiently consider the analysis of large-scale circuits. In this article, a graph theory-based partial scan approach is proposed to effectively address the issues caused by scan insertion and reduce the load of structural analysis. The proposed algorithm partitions the circuit into multiple portions using graph clustering. Scan cells are selected from each subgraph to reduce sequential test generation complexity and improve testability. By partially analyzing the circuit, the proposed approach not only addresses the complexity problem of structural analysis in large-scale circuits but also can be generally applied regardless of circuit size or the number of components. The experimental results show that the proposed algorithm achieves significantly reduced processing time in seconds and reduces scan cells by approximately 11.47% with only 0.21% of test coverage loss on average compared to full scan design.
{"title":"CLAPS: A Graph Clustering-Based Approach for Partial Scan Design","authors":"Jaeyoung Joung;Sangjun Lee;Jongho Park;Jaehyun Kim;Laesang Jung;Sungho Kang","doi":"10.1109/TCAD.2025.3576314","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3576314","url":null,"abstract":"Scan is one of the representative design for testability (DFT) techniques designed to test sequential circuits. However, the additional hardware overhead and performance degradation caused by scan insertion can be unacceptable in specific designs. Partial scan has been applied as an alternative to the scan to balance these issues. However, previous cell selection algorithms accompany high computational complexity depending on the number of circuit components, including flip-flops, and do not sufficiently consider the analysis of large-scale circuits. In this article, a graph theory-based partial scan approach is proposed to effectively address the issues caused by scan insertion and reduce the load of structural analysis. The proposed algorithm partitions the circuit into multiple portions using graph clustering. Scan cells are selected from each subgraph to reduce sequential test generation complexity and improve testability. By partially analyzing the circuit, the proposed approach not only addresses the complexity problem of structural analysis in large-scale circuits but also can be generally applied regardless of circuit size or the number of components. The experimental results show that the proposed algorithm achieves significantly reduced processing time in seconds and reduces scan cells by approximately 11.47% with only 0.21% of test coverage loss on average compared to full scan design.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 1","pages":"396-406"},"PeriodicalIF":2.9,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145904269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, resistive switching random access memory (ReRAM)-based hardware accelerators have demonstrated unprecedented performance compared to digital accelerators. However, due to limitations in the manufacturing process and large-scale integration, several significant nonideal effects, including IR-Drop, stuck-at-fault, and device noises in real ReRAM-based crossbar arrays, are typically incurred. These nonideal effects degrade signal integrity and performance, particularly in crossbar structures used for building high-density ReRAMs. Therefore, finding a fast and efficient software solution that can predict the effects of IR-drop without involving expensive hardware is highly desirable. In this work, addressing the main limitations of existing simulation methods, such as slow speed and high-resource costs, we propose an efficient analysis of large-scale ReRAM crossbar arrays and the corresponding nonideal factors based on sparse matrix modeling. We classify nonideal factors into linear (e.g., IR-drop) and nonlinear categories (e.g., shot noise). For linear factors, super-nodal sparse LU factorizations are used to solve. The array-level results show that compared to SPICE simulation, our method achieves a numerical solution accuracy of $10^{-15}$ with $506.8 sim 1253.3times $ faster and $17.46 sim 42934.3times $ reduced memory usage. For nonlinear factors, we propose two solutions based on different requirements. In one method, we obtain an approximate initial solution by solving a linear system while disregarding the nonlinear contributions and subsequently apply an extended Anderson acceleration method to solve the nonlinear equation, which is suitable for high-precision solutions. Another method simplifies the nonlinear equation into an equivalent linear form. Theoretical validation confirms the effectiveness of this method, significantly enhancing simulation speed while maintaining accuracy. Moreover, we build a high-precision ReRAM accelerator architecture with real-time compensation. Experimental results demonstrate that the proposed architecture effectively mitigates accuracy loss caused by nonideal factors.
{"title":"Real-Time Compensation Framework for Large-Scale ReRAM-Based Sparse LU Factorization","authors":"Weiran Chen;Zaitian Chen;Bei Yu;Song Chen;Yi Kang;Qi Xu","doi":"10.1109/TCAD.2025.3576332","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3576332","url":null,"abstract":"Recently, resistive switching random access memory (ReRAM)-based hardware accelerators have demonstrated unprecedented performance compared to digital accelerators. However, due to limitations in the manufacturing process and large-scale integration, several significant nonideal effects, including IR-Drop, stuck-at-fault, and device noises in real ReRAM-based crossbar arrays, are typically incurred. These nonideal effects degrade signal integrity and performance, particularly in crossbar structures used for building high-density ReRAMs. Therefore, finding a fast and efficient software solution that can predict the effects of IR-drop without involving expensive hardware is highly desirable. In this work, addressing the main limitations of existing simulation methods, such as slow speed and high-resource costs, we propose an efficient analysis of large-scale ReRAM crossbar arrays and the corresponding nonideal factors based on sparse matrix modeling. We classify nonideal factors into linear (e.g., IR-drop) and nonlinear categories (e.g., shot noise). For linear factors, super-nodal sparse LU factorizations are used to solve. The array-level results show that compared to SPICE simulation, our method achieves a numerical solution accuracy of <inline-formula> <tex-math>$10^{-15}$ </tex-math></inline-formula> with <inline-formula> <tex-math>$506.8 sim 1253.3times $ </tex-math></inline-formula> faster and <inline-formula> <tex-math>$17.46 sim 42934.3times $ </tex-math></inline-formula> reduced memory usage. For nonlinear factors, we propose two solutions based on different requirements. In one method, we obtain an approximate initial solution by solving a linear system while disregarding the nonlinear contributions and subsequently apply an extended Anderson acceleration method to solve the nonlinear equation, which is suitable for high-precision solutions. Another method simplifies the nonlinear equation into an equivalent linear form. Theoretical validation confirms the effectiveness of this method, significantly enhancing simulation speed while maintaining accuracy. Moreover, we build a high-precision ReRAM accelerator architecture with real-time compensation. Experimental results demonstrate that the proposed architecture effectively mitigates accuracy loss caused by nonideal factors.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 1","pages":"309-322"},"PeriodicalIF":2.9,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145904291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-04DOI: 10.1109/TCAD.2025.3576320
Yu Li;Biao Huang;Jinyin Hu;Cheng Zhuo
Dynamic deep neural networks, particularly multiexit networks, are increasingly recognized for their efficiency in edge-cloud scenarios. However, they are vulnerable to latency attacks that can degrade performance by increasing computation time. Current attack strategies often require white-box access to the model or lead to significant drops in inference accuracy, making them easily detectable. This article introduces SPLAT, a novel approach for executing stealthy and practical latency attacks on dynamic multiexit models under black-box conditions. SPLAT employs a two-stage mechanism: the first stage generates coarse-grained attack inputs using a functional surrogate model, while the second stage refines these perturbations through an efficient query strategy to enhance stealthiness and effectiveness. Extensive experiments validate that SPLAT significantly outperforms existing methods across various models and datasets.
{"title":"SPLAT: Revisiting Latency Attack on Dynamic Neural Networks","authors":"Yu Li;Biao Huang;Jinyin Hu;Cheng Zhuo","doi":"10.1109/TCAD.2025.3576320","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3576320","url":null,"abstract":"Dynamic deep neural networks, particularly multiexit networks, are increasingly recognized for their efficiency in edge-cloud scenarios. However, they are vulnerable to latency attacks that can degrade performance by increasing computation time. Current attack strategies often require white-box access to the model or lead to significant drops in inference accuracy, making them easily detectable. This article introduces SPLAT, a novel approach for executing stealthy and practical latency attacks on dynamic multiexit models under black-box conditions. SPLAT employs a two-stage mechanism: the first stage generates coarse-grained attack inputs using a functional surrogate model, while the second stage refines these perturbations through an efficient query strategy to enhance stealthiness and effectiveness. Extensive experiments validate that SPLAT significantly outperforms existing methods across various models and datasets.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 1","pages":"506-518"},"PeriodicalIF":2.9,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145904307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-04DOI: 10.1109/TCAD.2025.3576333
Yang Liu;Shuyang Li;Yu Li;Ruiqi Chen;Shun Li;Jun Yu;Kun Wang
Nonlinear activation plays an essential role in neural networks (NNs) for their generalization ability. However, implementing intricate mathematical operations on hardware platforms, including field-programmable gate arrays (FPGAs), presents significant challenges. Prior works based on piecewise functions or look-up table (LUT) have encountered difficulties in balancing precision requirements with fair hardware overhead and often necessitating complex manual interventions. To address these issues, this article proposes DIF-LUT Pro, an automated tool for simple yet scalable approximation for various nonlinear activations on FPGA. Specifically, the proposed algorithm achieves self-adaptive hardware design oriented toward target precision, by piecewise linear matching to fit the function derivative roughly and range addressable LUT to offset the difference. Moreover, DIF-LUT Pro integrates the algorithm into an automated tool, allowing users to configure the customized interface and generate the corresponding hardware description language (HDL) code with a single click. Experimental results show that 1) DIF-LUT Pro features robust automation and fair generality, capable of generating equitable hardware designs under various user configurations across different FPGA platforms and 2) DIF-LUT Pro produces approximations that are simple yet effective, achieving competitive performance compared to previous expert-crafted designs. Furthermore, two detailed case studies demonstrate the efficient application of DIF-LUT Pro on NeRF and SEResnet, proving its practical value. Our source code is open-source and available at https://github.com/AdrianLiu00/DIF-LUT-Tool.
{"title":"DIF-LUT Pro: An Automated Tool for Simple yet Scalable Approximation of Nonlinear Activation on FPGA","authors":"Yang Liu;Shuyang Li;Yu Li;Ruiqi Chen;Shun Li;Jun Yu;Kun Wang","doi":"10.1109/TCAD.2025.3576333","DOIUrl":"https://doi.org/10.1109/TCAD.2025.3576333","url":null,"abstract":"Nonlinear activation plays an essential role in neural networks (NNs) for their generalization ability. However, implementing intricate mathematical operations on hardware platforms, including field-programmable gate arrays (FPGAs), presents significant challenges. Prior works based on piecewise functions or look-up table (LUT) have encountered difficulties in balancing precision requirements with fair hardware overhead and often necessitating complex manual interventions. To address these issues, this article proposes DIF-LUT Pro, an automated tool for simple yet scalable approximation for various nonlinear activations on FPGA. Specifically, the proposed algorithm achieves self-adaptive hardware design oriented toward target precision, by piecewise linear matching to fit the function derivative roughly and range addressable LUT to offset the difference. Moreover, DIF-LUT Pro integrates the algorithm into an automated tool, allowing users to configure the customized interface and generate the corresponding hardware description language (HDL) code with a single click. Experimental results show that 1) DIF-LUT Pro features robust automation and fair generality, capable of generating equitable hardware designs under various user configurations across different FPGA platforms and 2) DIF-LUT Pro produces approximations that are simple yet effective, achieving competitive performance compared to previous expert-crafted designs. Furthermore, two detailed case studies demonstrate the efficient application of DIF-LUT Pro on NeRF and SEResnet, proving its practical value. Our source code is open-source and available at <uri>https://github.com/AdrianLiu00/DIF-LUT-Tool</uri>.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"45 1","pages":"295-308"},"PeriodicalIF":2.9,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145904285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}