Pub Date : 2025-07-25DOI: 10.1109/TVLSI.2025.3587928
{"title":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems Publication Information","authors":"","doi":"10.1109/TVLSI.2025.3587928","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3587928","url":null,"abstract":"","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"C2-C2"},"PeriodicalIF":2.8,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11096975","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144705035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-25DOI: 10.1109/TVLSI.2025.3587930
{"title":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems Society Information","authors":"","doi":"10.1109/TVLSI.2025.3587930","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3587930","url":null,"abstract":"","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"C3-C3"},"PeriodicalIF":2.8,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11096974","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-24DOI: 10.1109/TVLSI.2025.3590597
Hui Chen;Lianghua Quan;Weiqiang Liu;Zhonghai Lu
Binary and decimal logarithms (BDLs) are commonly used in science and engineering. This brief presents a theory of the radix-4 generalized hyperbolic coordinate rotation digital computer (GH-CORDIC) to compute them directly. Compared with traditional hyperbolic CORDIC (TH-CORDIC), the two logarithms can be calculated without extra dividers or multipliers. Compared with the GH-CORDIC, this theory has low iterations under the same high precision. Through theoretical derivation and software simulation, we can find that the calculation accuracy can reach the magnitude of $10^{-7}$ , and the number of iterations can be reduced by more than 50%. Through hardware implementation, the synthesis report shows that the proposed architecture can save 53.44% area and 46.36% power consumption compared with the latest radix-2 GH-CORDIC method.
{"title":"High-Precision Low-Latency Method and Architecture for Computing Binary and Decimal Logarithms","authors":"Hui Chen;Lianghua Quan;Weiqiang Liu;Zhonghai Lu","doi":"10.1109/TVLSI.2025.3590597","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3590597","url":null,"abstract":"Binary and decimal logarithms (BDLs) are commonly used in science and engineering. This brief presents a theory of the radix-4 generalized hyperbolic coordinate rotation digital computer (GH-CORDIC) to compute them directly. Compared with traditional hyperbolic CORDIC (TH-CORDIC), the two logarithms can be calculated without extra dividers or multipliers. Compared with the GH-CORDIC, this theory has low iterations under the same high precision. Through theoretical derivation and software simulation, we can find that the calculation accuracy can reach the magnitude of <inline-formula> <tex-math>$10^{-7}$ </tex-math></inline-formula>, and the number of iterations can be reduced by more than 50%. Through hardware implementation, the synthesis report shows that the proposed architecture can save 53.44% area and 46.36% power consumption compared with the latest radix-2 GH-CORDIC method.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 11","pages":"3186-3190"},"PeriodicalIF":3.1,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145398669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The presence of counterfeit recycled ICs (CRICs) in the global semiconductor supply chain is a major concern in the present-day world. These CRICs are less reliable and have become a serious threat to the ICs employed in safety–critical systems. Accurate age prediction for integrated circuits (ICs) is crucial for implementing preventative and mitigation strategies to avoid unexpected failures in the field. By precisely estimating the age of an IC, electronic systems can benefit from improved reliability and performance, as maintenance and replacements can be scheduled proactively, and reducing the risk of sudden breakdowns. Furthermore, accurate age prediction plays a vital role in extending the lifespan of electronic devices, which in turn helps to minimize electronic waste. This not only reduces the environmental impact but also supports the broader goal of green computing by promoting more sustainable and resource-efficient technology practices. In this article, we introduce a method for detecting a CRIC and estimating its age by utilizing the existing input-output (IO) pad structures targeting sensorless chips. The proposed methodology estimates age by measuring the voltage drop across the protection diodes present in the IO pad structure and applying this voltage drop to the proposed polynomial regression model. This methodology requires no additional sensory circuit, resulting in no area overhead. As there is no requirement for a special on-chip sensor, the proposed methodology can be used to detect the age of an IC in production. Our proposed polynomial regression model achieves a mean squared error (MSE) of 1.77 h, with a minimum improvement of 99.7% over the state-of-the-art methodologies.
{"title":"Leveraging IO Pad Protection Diodes for Recycled IC Detection and Age Estimation Using Polynomial Regression","authors":"Anmol Singh Narwariya;Srisubha Kalanadhabhatta;Amit Acharyya","doi":"10.1109/TVLSI.2025.3590317","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3590317","url":null,"abstract":"The presence of counterfeit recycled ICs (CRICs) in the global semiconductor supply chain is a major concern in the present-day world. These CRICs are less reliable and have become a serious threat to the ICs employed in safety–critical systems. Accurate age prediction for integrated circuits (ICs) is crucial for implementing preventative and mitigation strategies to avoid unexpected failures in the field. By precisely estimating the age of an IC, electronic systems can benefit from improved reliability and performance, as maintenance and replacements can be scheduled proactively, and reducing the risk of sudden breakdowns. Furthermore, accurate age prediction plays a vital role in extending the lifespan of electronic devices, which in turn helps to minimize electronic waste. This not only reduces the environmental impact but also supports the broader goal of green computing by promoting more sustainable and resource-efficient technology practices. In this article, we introduce a method for detecting a CRIC and estimating its age by utilizing the existing input-output (IO) pad structures targeting sensorless chips. The proposed methodology estimates age by measuring the voltage drop across the protection diodes present in the IO pad structure and applying this voltage drop to the proposed polynomial regression model. This methodology requires no additional sensory circuit, resulting in no area overhead. As there is no requirement for a special on-chip sensor, the proposed methodology can be used to detect the age of an IC in production. Our proposed polynomial regression model achieves a mean squared error (MSE) of 1.77 h, with a minimum improvement of 99.7% over the state-of-the-art methodologies.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 11","pages":"3166-3175"},"PeriodicalIF":3.1,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145405345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-24DOI: 10.1109/TVLSI.2025.3587502
Kevin Vicuña;Massimo Vatalaro;Frédéric Amiel;Felice Crupi;Lionel Trojman
This work introduces a novel 128-bit transient effect ring oscillator (TERO)-based physically unclonable function (PUF) designed for Intel MAX 10 field-programmable gate arrays (FPGAs). A reliable PUF solution suitable for security applications targeting high stability and area efficiency is presented. The proposed cell consists of two cross-coupled reconfigurable ring oscillators (ROs) aiming to achieve zero-observed instability at both golden key (GK) and under temperature variations. Conversely to the conventional application-specific integrated circuits (ASIC) approaches, which use the mean cycles to collapse (CTC), here the calibration process was performed by considering the CTC standard deviation extracted at GK conditions, namely, 1.2 V and $25~^{circ }$ C. The experimental results demonstrate that after the calibration process and considering a 1.64% of masked bits, the proposed solution shows a bit error rate (BER) lower than $mathbf {1.56times 10^{-4}%}$ , the minimum observable quantity for the adopted statistical set across the entire analyzed temperature range. Further, the solution also shows an excellent uniqueness of 49.78%, close to the ideal value of 50%. This is achieved at the cost of two logic array blocks (LABs) per bit.
本文介绍了一种新的基于128位瞬态效应环振荡器(TERO)的物理不可克隆功能(PUF),该功能专为Intel MAX 10现场可编程门阵列(fpga)设计。提出了一种可靠的PUF解决方案,适用于高稳定性和区域效率的安全应用。该电池由两个交叉耦合的可重构环形振荡器(ROs)组成,旨在在金钥匙(GK)和温度变化下实现零观察不稳定性。与传统的应用专用集成电路(ASIC)方法使用平均周期折叠(CTC)相反,本文的校准过程考虑了在GK条件下提取的CTC标准偏差,即1.2 V和$25~^{circ}$ c。实验结果表明,经过校准过程并考虑1.64%的掩码位,所提出的解决方案的误码率(BER)低于$mathbf{1.56 乘以10^{-4}%}$。所采用的统计集在整个分析温度范围内的最小可观测量。此外,该方案还显示出49.78%的优异唯一性,接近50%的理想值。这是以每比特两个逻辑阵列块(lab)为代价实现的。
{"title":"Highly Stable Reconfigurable TERO PUF Architecture for Hardware Security Applications","authors":"Kevin Vicuña;Massimo Vatalaro;Frédéric Amiel;Felice Crupi;Lionel Trojman","doi":"10.1109/TVLSI.2025.3587502","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3587502","url":null,"abstract":"This work introduces a novel 128-bit transient effect ring oscillator (TERO)-based physically unclonable function (PUF) designed for Intel MAX 10 field-programmable gate arrays (FPGAs). A reliable PUF solution suitable for security applications targeting high stability and area efficiency is presented. The proposed cell consists of two cross-coupled reconfigurable ring oscillators (ROs) aiming to achieve zero-observed instability at both golden key (GK) and under temperature variations. Conversely to the conventional application-specific integrated circuits (ASIC) approaches, which use the mean cycles to collapse (CTC), here the calibration process was performed by considering the CTC standard deviation extracted at GK conditions, namely, 1.2 V and <inline-formula> <tex-math>$25~^{circ }$ </tex-math></inline-formula>C. The experimental results demonstrate that after the calibration process and considering a 1.64% of masked bits, the proposed solution shows a bit error rate (BER) lower than <inline-formula> <tex-math>$mathbf {1.56times 10^{-4}%}$ </tex-math></inline-formula>, the minimum observable quantity for the adopted statistical set across the entire analyzed temperature range. Further, the solution also shows an excellent uniqueness of 49.78%, close to the ideal value of 50%. This is achieved at the cost of two logic array blocks (LABs) per bit.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 10","pages":"2873-2882"},"PeriodicalIF":3.1,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11095825","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-22DOI: 10.1109/TVLSI.2025.3585747
Jincheng Wang;Yuhao Shu;Lintao Lan;Yifei Li;Bin Ning;Yuxin Zhou;Hongtu Zhang;Yajun Ha
With the development of big data, there is an increasing demand for high-density searching, where content-addressable memory (CAM) presents an attractive solution for its ability to perform parallel searches. However, this goal is constrained by the difficulty of further reducing the area of SRAM cells, which is commonly used in traditional CAM implementations. To address this issue, we propose a novel CAM with a compact five-transistor-zero-capacitor (5T0C)-embedded dynamic random access memory (eDRAM) for high-density searching and logic-in-memory applications. First, we propose the 5T0C eDRAM gain cell featuring a 3T0C write port and a decoupled read port of 2T to achieve data storage and searching operations. Second, we present a reconfigurable sense amplifier (RSA) design with two different reference voltages to optimize the area overhead of peripheral circuits and support logic operations. Moreover, the 5T0C eDRAM-based CAM can be employed to achieve high-density searching and logic operations. We have validated the eDRAM-based CAM array in the 40-nm CMOS process. The postlayout simulation results show that our design achieves over 15% higher memory density compared to the state-of-the-art 6T SRAM. Additionally, it supports a maximum frequency of 637 and 658 MHz for binary CAM (BCAM) searching and logic operations, while consuming 0.91 and 27.47 fJ/bit at 1.1 V, respectively.
{"title":"A 5T0C eDRAM-Based Content Addressable Memory for High-Density Searching and Logic-in-Memory","authors":"Jincheng Wang;Yuhao Shu;Lintao Lan;Yifei Li;Bin Ning;Yuxin Zhou;Hongtu Zhang;Yajun Ha","doi":"10.1109/TVLSI.2025.3585747","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3585747","url":null,"abstract":"With the development of big data, there is an increasing demand for high-density searching, where content-addressable memory (CAM) presents an attractive solution for its ability to perform parallel searches. However, this goal is constrained by the difficulty of further reducing the area of SRAM cells, which is commonly used in traditional CAM implementations. To address this issue, we propose a novel CAM with a compact five-transistor-zero-capacitor (5T0C)-embedded dynamic random access memory (eDRAM) for high-density searching and logic-in-memory applications. First, we propose the 5T0C eDRAM gain cell featuring a 3T0C write port and a decoupled read port of 2T to achieve data storage and searching operations. Second, we present a reconfigurable sense amplifier (RSA) design with two different reference voltages to optimize the area overhead of peripheral circuits and support logic operations. Moreover, the 5T0C eDRAM-based CAM can be employed to achieve high-density searching and logic operations. We have validated the eDRAM-based CAM array in the 40-nm CMOS process. The postlayout simulation results show that our design achieves over 15% higher memory density compared to the state-of-the-art 6T SRAM. Additionally, it supports a maximum frequency of 637 and 658 MHz for binary CAM (BCAM) searching and logic operations, while consuming 0.91 and 27.47 fJ/bit at 1.1 V, respectively.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2497-2507"},"PeriodicalIF":3.1,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-16DOI: 10.1109/TVLSI.2025.3586902
Kari Hepola;Tharaka Ranasinghe Arachchige;Joonas Multanen;Pekka Jääskeläinen
Custom instruction (CI) set extensions are beneficial for increasing performance and energy efficiency in a set of target applications. For rapid prototyping of these types of application-specific processors, designers leverage hardware (HW)/software (SW) co-design to create hardware implementations and retarget the compiler using a high-level description of the instruction set extension. Ideally, the architecture description should be flexible enough to support both hardware generation and compiler retargeting from the same description format. The challenge with these methods lies in coupling hardware extensions with the processor core, because using microarchitecture-specific interfaces leads to low design reuse and increased verification effort. To mitigate these challenges, we introduce a HW/SW co-design toolset capable of adapting to a user-defined architecture description that captures the instruction set extension semantics. Based on the architecture description, the toolset can both retarget the compiler and generate co-processors interfacing with the Core-V eXtension interface (CV-X-IF) and Rocket custom co-processor interface (RoCC) protocols that are widely used standard interfaces for RISC-V processors. To demonstrate our methods, we integrate the co-processors with two different variations of CVA6 and Rocket core. The resulting execution time reduction is up to 40% on average, with an area overhead of 8% for the CVA6. For the Rocket core, the execution time reduction is 27% with a 6% area overhead.
{"title":"Automatically Retargeting Hardware and Code Generation for RISC-V Custom Instructions","authors":"Kari Hepola;Tharaka Ranasinghe Arachchige;Joonas Multanen;Pekka Jääskeläinen","doi":"10.1109/TVLSI.2025.3586902","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3586902","url":null,"abstract":"Custom instruction (CI) set extensions are beneficial for increasing performance and energy efficiency in a set of target applications. For rapid prototyping of these types of application-specific processors, designers leverage hardware (HW)/software (SW) co-design to create hardware implementations and retarget the compiler using a high-level description of the instruction set extension. Ideally, the architecture description should be flexible enough to support both hardware generation and compiler retargeting from the same description format. The challenge with these methods lies in coupling hardware extensions with the processor core, because using microarchitecture-specific interfaces leads to low design reuse and increased verification effort. To mitigate these challenges, we introduce a HW/SW co-design toolset capable of adapting to a user-defined architecture description that captures the instruction set extension semantics. Based on the architecture description, the toolset can both retarget the compiler and generate co-processors interfacing with the Core-V eXtension interface (CV-X-IF) and Rocket custom co-processor interface (RoCC) protocols that are widely used standard interfaces for RISC-V processors. To demonstrate our methods, we integrate the co-processors with two different variations of CVA6 and Rocket core. The resulting execution time reduction is up to 40% on average, with an area overhead of 8% for the CVA6. For the Rocket core, the execution time reduction is 27% with a 6% area overhead.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 10","pages":"2852-2861"},"PeriodicalIF":3.1,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11082109","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-16DOI: 10.1109/TVLSI.2025.3585027
Christina Dilopoulou;Yiorgos Tsiatouhas
SRAM-based in-memory computing (IMC) is a promising approach to overcome the bottleneck of traditional Von Neumann architectures that suffer from data transfer delay and energy inefficiency. Aging phenomena and process variations are a serious reliability and lifetime concern that may impact SRAM-based IMC array architectures, similar to conventional SRAM arrays. Bias temperature instability (BTI) is a dominant aging mechanism that degrades transistor performance and negatively affects the analog nature of the IMC computations. In this work, we present a simulation framework for the joined analysis of aging and process variation influence on IMC reliable operation. We perform, through SPICE simulations, an extensive BTI aging analysis on differential input SRAM-based IMC array architectures under different operating conditions and considering process variations. The simulation results show a substantial impact of aging on their reliability. Furthermore, we present an aging mitigation technique to maintain reliability and extend the lifetime of these circuits. Aging mitigation is achieved by periodically reconfiguring the active current paths in the IMC cells, with negligible cost on throughput and power consumption. The simulation results show that up to 68% of the IMC circuits can lose accuracy after three operating years, depending on the operating conditions. The aging mitigation technique effectively reduces the percentage of circuits that lose accuracy by up to 72% and decreases their degradation rate, essentially extending by more than $9.3times $ their reliable lifetime.
{"title":"BTI Aging Analysis and Mitigation for Differential Input In-Memory Computing SRAMs","authors":"Christina Dilopoulou;Yiorgos Tsiatouhas","doi":"10.1109/TVLSI.2025.3585027","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3585027","url":null,"abstract":"SRAM-based in-memory computing (IMC) is a promising approach to overcome the bottleneck of traditional Von Neumann architectures that suffer from data transfer delay and energy inefficiency. Aging phenomena and process variations are a serious reliability and lifetime concern that may impact SRAM-based IMC array architectures, similar to conventional SRAM arrays. Bias temperature instability (BTI) is a dominant aging mechanism that degrades transistor performance and negatively affects the analog nature of the IMC computations. In this work, we present a simulation framework for the joined analysis of aging and process variation influence on IMC reliable operation. We perform, through SPICE simulations, an extensive BTI aging analysis on differential input SRAM-based IMC array architectures under different operating conditions and considering process variations. The simulation results show a substantial impact of aging on their reliability. Furthermore, we present an aging mitigation technique to maintain reliability and extend the lifetime of these circuits. Aging mitigation is achieved by periodically reconfiguring the active current paths in the IMC cells, with negligible cost on throughput and power consumption. The simulation results show that up to 68% of the IMC circuits can lose accuracy after three operating years, depending on the operating conditions. The aging mitigation technique effectively reduces the percentage of circuits that lose accuracy by up to 72% and decreases their degradation rate, essentially extending by more than <inline-formula> <tex-math>$9.3times $ </tex-math></inline-formula> their reliable lifetime.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2570-2579"},"PeriodicalIF":3.1,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Implementation techniques and results for a recently proposed real-time reconfigurable low-pass equalizer (RLPE) consisting of a variable bandwidth (VBW) filter and a variable equalizer (VE) are presented. Both components utilize fixed finite-length impulse response (FIR) filters combined with a few general multipliers, resulting in lower area and power consumption compared to a general FIR filter, despite requiring more multiplications. This is because the constant multipliers in the fixed FIR filters of the RLPE can be optimized for implementation. An additional advantage is that the proposed RLPE does not require online design. Various implementation alternatives for fixed FIR filters, including ways to increase the frequency, are evaluated to optimize the implementation of the RLPE. Several versions of the proposed RLPE and a general FIR filter for comparison are implemented using a 28-nm fully depleted silicon on insulator (FD-SOI) standard cell library. The results demonstrate that the RLPE baseline design requires less power and area than the general equalizer, and although the frequency of the baseline implementation is lower, the design can reach the same frequency while still having significantly less power and area. Furthermore, an approach is introduced to break the chain in the polynomial section of the VBW filter by using fewer additional registers compared to standard pipelining. Instead, this method reformulates the constant multiplication problem to produce correct results. For the considered case, the power consumption is reduced between 49% and 70% for different frequencies, with an area decrease in the range of 64%–67%, by using the proposed RLPE compared to a general FIR filter.
{"title":"Low-Complexity Implementation of Real-Time Reconfigurable Low-Pass Equalizers","authors":"Narges Mohammadi Sarband;Oksana Moryakova;Håkan Johansson;Oscar Gustafsson","doi":"10.1109/TVLSI.2025.3578450","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3578450","url":null,"abstract":"Implementation techniques and results for a recently proposed real-time reconfigurable low-pass equalizer (RLPE) consisting of a variable bandwidth (VBW) filter and a variable equalizer (VE) are presented. Both components utilize fixed finite-length impulse response (FIR) filters combined with a few general multipliers, resulting in lower area and power consumption compared to a general FIR filter, despite requiring more multiplications. This is because the constant multipliers in the fixed FIR filters of the RLPE can be optimized for implementation. An additional advantage is that the proposed RLPE does not require online design. Various implementation alternatives for fixed FIR filters, including ways to increase the frequency, are evaluated to optimize the implementation of the RLPE. Several versions of the proposed RLPE and a general FIR filter for comparison are implemented using a 28-nm fully depleted silicon on insulator (FD-SOI) standard cell library. The results demonstrate that the RLPE baseline design requires less power and area than the general equalizer, and although the frequency of the baseline implementation is lower, the design can reach the same frequency while still having significantly less power and area. Furthermore, an approach is introduced to break the chain in the polynomial section of the VBW filter by using fewer additional registers compared to standard pipelining. Instead, this method reformulates the constant multiplication problem to produce correct results. For the considered case, the power consumption is reduced between 49% and 70% for different frequencies, with an area decrease in the range of 64%–67%, by using the proposed RLPE compared to a general FIR filter.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2462-2473"},"PeriodicalIF":3.1,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11074767","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-10DOI: 10.1109/TVLSI.2025.3585360
Xie He;Dongxu Li
With the increasing demand for high-energy efficiency in multiply-accumulate (MAC) operations within deep learning accelerators, computing-in-memory (CIM) has gained significant attention. Time-domain (TD) CIM eliminates the need for analog-to-digital converters (ADCs), but single-bit delay units suffer from low computational efficiency. To address these issues, this work presents a TD multibit-per-unit CIM macro that leverages a precision-configurable time-to-digital converter (TDC) to enable accuracy configurability. Experimental results show that the proposed design achieves a 3-bit delay unit as a multibit CIM unit and an overall of 3-byte weight precision and 8-bit input precision. Compared to using three 1-bit/unit CIM delay units with an adder, it achieves a linearity with linear offset less than 3%. Besides, bias voltage adjusts the frequency and precision of the circuit (from 600 to 900 mV), enabling a minimum delay step of 0.11 ns. This system achieves a maximum energy efficiency of 268 TOPS/W under different VDD, making it a promising solution for always-on edge AI applications.
{"title":"A 3-bit/Unit Time-Domain Compute-In-Memory Macro With Adjustable Unit Delay","authors":"Xie He;Dongxu Li","doi":"10.1109/TVLSI.2025.3585360","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3585360","url":null,"abstract":"With the increasing demand for high-energy efficiency in multiply-accumulate (MAC) operations within deep learning accelerators, computing-in-memory (CIM) has gained significant attention. Time-domain (TD) CIM eliminates the need for analog-to-digital converters (ADCs), but single-bit delay units suffer from low computational efficiency. To address these issues, this work presents a TD multibit-per-unit CIM macro that leverages a precision-configurable time-to-digital converter (TDC) to enable accuracy configurability. Experimental results show that the proposed design achieves a 3-bit delay unit as a multibit CIM unit and an overall of 3-byte weight precision and 8-bit input precision. Compared to using three 1-bit/unit CIM delay units with an adder, it achieves a linearity with linear offset less than 3%. Besides, bias voltage adjusts the frequency and precision of the circuit (from 600 to 900 mV), enabling a minimum delay step of 0.11 ns. This system achieves a maximum energy efficiency of 268 TOPS/W under different VDD, making it a promising solution for always-on edge AI applications.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 10","pages":"2897-2901"},"PeriodicalIF":3.1,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}