首页 > 最新文献

IEEE Transactions on Very Large Scale Integration (VLSI) Systems最新文献

英文 中文
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Publication Information IEEE超大规模集成电路(VLSI)系统学报
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-25 DOI: 10.1109/TVLSI.2025.3587928
{"title":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems Publication Information","authors":"","doi":"10.1109/TVLSI.2025.3587928","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3587928","url":null,"abstract":"","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"C2-C2"},"PeriodicalIF":2.8,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11096975","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144705035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IEEE Transactions on Very Large Scale Integration (VLSI) Systems Society Information 超大规模集成电路(VLSI)系统学报
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-25 DOI: 10.1109/TVLSI.2025.3587930
{"title":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems Society Information","authors":"","doi":"10.1109/TVLSI.2025.3587930","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3587930","url":null,"abstract":"","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 8","pages":"C3-C3"},"PeriodicalIF":2.8,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11096974","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-Precision Low-Latency Method and Architecture for Computing Binary and Decimal Logarithms 计算二进制和十进制对数的高精度低延迟方法和体系结构
IF 3.1 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-24 DOI: 10.1109/TVLSI.2025.3590597
Hui Chen;Lianghua Quan;Weiqiang Liu;Zhonghai Lu
Binary and decimal logarithms (BDLs) are commonly used in science and engineering. This brief presents a theory of the radix-4 generalized hyperbolic coordinate rotation digital computer (GH-CORDIC) to compute them directly. Compared with traditional hyperbolic CORDIC (TH-CORDIC), the two logarithms can be calculated without extra dividers or multipliers. Compared with the GH-CORDIC, this theory has low iterations under the same high precision. Through theoretical derivation and software simulation, we can find that the calculation accuracy can reach the magnitude of $10^{-7}$ , and the number of iterations can be reduced by more than 50%. Through hardware implementation, the synthesis report shows that the proposed architecture can save 53.44% area and 46.36% power consumption compared with the latest radix-2 GH-CORDIC method.
二进制和十进制对数(bdl)在科学和工程中被广泛使用。本文简要介绍了一种直接计算基数-4广义双曲坐标旋转数字计算机(GH-CORDIC)的理论。与传统的双曲CORDIC (TH-CORDIC)相比,这两个对数的计算不需要额外的除法或乘数。与GH-CORDIC相比,该理论在相同的高精度下迭代次数少。通过理论推导和软件仿真,我们可以发现计算精度可以达到$10^{-7}$的量级,迭代次数可以减少50%以上。通过硬件实现,综合报告表明,与最新的radix-2 GH-CORDIC方法相比,该架构可节省53.44%的面积和46.36%的功耗。
{"title":"High-Precision Low-Latency Method and Architecture for Computing Binary and Decimal Logarithms","authors":"Hui Chen;Lianghua Quan;Weiqiang Liu;Zhonghai Lu","doi":"10.1109/TVLSI.2025.3590597","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3590597","url":null,"abstract":"Binary and decimal logarithms (BDLs) are commonly used in science and engineering. This brief presents a theory of the radix-4 generalized hyperbolic coordinate rotation digital computer (GH-CORDIC) to compute them directly. Compared with traditional hyperbolic CORDIC (TH-CORDIC), the two logarithms can be calculated without extra dividers or multipliers. Compared with the GH-CORDIC, this theory has low iterations under the same high precision. Through theoretical derivation and software simulation, we can find that the calculation accuracy can reach the magnitude of <inline-formula> <tex-math>$10^{-7}$ </tex-math></inline-formula>, and the number of iterations can be reduced by more than 50%. Through hardware implementation, the synthesis report shows that the proposed architecture can save 53.44% area and 46.36% power consumption compared with the latest radix-2 GH-CORDIC method.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 11","pages":"3186-3190"},"PeriodicalIF":3.1,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145398669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Leveraging IO Pad Protection Diodes for Recycled IC Detection and Age Estimation Using Polynomial Regression 利用IO焊盘保护二极管回收IC检测和使用多项式回归的年龄估计
IF 3.1 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-24 DOI: 10.1109/TVLSI.2025.3590317
Anmol Singh Narwariya;Srisubha Kalanadhabhatta;Amit Acharyya
The presence of counterfeit recycled ICs (CRICs) in the global semiconductor supply chain is a major concern in the present-day world. These CRICs are less reliable and have become a serious threat to the ICs employed in safety–critical systems. Accurate age prediction for integrated circuits (ICs) is crucial for implementing preventative and mitigation strategies to avoid unexpected failures in the field. By precisely estimating the age of an IC, electronic systems can benefit from improved reliability and performance, as maintenance and replacements can be scheduled proactively, and reducing the risk of sudden breakdowns. Furthermore, accurate age prediction plays a vital role in extending the lifespan of electronic devices, which in turn helps to minimize electronic waste. This not only reduces the environmental impact but also supports the broader goal of green computing by promoting more sustainable and resource-efficient technology practices. In this article, we introduce a method for detecting a CRIC and estimating its age by utilizing the existing input-output (IO) pad structures targeting sensorless chips. The proposed methodology estimates age by measuring the voltage drop across the protection diodes present in the IO pad structure and applying this voltage drop to the proposed polynomial regression model. This methodology requires no additional sensory circuit, resulting in no area overhead. As there is no requirement for a special on-chip sensor, the proposed methodology can be used to detect the age of an IC in production. Our proposed polynomial regression model achieves a mean squared error (MSE) of 1.77 h, with a minimum improvement of 99.7% over the state-of-the-art methodologies.
假冒回收集成电路(CRICs)在全球半导体供应链中的存在是当今世界的一个主要问题。这些CRICs可靠性较差,对安全关键系统中使用的ic构成了严重威胁。集成电路(ic)的准确寿命预测对于实施预防和缓解策略以避免现场意外故障至关重要。通过精确估计集成电路的年龄,电子系统可以从提高可靠性和性能中受益,因为维护和更换可以提前安排,并降低突然故障的风险。此外,准确的年龄预测在延长电子设备的使用寿命方面起着至关重要的作用,这反过来又有助于减少电子废物。这不仅减少了对环境的影响,而且还通过促进更可持续和资源高效的技术实践来支持绿色计算的更广泛目标。在本文中,我们介绍了一种利用现有的针对无传感器芯片的输入输出(IO)衬垫结构来检测CRIC并估计其年龄的方法。所提出的方法通过测量IO焊盘结构中存在的保护二极管的电压降并将该电压降应用于所提出的多项式回归模型来估计年龄。这种方法不需要额外的感觉电路,因此没有面积开销。由于不需要特殊的片上传感器,所提出的方法可用于检测生产中的集成电路的年龄。我们提出的多项式回归模型实现了1.77 h的均方误差(MSE),与最先进的方法相比,至少提高了99.7%。
{"title":"Leveraging IO Pad Protection Diodes for Recycled IC Detection and Age Estimation Using Polynomial Regression","authors":"Anmol Singh Narwariya;Srisubha Kalanadhabhatta;Amit Acharyya","doi":"10.1109/TVLSI.2025.3590317","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3590317","url":null,"abstract":"The presence of counterfeit recycled ICs (CRICs) in the global semiconductor supply chain is a major concern in the present-day world. These CRICs are less reliable and have become a serious threat to the ICs employed in safety–critical systems. Accurate age prediction for integrated circuits (ICs) is crucial for implementing preventative and mitigation strategies to avoid unexpected failures in the field. By precisely estimating the age of an IC, electronic systems can benefit from improved reliability and performance, as maintenance and replacements can be scheduled proactively, and reducing the risk of sudden breakdowns. Furthermore, accurate age prediction plays a vital role in extending the lifespan of electronic devices, which in turn helps to minimize electronic waste. This not only reduces the environmental impact but also supports the broader goal of green computing by promoting more sustainable and resource-efficient technology practices. In this article, we introduce a method for detecting a CRIC and estimating its age by utilizing the existing input-output (IO) pad structures targeting sensorless chips. The proposed methodology estimates age by measuring the voltage drop across the protection diodes present in the IO pad structure and applying this voltage drop to the proposed polynomial regression model. This methodology requires no additional sensory circuit, resulting in no area overhead. As there is no requirement for a special on-chip sensor, the proposed methodology can be used to detect the age of an IC in production. Our proposed polynomial regression model achieves a mean squared error (MSE) of 1.77 h, with a minimum improvement of 99.7% over the state-of-the-art methodologies.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 11","pages":"3166-3175"},"PeriodicalIF":3.1,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145405345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Highly Stable Reconfigurable TERO PUF Architecture for Hardware Security Applications 用于硬件安全应用的高度稳定可重构TERO PUF架构
IF 3.1 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-24 DOI: 10.1109/TVLSI.2025.3587502
Kevin Vicuña;Massimo Vatalaro;Frédéric Amiel;Felice Crupi;Lionel Trojman
This work introduces a novel 128-bit transient effect ring oscillator (TERO)-based physically unclonable function (PUF) designed for Intel MAX 10 field-programmable gate arrays (FPGAs). A reliable PUF solution suitable for security applications targeting high stability and area efficiency is presented. The proposed cell consists of two cross-coupled reconfigurable ring oscillators (ROs) aiming to achieve zero-observed instability at both golden key (GK) and under temperature variations. Conversely to the conventional application-specific integrated circuits (ASIC) approaches, which use the mean cycles to collapse (CTC), here the calibration process was performed by considering the CTC standard deviation extracted at GK conditions, namely, 1.2 V and $25~^{circ }$ C. The experimental results demonstrate that after the calibration process and considering a 1.64% of masked bits, the proposed solution shows a bit error rate (BER) lower than $mathbf {1.56times 10^{-4}%}$ , the minimum observable quantity for the adopted statistical set across the entire analyzed temperature range. Further, the solution also shows an excellent uniqueness of 49.78%, close to the ideal value of 50%. This is achieved at the cost of two logic array blocks (LABs) per bit.
本文介绍了一种新的基于128位瞬态效应环振荡器(TERO)的物理不可克隆功能(PUF),该功能专为Intel MAX 10现场可编程门阵列(fpga)设计。提出了一种可靠的PUF解决方案,适用于高稳定性和区域效率的安全应用。该电池由两个交叉耦合的可重构环形振荡器(ROs)组成,旨在在金钥匙(GK)和温度变化下实现零观察不稳定性。与传统的应用专用集成电路(ASIC)方法使用平均周期折叠(CTC)相反,本文的校准过程考虑了在GK条件下提取的CTC标准偏差,即1.2 V和$25~^{circ}$ c。实验结果表明,经过校准过程并考虑1.64%的掩码位,所提出的解决方案的误码率(BER)低于$mathbf{1.56 乘以10^{-4}%}$。所采用的统计集在整个分析温度范围内的最小可观测量。此外,该方案还显示出49.78%的优异唯一性,接近50%的理想值。这是以每比特两个逻辑阵列块(lab)为代价实现的。
{"title":"Highly Stable Reconfigurable TERO PUF Architecture for Hardware Security Applications","authors":"Kevin Vicuña;Massimo Vatalaro;Frédéric Amiel;Felice Crupi;Lionel Trojman","doi":"10.1109/TVLSI.2025.3587502","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3587502","url":null,"abstract":"This work introduces a novel 128-bit transient effect ring oscillator (TERO)-based physically unclonable function (PUF) designed for Intel MAX 10 field-programmable gate arrays (FPGAs). A reliable PUF solution suitable for security applications targeting high stability and area efficiency is presented. The proposed cell consists of two cross-coupled reconfigurable ring oscillators (ROs) aiming to achieve zero-observed instability at both golden key (GK) and under temperature variations. Conversely to the conventional application-specific integrated circuits (ASIC) approaches, which use the mean cycles to collapse (CTC), here the calibration process was performed by considering the CTC standard deviation extracted at GK conditions, namely, 1.2 V and <inline-formula> <tex-math>$25~^{circ }$ </tex-math></inline-formula>C. The experimental results demonstrate that after the calibration process and considering a 1.64% of masked bits, the proposed solution shows a bit error rate (BER) lower than <inline-formula> <tex-math>$mathbf {1.56times 10^{-4}%}$ </tex-math></inline-formula>, the minimum observable quantity for the adopted statistical set across the entire analyzed temperature range. Further, the solution also shows an excellent uniqueness of 49.78%, close to the ideal value of 50%. This is achieved at the cost of two logic array blocks (LABs) per bit.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 10","pages":"2873-2882"},"PeriodicalIF":3.1,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11095825","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A 5T0C eDRAM-Based Content Addressable Memory for High-Density Searching and Logic-in-Memory 基于5T0C edram的内容可寻址存储器,用于高密度搜索和内存逻辑
IF 3.1 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-22 DOI: 10.1109/TVLSI.2025.3585747
Jincheng Wang;Yuhao Shu;Lintao Lan;Yifei Li;Bin Ning;Yuxin Zhou;Hongtu Zhang;Yajun Ha
With the development of big data, there is an increasing demand for high-density searching, where content-addressable memory (CAM) presents an attractive solution for its ability to perform parallel searches. However, this goal is constrained by the difficulty of further reducing the area of SRAM cells, which is commonly used in traditional CAM implementations. To address this issue, we propose a novel CAM with a compact five-transistor-zero-capacitor (5T0C)-embedded dynamic random access memory (eDRAM) for high-density searching and logic-in-memory applications. First, we propose the 5T0C eDRAM gain cell featuring a 3T0C write port and a decoupled read port of 2T to achieve data storage and searching operations. Second, we present a reconfigurable sense amplifier (RSA) design with two different reference voltages to optimize the area overhead of peripheral circuits and support logic operations. Moreover, the 5T0C eDRAM-based CAM can be employed to achieve high-density searching and logic operations. We have validated the eDRAM-based CAM array in the 40-nm CMOS process. The postlayout simulation results show that our design achieves over 15% higher memory density compared to the state-of-the-art 6T SRAM. Additionally, it supports a maximum frequency of 637 and 658 MHz for binary CAM (BCAM) searching and logic operations, while consuming 0.91 and 27.47 fJ/bit at 1.1 V, respectively.
随着大数据的发展,对高密度搜索的需求越来越大,内容寻址存储器(CAM)以其并行搜索的能力成为一种有吸引力的解决方案。然而,这一目标受到进一步减少SRAM单元面积的困难的限制,这是传统CAM实现中常用的。为了解决这个问题,我们提出了一种新颖的CAM,它具有紧凑的五晶体管零电容(5T0C)嵌入式动态随机存取存储器(eDRAM),用于高密度搜索和内存逻辑应用。首先,我们提出了5T0C eDRAM增益单元,该单元具有3T0C写端口和2T解耦读端口,以实现数据存储和搜索操作。其次,我们提出了一种具有两种不同参考电压的可重构感测放大器(RSA)设计,以优化外围电路的面积开销并支持逻辑运算。此外,基于5T0C edram的CAM可以实现高密度的搜索和逻辑运算。我们已经在40纳米CMOS工艺中验证了基于edram的CAM阵列。布局后仿真结果表明,与最先进的6T SRAM相比,我们的设计实现了超过15%的内存密度。此外,它支持二进制CAM (BCAM)搜索和逻辑操作的最大频率为637和658 MHz,而在1.1 V时分别消耗0.91和27.47 fJ/bit。
{"title":"A 5T0C eDRAM-Based Content Addressable Memory for High-Density Searching and Logic-in-Memory","authors":"Jincheng Wang;Yuhao Shu;Lintao Lan;Yifei Li;Bin Ning;Yuxin Zhou;Hongtu Zhang;Yajun Ha","doi":"10.1109/TVLSI.2025.3585747","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3585747","url":null,"abstract":"With the development of big data, there is an increasing demand for high-density searching, where content-addressable memory (CAM) presents an attractive solution for its ability to perform parallel searches. However, this goal is constrained by the difficulty of further reducing the area of SRAM cells, which is commonly used in traditional CAM implementations. To address this issue, we propose a novel CAM with a compact five-transistor-zero-capacitor (5T0C)-embedded dynamic random access memory (eDRAM) for high-density searching and logic-in-memory applications. First, we propose the 5T0C eDRAM gain cell featuring a 3T0C write port and a decoupled read port of 2T to achieve data storage and searching operations. Second, we present a reconfigurable sense amplifier (RSA) design with two different reference voltages to optimize the area overhead of peripheral circuits and support logic operations. Moreover, the 5T0C eDRAM-based CAM can be employed to achieve high-density searching and logic operations. We have validated the eDRAM-based CAM array in the 40-nm CMOS process. The postlayout simulation results show that our design achieves over 15% higher memory density compared to the state-of-the-art 6T SRAM. Additionally, it supports a maximum frequency of 637 and 658 MHz for binary CAM (BCAM) searching and logic operations, while consuming 0.91 and 27.47 fJ/bit at 1.1 V, respectively.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2497-2507"},"PeriodicalIF":3.1,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatically Retargeting Hardware and Code Generation for RISC-V Custom Instructions RISC-V自定义指令的自动重定向硬件和代码生成
IF 3.1 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-16 DOI: 10.1109/TVLSI.2025.3586902
Kari Hepola;Tharaka Ranasinghe Arachchige;Joonas Multanen;Pekka Jääskeläinen
Custom instruction (CI) set extensions are beneficial for increasing performance and energy efficiency in a set of target applications. For rapid prototyping of these types of application-specific processors, designers leverage hardware (HW)/software (SW) co-design to create hardware implementations and retarget the compiler using a high-level description of the instruction set extension. Ideally, the architecture description should be flexible enough to support both hardware generation and compiler retargeting from the same description format. The challenge with these methods lies in coupling hardware extensions with the processor core, because using microarchitecture-specific interfaces leads to low design reuse and increased verification effort. To mitigate these challenges, we introduce a HW/SW co-design toolset capable of adapting to a user-defined architecture description that captures the instruction set extension semantics. Based on the architecture description, the toolset can both retarget the compiler and generate co-processors interfacing with the Core-V eXtension interface (CV-X-IF) and Rocket custom co-processor interface (RoCC) protocols that are widely used standard interfaces for RISC-V processors. To demonstrate our methods, we integrate the co-processors with two different variations of CVA6 and Rocket core. The resulting execution time reduction is up to 40% on average, with an area overhead of 8% for the CVA6. For the Rocket core, the execution time reduction is 27% with a 6% area overhead.
自定义指令(CI)集扩展有助于提高一组目标应用程序的性能和能效。为了对这些类型的特定于应用程序的处理器进行快速原型设计,设计人员利用硬件(HW)/软件(SW)协同设计来创建硬件实现,并使用指令集扩展的高级描述来重新定位编译器。理想情况下,体系结构描述应该足够灵活,以支持硬件生成和编译器从相同的描述格式重定向。这些方法的挑战在于将硬件扩展与处理器核心耦合在一起,因为使用特定于微体系结构的接口会导致低设计重用和增加验证工作。为了缓解这些挑战,我们引入了一个硬件/软件协同设计工具集,该工具集能够适应捕获指令集扩展语义的用户定义的体系结构描述。基于架构描述,该工具集既可以重新定位编译器,也可以生成与Core-V扩展接口(CV-X-IF)和Rocket自定义协处理器接口(RoCC)协议接口的协处理器,这些协议是RISC-V处理器广泛使用的标准接口。为了演示我们的方法,我们将协处理器与CVA6和Rocket内核的两种不同变体集成在一起。结果执行时间平均减少了40%,CVA6的区域开销为8%。对于火箭核心,执行时间减少27%,面积开销减少6%。
{"title":"Automatically Retargeting Hardware and Code Generation for RISC-V Custom Instructions","authors":"Kari Hepola;Tharaka Ranasinghe Arachchige;Joonas Multanen;Pekka Jääskeläinen","doi":"10.1109/TVLSI.2025.3586902","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3586902","url":null,"abstract":"Custom instruction (CI) set extensions are beneficial for increasing performance and energy efficiency in a set of target applications. For rapid prototyping of these types of application-specific processors, designers leverage hardware (HW)/software (SW) co-design to create hardware implementations and retarget the compiler using a high-level description of the instruction set extension. Ideally, the architecture description should be flexible enough to support both hardware generation and compiler retargeting from the same description format. The challenge with these methods lies in coupling hardware extensions with the processor core, because using microarchitecture-specific interfaces leads to low design reuse and increased verification effort. To mitigate these challenges, we introduce a HW/SW co-design toolset capable of adapting to a user-defined architecture description that captures the instruction set extension semantics. Based on the architecture description, the toolset can both retarget the compiler and generate co-processors interfacing with the Core-V eXtension interface (CV-X-IF) and Rocket custom co-processor interface (RoCC) protocols that are widely used standard interfaces for RISC-V processors. To demonstrate our methods, we integrate the co-processors with two different variations of CVA6 and Rocket core. The resulting execution time reduction is up to 40% on average, with an area overhead of 8% for the CVA6. For the Rocket core, the execution time reduction is 27% with a 6% area overhead.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 10","pages":"2852-2861"},"PeriodicalIF":3.1,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11082109","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BTI Aging Analysis and Mitigation for Differential Input In-Memory Computing SRAMs 差分输入内存计算ram的BTI老化分析与缓解
IF 3.1 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-16 DOI: 10.1109/TVLSI.2025.3585027
Christina Dilopoulou;Yiorgos Tsiatouhas
SRAM-based in-memory computing (IMC) is a promising approach to overcome the bottleneck of traditional Von Neumann architectures that suffer from data transfer delay and energy inefficiency. Aging phenomena and process variations are a serious reliability and lifetime concern that may impact SRAM-based IMC array architectures, similar to conventional SRAM arrays. Bias temperature instability (BTI) is a dominant aging mechanism that degrades transistor performance and negatively affects the analog nature of the IMC computations. In this work, we present a simulation framework for the joined analysis of aging and process variation influence on IMC reliable operation. We perform, through SPICE simulations, an extensive BTI aging analysis on differential input SRAM-based IMC array architectures under different operating conditions and considering process variations. The simulation results show a substantial impact of aging on their reliability. Furthermore, we present an aging mitigation technique to maintain reliability and extend the lifetime of these circuits. Aging mitigation is achieved by periodically reconfiguring the active current paths in the IMC cells, with negligible cost on throughput and power consumption. The simulation results show that up to 68% of the IMC circuits can lose accuracy after three operating years, depending on the operating conditions. The aging mitigation technique effectively reduces the percentage of circuits that lose accuracy by up to 72% and decreases their degradation rate, essentially extending by more than $9.3times $ their reliable lifetime.
基于sram的内存计算(IMC)是一种很有前途的方法,可以克服传统冯·诺依曼架构的瓶颈,即数据传输延迟和能量效率低下。老化现象和工艺变化是一个严重的可靠性和寿命问题,可能会影响基于SRAM的IMC阵列架构,类似于传统的SRAM阵列。偏置温度不稳定性(BTI)是一种主要的老化机制,它会降低晶体管的性能,并对IMC计算的模拟性质产生负面影响。在本工作中,我们提出了一个模拟框架,用于联合分析老化和工艺变化对IMC可靠运行的影响。通过SPICE模拟,我们对不同操作条件下基于sram的差分输入IMC阵列架构进行了广泛的BTI老化分析,并考虑了工艺变化。仿真结果表明,老化对其可靠性有较大影响。此外,我们提出了一种老化减缓技术,以保持这些电路的可靠性和延长寿命。通过定期重新配置IMC单元中的有源电流路径,可以实现老化缓解,而吞吐量和功耗的成本可以忽略不计。仿真结果表明,根据工作条件的不同,高达68%的IMC电路在工作3年后会失去精度。老化减缓技术有效地减少了高达72%的电路失去精度的百分比,并降低了它们的退化率,基本上延长了超过9.3倍的可靠寿命。
{"title":"BTI Aging Analysis and Mitigation for Differential Input In-Memory Computing SRAMs","authors":"Christina Dilopoulou;Yiorgos Tsiatouhas","doi":"10.1109/TVLSI.2025.3585027","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3585027","url":null,"abstract":"SRAM-based in-memory computing (IMC) is a promising approach to overcome the bottleneck of traditional Von Neumann architectures that suffer from data transfer delay and energy inefficiency. Aging phenomena and process variations are a serious reliability and lifetime concern that may impact SRAM-based IMC array architectures, similar to conventional SRAM arrays. Bias temperature instability (BTI) is a dominant aging mechanism that degrades transistor performance and negatively affects the analog nature of the IMC computations. In this work, we present a simulation framework for the joined analysis of aging and process variation influence on IMC reliable operation. We perform, through SPICE simulations, an extensive BTI aging analysis on differential input SRAM-based IMC array architectures under different operating conditions and considering process variations. The simulation results show a substantial impact of aging on their reliability. Furthermore, we present an aging mitigation technique to maintain reliability and extend the lifetime of these circuits. Aging mitigation is achieved by periodically reconfiguring the active current paths in the IMC cells, with negligible cost on throughput and power consumption. The simulation results show that up to 68% of the IMC circuits can lose accuracy after three operating years, depending on the operating conditions. The aging mitigation technique effectively reduces the percentage of circuits that lose accuracy by up to 72% and decreases their degradation rate, essentially extending by more than <inline-formula> <tex-math>$9.3times $ </tex-math></inline-formula> their reliable lifetime.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2570-2579"},"PeriodicalIF":3.1,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Low-Complexity Implementation of Real-Time Reconfigurable Low-Pass Equalizers 实时可重构低通均衡器的低复杂度实现
IF 3.1 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-10 DOI: 10.1109/TVLSI.2025.3578450
Narges Mohammadi Sarband;Oksana Moryakova;Håkan Johansson;Oscar Gustafsson
Implementation techniques and results for a recently proposed real-time reconfigurable low-pass equalizer (RLPE) consisting of a variable bandwidth (VBW) filter and a variable equalizer (VE) are presented. Both components utilize fixed finite-length impulse response (FIR) filters combined with a few general multipliers, resulting in lower area and power consumption compared to a general FIR filter, despite requiring more multiplications. This is because the constant multipliers in the fixed FIR filters of the RLPE can be optimized for implementation. An additional advantage is that the proposed RLPE does not require online design. Various implementation alternatives for fixed FIR filters, including ways to increase the frequency, are evaluated to optimize the implementation of the RLPE. Several versions of the proposed RLPE and a general FIR filter for comparison are implemented using a 28-nm fully depleted silicon on insulator (FD-SOI) standard cell library. The results demonstrate that the RLPE baseline design requires less power and area than the general equalizer, and although the frequency of the baseline implementation is lower, the design can reach the same frequency while still having significantly less power and area. Furthermore, an approach is introduced to break the chain in the polynomial section of the VBW filter by using fewer additional registers compared to standard pipelining. Instead, this method reformulates the constant multiplication problem to produce correct results. For the considered case, the power consumption is reduced between 49% and 70% for different frequencies, with an area decrease in the range of 64%–67%, by using the proposed RLPE compared to a general FIR filter.
介绍了一种由可变带宽(VBW)滤波器和可变均衡器(VE)组成的实时可重构低通均衡器(RLPE)的实现技术和结果。这两种元件都使用固定的有限长度脉冲响应(FIR)滤波器与一些通用乘法器相结合,尽管需要更多的乘法器,但与通用FIR滤波器相比,其面积和功耗更低。这是因为RLPE的固定FIR滤波器中的常数乘法器可以优化实现。另一个优点是,RLPE不需要在线设计。评估了固定FIR滤波器的各种实现方案,包括提高频率的方法,以优化RLPE的实现。几个版本的RLPE和一个通用FIR滤波器进行比较,使用28纳米完全耗尽绝缘体上硅(FD-SOI)标准电池库实现。结果表明,RLPE基准设计比一般均衡器需要更少的功率和面积,尽管基准实现的频率较低,但设计可以在功耗和面积显著减少的情况下达到相同的频率。此外,与标准流水线相比,引入了一种方法,通过使用更少的额外寄存器来打破VBW滤波器多项式部分的链。相反,这种方法重新表述了常数乘法问题,以产生正确的结果。对于所考虑的情况,与一般FIR滤波器相比,使用所提出的RLPE,不同频率的功耗降低了49%至70%,面积减少了64%至67%。
{"title":"Low-Complexity Implementation of Real-Time Reconfigurable Low-Pass Equalizers","authors":"Narges Mohammadi Sarband;Oksana Moryakova;Håkan Johansson;Oscar Gustafsson","doi":"10.1109/TVLSI.2025.3578450","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3578450","url":null,"abstract":"Implementation techniques and results for a recently proposed real-time reconfigurable low-pass equalizer (RLPE) consisting of a variable bandwidth (VBW) filter and a variable equalizer (VE) are presented. Both components utilize fixed finite-length impulse response (FIR) filters combined with a few general multipliers, resulting in lower area and power consumption compared to a general FIR filter, despite requiring more multiplications. This is because the constant multipliers in the fixed FIR filters of the RLPE can be optimized for implementation. An additional advantage is that the proposed RLPE does not require online design. Various implementation alternatives for fixed FIR filters, including ways to increase the frequency, are evaluated to optimize the implementation of the RLPE. Several versions of the proposed RLPE and a general FIR filter for comparison are implemented using a 28-nm fully depleted silicon on insulator (FD-SOI) standard cell library. The results demonstrate that the RLPE baseline design requires less power and area than the general equalizer, and although the frequency of the baseline implementation is lower, the design can reach the same frequency while still having significantly less power and area. Furthermore, an approach is introduced to break the chain in the polynomial section of the VBW filter by using fewer additional registers compared to standard pipelining. Instead, this method reformulates the constant multiplication problem to produce correct results. For the considered case, the power consumption is reduced between 49% and 70% for different frequencies, with an area decrease in the range of 64%–67%, by using the proposed RLPE compared to a general FIR filter.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 9","pages":"2462-2473"},"PeriodicalIF":3.1,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11074767","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A 3-bit/Unit Time-Domain Compute-In-Memory Macro With Adjustable Unit Delay 具有可调单位延迟的3位/单位时域内存宏
IF 3.1 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-10 DOI: 10.1109/TVLSI.2025.3585360
Xie He;Dongxu Li
With the increasing demand for high-energy efficiency in multiply-accumulate (MAC) operations within deep learning accelerators, computing-in-memory (CIM) has gained significant attention. Time-domain (TD) CIM eliminates the need for analog-to-digital converters (ADCs), but single-bit delay units suffer from low computational efficiency. To address these issues, this work presents a TD multibit-per-unit CIM macro that leverages a precision-configurable time-to-digital converter (TDC) to enable accuracy configurability. Experimental results show that the proposed design achieves a 3-bit delay unit as a multibit CIM unit and an overall of 3-byte weight precision and 8-bit input precision. Compared to using three 1-bit/unit CIM delay units with an adder, it achieves a linearity with linear offset less than 3%. Besides, bias voltage adjusts the frequency and precision of the circuit (from 600 to 900 mV), enabling a minimum delay step of 0.11 ns. This system achieves a maximum energy efficiency of 268 TOPS/W under different VDD, making it a promising solution for always-on edge AI applications.
随着深度学习加速器对高能效乘法累加运算(MAC)的需求不断增加,内存计算(CIM)得到了广泛关注。时域(TD) CIM消除了对模数转换器(adc)的需求,但单比特延迟单元的计算效率较低。为了解决这些问题,本工作提出了一个每单位多比特的TD CIM宏,该宏利用精确可配置的时间-数字转换器(TDC)来实现精确可配置性。实验结果表明,该设计实现了3位延迟单元作为多比特CIM单元,总体上具有3字节权重精度和8位输入精度。与使用三个1位/单位的CIM延迟单元和加法器相比,它实现了线性偏移小于3%的线性度。此外,偏置电压调节电路的频率和精度(从600到900 mV),使最小延迟步长为0.11 ns。该系统在不同VDD下实现了268 TOPS/W的最高能效,使其成为始终在线的人工智能应用的有前途的解决方案。
{"title":"A 3-bit/Unit Time-Domain Compute-In-Memory Macro With Adjustable Unit Delay","authors":"Xie He;Dongxu Li","doi":"10.1109/TVLSI.2025.3585360","DOIUrl":"https://doi.org/10.1109/TVLSI.2025.3585360","url":null,"abstract":"With the increasing demand for high-energy efficiency in multiply-accumulate (MAC) operations within deep learning accelerators, computing-in-memory (CIM) has gained significant attention. Time-domain (TD) CIM eliminates the need for analog-to-digital converters (ADCs), but single-bit delay units suffer from low computational efficiency. To address these issues, this work presents a TD multibit-per-unit CIM macro that leverages a precision-configurable time-to-digital converter (TDC) to enable accuracy configurability. Experimental results show that the proposed design achieves a 3-bit delay unit as a multibit CIM unit and an overall of 3-byte weight precision and 8-bit input precision. Compared to using three 1-bit/unit CIM delay units with an adder, it achieves a linearity with linear offset less than 3%. Besides, bias voltage adjusts the frequency and precision of the circuit (from 600 to 900 mV), enabling a minimum delay step of 0.11 ns. This system achieves a maximum energy efficiency of 268 TOPS/W under different VDD, making it a promising solution for always-on edge AI applications.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 10","pages":"2897-2901"},"PeriodicalIF":3.1,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1