Pub Date : 2024-09-24DOI: 10.1109/TVLSI.2024.3455091
Dingyang Zou;Gaoche Zhang;Xu Zhang;Meiqi Wang;Zhongfeng Wang
Due to the demand for high energy efficiency in deep neural network (DNN) accelerators, computing-in-memory (CIM) is becoming increasingly popular in recent years. However, current CIM designs suffer from high latency and insufficient flexibility. To address the issues, this brief proposes a Booth-multiplication-based CIM macro (BCIM) with modified Booth encoding and partial product (PP) generation method specially designed for CIM architecture. In addition, a methodology is presented for designing precision-reconfigurable digital CIM macros. We also optimize the precision-reconfigurable shift adder in the macro based on the cutting down carry connection method. The design attains a performance of 2048 GOPS and a peak energy efficiency of 79.15 TOPS/W in the signed INT4 mode at a frequency of 500 MHz.
{"title":"An Efficient and Precision-Reconfigurable Digital CIM Macro for DNN Accelerators","authors":"Dingyang Zou;Gaoche Zhang;Xu Zhang;Meiqi Wang;Zhongfeng Wang","doi":"10.1109/TVLSI.2024.3455091","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3455091","url":null,"abstract":"Due to the demand for high energy efficiency in deep neural network (DNN) accelerators, computing-in-memory (CIM) is becoming increasingly popular in recent years. However, current CIM designs suffer from high latency and insufficient flexibility. To address the issues, this brief proposes a Booth-multiplication-based CIM macro (BCIM) with modified Booth encoding and partial product (PP) generation method specially designed for CIM architecture. In addition, a methodology is presented for designing precision-reconfigurable digital CIM macros. We also optimize the precision-reconfigurable shift adder in the macro based on the cutting down carry connection method. The design attains a performance of 2048 GOPS and a peak energy efficiency of 79.15 TOPS/W in the signed INT4 mode at a frequency of 500 MHz.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"563-567"},"PeriodicalIF":2.8,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-24DOI: 10.1109/TVLSI.2024.3458872
Yazheng Tu;Shi Bai;Jinjun Xiong;Jiafeng Xie
The Nth-degree truncated polynomial ring units (NTRUs)-based postquantum cryptography (PQC) has drawn significant attention from the research communities, e.g., the National Institute of Standards and Technology (NIST) PQC standardization process selected algorithm Fast Fourier lattice-based compact (Falcon). Following the research trend, efficient hardware accelerator design for polynomial multiplication (an important component of the NTRU-based PQC) is crucial. Unlike the commonly used number theoretic transform (NTT) method, in this article, we have presented a novel SChoolbook-Originated Polynomial multiplication accElerators (SCOPE) design framework. Overall, we have proposed the schoolbook-based method in an innovative format to implement the targeted polynomial multiplication, first through a schoolbook-variant version and then through a Toeplitz matrix-vector product (TMVP)-based approach. Four layers of coherent and interdependent efforts have been carried out: 1) a novel lookup table (LUT)-based point-wise multiplier is proposed along with a related modular reduction technique to obtain optimal implementation; 2) a new hardware accelerator is introduced for the targeted polynomial multiplication, deploying the proposed point-wise multiplier; 3) the proposed architecture is extended to a TMVP-based polynomial multiplication accelerator; and 4) the efficiency of the proposed accelerators is demonstrated through implementation and comparison. Finally, the proposed design strategy is also extended to another NTRU-based scheme and other schoolbook- and toom-cook-based polynomial multiplications (used in other PQC), and obtains the same superior performance. We hope that the outcome of this research can impact the ongoing NIST PQC standardization process and related full-hardware implementation work for schemes like Falcon.
{"title":"SCOPE: Schoolbook-Originated Novel Polynomial Multiplication Accelerators for NTRU-Based PQC","authors":"Yazheng Tu;Shi Bai;Jinjun Xiong;Jiafeng Xie","doi":"10.1109/TVLSI.2024.3458872","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3458872","url":null,"abstract":"The <italic>N</i>th-degree truncated polynomial ring units (NTRUs)-based postquantum cryptography (PQC) has drawn significant attention from the research communities, e.g., the National Institute of Standards and Technology (NIST) PQC standardization process selected algorithm Fast Fourier lattice-based compact (Falcon). Following the research trend, efficient hardware accelerator design for polynomial multiplication (an important component of the NTRU-based PQC) is crucial. Unlike the commonly used number theoretic transform (NTT) method, in this article, we have presented a novel SChoolbook-Originated Polynomial multiplication accElerators (SCOPE) design framework. Overall, we have proposed the schoolbook-based method in an innovative format to implement the targeted polynomial multiplication, first through a schoolbook-variant version and then through a Toeplitz matrix-vector product (TMVP)-based approach. Four layers of coherent and interdependent efforts have been carried out: 1) a novel lookup table (LUT)-based point-wise multiplier is proposed along with a related modular reduction technique to obtain optimal implementation; 2) a new hardware accelerator is introduced for the targeted polynomial multiplication, deploying the proposed point-wise multiplier; 3) the proposed architecture is extended to a TMVP-based polynomial multiplication accelerator; and 4) the efficiency of the proposed accelerators is demonstrated through implementation and comparison. Finally, the proposed design strategy is also extended to another NTRU-based scheme and other schoolbook- and toom-cook-based polynomial multiplications (used in other PQC), and obtains the same superior performance. We hope that the outcome of this research can impact the ongoing NIST PQC standardization process and related full-hardware implementation work for schemes like Falcon.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"408-420"},"PeriodicalIF":2.8,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-23DOI: 10.1109/TVLSI.2024.3454431
Henri Lunnikivi;Roni Hämäläinen;Timo D. Hämäläinen
The integration of large-scale systems-on-chip warrants thorough verification both at the level of the individual component and at the system level. In this article, we address the automated testing of system-level memory maps. The golden reference is the IEEE 1685/IP-XACT hardware description, which includes implementation agnostic definitions for the global memory map. The IP-XACT description is used as a specification for implementing the registers and memory regions in a register transfer-level (RTL) language, and for implementing the corresponding hardware-dependent software. The challenge is that hardware design changes might not always propagate to firmware and applications developers, which causes errors and faults. We present a method and a tool called Keelhaul which takes as input the CMSIS-SVD format commonly used for firmware development and generates automated software tests that attempt to access all available memory mapped input/output registers. During development of a large-scale research-focused multiprocessor system-on-chip, we ran a total of 32 automatically generated test suites per pipeline comprising 882 test cases for each of its two CPU subsystems. A total of 15 distinct issues were found by the tool in the lead-up to tapeout. Another research-focused SoC was validated posttapeout with 984 test cases generated for each core, resulting in the discovery of four distinct issues. Keelhaul can be used with any IP-XACT or CMSIS-SVD-based systems-on-chip that include processors for accessing implemented registers and memory regions.
{"title":"Keelhaul: Processor-Driven Chip Connectivity and Memory Map Metadata Validator for Large Systems-on-Chip","authors":"Henri Lunnikivi;Roni Hämäläinen;Timo D. Hämäläinen","doi":"10.1109/TVLSI.2024.3454431","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3454431","url":null,"abstract":"The integration of large-scale systems-on-chip warrants thorough verification both at the level of the individual component and at the system level. In this article, we address the automated testing of system-level memory maps. The golden reference is the IEEE 1685/IP-XACT hardware description, which includes implementation agnostic definitions for the global memory map. The IP-XACT description is used as a specification for implementing the registers and memory regions in a register transfer-level (RTL) language, and for implementing the corresponding hardware-dependent software. The challenge is that hardware design changes might not always propagate to firmware and applications developers, which causes errors and faults. We present a method and a tool called Keelhaul which takes as input the CMSIS-SVD format commonly used for firmware development and generates automated software tests that attempt to access all available memory mapped input/output registers. During development of a large-scale research-focused multiprocessor system-on-chip, we ran a total of 32 automatically generated test suites per pipeline comprising 882 test cases for each of its two CPU subsystems. A total of 15 distinct issues were found by the tool in the lead-up to tapeout. Another research-focused SoC was validated posttapeout with 984 test cases generated for each core, resulting in the discovery of four distinct issues. Keelhaul can be used with any IP-XACT or CMSIS-SVD-based systems-on-chip that include processors for accessing implemented registers and memory regions.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 12","pages":"2269-2280"},"PeriodicalIF":2.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142821278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-20DOI: 10.1109/TVLSI.2024.3453946
Yuxin Ji;Yuhang Zhang;Changyan Chen;Jian Zhao;Fakhrul Zaman Rokhani;Yehea Ismail;Yongfu Li
Data-retention flip-flops (DR-FFs) efficiently maintain data during sleep mode, and retain state during transitions between active and sleep mode. This brief proposes an ultralow power DR-FF design with an improved autonomous data-retention (ADR) latch operating with a supply voltage range down to near/subthreshold, achieving a sleep mode leakage power of 12.2 pW, $1.4times $ –$3.8times $ less than the prior CMOS DR-FFs. Our proposed DR-FFs consume the lowest active mode switching efficiency of 36.5 fJ/step, $1.2times $ –$4times $ less than the prior works, and a comparable transition efficiency of 1.9 fJ/step. Furthermore, our proposed DR-FFs require minimal control signals, logic gates, and switches, significantly reducing design complexity, and avoiding the drawbacks of nonvolatile data retention FFs (NV-FFs).
{"title":"A 0.4 V, 12.2 pW Leakage, 36.5 fJ/Step Switching Efficiency Data Retention Flip-Flop in 22 nm FDSOI","authors":"Yuxin Ji;Yuhang Zhang;Changyan Chen;Jian Zhao;Fakhrul Zaman Rokhani;Yehea Ismail;Yongfu Li","doi":"10.1109/TVLSI.2024.3453946","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3453946","url":null,"abstract":"Data-retention flip-flops (DR-FFs) efficiently maintain data during sleep mode, and retain state during transitions between active and sleep mode. This brief proposes an ultralow power DR-FF design with an improved autonomous data-retention (ADR) latch operating with a supply voltage range down to near/subthreshold, achieving a sleep mode leakage power of 12.2 pW, <inline-formula> <tex-math>$1.4times $ </tex-math></inline-formula>–<inline-formula> <tex-math>$3.8times $ </tex-math></inline-formula> less than the prior CMOS DR-FFs. Our proposed DR-FFs consume the lowest active mode switching efficiency of 36.5 fJ/step, <inline-formula> <tex-math>$1.2times $ </tex-math></inline-formula>–<inline-formula> <tex-math>$4times $ </tex-math></inline-formula> less than the prior works, and a comparable transition efficiency of 1.9 fJ/step. Furthermore, our proposed DR-FFs require minimal control signals, logic gates, and switches, significantly reducing design complexity, and avoiding the drawbacks of nonvolatile data retention FFs (NV-FFs).","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"573-577"},"PeriodicalIF":2.8,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-19DOI: 10.1109/TVLSI.2024.3448374
Andres Ayes;Eby G. Friedman
Power delivery in three-dimensional (3-D) integrated systems poses several challenges such as high current densities, large voltage drops due to multiple levels of resistive vertical interconnect, and significant switching noise originating from transient currents within different layers. Voltage stacking is a power delivery technique that is highly compatible with 3-D integration due to the physical proximity between layers, enabling the efficient transfer of recycled current. Power noise in clock networks is, however, not inherently addressed by 3-D voltage stacking. In this brief, a quasi-adiabatic technique between multiple clock networks within 3-D voltage stacked systems is proposed. The technique exploits the proximity of the clock networks to enable mutual charging and discharging when the clock signals transition to the same voltage. During this transition, the clock distribution networks are isolated from the power grid, reducing simultaneous switching noise and current load. The maximum current is reduced by an additional 13% as compared to only voltage stacking, the maximum voltage noise is reduced by up to 72% when the clock networks are isolated from the power grids, and the clock networks pull nearly 50% less charge from the source. The proposed technique is evaluated on a 7 nm predictive technology model.
{"title":"Quasi-Adiabatic Clock Networks in 3-D Voltage Stacked Systems","authors":"Andres Ayes;Eby G. Friedman","doi":"10.1109/TVLSI.2024.3448374","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3448374","url":null,"abstract":"Power delivery in three-dimensional (3-D) integrated systems poses several challenges such as high current densities, large voltage drops due to multiple levels of resistive vertical interconnect, and significant switching noise originating from transient currents within different layers. Voltage stacking is a power delivery technique that is highly compatible with 3-D integration due to the physical proximity between layers, enabling the efficient transfer of recycled current. Power noise in clock networks is, however, not inherently addressed by 3-D voltage stacking. In this brief, a quasi-adiabatic technique between multiple clock networks within 3-D voltage stacked systems is proposed. The technique exploits the proximity of the clock networks to enable mutual charging and discharging when the clock signals transition to the same voltage. During this transition, the clock distribution networks are isolated from the power grid, reducing simultaneous switching noise and current load. The maximum current is reduced by an additional 13% as compared to only voltage stacking, the maximum voltage noise is reduced by up to 72% when the clock networks are isolated from the power grids, and the clock networks pull nearly 50% less charge from the source. The proposed technique is evaluated on a 7 nm predictive technology model.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 12","pages":"2394-2397"},"PeriodicalIF":2.8,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142821254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-resolution successive approximation register (SAR) analog-to-digital converters (ADCs) commonly need to calibrate their bit weights. Due to the nonidealities of the calibration circuits, the calibrated bit weights carry errors. This error could propagate during the calibration procedure. Due to the high precision requirement of these ADCs, such residue error commonly becomes the signal-to-noise-and-distortion ratio (SNDR) bottleneck of the overall ADC. This article presents an analysis of the residue error from bit weight self-calibration methods of high-resolution SAR ADCs. The major sources contributing to this error and the error reduction methods are quantitively analyzed. A statistical analysis of the noise-induced random error is developed. Our statistical model finds that the noise-induced random error follows the chi-square distribution. In practice, this random error is commonly reduced by repetitively measuring and averaging the calibrated bit weights. Our statistical model quantifies this bit weight error and leads to a clearer understanding of the error mechanism and design trade-offs. Following our chi-square model, the SNDR degradation due to the circuit noise during the calibration can be easily estimated without going through the time-consuming traditional transistor-level design and simulation process. The required repetition time can also be calculated. The bit-weight error models derived in this article are verified with measurement on a 16-bit SAR ADC design in a 180-nm CMOS process. Results from our model match both simulations and measurements well.
高分辨率逐次逼近寄存器(SAR)模数转换器(ADC)通常需要校准位权重。由于校准电路的非理想性,校准后的位权重会产生误差。这种误差可能在校准过程中传播。由于这些 ADC 的精度要求很高,这种残余误差通常会成为整个 ADC 的信噪比 (SNDR) 瓶颈。本文分析了高分辨率 SAR ADC 位权自校准方法产生的残差误差。文章定量分析了造成这一误差的主要来源和减少误差的方法。文章对噪声引起的随机误差进行了统计分析。我们的统计模型发现,噪声引起的随机误差遵循秩方分布。在实践中,这种随机误差通常通过重复测量和平均校准位权来减少。我们的统计模型量化了这种位权重误差,使人们对误差机制和设计权衡有了更清晰的认识。根据我们的秩方模型,校准过程中电路噪声导致的 SNDR 下降可以很容易地估算出来,而无需进行耗时的传统晶体管级设计和仿真过程。所需的重复时间也可以计算出来。本文推导出的位重误差模型在 180-nm CMOS 工艺的 16 位 SAR ADC 设计上进行了测量验证。我们的模型得出的结果与模拟和测量结果十分吻合。
{"title":"The Error Analysis of Bit Weight Self-Calibration Methods for High-Resolution SAR ADCs","authors":"Yanhang Chen;Siji Huang;Qifeng Huang;Yifei Fan;Jie Yuan","doi":"10.1109/TVLSI.2024.3458071","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3458071","url":null,"abstract":"High-resolution successive approximation register (SAR) analog-to-digital converters (ADCs) commonly need to calibrate their bit weights. Due to the nonidealities of the calibration circuits, the calibrated bit weights carry errors. This error could propagate during the calibration procedure. Due to the high precision requirement of these ADCs, such residue error commonly becomes the signal-to-noise-and-distortion ratio (SNDR) bottleneck of the overall ADC. This article presents an analysis of the residue error from bit weight self-calibration methods of high-resolution SAR ADCs. The major sources contributing to this error and the error reduction methods are quantitively analyzed. A statistical analysis of the noise-induced random error is developed. Our statistical model finds that the noise-induced random error follows the chi-square distribution. In practice, this random error is commonly reduced by repetitively measuring and averaging the calibrated bit weights. Our statistical model quantifies this bit weight error and leads to a clearer understanding of the error mechanism and design trade-offs. Following our chi-square model, the SNDR degradation due to the circuit noise during the calibration can be easily estimated without going through the time-consuming traditional transistor-level design and simulation process. The required repetition time can also be calculated. The bit-weight error models derived in this article are verified with measurement on a 16-bit SAR ADC design in a 180-nm CMOS process. Results from our model match both simulations and measurements well.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 11","pages":"1983-1992"},"PeriodicalIF":2.8,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142518122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-18DOI: 10.1109/TVLSI.2024.3439231
Duy-Thanh Nguyen;Abhiroop Bhattacharjee;Abhishek Moitra;Priyadarshini Panda
AI chips commonly employ SRAM memory as buffers for their reliability and speed, which contribute to high performance. However, SRAM is expensive and demands significant area and energy consumption. Previous studies have explored replacing SRAM with emerging technologies, such as nonvolatile memory, which offers fast read memory access and a small cell area. Despite these advantages, nonvolatile memory’s slow write memory access and high write energy consumption prevent it from surpassing SRAM performance in AI applications with extensive memory access requirements. Some research has also investigated embedded dynamic random access memory (eDRAM) as an area-efficient on-chip memory with similar access times as SRAM. Still, refresh power remains a concern, leaving the trade-off among performance, area, and power consumption unresolved. To address this issue, this article presents a novel mixed CMOS cell memory design that balances performance, area, and energy efficiency for AI memory by combining SRAM and eDRAM cells. We consider the proportion ratio of one SRAM and seven eDRAM cells in the memory to achieve area reduction using mixed CMOS cell memory. In addition, we capitalize on the characteristics of deep neural network (DNN) data representation and integrate asymmetric eDRAM cells to lower energy consumption. To validate our proposed MCAIMem solution, we conduct extensive simulations and benchmarking against traditional SRAM. Our results demonstrate that the MCAIMem significantly outperforms these alternatives in terms of area and energy efficiency. Specifically, our MCAIMem can reduce the area by 48% and energy consumption by $3.4times $