首页 > 最新文献

IEEE Transactions on Very Large Scale Integration (VLSI) Systems最新文献

英文 中文
An Efficient and Precision-Reconfigurable Digital CIM Macro for DNN Accelerators 用于DNN加速器的高效高精度可重构数字CIM宏
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-24 DOI: 10.1109/TVLSI.2024.3455091
Dingyang Zou;Gaoche Zhang;Xu Zhang;Meiqi Wang;Zhongfeng Wang
Due to the demand for high energy efficiency in deep neural network (DNN) accelerators, computing-in-memory (CIM) is becoming increasingly popular in recent years. However, current CIM designs suffer from high latency and insufficient flexibility. To address the issues, this brief proposes a Booth-multiplication-based CIM macro (BCIM) with modified Booth encoding and partial product (PP) generation method specially designed for CIM architecture. In addition, a methodology is presented for designing precision-reconfigurable digital CIM macros. We also optimize the precision-reconfigurable shift adder in the macro based on the cutting down carry connection method. The design attains a performance of 2048 GOPS and a peak energy efficiency of 79.15 TOPS/W in the signed INT4 mode at a frequency of 500 MHz.
近年来,由于深度神经网络(DNN)加速器对高能效的需求,内存计算(CIM)越来越受欢迎。然而,当前的CIM设计存在高延迟和灵活性不足的问题。为了解决这些问题,本文提出了一种基于展位倍增的CIM宏(BCIM),其中修改了展位编码和专门为CIM架构设计的部分产品(PP)生成方法。此外,还提出了一种设计精度可重构数字CIM宏的方法。本文还对宏中的精度可重构移位加法器进行了基于降进连接方法的优化。在签名INT4模式下,该设计在500 MHz频率下的性能为2048 GOPS,峰值能效为79.15 TOPS/W。
{"title":"An Efficient and Precision-Reconfigurable Digital CIM Macro for DNN Accelerators","authors":"Dingyang Zou;Gaoche Zhang;Xu Zhang;Meiqi Wang;Zhongfeng Wang","doi":"10.1109/TVLSI.2024.3455091","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3455091","url":null,"abstract":"Due to the demand for high energy efficiency in deep neural network (DNN) accelerators, computing-in-memory (CIM) is becoming increasingly popular in recent years. However, current CIM designs suffer from high latency and insufficient flexibility. To address the issues, this brief proposes a Booth-multiplication-based CIM macro (BCIM) with modified Booth encoding and partial product (PP) generation method specially designed for CIM architecture. In addition, a methodology is presented for designing precision-reconfigurable digital CIM macros. We also optimize the precision-reconfigurable shift adder in the macro based on the cutting down carry connection method. The design attains a performance of 2048 GOPS and a peak energy efficiency of 79.15 TOPS/W in the signed INT4 mode at a frequency of 500 MHz.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"563-567"},"PeriodicalIF":2.8,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SCOPE: Schoolbook-Originated Novel Polynomial Multiplication Accelerators for NTRU-Based PQC 经营范围:基于教科书的基于ntrupqc的新型多项式乘法加速器
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-24 DOI: 10.1109/TVLSI.2024.3458872
Yazheng Tu;Shi Bai;Jinjun Xiong;Jiafeng Xie
The Nth-degree truncated polynomial ring units (NTRUs)-based postquantum cryptography (PQC) has drawn significant attention from the research communities, e.g., the National Institute of Standards and Technology (NIST) PQC standardization process selected algorithm Fast Fourier lattice-based compact (Falcon). Following the research trend, efficient hardware accelerator design for polynomial multiplication (an important component of the NTRU-based PQC) is crucial. Unlike the commonly used number theoretic transform (NTT) method, in this article, we have presented a novel SChoolbook-Originated Polynomial multiplication accElerators (SCOPE) design framework. Overall, we have proposed the schoolbook-based method in an innovative format to implement the targeted polynomial multiplication, first through a schoolbook-variant version and then through a Toeplitz matrix-vector product (TMVP)-based approach. Four layers of coherent and interdependent efforts have been carried out: 1) a novel lookup table (LUT)-based point-wise multiplier is proposed along with a related modular reduction technique to obtain optimal implementation; 2) a new hardware accelerator is introduced for the targeted polynomial multiplication, deploying the proposed point-wise multiplier; 3) the proposed architecture is extended to a TMVP-based polynomial multiplication accelerator; and 4) the efficiency of the proposed accelerators is demonstrated through implementation and comparison. Finally, the proposed design strategy is also extended to another NTRU-based scheme and other schoolbook- and toom-cook-based polynomial multiplications (used in other PQC), and obtains the same superior performance. We hope that the outcome of this research can impact the ongoing NIST PQC standardization process and related full-hardware implementation work for schemes like Falcon.
基于n度截断多项式环单元(NTRUs)的后量子密码学(PQC)已经引起了研究团体的极大关注,例如,美国国家标准与技术研究院(NIST)的PQC标准化过程中选择了基于Fast Fourier lattice-based compact (Falcon)的算法。多项式乘法是基于ntrupqc的重要组成部分,高效的硬件加速器设计是研究趋势的关键。与常用的数论变换(NTT)方法不同,在本文中,我们提出了一种新的教科书式多项式乘法加速器(SCOPE)设计框架。总体而言,我们提出了基于教科书的方法,以一种创新的格式实现目标多项式乘法,首先通过教科书变体版本,然后通过基于Toeplitz矩阵向量积(TMVP)的方法。在四个层次上进行了连贯和相互依赖的工作:1)提出了一种新的基于查找表(LUT)的逐点乘法器以及相关的模块化约简技术,以获得最佳实现;2)为目标多项式乘法引入新的硬件加速器,部署所提出的点向乘法器;3)将该架构扩展为基于tmvp的多项式乘法加速器;4)通过实现和比较,验证了所提加速器的有效性。最后,将所提出的设计策略扩展到另一种基于ntru的方案以及其他基于教科书和教室的多项式乘法(用于其他PQC)中,并获得了同样优越的性能。我们希望这项研究的结果能够影响正在进行的NIST PQC标准化过程以及Falcon等方案的相关全硬件实施工作。
{"title":"SCOPE: Schoolbook-Originated Novel Polynomial Multiplication Accelerators for NTRU-Based PQC","authors":"Yazheng Tu;Shi Bai;Jinjun Xiong;Jiafeng Xie","doi":"10.1109/TVLSI.2024.3458872","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3458872","url":null,"abstract":"The <italic>N</i>th-degree truncated polynomial ring units (NTRUs)-based postquantum cryptography (PQC) has drawn significant attention from the research communities, e.g., the National Institute of Standards and Technology (NIST) PQC standardization process selected algorithm Fast Fourier lattice-based compact (Falcon). Following the research trend, efficient hardware accelerator design for polynomial multiplication (an important component of the NTRU-based PQC) is crucial. Unlike the commonly used number theoretic transform (NTT) method, in this article, we have presented a novel SChoolbook-Originated Polynomial multiplication accElerators (SCOPE) design framework. Overall, we have proposed the schoolbook-based method in an innovative format to implement the targeted polynomial multiplication, first through a schoolbook-variant version and then through a Toeplitz matrix-vector product (TMVP)-based approach. Four layers of coherent and interdependent efforts have been carried out: 1) a novel lookup table (LUT)-based point-wise multiplier is proposed along with a related modular reduction technique to obtain optimal implementation; 2) a new hardware accelerator is introduced for the targeted polynomial multiplication, deploying the proposed point-wise multiplier; 3) the proposed architecture is extended to a TMVP-based polynomial multiplication accelerator; and 4) the efficiency of the proposed accelerators is demonstrated through implementation and comparison. Finally, the proposed design strategy is also extended to another NTRU-based scheme and other schoolbook- and toom-cook-based polynomial multiplications (used in other PQC), and obtains the same superior performance. We hope that the outcome of this research can impact the ongoing NIST PQC standardization process and related full-hardware implementation work for schemes like Falcon.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"408-420"},"PeriodicalIF":2.8,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Keelhaul: Processor-Driven Chip Connectivity and Memory Map Metadata Validator for Large Systems-on-Chip Keelhaul:面向大型片上系统的处理器驱动芯片连接和内存映射元数据验证器
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-23 DOI: 10.1109/TVLSI.2024.3454431
Henri Lunnikivi;Roni Hämäläinen;Timo D. Hämäläinen
The integration of large-scale systems-on-chip warrants thorough verification both at the level of the individual component and at the system level. In this article, we address the automated testing of system-level memory maps. The golden reference is the IEEE 1685/IP-XACT hardware description, which includes implementation agnostic definitions for the global memory map. The IP-XACT description is used as a specification for implementing the registers and memory regions in a register transfer-level (RTL) language, and for implementing the corresponding hardware-dependent software. The challenge is that hardware design changes might not always propagate to firmware and applications developers, which causes errors and faults. We present a method and a tool called Keelhaul which takes as input the CMSIS-SVD format commonly used for firmware development and generates automated software tests that attempt to access all available memory mapped input/output registers. During development of a large-scale research-focused multiprocessor system-on-chip, we ran a total of 32 automatically generated test suites per pipeline comprising 882 test cases for each of its two CPU subsystems. A total of 15 distinct issues were found by the tool in the lead-up to tapeout. Another research-focused SoC was validated posttapeout with 984 test cases generated for each core, resulting in the discovery of four distinct issues. Keelhaul can be used with any IP-XACT or CMSIS-SVD-based systems-on-chip that include processors for accessing implemented registers and memory regions.
大规模片上系统的集成需要在单个组件和系统层面进行全面验证。在本文中,我们将讨论系统级内存映射的自动测试。黄金参考资料是 IEEE 1685/IP-XACT 硬件描述,其中包括与实现无关的全局内存映射定义。IP-XACT 描述被用作在寄存器传输级 (RTL) 语言中实现寄存器和内存区域以及实现相应硬件相关软件的规范。所面临的挑战是,硬件设计变更不一定会传播给固件和应用软件开发人员,从而导致错误和故障。我们介绍了一种名为 Keelhaul 的方法和工具,它将固件开发常用的 CMSIS-SVD 格式作为输入,并生成自动软件测试,尝试访问所有可用的内存映射输入/输出寄存器。在开发大型研究型多处理器片上系统期间,我们为每个流水线运行了 32 个自动生成的测试套件,其中包括针对两个 CPU 子系统的 882 个测试用例。该工具总共发现了 15 个不同的问题。另一个以研究为重点的 SoC 在出带后进行了验证,为每个内核生成了 984 个测试用例,结果发现了 4 个不同的问题。Keelhaul 可用于任何基于 IP-XACT 或 CMSIS-SVD 的片上系统,这些系统包括用于访问已实现寄存器和内存区域的处理器。
{"title":"Keelhaul: Processor-Driven Chip Connectivity and Memory Map Metadata Validator for Large Systems-on-Chip","authors":"Henri Lunnikivi;Roni Hämäläinen;Timo D. Hämäläinen","doi":"10.1109/TVLSI.2024.3454431","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3454431","url":null,"abstract":"The integration of large-scale systems-on-chip warrants thorough verification both at the level of the individual component and at the system level. In this article, we address the automated testing of system-level memory maps. The golden reference is the IEEE 1685/IP-XACT hardware description, which includes implementation agnostic definitions for the global memory map. The IP-XACT description is used as a specification for implementing the registers and memory regions in a register transfer-level (RTL) language, and for implementing the corresponding hardware-dependent software. The challenge is that hardware design changes might not always propagate to firmware and applications developers, which causes errors and faults. We present a method and a tool called Keelhaul which takes as input the CMSIS-SVD format commonly used for firmware development and generates automated software tests that attempt to access all available memory mapped input/output registers. During development of a large-scale research-focused multiprocessor system-on-chip, we ran a total of 32 automatically generated test suites per pipeline comprising 882 test cases for each of its two CPU subsystems. A total of 15 distinct issues were found by the tool in the lead-up to tapeout. Another research-focused SoC was validated posttapeout with 984 test cases generated for each core, resulting in the discovery of four distinct issues. Keelhaul can be used with any IP-XACT or CMSIS-SVD-based systems-on-chip that include processors for accessing implemented registers and memory regions.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 12","pages":"2269-2280"},"PeriodicalIF":2.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142821278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A 0.4 V, 12.2 pW Leakage, 36.5 fJ/Step Switching Efficiency Data Retention Flip-Flop in 22 nm FDSOI 一个0.4 V, 12.2 pW漏电,36.5 fJ/阶跃开关效率的22nm FDSOI数据保持触发器
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-20 DOI: 10.1109/TVLSI.2024.3453946
Yuxin Ji;Yuhang Zhang;Changyan Chen;Jian Zhao;Fakhrul Zaman Rokhani;Yehea Ismail;Yongfu Li
Data-retention flip-flops (DR-FFs) efficiently maintain data during sleep mode, and retain state during transitions between active and sleep mode. This brief proposes an ultralow power DR-FF design with an improved autonomous data-retention (ADR) latch operating with a supply voltage range down to near/subthreshold, achieving a sleep mode leakage power of 12.2 pW, $1.4times $ $3.8times $ less than the prior CMOS DR-FFs. Our proposed DR-FFs consume the lowest active mode switching efficiency of 36.5 fJ/step, $1.2times $ $4times $ less than the prior works, and a comparable transition efficiency of 1.9 fJ/step. Furthermore, our proposed DR-FFs require minimal control signals, logic gates, and switches, significantly reducing design complexity, and avoiding the drawbacks of nonvolatile data retention FFs (NV-FFs).
数据保持触发器(dr - ff)在睡眠模式下有效地维护数据,并在活动模式和睡眠模式之间转换时保持状态。本文提出了一种超低功耗DR-FF设计,该设计具有改进的自主数据保留(ADR)锁存器,在电源电压范围降至接近/亚阈值的情况下工作,实现了12.2 pW的睡眠模式泄漏功率,比以前的CMOS DR-FF低1.4倍至3.8倍。我们提出的dr - ff消耗的最低有源模式切换效率为36.5 fJ/步,比先前的工作低1.2 - 4倍,转换效率为1.9 fJ/步。此外,我们提出的dr - ff需要最小的控制信号、逻辑门和开关,大大降低了设计复杂性,并避免了非易失性数据保留ff (nv - ff)的缺点。
{"title":"A 0.4 V, 12.2 pW Leakage, 36.5 fJ/Step Switching Efficiency Data Retention Flip-Flop in 22 nm FDSOI","authors":"Yuxin Ji;Yuhang Zhang;Changyan Chen;Jian Zhao;Fakhrul Zaman Rokhani;Yehea Ismail;Yongfu Li","doi":"10.1109/TVLSI.2024.3453946","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3453946","url":null,"abstract":"Data-retention flip-flops (DR-FFs) efficiently maintain data during sleep mode, and retain state during transitions between active and sleep mode. This brief proposes an ultralow power DR-FF design with an improved autonomous data-retention (ADR) latch operating with a supply voltage range down to near/subthreshold, achieving a sleep mode leakage power of 12.2 pW, <inline-formula> <tex-math>$1.4times $ </tex-math></inline-formula>–<inline-formula> <tex-math>$3.8times $ </tex-math></inline-formula> less than the prior CMOS DR-FFs. Our proposed DR-FFs consume the lowest active mode switching efficiency of 36.5 fJ/step, <inline-formula> <tex-math>$1.2times $ </tex-math></inline-formula>–<inline-formula> <tex-math>$4times $ </tex-math></inline-formula> less than the prior works, and a comparable transition efficiency of 1.9 fJ/step. Furthermore, our proposed DR-FFs require minimal control signals, logic gates, and switches, significantly reducing design complexity, and avoiding the drawbacks of nonvolatile data retention FFs (NV-FFs).","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"573-577"},"PeriodicalIF":2.8,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Quasi-Adiabatic Clock Networks in 3-D Voltage Stacked Systems 三维电压堆叠系统中的准绝热时钟网络
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-19 DOI: 10.1109/TVLSI.2024.3448374
Andres Ayes;Eby G. Friedman
Power delivery in three-dimensional (3-D) integrated systems poses several challenges such as high current densities, large voltage drops due to multiple levels of resistive vertical interconnect, and significant switching noise originating from transient currents within different layers. Voltage stacking is a power delivery technique that is highly compatible with 3-D integration due to the physical proximity between layers, enabling the efficient transfer of recycled current. Power noise in clock networks is, however, not inherently addressed by 3-D voltage stacking. In this brief, a quasi-adiabatic technique between multiple clock networks within 3-D voltage stacked systems is proposed. The technique exploits the proximity of the clock networks to enable mutual charging and discharging when the clock signals transition to the same voltage. During this transition, the clock distribution networks are isolated from the power grid, reducing simultaneous switching noise and current load. The maximum current is reduced by an additional 13% as compared to only voltage stacking, the maximum voltage noise is reduced by up to 72% when the clock networks are isolated from the power grids, and the clock networks pull nearly 50% less charge from the source. The proposed technique is evaluated on a 7 nm predictive technology model.
三维(3-D)集成系统的电力传输面临着一些挑战,例如高电流密度,由于多层电阻垂直互连而产生的大电压降,以及不同层内瞬态电流产生的显著开关噪声。电压堆叠是一种电力传输技术,由于层与层之间的物理接近,它与3d集成高度兼容,能够有效地传输回收电流。然而,时钟网络中的功率噪声不能通过三维电压叠加来固有地解决。本文提出了三维电压堆叠系统中多个时钟网络之间的准绝热技术。该技术利用时钟网络的接近性,使时钟信号转换到相同电压时能够相互充电和放电。在此过渡期间,时钟分配网络与电网隔离,减少了同时开关噪声和电流负载。与仅电压叠加相比,最大电流减少了13%,当时钟网络与电网隔离时,最大电压噪声减少了72%,时钟网络从电源中吸收的电荷减少了近50%。在7纳米预测技术模型上对该技术进行了评估。
{"title":"Quasi-Adiabatic Clock Networks in 3-D Voltage Stacked Systems","authors":"Andres Ayes;Eby G. Friedman","doi":"10.1109/TVLSI.2024.3448374","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3448374","url":null,"abstract":"Power delivery in three-dimensional (3-D) integrated systems poses several challenges such as high current densities, large voltage drops due to multiple levels of resistive vertical interconnect, and significant switching noise originating from transient currents within different layers. Voltage stacking is a power delivery technique that is highly compatible with 3-D integration due to the physical proximity between layers, enabling the efficient transfer of recycled current. Power noise in clock networks is, however, not inherently addressed by 3-D voltage stacking. In this brief, a quasi-adiabatic technique between multiple clock networks within 3-D voltage stacked systems is proposed. The technique exploits the proximity of the clock networks to enable mutual charging and discharging when the clock signals transition to the same voltage. During this transition, the clock distribution networks are isolated from the power grid, reducing simultaneous switching noise and current load. The maximum current is reduced by an additional 13% as compared to only voltage stacking, the maximum voltage noise is reduced by up to 72% when the clock networks are isolated from the power grids, and the clock networks pull nearly 50% less charge from the source. The proposed technique is evaluated on a 7 nm predictive technology model.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 12","pages":"2394-2397"},"PeriodicalIF":2.8,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142821254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Error Analysis of Bit Weight Self-Calibration Methods for High-Resolution SAR ADCs 高分辨率 SAR ADC 比特权重自校准方法的误差分析
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-19 DOI: 10.1109/TVLSI.2024.3458071
Yanhang Chen;Siji Huang;Qifeng Huang;Yifei Fan;Jie Yuan
High-resolution successive approximation register (SAR) analog-to-digital converters (ADCs) commonly need to calibrate their bit weights. Due to the nonidealities of the calibration circuits, the calibrated bit weights carry errors. This error could propagate during the calibration procedure. Due to the high precision requirement of these ADCs, such residue error commonly becomes the signal-to-noise-and-distortion ratio (SNDR) bottleneck of the overall ADC. This article presents an analysis of the residue error from bit weight self-calibration methods of high-resolution SAR ADCs. The major sources contributing to this error and the error reduction methods are quantitively analyzed. A statistical analysis of the noise-induced random error is developed. Our statistical model finds that the noise-induced random error follows the chi-square distribution. In practice, this random error is commonly reduced by repetitively measuring and averaging the calibrated bit weights. Our statistical model quantifies this bit weight error and leads to a clearer understanding of the error mechanism and design trade-offs. Following our chi-square model, the SNDR degradation due to the circuit noise during the calibration can be easily estimated without going through the time-consuming traditional transistor-level design and simulation process. The required repetition time can also be calculated. The bit-weight error models derived in this article are verified with measurement on a 16-bit SAR ADC design in a 180-nm CMOS process. Results from our model match both simulations and measurements well.
高分辨率逐次逼近寄存器(SAR)模数转换器(ADC)通常需要校准位权重。由于校准电路的非理想性,校准后的位权重会产生误差。这种误差可能在校准过程中传播。由于这些 ADC 的精度要求很高,这种残余误差通常会成为整个 ADC 的信噪比 (SNDR) 瓶颈。本文分析了高分辨率 SAR ADC 位权自校准方法产生的残差误差。文章定量分析了造成这一误差的主要来源和减少误差的方法。文章对噪声引起的随机误差进行了统计分析。我们的统计模型发现,噪声引起的随机误差遵循秩方分布。在实践中,这种随机误差通常通过重复测量和平均校准位权来减少。我们的统计模型量化了这种位权重误差,使人们对误差机制和设计权衡有了更清晰的认识。根据我们的秩方模型,校准过程中电路噪声导致的 SNDR 下降可以很容易地估算出来,而无需进行耗时的传统晶体管级设计和仿真过程。所需的重复时间也可以计算出来。本文推导出的位重误差模型在 180-nm CMOS 工艺的 16 位 SAR ADC 设计上进行了测量验证。我们的模型得出的结果与模拟和测量结果十分吻合。
{"title":"The Error Analysis of Bit Weight Self-Calibration Methods for High-Resolution SAR ADCs","authors":"Yanhang Chen;Siji Huang;Qifeng Huang;Yifei Fan;Jie Yuan","doi":"10.1109/TVLSI.2024.3458071","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3458071","url":null,"abstract":"High-resolution successive approximation register (SAR) analog-to-digital converters (ADCs) commonly need to calibrate their bit weights. Due to the nonidealities of the calibration circuits, the calibrated bit weights carry errors. This error could propagate during the calibration procedure. Due to the high precision requirement of these ADCs, such residue error commonly becomes the signal-to-noise-and-distortion ratio (SNDR) bottleneck of the overall ADC. This article presents an analysis of the residue error from bit weight self-calibration methods of high-resolution SAR ADCs. The major sources contributing to this error and the error reduction methods are quantitively analyzed. A statistical analysis of the noise-induced random error is developed. Our statistical model finds that the noise-induced random error follows the chi-square distribution. In practice, this random error is commonly reduced by repetitively measuring and averaging the calibrated bit weights. Our statistical model quantifies this bit weight error and leads to a clearer understanding of the error mechanism and design trade-offs. Following our chi-square model, the SNDR degradation due to the circuit noise during the calibration can be easily estimated without going through the time-consuming traditional transistor-level design and simulation process. The required repetition time can also be calculated. The bit-weight error models derived in this article are verified with measurement on a 16-bit SAR ADC design in a 180-nm CMOS process. Results from our model match both simulations and measurements well.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 11","pages":"1983-1992"},"PeriodicalIF":2.8,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142518122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MCAIMem: A Mixed SRAM and eDRAM Cell for Area and Energy-Efficient On-Chip AI Memory MCAIMem:用于面积和能效高的片上人工智能存储器的混合 SRAM 和 eDRAM 单元
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-18 DOI: 10.1109/TVLSI.2024.3439231
Duy-Thanh Nguyen;Abhiroop Bhattacharjee;Abhishek Moitra;Priyadarshini Panda
AI chips commonly employ SRAM memory as buffers for their reliability and speed, which contribute to high performance. However, SRAM is expensive and demands significant area and energy consumption. Previous studies have explored replacing SRAM with emerging technologies, such as nonvolatile memory, which offers fast read memory access and a small cell area. Despite these advantages, nonvolatile memory’s slow write memory access and high write energy consumption prevent it from surpassing SRAM performance in AI applications with extensive memory access requirements. Some research has also investigated embedded dynamic random access memory (eDRAM) as an area-efficient on-chip memory with similar access times as SRAM. Still, refresh power remains a concern, leaving the trade-off among performance, area, and power consumption unresolved. To address this issue, this article presents a novel mixed CMOS cell memory design that balances performance, area, and energy efficiency for AI memory by combining SRAM and eDRAM cells. We consider the proportion ratio of one SRAM and seven eDRAM cells in the memory to achieve area reduction using mixed CMOS cell memory. In addition, we capitalize on the characteristics of deep neural network (DNN) data representation and integrate asymmetric eDRAM cells to lower energy consumption. To validate our proposed MCAIMem solution, we conduct extensive simulations and benchmarking against traditional SRAM. Our results demonstrate that the MCAIMem significantly outperforms these alternatives in terms of area and energy efficiency. Specifically, our MCAIMem can reduce the area by 48% and energy consumption by $3.4times $ compared with SRAM designs, without incurring any accuracy loss.
人工智能芯片通常使用 SRAM 存储器作为缓冲器,其可靠性和速度有助于实现高性能。然而,SRAM 价格昂贵,需要大量的面积和能耗。以前的研究曾探讨过用非易失性存储器等新兴技术取代 SRAM,因为非易失性存储器读取内存速度快,单元面积小。尽管具有这些优势,但非易失性存储器的写入内存访问速度慢、写入能耗高,因此在具有大量内存访问要求的人工智能应用中,非易失性存储器的性能无法超越 SRAM。一些研究还将嵌入式动态随机访问存储器(eDRAM)作为一种面积效率高的片上存储器进行了研究,其访问时间与 SRAM 相似。然而,刷新功耗仍然是一个令人担忧的问题,性能、面积和功耗之间的权衡尚未解决。为解决这一问题,本文提出了一种新型混合 CMOS 单元存储器设计,通过结合 SRAM 和 eDRAM 单元,平衡了人工智能存储器的性能、面积和能效。我们考虑了存储器中一个 SRAM 和七个 eDRAM 单元的比例,以利用混合 CMOS 单元存储器实现面积缩减。此外,我们还利用深度神经网络(DNN)数据表示的特点,集成了非对称 eDRAM 单元,以降低能耗。为了验证我们提出的 MCAIMem 解决方案,我们进行了大量仿真,并以传统 SRAM 为基准进行了测试。结果表明,MCAIMem 在面积和能效方面明显优于这些替代方案。具体来说,与 SRAM 设计相比,我们的 MCAIMem 可以减少 48% 的面积和 3.4 美元/次的能耗,而且不会造成任何精度损失。
{"title":"MCAIMem: A Mixed SRAM and eDRAM Cell for Area and Energy-Efficient On-Chip AI Memory","authors":"Duy-Thanh Nguyen;Abhiroop Bhattacharjee;Abhishek Moitra;Priyadarshini Panda","doi":"10.1109/TVLSI.2024.3439231","DOIUrl":"10.1109/TVLSI.2024.3439231","url":null,"abstract":"AI chips commonly employ SRAM memory as buffers for their reliability and speed, which contribute to high performance. However, SRAM is expensive and demands significant area and energy consumption. Previous studies have explored replacing SRAM with emerging technologies, such as nonvolatile memory, which offers fast read memory access and a small cell area. Despite these advantages, nonvolatile memory’s slow write memory access and high write energy consumption prevent it from surpassing SRAM performance in AI applications with extensive memory access requirements. Some research has also investigated embedded dynamic random access memory (eDRAM) as an area-efficient on-chip memory with similar access times as SRAM. Still, refresh power remains a concern, leaving the trade-off among performance, area, and power consumption unresolved. To address this issue, this article presents a novel mixed CMOS cell memory design that balances performance, area, and energy efficiency for AI memory by combining SRAM and eDRAM cells. We consider the proportion ratio of one SRAM and seven eDRAM cells in the memory to achieve area reduction using mixed CMOS cell memory. In addition, we capitalize on the characteristics of deep neural network (DNN) data representation and integrate asymmetric eDRAM cells to lower energy consumption. To validate our proposed MCAIMem solution, we conduct extensive simulations and benchmarking against traditional SRAM. Our results demonstrate that the MCAIMem significantly outperforms these alternatives in terms of area and energy efficiency. Specifically, our MCAIMem can reduce the area by 48% and energy consumption by \u0000<inline-formula> <tex-math>$3.4times $ </tex-math></inline-formula>\u0000 compared with SRAM designs, without incurring any accuracy loss.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 11","pages":"2023-2036"},"PeriodicalIF":2.8,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Marmotini: A Weight Density Adaptation Architecture With Hybrid Compression Method for Spiking Neural Network Marmotini:采用混合压缩方法的尖峰神经网络权重密度自适应架构
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-18 DOI: 10.1109/TVLSI.2024.3453897
Zilin Wang;Yi Zhong;Zehong Ou;Youming Yang;Shuo Feng;Guang Chen;Xiaoxin Cui;Song Jia;Yuan Wang
Brain-inspired spiking neural network (SNN) has recently attracted widespread interest owing to its event-driven nature and relatively low-power hardware for transmitting highly sparse binary spikes. To further improve energy efficiency, some matrix compression algorithms are used for weight storage. However, the weight sparsity of different layers varies greatly. For a multicore neuromorphic system, it is difficult for the same compression algorithm to adapt to all the layers of SNN model. In this work, we propose a weight density adaptation architecture with hybrid compression method for SNN, named Marmotini. It is a multicore heterogeneous design, including three types of cores to complete computation of different weight sparsity. Benefiting from the hybrid compression method, Marmotini minimizes the waste of neurons and weights as much as possible. Besides, for better flexibility, a reconfigurable core that can be configured to compute convolutional layer or fully connected layer is proposed. Implemented on Xilinx Kintex UltraScale XCKU115 field-programmable gate array (FPGA) board, Marmotini can operate at 150-MHz frequency, achieving 244.6-GSOP/s peak performance and 54.1-GSOP/W energy efficiency at 0% spike sparsity.
由于其事件驱动的特性和用于传输高度稀疏二进制尖峰的相对低功耗硬件,大脑激发的尖峰神经网络(SNN)最近引起了广泛的兴趣。为了进一步提高能量效率,一些矩阵压缩算法被用于权重存储。然而,不同层的权值稀疏度差异很大。对于一个多核神经形态系统,同一种压缩算法很难适应SNN模型的所有层。在这项工作中,我们提出了一种具有混合压缩方法的SNN权密度自适应架构,称为Marmotini。它是一种多核异构设计,包括三种核来完成不同权值稀疏度的计算。得益于混合压缩方法,Marmotini尽可能地减少了神经元和权重的浪费。此外,为了获得更好的灵活性,提出了可配置的可重构核,该核可配置为计算卷积层或全连接层。在Xilinx Kintex UltraScale XCKU115现场可编程门阵列(FPGA)板上实现,Marmotini可以在150 mhz频率下工作,峰值性能为244.6 gsop /s,峰值密度为0%时的能效为54.1 gsop /W。
{"title":"Marmotini: A Weight Density Adaptation Architecture With Hybrid Compression Method for Spiking Neural Network","authors":"Zilin Wang;Yi Zhong;Zehong Ou;Youming Yang;Shuo Feng;Guang Chen;Xiaoxin Cui;Song Jia;Yuan Wang","doi":"10.1109/TVLSI.2024.3453897","DOIUrl":"10.1109/TVLSI.2024.3453897","url":null,"abstract":"Brain-inspired spiking neural network (SNN) has recently attracted widespread interest owing to its event-driven nature and relatively low-power hardware for transmitting highly sparse binary spikes. To further improve energy efficiency, some matrix compression algorithms are used for weight storage. However, the weight sparsity of different layers varies greatly. For a multicore neuromorphic system, it is difficult for the same compression algorithm to adapt to all the layers of SNN model. In this work, we propose a weight density adaptation architecture with hybrid compression method for SNN, named Marmotini. It is a multicore heterogeneous design, including three types of cores to complete computation of different weight sparsity. Benefiting from the hybrid compression method, Marmotini minimizes the waste of neurons and weights as much as possible. Besides, for better flexibility, a reconfigurable core that can be configured to compute convolutional layer or fully connected layer is proposed. Implemented on Xilinx Kintex UltraScale XCKU115 field-programmable gate array (FPGA) board, Marmotini can operate at 150-MHz frequency, achieving 244.6-GSOP/s peak performance and 54.1-GSOP/W energy efficiency at 0% spike sparsity.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 12","pages":"2293-2302"},"PeriodicalIF":2.8,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A 22-nm 264-GOPS/mm2 6T SRAM and Proportional Current Compute Cell-Based Computing-in-Memory Macro for CNNs 用于 CNN 的 22 纳米 264-GOPS/mm$^{2}$ 6T SRAM 和基于比例电流计算单元的内存计算宏
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-18 DOI: 10.1109/TVLSI.2024.3446045
Feiran Liu;Anran Yin;Chen Xue;Bo Wang;Zhongyuan Feng;Han Liu;Xiang Li;Hui Gao;Tianzhu Xiong;Xin Si
With the rise of artificial intelligence and big data applications, the general-purpose Von Neumann architecture is no longer capable of fulfilling the requirements of these application scenarios. The large amount of parallelizable and repeatable multiply-and-accumulate (MAC) operations in deep neural networks provide the possibility for the emergence of storage-computing integrated architectures. Current-based computation and quantization are employed to circumvent signal margin limitations on the power supply voltage of the computing unit, thereby facilitating low-power design. The proposed design is a computing-in-memory (CIM) circuit based on current sampling accumulation and applies a current-sensing analog-to-digital converter design that exhibits reduced sensitivity to parasitic capacitance compared to voltage-based analog-to-digital converters. Its power consumption is proportional to the input current, achieving higher area efficiency and energy efficiency gains. The design of the CIM circuit based on the current sampling in the 22-nm FDSOI process is fabricated with an area efficiency of 264 GOPS/mm2. The peak energy efficiency is 20.81 TOPS/W, and the inference accuracy reaches 92.11% when employed to VGG-16 under CIFAR-10 dataset.
随着人工智能和大数据应用的兴起,通用的冯·诺依曼架构已经无法满足这些应用场景的需求。深度神经网络中大量可并行和可重复的乘法累加运算为存储计算集成架构的出现提供了可能。采用基于电流的计算和量化,规避了计算单元电源电压的信号裕度限制,便于低功耗设计。所提出的设计是一个基于电流采样积累的内存计算(CIM)电路,并应用电流传感模数转换器设计,与基于电压的模数转换器相比,该设计对寄生电容的灵敏度降低。其功耗与输入电流成正比,可实现更高的面积效率和能效增益。基于22nm FDSOI工艺的电流采样,设计了面积效率为264 GOPS/mm2的CIM电路。在CIFAR-10数据集下,对VGG-16的推理精度达到了92.11%,最高能量效率为20.81 TOPS/W。
{"title":"A 22-nm 264-GOPS/mm2 6T SRAM and Proportional Current Compute Cell-Based Computing-in-Memory Macro for CNNs","authors":"Feiran Liu;Anran Yin;Chen Xue;Bo Wang;Zhongyuan Feng;Han Liu;Xiang Li;Hui Gao;Tianzhu Xiong;Xin Si","doi":"10.1109/TVLSI.2024.3446045","DOIUrl":"10.1109/TVLSI.2024.3446045","url":null,"abstract":"With the rise of artificial intelligence and big data applications, the general-purpose Von Neumann architecture is no longer capable of fulfilling the requirements of these application scenarios. The large amount of parallelizable and repeatable multiply-and-accumulate (MAC) operations in deep neural networks provide the possibility for the emergence of storage-computing integrated architectures. Current-based computation and quantization are employed to circumvent signal margin limitations on the power supply voltage of the computing unit, thereby facilitating low-power design. The proposed design is a computing-in-memory (CIM) circuit based on current sampling accumulation and applies a current-sensing analog-to-digital converter design that exhibits reduced sensitivity to parasitic capacitance compared to voltage-based analog-to-digital converters. Its power consumption is proportional to the input current, achieving higher area efficiency and energy efficiency gains. The design of the CIM circuit based on the current sampling in the 22-nm FDSOI process is fabricated with an area efficiency of 264 GOPS/mm2. The peak energy efficiency is 20.81 TOPS/W, and the inference accuracy reaches 92.11% when employed to VGG-16 under CIFAR-10 dataset.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 12","pages":"2389-2393"},"PeriodicalIF":2.8,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Interpolation-Free Fractional Motion Estimation Algorithm and Hardware Implementation for VVC 用于 VVC 的无插值分数运动估计算法和硬件实现
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-09-17 DOI: 10.1109/TVLSI.2024.3455374
Shushi Chen;Leilei Huang;Zhao Zan;Xiaoyang Zeng;Yibo Fan
Versatile video coding (VVC) introduces multi-type tree (MTT) and larger coding tree unit (CTU) to improve compression efficiency compared to its predecessor High Efficiency Video Coding (HEVC). This leads to higher throughput for fractional motion estimation (FME) to meet the needs of real-time processing. In this context, this article proposes an interpolation-free algorithm based on an error surface to improve the throughput of FME hardware. The error surface is constructed by the rate-distortion costs (RDCs) of the integer motion vector (IMV) and its neighbors. To improve the prediction accuracy, a hardware-friendly RDC estimation strategy is proposed to construct the error surface. The experimental results show that the corresponding Bjontegaard Delta Bit Rate (BDBR) in Random Access (RA), Low Delay P (LDP) and Low Delay B (LDB) configuration increases by only 0.358%, 0.479%, and 0.511% compared with the VVC test model (VTM) 16.0. Compared with the default FME algorithms of VVC, the time cost of FME is reduced by 53.47%, 56.28%, and 54.23%, respectively, in RA, LDP, and LDB configurations. The algorithm is free of iteration and interpolation, which can contribute to low-cost and high-throughput hardware. The proposed architecture can support FME of all coding units (CUs) in a CTU with one layer of MTT under the quaternary tree (QT), and the CU size can vary from $8times 8$ to $128times 128$ . Synthesized using GF 28-nm process, the architecture can achieve $7680times 4320$ @60 fps throughput at 800 MHz, with a gate count of 244 K and power consumption of 76.5 mW. This proposed architecture can meet the real-time coding requirements of VVC.
与HEVC (High efficiency video coding)相比,VVC (Versatile video coding)通过引入MTT (multi-type tree)和更大的编码树单元CTU (coding tree unit)来提高压缩效率。这使得分数运动估计(FME)的吞吐量更高,以满足实时处理的需要。在此背景下,本文提出了一种基于误差曲面的无插值算法,以提高FME硬件的吞吐量。误差曲面由整数运动矢量(IMV)及其邻向量的率失真代价(rdc)构成。为了提高预测精度,提出了一种硬件友好的RDC估计策略来构造误差面。实验结果表明,与VVC测试模型(VTM) 16.0相比,随机接入(RA)、低延迟P (LDP)和低延迟B (LDB)配置下相应的Bjontegaard Delta比特率(BDBR)分别提高了0.358%、0.479%和0.511%。与VVC的缺省FME算法相比,在RA、LDP和LDB配置下,FME的时间成本分别降低了53.47%、56.28%和54.23%。该算法不需要迭代和插补,有助于实现低成本和高吞吐量的硬件。所提出的架构可以在四元树(QT)下使用一层MTT支持CTU中所有编码单元(CU)的FME,并且CU的大小可以从$8 × 8$到$128 × 128$不等。该架构采用GF 28纳米工艺合成,在800 MHz下可实现$7680times 4320$ @60 fps的吞吐量,栅极计数为244k,功耗为76.5 mW。该架构能够满足VVC的实时编码要求。
{"title":"An Interpolation-Free Fractional Motion Estimation Algorithm and Hardware Implementation for VVC","authors":"Shushi Chen;Leilei Huang;Zhao Zan;Xiaoyang Zeng;Yibo Fan","doi":"10.1109/TVLSI.2024.3455374","DOIUrl":"10.1109/TVLSI.2024.3455374","url":null,"abstract":"Versatile video coding (VVC) introduces multi-type tree (MTT) and larger coding tree unit (CTU) to improve compression efficiency compared to its predecessor High Efficiency Video Coding (HEVC). This leads to higher throughput for fractional motion estimation (FME) to meet the needs of real-time processing. In this context, this article proposes an interpolation-free algorithm based on an error surface to improve the throughput of FME hardware. The error surface is constructed by the rate-distortion costs (RDCs) of the integer motion vector (IMV) and its neighbors. To improve the prediction accuracy, a hardware-friendly RDC estimation strategy is proposed to construct the error surface. The experimental results show that the corresponding Bjontegaard Delta Bit Rate (BDBR) in Random Access (RA), Low Delay P (LDP) and Low Delay B (LDB) configuration increases by only 0.358%, 0.479%, and 0.511% compared with the VVC test model (VTM) 16.0. Compared with the default FME algorithms of VVC, the time cost of FME is reduced by 53.47%, 56.28%, and 54.23%, respectively, in RA, LDP, and LDB configurations. The algorithm is free of iteration and interpolation, which can contribute to low-cost and high-throughput hardware. The proposed architecture can support FME of all coding units (CUs) in a CTU with one layer of MTT under the quaternary tree (QT), and the CU size can vary from <inline-formula> <tex-math>$8times 8$ </tex-math></inline-formula> to <inline-formula> <tex-math>$128times 128$ </tex-math></inline-formula>. Synthesized using GF 28-nm process, the architecture can achieve <inline-formula> <tex-math>$7680times 4320$ </tex-math></inline-formula>@60 fps throughput at 800 MHz, with a gate count of 244 K and power consumption of 76.5 mW. This proposed architecture can meet the real-time coding requirements of VVC.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"395-407"},"PeriodicalIF":2.8,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1