首页 > 最新文献

IEEE Transactions on Very Large Scale Integration (VLSI) Systems最新文献

英文 中文
Manipulated Lookup Table Method for Efficient High-Performance Modular Multiplier 高效高性能模块乘法器的操纵查找表方法
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-12-04 DOI: 10.1109/TVLSI.2024.3505920
Anawin Opasatian;Makoto Ikeda
Modular multiplication is a fundamental operation in many cryptographic systems, with its efficiency playing a crucial role in the overall performance of these systems. Since many cryptographic systems operate with a fixed modulus, we propose an enhancement to the fixed modulus lookup table (LuT) method used for modular reduction, which we refer to as the manipulated LuT (MLuT) method. Our approach applies to any modulus and has demonstrated comparable performance compared with some specialized reduction algorithms designed for specific moduli. The strength of our proposed method in terms of circuit performance is shown by implementing it on Virtex7 and Virtex Ultrascale+ FPGA as the LUT-based MLuT modular multiplier (LUT-MLuTMM) with generalized parallel counters (GPCs) used in the summation step. In one-stage implementations, our proposed method achieves up to a 90% reduction in area and a 50% reduction in latency compared with the generic LuT method. In multistage implementations, our approach offers the best area-interleaved time product, with improvements of 39%, 13%, and 29% over the current state-of-the-art for ~256-bit, SIKE434, and BLS12-381 modular multipliers, respectively. These results demonstrate the potential of our method for high-performance cryptographic accelerators employing a fixed modulus.
模乘法运算是许多密码系统的基本运算,其效率对密码系统的整体性能起着至关重要的作用。由于许多密码系统使用固定模操作,因此我们提出了对用于模约简的固定模查找表(LuT)方法的增强,我们将其称为操纵LuT (MLuT)方法。我们的方法适用于任何模量,并且与为特定模量设计的一些专门的约简算法相比,已经证明了相当的性能。我们提出的方法在电路性能方面的优势通过在Virtex7和Virtex Ultrascale+ FPGA上实现它作为基于lut的MLuT模块化乘法器(LUT-MLuTMM),在求和步骤中使用广义并行计数器(gpc)来显示。在单阶段实现中,与通用LuT方法相比,我们提出的方法可以减少90%的面积,减少50%的延迟。在多阶段实现中,我们的方法提供了最佳的区域交错时间产品,与目前最先进的~256位、SIKE434和BLS12-381模块化乘法器相比,分别提高了39%、13%和29%。这些结果证明了我们的方法对于采用固定模的高性能密码加速器的潜力。
{"title":"Manipulated Lookup Table Method for Efficient High-Performance Modular Multiplier","authors":"Anawin Opasatian;Makoto Ikeda","doi":"10.1109/TVLSI.2024.3505920","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3505920","url":null,"abstract":"Modular multiplication is a fundamental operation in many cryptographic systems, with its efficiency playing a crucial role in the overall performance of these systems. Since many cryptographic systems operate with a fixed modulus, we propose an enhancement to the fixed modulus lookup table (LuT) method used for modular reduction, which we refer to as the manipulated LuT (MLuT) method. Our approach applies to any modulus and has demonstrated comparable performance compared with some specialized reduction algorithms designed for specific moduli. The strength of our proposed method in terms of circuit performance is shown by implementing it on Virtex7 and Virtex Ultrascale+ FPGA as the LUT-based MLuT modular multiplier (LUT-MLuTMM) with generalized parallel counters (GPCs) used in the summation step. In one-stage implementations, our proposed method achieves up to a 90% reduction in area and a 50% reduction in latency compared with the generic LuT method. In multistage implementations, our approach offers the best area-interleaved time product, with improvements of 39%, 13%, and 29% over the current state-of-the-art for ~256-bit, SIKE434, and BLS12-381 modular multipliers, respectively. These results demonstrate the potential of our method for high-performance cryptographic accelerators employing a fixed modulus.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"114-127"},"PeriodicalIF":2.8,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10777922","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142918390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A 0.875–0.95-pJ/b 40-Gb/s PAM-3 Baud-Rate Receiver With One-Tap DFE 一种0.875 - 0.95 pj /b 40gb /s PAM-3波特率单接DFE接收机
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-12-04 DOI: 10.1109/TVLSI.2024.3507714
Jhe-En Lin;Shen-Iuan Liu
This article presents a 40-Gb/s (25.6-GBaud) three-level pulse amplitude modulation (PAM-3) baud-rate receiver with one-tap decision-feedback equalize (DFE). A baud-rate phase detector (BRPD) that locks at the point with zero first postcursor is proposed. In addition, by reusing the BRPD’s error samplers, a weighting coefficient calibration is presented to select the DFE weighting coefficient that maximizes the top level of the eye diagram, thereby improving eye height across different channel losses. An inductorless continuous-time linear equalizer (CTLE) and a variable gain amplifier (VGA) are also included. The VGA adjusts the output common-mode resistance to control data swing, reducing power consumption when the required swing is small. Furthermore, by using the modified summer-merged slicers, the capacitance from the slicers to the VGA is reduced. Finally, a digital clock/data recovery (CDR) circuit is presented, which includes a demultiplexer (DeMUX) with a short delay time to reduce the loop latency. The 40-Gb/s PAM-3 receiver is fabricated in 28-nm CMOS technology. For a 25.6-Gbaud pseudorandom ternary sequence of $3^{7}$ –1, the measured bit error rate (BER) is below $10^{-12}$ for channel losses of 9 and 17.5 dB. At a 9-dB loss, total power consumption is 35-mW with a calculated FoM of 0.875-pJ/bit. At 17.5-dB loss, total power consumption is 38-mW with a calculated FoM of 0.95-pJ/bit.
本文介绍了一种40gb /s (25.6 gbaud)三电平脉冲调幅(PAM-3)波特率的一分接决策反馈均衡(DFE)接收机。提出了一种锁定在第一个后光标为零的波特率鉴相器(BRPD)。此外,通过重复使用BRPD的误差采样器,提出了加权系数校准,以选择最大眼图顶层的DFE加权系数,从而提高不同通道损失下的眼高度。还包括一个无电感连续时间线性均衡器(CTLE)和一个可变增益放大器(VGA)。VGA通过调节输出共模电阻来控制数据摆幅,在所需摆幅较小时降低功耗。此外,通过使用改进的夏季合并切片器,减少了从切片器到VGA的电容。最后,提出了一个数字时钟/数据恢复(CDR)电路,其中包括一个具有短延迟时间的解复用器(DeMUX),以减少环路延迟。40 gb /s PAM-3接收机采用28纳米CMOS技术制造。对于$3^{7}$ -1的25.6 gbaud伪随机三进制序列,在信道损耗为9和17.5 dB时,测量的误码率(BER)低于$10^{-12}$。在9db损耗下,总功耗为35mw,计算FoM为0.875 pj /bit。在17.5 db损耗下,总功耗为38 mw,计算FoM为0.95 pj /bit。
{"title":"A 0.875–0.95-pJ/b 40-Gb/s PAM-3 Baud-Rate Receiver With One-Tap DFE","authors":"Jhe-En Lin;Shen-Iuan Liu","doi":"10.1109/TVLSI.2024.3507714","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3507714","url":null,"abstract":"This article presents a 40-Gb/s (25.6-GBaud) three-level pulse amplitude modulation (PAM-3) baud-rate receiver with one-tap decision-feedback equalize (DFE). A baud-rate phase detector (BRPD) that locks at the point with zero first postcursor is proposed. In addition, by reusing the BRPD’s error samplers, a weighting coefficient calibration is presented to select the DFE weighting coefficient that maximizes the top level of the eye diagram, thereby improving eye height across different channel losses. An inductorless continuous-time linear equalizer (CTLE) and a variable gain amplifier (VGA) are also included. The VGA adjusts the output common-mode resistance to control data swing, reducing power consumption when the required swing is small. Furthermore, by using the modified summer-merged slicers, the capacitance from the slicers to the VGA is reduced. Finally, a digital clock/data recovery (CDR) circuit is presented, which includes a demultiplexer (DeMUX) with a short delay time to reduce the loop latency. The 40-Gb/s PAM-3 receiver is fabricated in 28-nm CMOS technology. For a 25.6-Gbaud pseudorandom ternary sequence of \u0000<inline-formula> <tex-math>$3^{7}$ </tex-math></inline-formula>\u0000–1, the measured bit error rate (BER) is below \u0000<inline-formula> <tex-math>$10^{-12}$ </tex-math></inline-formula>\u0000 for channel losses of 9 and 17.5 dB. At a 9-dB loss, total power consumption is 35-mW with a calculated FoM of 0.875-pJ/bit. At 17.5-dB loss, total power consumption is 38-mW with a calculated FoM of 0.95-pJ/bit.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"168-178"},"PeriodicalIF":2.8,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142918175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VSAGE: An End-to-End Automated VCO-Based ΔΣ ADC Generator VSAGE:端到端自动化基于vco的ΔΣ ADC生成器
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-12-04 DOI: 10.1109/TVLSI.2024.3507567
Ken Li;Tian Xie;Tzu-Han Wang;Shaolan Li
This article presents VSAGE, an agile end-to-end automated voltage-controlled oscillator (VCO)-based $Delta Sigma $ analog-to-digital converter (ADC) generator. It exploits time-domain architectures and design mindset, so that the design flow is highly oriented around digital standard cells in contrast to the transistor-level-focused approach in conventional analog design. Through this, it speeds up and simplifies both the synthesis phase and layout phase. Combined with an efficient knowledge-machine learning (ML)-guided synthesis flow, it can translate input specifications to a full system layout with reliable performance within minutes. This work also features a compact oscillator and system modeling method that facilitates light-resource accurate computation and network training. The generator is verified with 12 design cases in 65-nm and 28-nm processes, proving its capability of generating competitive design with good process portability.
本文介绍了VSAGE,一种灵活的端到端自动压控振荡器(VCO) $Delta Sigma $模数转换器(ADC)发生器。它利用时域架构和设计思维,因此设计流程高度围绕数字标准单元,而不是传统模拟设计中以晶体管级为中心的方法。通过这一点,加快和简化了合成阶段和布局阶段。结合高效的知识机器学习(ML)指导合成流程,它可以在几分钟内将输入规格转换为具有可靠性能的完整系统布局。这项工作还具有紧凑的振荡器和系统建模方法,便于光源精确计算和网络训练。通过65纳米和28纳米工艺的12个设计案例验证了该生成器,证明其具有良好的工艺可移植性,能够生成具有竞争力的设计。
{"title":"VSAGE: An End-to-End Automated VCO-Based ΔΣ ADC Generator","authors":"Ken Li;Tian Xie;Tzu-Han Wang;Shaolan Li","doi":"10.1109/TVLSI.2024.3507567","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3507567","url":null,"abstract":"This article presents VSAGE, an agile end-to-end automated voltage-controlled oscillator (VCO)-based \u0000<inline-formula> <tex-math>$Delta Sigma $ </tex-math></inline-formula>\u0000 analog-to-digital converter (ADC) generator. It exploits time-domain architectures and design mindset, so that the design flow is highly oriented around digital standard cells in contrast to the transistor-level-focused approach in conventional analog design. Through this, it speeds up and simplifies both the synthesis phase and layout phase. Combined with an efficient knowledge-machine learning (ML)-guided synthesis flow, it can translate input specifications to a full system layout with reliable performance within minutes. This work also features a compact oscillator and system modeling method that facilitates light-resource accurate computation and network training. The generator is verified with 12 design cases in 65-nm and 28-nm processes, proving its capability of generating competitive design with good process portability.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"128-139"},"PeriodicalIF":2.8,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142918379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MCM-SR: Multiple Constant Multiplication-Based CNN Streaming Hardware Architecture for Super-Resolution MCM-SR:基于多常数乘法的CNN超分辨率流媒体硬件架构
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-12-04 DOI: 10.1109/TVLSI.2024.3504513
Seung-Hwan Bae;Hyuk-Jae Lee;Hyun Kim
Convolutional neural network (CNN)-based super-resolution (SR) methods have become prevalent in display devices due to their superior image quality. However, the significant computational demands of CNN-based SR require hardware accelerators for real-time processing. Among the hardware architectures, the streaming architecture can significantly reduce latency and power consumption by minimizing external dynamic random access memory (DRAM) access. Nevertheless, this architecture necessitates a considerable hardware area, as each layer needs a dedicated processing engine. Furthermore, achieving high hardware utilization in this architecture requires substantial design expertise. In this article, we propose methods to reduce the hardware resources of CNN-based SR accelerators by applying the multiple constant multiplication (MCM) algorithm. We propose a loop interchange method for the convolution (CONV) operation to reduce the logic area by 23% and an adaptive loop interchange method for each layer that considers both the static random access memory (SRAM) and logic area simultaneously to reduce the SRAM size by 15%. In addition, we improve the MCM graph exploration speed by $5.4times $ while maintaining the SR quality through beam search when CONV weights are approximated to reduce the hardware resources.
基于卷积神经网络(CNN)的超分辨率(SR)方法由于其优越的图像质量而在显示设备中变得普遍。然而,基于cnn的SR的巨大计算需求需要硬件加速器来进行实时处理。在硬件架构中,流架构可以通过最小化外部动态随机存取存储器(DRAM)访问来显著降低延迟和功耗。然而,这种体系结构需要相当大的硬件区域,因为每一层都需要一个专用的处理引擎。此外,在这种体系结构中实现高硬件利用率需要大量的设计专业知识。在本文中,我们提出了利用多重常数乘法(multiple constant multiplication, MCM)算法来减少基于cnn的SR加速器硬件资源的方法。我们提出了一种用于卷积(CONV)操作的环路交换方法,以减少23%的逻辑面积,以及一种用于每层的自适应环路交换方法,该方法同时考虑静态随机存取存储器(SRAM)和逻辑面积,以减少15%的SRAM大小。此外,我们将MCM图的搜索速度提高了5.4倍,同时在近似CONV权值时通过波束搜索保持SR质量,以减少硬件资源。
{"title":"MCM-SR: Multiple Constant Multiplication-Based CNN Streaming Hardware Architecture for Super-Resolution","authors":"Seung-Hwan Bae;Hyuk-Jae Lee;Hyun Kim","doi":"10.1109/TVLSI.2024.3504513","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3504513","url":null,"abstract":"Convolutional neural network (CNN)-based super-resolution (SR) methods have become prevalent in display devices due to their superior image quality. However, the significant computational demands of CNN-based SR require hardware accelerators for real-time processing. Among the hardware architectures, the streaming architecture can significantly reduce latency and power consumption by minimizing external dynamic random access memory (DRAM) access. Nevertheless, this architecture necessitates a considerable hardware area, as each layer needs a dedicated processing engine. Furthermore, achieving high hardware utilization in this architecture requires substantial design expertise. In this article, we propose methods to reduce the hardware resources of CNN-based SR accelerators by applying the multiple constant multiplication (MCM) algorithm. We propose a loop interchange method for the convolution (CONV) operation to reduce the logic area by 23% and an adaptive loop interchange method for each layer that considers both the static random access memory (SRAM) and logic area simultaneously to reduce the SRAM size by 15%. In addition, we improve the MCM graph exploration speed by \u0000<inline-formula> <tex-math>$5.4times $ </tex-math></inline-formula>\u0000 while maintaining the SR quality through beam search when CONV weights are approximated to reduce the hardware resources.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"75-87"},"PeriodicalIF":2.8,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142918474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A 0.09-pJ/Bit Logic-Compatible Multiple-Time Programmable (MTP) Memory-Based PUF Design for IoT Applications 面向物联网应用的0.09 pj /Bit逻辑兼容多时间可编程(MTP)内存PUF设计
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-27 DOI: 10.1109/TVLSI.2024.3496735
Shuming Guo;Yinyin Lin;Hao Wang;Yao Li;Chongyan Gu;Weiqiang Liu;Yijun Cui
The Internet of Things (IoT) allows devices to interact for real-time data transfer and remote control. However, IoT hardware devices have been shown security vulnerabilities. Edge device authentications, as a crucial process for IoT systems, generate and use unique IDs for secure data transmissions. Conventional authentication techniques, computational and heavyweight, are challenging and infeasible in IoT due to limited resources in IoTs. Physical unclonable functions (PUFs), a lightweight hardware-based security primitive, were proposed for resource-constrained applications. We propose a new PUF design for resource-constrained IoT devices based on low-cost logic-compatible multiple-time programmable (MTP) memory cells. The structure includes an array of MTP differential memory cells and a PUF extraction circuit. The extraction method uses the random distribution of BL current after programming each memory cell in logic-compatible MTP memory as the entropy source of PUF. Responses are obtained by comparing the current values of two memory cells under a certain address by challenge, forming challenge–response pairs (CRPs). This scheme does not increase hardware consumption and circuit differences on edge devices and is intrinsic PUF. Finally, 200 PUF chips were fabricated by CSMC based on the 0.153- $mu $ m MCU single-gate CMOS process. The performance of the logic-compatible MTP memory cell and its PUF was evaluated. A logic-compatible MTP cell has good programming erase efficiency and good durability and retention. The uniqueness of the proposed PUF is 50.29%, the uniformity is 51.82%, and the reliability is 93.61%.
物联网(IoT)允许设备进行交互,以实现实时数据传输和远程控制。然而,物联网硬件设备已经显示出安全漏洞。边缘设备认证作为物联网系统的关键过程,生成和使用唯一id来安全传输数据。由于物联网资源有限,传统的计算性和重量级认证技术在物联网中具有挑战性和不可行的。针对资源受限的应用,提出了一种基于硬件的轻量级安全原语物理不可克隆函数(puf)。我们提出了一种新的基于低成本逻辑兼容多时间可编程(MTP)存储单元的资源受限物联网设备PUF设计。该结构包括MTP差分存储单元阵列和PUF提取电路。该提取方法利用逻辑兼容MTP存储器中各存储单元编程后的BL电流随机分布作为PUF的熵源。通过挑战比较特定地址下两个记忆单元的电流值来获得响应,形成挑战-响应对(challenge - response pairs, CRPs)。该方案不增加硬件消耗和边缘设备上的电路差异,是固有的PUF。最后,基于0.153- $mu $ m单片机单门CMOS工艺,在CSMC制造了200颗PUF芯片。对逻辑兼容MTP存储单元的性能及其PUF进行了评价。逻辑兼容的MTP单元具有良好的编程擦除效率和良好的耐用性和保留性。该PUF的唯一性为50.29%,均匀性为51.82%,可靠性为93.61%。
{"title":"A 0.09-pJ/Bit Logic-Compatible Multiple-Time Programmable (MTP) Memory-Based PUF Design for IoT Applications","authors":"Shuming Guo;Yinyin Lin;Hao Wang;Yao Li;Chongyan Gu;Weiqiang Liu;Yijun Cui","doi":"10.1109/TVLSI.2024.3496735","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3496735","url":null,"abstract":"The Internet of Things (IoT) allows devices to interact for real-time data transfer and remote control. However, IoT hardware devices have been shown security vulnerabilities. Edge device authentications, as a crucial process for IoT systems, generate and use unique IDs for secure data transmissions. Conventional authentication techniques, computational and heavyweight, are challenging and infeasible in IoT due to limited resources in IoTs. Physical unclonable functions (PUFs), a lightweight hardware-based security primitive, were proposed for resource-constrained applications. We propose a new PUF design for resource-constrained IoT devices based on low-cost logic-compatible multiple-time programmable (MTP) memory cells. The structure includes an array of MTP differential memory cells and a PUF extraction circuit. The extraction method uses the random distribution of BL current after programming each memory cell in logic-compatible MTP memory as the entropy source of PUF. Responses are obtained by comparing the current values of two memory cells under a certain address by challenge, forming challenge–response pairs (CRPs). This scheme does not increase hardware consumption and circuit differences on edge devices and is intrinsic PUF. Finally, 200 PUF chips were fabricated by CSMC based on the 0.153-\u0000<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>\u0000m MCU single-gate CMOS process. The performance of the logic-compatible MTP memory cell and its PUF was evaluated. A logic-compatible MTP cell has good programming erase efficiency and good durability and retention. The uniqueness of the proposed PUF is 50.29%, the uniformity is 51.82%, and the reliability is 93.61%.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"248-260"},"PeriodicalIF":2.8,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142918376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An RISC-V PPA-Fusion Cooperative Optimization Framework Based on Hybrid Strategies 基于混合策略的RISC-V PPA-Fusion协同优化框架
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-25 DOI: 10.1109/TVLSI.2024.3496858
Tianning Gao;Yifan Wang;Ming Zhu;Xiulong Wu;Dian Zhou;Zhaori Bi
The optimization of RISC-V designs, encompassing both microarchitecture and CAD tool parameters, is a great challenge due to an extensive and high-dimensional search space. Conventional optimization methods, such as case-specific approaches and black-box optimization approaches, often fall short of addressing the diverse and complex nature of RISC-V designs. To achieve optimal results across various RISC-V designs, we propose the cooperative optimization framework (COF) that integrates multiple black-box optimizers, each specializing in different optimization problems. The COF introduces the landscape knowledge exchange mechanism (LKEM) to direct the optimizers to share their knowledge of the optimization problem. Moreover, the COF employs the dynamic computational resource allocation (DCRA) strategies to dynamically allocate computational resources to the optimizers. The DCRA strategies are guided by the optimizer efficiency evaluation (OEE) mechanism and a time series forecasting (TSF) model. The OEE provides real-time performance evaluations. The TSF model forecasts the optimization progress made by the optimizers, given the allocated computational resources. In our experiments, the COF reduced the cycle per instruction (CPI) of the Berkeley out-of-order machine (BOOM) by 15.36% and the power of Rocket-Chip by 12.84% without constraint violation compared to the respective initial designs.
RISC-V设计的优化,包括微架构和CAD工具参数,是一个巨大的挑战,由于广泛和高维的搜索空间。传统的优化方法,如具体案例方法和黑盒优化方法,往往无法解决RISC-V设计的多样性和复杂性。为了在各种RISC-V设计中获得最佳结果,我们提出了协作优化框架(COF),该框架集成了多个黑盒优化器,每个黑盒优化器专门用于不同的优化问题。COF引入景观知识交换机制(LKEM),引导优化者分享他们对优化问题的知识。此外,COF采用动态计算资源分配(DCRA)策略将计算资源动态分配给优化器。DCRA策略由优化器效率评价(OEE)机制和时间序列预测(TSF)模型指导。OEE提供实时性能评估。在给定分配的计算资源情况下,TSF模型预测优化器的优化进度。在实验中,与初始设计相比,COF在不违反约束的情况下将Berkeley乱序机(BOOM)的每指令周期(CPI)降低了15.36%,将Rocket-Chip的功率降低了12.84%。
{"title":"An RISC-V PPA-Fusion Cooperative Optimization Framework Based on Hybrid Strategies","authors":"Tianning Gao;Yifan Wang;Ming Zhu;Xiulong Wu;Dian Zhou;Zhaori Bi","doi":"10.1109/TVLSI.2024.3496858","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3496858","url":null,"abstract":"The optimization of RISC-V designs, encompassing both microarchitecture and CAD tool parameters, is a great challenge due to an extensive and high-dimensional search space. Conventional optimization methods, such as case-specific approaches and black-box optimization approaches, often fall short of addressing the diverse and complex nature of RISC-V designs. To achieve optimal results across various RISC-V designs, we propose the cooperative optimization framework (COF) that integrates multiple black-box optimizers, each specializing in different optimization problems. The COF introduces the landscape knowledge exchange mechanism (LKEM) to direct the optimizers to share their knowledge of the optimization problem. Moreover, the COF employs the dynamic computational resource allocation (DCRA) strategies to dynamically allocate computational resources to the optimizers. The DCRA strategies are guided by the optimizer efficiency evaluation (OEE) mechanism and a time series forecasting (TSF) model. The OEE provides real-time performance evaluations. The TSF model forecasts the optimization progress made by the optimizers, given the allocated computational resources. In our experiments, the COF reduced the cycle per instruction (CPI) of the Berkeley out-of-order machine (BOOM) by 15.36% and the power of Rocket-Chip by 12.84% without constraint violation compared to the respective initial designs.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"140-153"},"PeriodicalIF":2.8,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142918391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ArXrCiM: Architectural Exploration of Application-Specific Resonant SRAM Compute-in-Memory ArXrCiM:特定应用谐振式SRAM内存计算的架构探索
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-25 DOI: 10.1109/TVLSI.2024.3502359
Dhandeep Challagundla;Ignatius Bezzam;Riadul Islam
While general-purpose computing follows von Neumann’s architecture, the data movement between memory and processor elements dictates the processor’s performance. The evolving compute-in-memory (CiM) paradigm tackles this issue by facilitating simultaneous processing and storage within static random-access memory (SRAM) elements. Numerous design decisions taken at different levels of hierarchy affect the figures of merit (FoMs) of SRAM, such as power, performance, area, and yield. The absence of a rapid assessment mechanism for the impact of changes at different hierarchy levels on global FoMs poses a challenge to accurately evaluating innovative SRAM designs. This article presents an automation tool designed to optimize the energy and latency of SRAM designs incorporating diverse implementation strategies for executing logic operations within the SRAM. The tool structure allows easy comparison across different array topologies and various design strategies to result in energy-efficient implementations. Our study involves a comprehensive comparison of over 6900+ distinct design implementation strategies for École Polytechnique Fédérale de Lausanne (EPFL) combinational benchmark circuits on the energy-recycling resonant CiM (rCiM) architecture designed using Taiwan Semiconductor Manufacturing Company (TSMC) 28-nm technology. When provided with a combinational circuit, the tool aims to generate an energy-efficient implementation strategy tailored to the specified input memory and latency constraints. The tool reduces 80.9% of energy consumption on average across all benchmarks while using the six-topology implementation compared with the baseline implementation of single-macro topology by considering the parallel processing capability of rCiM cache size ranging from 4 to 192 kB.
虽然通用计算遵循冯·诺伊曼的架构,但内存和处理器元素之间的数据移动决定了处理器的性能。不断发展的内存计算(CiM)范例通过促进静态随机存取存储器(SRAM)元素内的同步处理和存储来解决这个问题。在不同层次上所做的许多设计决策会影响SRAM的性能指标(FoMs),如功率、性能、面积和良率。缺乏一种快速评估机制来评估不同层次变化对全球FoMs的影响,这对准确评估创新SRAM设计提出了挑战。本文提出了一种自动化工具,旨在优化SRAM设计的能量和延迟,该设计结合了在SRAM内执行逻辑操作的多种实现策略。工具结构允许轻松比较不同的阵列拓扑和各种设计策略,从而实现节能。我们的研究涉及对使用台湾半导体制造公司(TSMC) 28纳米技术设计的能量回收谐振CiM (rCiM)架构上的École Polytechnique fsamdsamrale de Lausanne (EPFL)组合基准电路的6900多种不同设计实现策略的全面比较。当提供组合电路时,该工具旨在生成针对指定输入存储器和延迟限制的节能实现策略。通过考虑rCiM缓存大小范围从4到192 kB的并行处理能力,与单宏拓扑的基线实现相比,该工具在使用六拓扑实现时,在所有基准测试中平均减少了80.9%的能耗。
{"title":"ArXrCiM: Architectural Exploration of Application-Specific Resonant SRAM Compute-in-Memory","authors":"Dhandeep Challagundla;Ignatius Bezzam;Riadul Islam","doi":"10.1109/TVLSI.2024.3502359","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3502359","url":null,"abstract":"While general-purpose computing follows von Neumann’s architecture, the data movement between memory and processor elements dictates the processor’s performance. The evolving compute-in-memory (CiM) paradigm tackles this issue by facilitating simultaneous processing and storage within static random-access memory (SRAM) elements. Numerous design decisions taken at different levels of hierarchy affect the figures of merit (FoMs) of SRAM, such as power, performance, area, and yield. The absence of a rapid assessment mechanism for the impact of changes at different hierarchy levels on global FoMs poses a challenge to accurately evaluating innovative SRAM designs. This article presents an automation tool designed to optimize the energy and latency of SRAM designs incorporating diverse implementation strategies for executing logic operations within the SRAM. The tool structure allows easy comparison across different array topologies and various design strategies to result in energy-efficient implementations. Our study involves a comprehensive comparison of over 6900+ distinct design implementation strategies for École Polytechnique Fédérale de Lausanne (EPFL) combinational benchmark circuits on the energy-recycling resonant CiM (rCiM) architecture designed using Taiwan Semiconductor Manufacturing Company (TSMC) 28-nm technology. When provided with a combinational circuit, the tool aims to generate an energy-efficient implementation strategy tailored to the specified input memory and latency constraints. The tool reduces 80.9% of energy consumption on average across all benchmarks while using the six-topology implementation compared with the baseline implementation of single-macro topology by considering the parallel processing capability of rCiM cache size ranging from 4 to 192 kB.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"179-192"},"PeriodicalIF":2.8,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142918425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Securet3d: An Adaptive, Secure, and Fault-Tolerant Aware Routing Algorithm for Vertically–Partially Connected 3D-NoC Securet3d:一种垂直部分连接3D-NoC的自适应、安全、容错感知路由算法
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-25 DOI: 10.1109/TVLSI.2024.3500575
Alexandre Almeida da Silva;Lucas Nogueira;Alexandre Coelho;Jarbas A. N. Silveira;César Marcon
Multiprocessor systems-on-chip (MPSoCs) based on 3-D networks-on-chip (3D-NoCs) are crucial architectures for robust parallel computing, efficiently sharing resources across complex applications. To ensure the secure operation of these systems, it is essential to implement adaptive, fault-tolerant mechanisms capable of protecting sensitive data. This work proposes the Securet3d routing algorithm, which establishes secure data paths in fault-tolerant 3D-NoCs. Our approach enhances the Reflect3d algorithm by introducing a detailed scheme for mapping secure paths and improving the system’s ability to withstand faults. To validate its effectiveness, we compare Securet3d with three other fault-tolerant routing algorithms for vertically-partially connected 3D-NoCs. All algorithms were implemented in SystemVerilog and evaluated through simulation using ModelSim and hardware synthesis with Cadence’s Genus tool. Experimental results show that Securet3d reduces latency and enhances cost-effectiveness compared with other approaches. When implemented with a 28-nm technology library, Securet3d demonstrates minimal area and energy overhead, indicating scalability and efficiency. Under denial-of-service (DoS) attacks, Securet3d maintains basically unaltered average packet latencies on 70, 90, and 29 clock cycles for uniform random, bit-complement, and shuffle traffic, significantly lower than those of other algorithms without including security mechanisms (5763, 4632, and 3712 clock cycles in average, respectively). These results highlight the superior security, scalability, and adaptability of Securet3d for complex communication systems.
基于3-D片上网络(3d - noc)的多处理器片上系统(mpsoc)是实现强大并行计算的关键架构,可以在复杂应用程序之间有效地共享资源。为了确保这些系统的安全运行,必须实现能够保护敏感数据的自适应容错机制。本文提出了Securet3d路由算法,该算法在容错3d - noc中建立了安全的数据路径。我们的方法通过引入映射安全路径的详细方案和提高系统承受故障的能力来增强Reflect3d算法。为了验证其有效性,我们将Securet3d与其他三种用于垂直部分连接的3d - noc的容错路由算法进行了比较。所有算法都在SystemVerilog中实现,并通过使用ModelSim进行仿真和使用Cadence的Genus工具进行硬件合成来评估。实验结果表明,与其他方法相比,Securet3d减少了延迟,提高了成本效益。当使用28纳米技术库实现时,Securet3d展示了最小的面积和能量开销,表明了可扩展性和效率。在拒绝服务(DoS)攻击下,Securet3d对于均匀随机、位补和shuffle流量,在70、90和29个时钟周期上保持基本不变的平均数据包延迟,明显低于其他不包含安全机制的算法(平均分别为5763、4632和3712个时钟周期)。这些结果突出了Securet3d在复杂通信系统中优越的安全性、可扩展性和适应性。
{"title":"Securet3d: An Adaptive, Secure, and Fault-Tolerant Aware Routing Algorithm for Vertically–Partially Connected 3D-NoC","authors":"Alexandre Almeida da Silva;Lucas Nogueira;Alexandre Coelho;Jarbas A. N. Silveira;César Marcon","doi":"10.1109/TVLSI.2024.3500575","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3500575","url":null,"abstract":"Multiprocessor systems-on-chip (MPSoCs) based on 3-D networks-on-chip (3D-NoCs) are crucial architectures for robust parallel computing, efficiently sharing resources across complex applications. To ensure the secure operation of these systems, it is essential to implement adaptive, fault-tolerant mechanisms capable of protecting sensitive data. This work proposes the Securet3d routing algorithm, which establishes secure data paths in fault-tolerant 3D-NoCs. Our approach enhances the Reflect3d algorithm by introducing a detailed scheme for mapping secure paths and improving the system’s ability to withstand faults. To validate its effectiveness, we compare Securet3d with three other fault-tolerant routing algorithms for vertically-partially connected 3D-NoCs. All algorithms were implemented in SystemVerilog and evaluated through simulation using ModelSim and hardware synthesis with Cadence’s Genus tool. Experimental results show that Securet3d reduces latency and enhances cost-effectiveness compared with other approaches. When implemented with a 28-nm technology library, Securet3d demonstrates minimal area and energy overhead, indicating scalability and efficiency. Under denial-of-service (DoS) attacks, Securet3d maintains basically unaltered average packet latencies on 70, 90, and 29 clock cycles for uniform random, bit-complement, and shuffle traffic, significantly lower than those of other algorithms without including security mechanisms (5763, 4632, and 3712 clock cycles in average, respectively). These results highlight the superior security, scalability, and adaptability of Securet3d for complex communication systems.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"275-287"},"PeriodicalIF":2.8,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142918383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FPGA-Based Low-Bit and Lightweight Fast Light Field Depth Estimation 基于fpga的低比特轻量级快速光场深度估计
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-19 DOI: 10.1109/TVLSI.2024.3496751
Jie Li;Chuanlun Zhang;Wenxuan Yang;Heng Li;Xiaoyan Wang;Chuanjun Zhao;Shuangli Du;Yiguang Liu
The 3-D vision computing is a key application in unmanned systems, satellites, and planetary rovers. Learning-based light field (LF) depth estimation is one of the major research directions in 3-D vision computing. However, conventional learning-based depth estimation methods involve a large number of parameters and floating-point operations, making it challenging to achieve low-power, fast, and high-precision LF depth estimation on a field-programmable gate array (FPGA). Motivated by this issue, an FPGA-based low-bit, lightweight LF depth estimation network (L $^{3}text {FNet}$ ) is proposed. First, a hardware-friendly network is designed, which has small weight parameters, low computational load, and a simple network architecture with minor accuracy loss. Second, we apply efficient hardware unit design and software-hardware collaborative dataflow architecture to construct an FPGA-based fast, low-bit acceleration engine. Experimental results show that compared with the state-of-the-art works with lower mean-square error (mse), L $^{3}text {FNet}$ can reduce the computational load by more than 109 times and weight parameters by approximately 78 times. Moreover, on the ZCU104 platform, it requires 95.65% lookup tables (LUTs), 80.67% digital signal processors (DSPs), 80.93% BlockRAM (BRAM), 58.52% LUTRAM, and 9.493-W power consumption to achieve an efficient acceleration engine with a latency as low as 272 ns. The code and model of the proposed method are available at https://github.com/sansi-zhang/L3FNet.
三维视觉计算是无人系统、卫星和行星探测器的关键应用。基于学习的光场深度估计是三维视觉计算的主要研究方向之一。然而,传统的基于学习的深度估计方法涉及大量参数和浮点运算,这使得在现场可编程门阵列(FPGA)上实现低功耗、快速、高精度的LF深度估计具有挑战性。针对这一问题,提出了一种基于fpga的低比特轻量级LF深度估计网络(L $^{3}text {FNet}$)。首先,设计了一个权重参数小、计算量小、网络结构简单、精度损失小的硬件友好型网络。其次,我们采用高效的硬件单元设计和软硬件协同数据流架构,构建了一个基于fpga的快速低比特加速引擎。实验结果表明,L $^{3}text {FNet}$与较低均方误差(mse)的现有算法相比,计算量减少了109倍以上,权重参数减少了约78倍。此外,在ZCU104平台上,为了实现延迟低至272 ns的高效加速引擎,需要95.65%的查找表(lut)、80.67%的数字信号处理器(dsp)、80.93%的块ram (BRAM)、58.52%的LUTRAM和9.493 w的功耗。所提出的方法的代码和模型可在https://github.com/sansi-zhang/L3FNet上获得。
{"title":"FPGA-Based Low-Bit and Lightweight Fast Light Field Depth Estimation","authors":"Jie Li;Chuanlun Zhang;Wenxuan Yang;Heng Li;Xiaoyan Wang;Chuanjun Zhao;Shuangli Du;Yiguang Liu","doi":"10.1109/TVLSI.2024.3496751","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3496751","url":null,"abstract":"The 3-D vision computing is a key application in unmanned systems, satellites, and planetary rovers. Learning-based light field (LF) depth estimation is one of the major research directions in 3-D vision computing. However, conventional learning-based depth estimation methods involve a large number of parameters and floating-point operations, making it challenging to achieve low-power, fast, and high-precision LF depth estimation on a field-programmable gate array (FPGA). Motivated by this issue, an FPGA-based low-bit, lightweight LF depth estimation network (L\u0000<inline-formula> <tex-math>$^{3}text {FNet}$ </tex-math></inline-formula>\u0000) is proposed. First, a hardware-friendly network is designed, which has small weight parameters, low computational load, and a simple network architecture with minor accuracy loss. Second, we apply efficient hardware unit design and software-hardware collaborative dataflow architecture to construct an FPGA-based fast, low-bit acceleration engine. Experimental results show that compared with the state-of-the-art works with lower mean-square error (mse), L\u0000<inline-formula> <tex-math>$^{3}text {FNet}$ </tex-math></inline-formula>\u0000 can reduce the computational load by more than 109 times and weight parameters by approximately 78 times. Moreover, on the ZCU104 platform, it requires 95.65% lookup tables (LUTs), 80.67% digital signal processors (DSPs), 80.93% BlockRAM (BRAM), 58.52% LUTRAM, and 9.493-W power consumption to achieve an efficient acceleration engine with a latency as low as 272 ns. The code and model of the proposed method are available at \u0000<uri>https://github.com/sansi-zhang/L3FNet</uri>\u0000.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 1","pages":"88-101"},"PeriodicalIF":2.8,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142918178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A 22-nm All-Digital Time-Domain Neural Network Accelerator for Precision In-Sensor Processing 用于精密传感器处理的22纳米全数字时域神经网络加速器
IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2024-11-19 DOI: 10.1109/TVLSI.2024.3496090
Ahmed M. Mohey;Jelin Leslin;Gaurav Singh;Marko Kosunen;Jussi Ryynänen;Martin Andraud
Deep neural network (DNN) accelerators are increasingly integrated into sensing applications, such as wearables and sensor networks, to provide advanced in-sensor processing capabilities. Given wearables’ strict size and power requirements, minimizing the area and energy consumption of DNN accelerators is a critical concern. In that regard, computing DNN models in the time domain is a promising architecture, taking advantage of both technology scaling friendliness and efficiency. Yet, time-domain accelerators are typically not fully digital, limiting the full benefits of time-domain computation. In this work, we propose an all-digital time-domain accelerator with a small size and low energy consumption to target precision in-sensor processing like human activity recognition (HAR). The proposed accelerator features a simple and efficient architecture without dependencies on analog nonidealities such as leakage and charge errors. An eight-neuron layer (core computation layer) is implemented in 22-nm FD-SOI technology. The layer occupies $70 times ,70,mu $ m while supporting multibit inputs (8-bit) and weights (8-bit) with signed accumulation up to 18 bits. The power dissipation of the computation layer is 576 $mu $ W at 0.72-V supply and 500-MHz clock frequency achieving an average area efficiency of 24.74 GOPS/mm2 (up to 544.22 GOPS/mm2), an average energy efficiency of 0.21 TOPS/W (up to 4.63 TOPS/W), and a normalized energy efficiency of 13.46 1b-TOPS/W (up to 296.30 1b-TOPS/W).
深度神经网络(DNN)加速器正越来越多地集成到可穿戴设备和传感器网络等传感应用中,以提供先进的传感器内处理能力。鉴于可穿戴设备对尺寸和功耗的严格要求,最大限度地减少 DNN 加速器的面积和能耗是一个关键问题。在这方面,在时域中计算 DNN 模型是一种很有前景的架构,既能利用技术扩展友好性,又能提高效率。然而,时域加速器通常不是完全数字化的,从而限制了时域计算的全部优势。在这项工作中,我们提出了一种全数字时域加速器,具有体积小、能耗低的特点,可用于精密传感器内处理,如人类活动识别(HAR)。该加速器采用简单高效的架构,不依赖于漏电和电荷误差等模拟非理想状态。八神经元层(核心计算层)采用 22 纳米 FD-SOI 技术实现。该层占用 70 美元的时间,同时支持多位输入(8 位)和权重(8 位),带符号累加可达 18 位。在 0.72-V 电源和 500-MHz 时钟频率下,计算层的功耗为 576 美元/毫瓦,实现了 24.74 GOPS/mm2 的平均面积效率(高达 544.22 GOPS/mm2)、0.21 TOPS/W 的平均能效(高达 4.63 TOPS/W)和 13.46 1b-TOPS/W 的归一化能效(高达 296.30 1b-TOPS/W)。
{"title":"A 22-nm All-Digital Time-Domain Neural Network Accelerator for Precision In-Sensor Processing","authors":"Ahmed M. Mohey;Jelin Leslin;Gaurav Singh;Marko Kosunen;Jussi Ryynänen;Martin Andraud","doi":"10.1109/TVLSI.2024.3496090","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3496090","url":null,"abstract":"Deep neural network (DNN) accelerators are increasingly integrated into sensing applications, such as wearables and sensor networks, to provide advanced in-sensor processing capabilities. Given wearables’ strict size and power requirements, minimizing the area and energy consumption of DNN accelerators is a critical concern. In that regard, computing DNN models in the time domain is a promising architecture, taking advantage of both technology scaling friendliness and efficiency. Yet, time-domain accelerators are typically not fully digital, limiting the full benefits of time-domain computation. In this work, we propose an all-digital time-domain accelerator with a small size and low energy consumption to target precision in-sensor processing like human activity recognition (HAR). The proposed accelerator features a simple and efficient architecture without dependencies on analog nonidealities such as leakage and charge errors. An eight-neuron layer (core computation layer) is implemented in 22-nm FD-SOI technology. The layer occupies \u0000<inline-formula> <tex-math>$70 times ,70,mu $ </tex-math></inline-formula>\u0000m while supporting multibit inputs (8-bit) and weights (8-bit) with signed accumulation up to 18 bits. The power dissipation of the computation layer is 576\u0000<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>\u0000W at 0.72-V supply and 500-MHz clock frequency achieving an average area efficiency of 24.74 GOPS/mm2 (up to 544.22 GOPS/mm2), an average energy efficiency of 0.21 TOPS/W (up to 4.63 TOPS/W), and a normalized energy efficiency of 13.46 1b-TOPS/W (up to 296.30 1b-TOPS/W).","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 12","pages":"2220-2231"},"PeriodicalIF":2.8,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142821135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1