IEEE Transactions on Very Large Scale Integration (VLSI) Systems最新文献_第4页

Designing Precharge-Free Energy-Efficient Content-Addressable Memories 设计无预充电的高能效内容可寻址存储器

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-10-18 DOI: 10.1109/TVLSI.2024.3475036

Ramiro Taco;Esteban Garzón;Robert Hanhan;Adam Teman;Leonid Yavits;Marco Lanuzza

Content-addressable memory (CAM) is a specialized type of memory that facilitates massively parallel comparison of a search pattern against its entire content. State-of-the-art (SOTA) CAM solutions are either fast but power-hungry (NOR CAM) or slow while consuming less power (nand CAM). These limitations stem from the dynamic precharge operation, leading to excessive power consumption in NOR CAMs and charge-sharing issues in NAND CAMs. In this work, we propose a precharge-free CAM (PCAM) class for energy-efficient applications. By avoiding precharge operation, PCAM consumes less energy than a NAND CAM, while achieving search speed comparable to a NOR CAM. PCAM was designed using a 65-nm CMOS technology and comprehensively evaluated under extensive Monte Carlo (MC) simulations while taking into account layout parasitics. When benchmarked against conventional NAND CAM, PCAM demonstrates improved search run time (reduced by more than 30%) and 15% less search energy. Moreover, PCAM can cut energy consumption by more than 75% when compared to conventional NOR CAM. We further extend our analysis to the application level, functionally evaluating the CAM designs as a fully associative cache using a CPU simulator running various benchmark workloads. This analysis confirms that PCAMs represent an optimal energy-performance design choice for associative memories and their broad spectrum of applications.

内容可寻址内存（CAM）是一种专门的内存类型，便于大规模并行比较搜索模式与整个内容。最先进的 CAM 解决方案要么速度快但功耗高（NOR CAM），要么速度慢但功耗低（nand CAM）。这些限制源于动态预充电操作，导致 NOR CAM 的功耗过高和 NAND CAM 的充电共享问题。在这项工作中，我们为高能效应用提出了一种免预充电 CAM（PCAM）类别。通过避免预充电操作，PCAM 的能耗低于 NAND CAM，同时搜索速度与 NOR CAM 相当。PCAM 采用 65 纳米 CMOS 技术设计，并在广泛的蒙特卡罗（MC）模拟中进行了全面评估，同时考虑了布局寄生效应。与传统的 NAND CAM 相比，PCAM 的搜索运行时间缩短了 30% 以上，搜索能耗降低了 15%。此外，与传统的 NOR CAM 相比，PCAM 可以减少 75% 以上的能耗。我们进一步将分析扩展到应用层面，使用运行各种基准工作负载的 CPU 模拟器，将 CAM 设计作为全关联高速缓存进行功能评估。这项分析证实，PCAM 是关联存储器及其广泛应用的最佳能效设计选择。

{"title":"Designing Precharge-Free Energy-Efficient Content-Addressable Memories","authors":"Ramiro Taco;Esteban Garzón;Robert Hanhan;Adam Teman;Leonid Yavits;Marco Lanuzza","doi":"10.1109/TVLSI.2024.3475036","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3475036","url":null,"abstract":"Content-addressable memory (CAM) is a specialized type of memory that facilitates massively parallel comparison of a search pattern against its entire content. State-of-the-art (SOTA) CAM solutions are either fast but power-hungry (NOR CAM) or slow while consuming less power (nand CAM). These limitations stem from the dynamic precharge operation, leading to excessive power consumption in NOR CAMs and charge-sharing issues in NAND CAMs. In this work, we propose a precharge-free CAM (PCAM) class for energy-efficient applications. By avoiding precharge operation, PCAM consumes less energy than a NAND CAM, while achieving search speed comparable to a NOR CAM. PCAM was designed using a 65-nm CMOS technology and comprehensively evaluated under extensive Monte Carlo (MC) simulations while taking into account layout parasitics. When benchmarked against conventional NAND CAM, PCAM demonstrates improved search run time (reduced by more than 30%) and 15% less search energy. Moreover, PCAM can cut energy consumption by more than 75% when compared to conventional NOR CAM. We further extend our analysis to the application level, functionally evaluating the CAM designs as a fully associative cache using a CPU simulator running various benchmark workloads. This analysis confirms that PCAMs represent an optimal energy-performance design choice for associative memories and their broad spectrum of applications.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 12","pages":"2303-2314"},"PeriodicalIF":2.8,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10723100","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142821282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Study on Nonlinearity in Mixers Using a Time-Varying Volterra-Based Distortion Contribution Analysis Tool 使用基于时变伏特拉失真贡献分析工具的混频器非线性研究

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-10-17 DOI: 10.1109/TVLSI.2024.3474183

Negar Shabanzadeh;Aarno Pärssinen;Timo Rahkonen

To optimize the linearity of mixers, one needs to recognize the origins and mixing mechanisms of dominant nonlinearities. This article presents a time-varying (TV) nonlinear analysis, where TV polynomial models are used to yield harmonic mixing functions in four simple mixer structures. Local oscillator (LO) waveform engineering has been utilized to tailor the nonlinearity coefficients, which were later fed into a numerical distortion contribution analysis tool to calculate the nonlinearity using a Volterra series-based approach. The spectra of the nonlinear voltages have been drawn, followed by plots breaking the final IM3 distortion down into its contributions. The effect of filtering on distortion has been illustrated in the end.

为了优化混频器的线性度，我们需要认识主要非线性的起源和混合机制。本文介绍了一种时变（TV）非线性分析，利用 TV 多项式模型在四个简单混频器结构中产生谐波混合函数。本地振荡器（LO）波形工程被用来定制非线性系数，然后将其输入数值失真贡献分析工具，利用基于 Volterra 系列的方法计算非线性。绘制了非线性电压的频谱，然后绘制了将最终 IM3 失真分解为其贡献的图谱。最后还说明了滤波对失真产生的影响。

引用次数: 0

Bandwidth-Latency-Thermal Co-Optimization of Interconnect-Dominated Many-Core 3D-IC 互连主导多核3D-IC的带宽-延迟-热协同优化

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-10-16 DOI: 10.1109/TVLSI.2024.3467148

Sudipta Das;Samuel Riedel;Mohamed Naeim;Moritz Brunion;Marco Bertuletti;Luca Benini;Julien Ryckaert;James Myers;Dwaipayan Biswas;Dragomir Milojevic

The ongoing integration of advanced functionalities in contemporary system-on-chips (SoCs) poses significant challenges related to memory bandwidth, capacity, and thermal stability. These challenges are further amplified with the advancement of artificial intelligence (AI), necessitating enhanced memory and interconnect bandwidth and latency. This article presents a comprehensive study encompassing architectural modifications of an interconnect-dominated many-core SoC targeting the significant increase of intermediate, on-chip cache memory bandwidth and access latency tuning. The proposed SoC has been implemented in 3-D using A10 nanosheet technology and early thermal analysis has been performed. Our workload simulations reveal, respectively, up to 12- and 2.5-fold acceleration in the 64-core and 16-core versions of the SoC. Such speed-up comes at 40% increase in die-area and a 60% rise in power dissipation when implemented in 2-D. In contrast, the 3-D counterpart not only minimizes the footprint but also yields 20% power savings, attributable to a 40% reduction in wirelength. The article further highlights the importance of pipeline restructuring to leverage the potential of 3-D technology for achieving lower latency and more efficient memory access. Finally, we discuss the thermal implications of various 3-D partitioning schemes in High Performance Computing (HPC) and mobile applications. Our analysis reveals that, unlike high-power density HPC cases, 3-D mobile case increases

$T_{max }$

only by

$2~^{circ } $

C–

$3~^{circ } $

C compared to 2-D, while the HPC scenario analysis requires multiconstrained efficient partitioning for 3-D implementations.

现代系统芯片（soc）不断集成高级功能，对内存带宽、容量和热稳定性提出了重大挑战。随着人工智能（AI）的进步，这些挑战进一步放大，需要增强内存、互连带宽和延迟。本文提出了一项全面的研究，包括互连主导的多核SoC的架构修改，目标是显著增加中间，片上缓存存储器带宽和访问延迟调优。所提出的SoC已经使用A10纳米片技术实现了3d，并进行了早期热分析。我们的工作负载模拟显示，64核和16核版本的SoC分别具有高达12倍和2.5倍的加速。当在2d中实现时，这种加速带来了40%的模面积增加和60%的功耗增加。相比之下，3d打印不仅可以最大限度地减少占地面积，还可以节省20%的电力，因为无线长度减少了40%。本文进一步强调了管道重组的重要性，以利用3d技术的潜力来实现更低的延迟和更高效的内存访问。最后，我们讨论了各种三维分区方案在高性能计算（HPC）和移动应用中的热影响。我们的分析表明，与高功率密度HPC情况不同，3- d移动情况下，与2- d相比，$T_{max}$仅增加$2~^{circ}$ C - $3~^{circ}$ C，而HPC场景分析需要多约束的高效分区来实现3- d实现。

{"title":"Bandwidth-Latency-Thermal Co-Optimization of Interconnect-Dominated Many-Core 3D-IC","authors":"Sudipta Das;Samuel Riedel;Mohamed Naeim;Moritz Brunion;Marco Bertuletti;Luca Benini;Julien Ryckaert;James Myers;Dwaipayan Biswas;Dragomir Milojevic","doi":"10.1109/TVLSI.2024.3467148","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3467148","url":null,"abstract":"The ongoing integration of advanced functionalities in contemporary system-on-chips (SoCs) poses significant challenges related to memory bandwidth, capacity, and thermal stability. These challenges are further amplified with the advancement of artificial intelligence (AI), necessitating enhanced memory and interconnect bandwidth and latency. This article presents a comprehensive study encompassing architectural modifications of an interconnect-dominated many-core SoC targeting the significant increase of intermediate, on-chip cache memory bandwidth and access latency tuning. The proposed SoC has been implemented in 3-D using A10 nanosheet technology and early thermal analysis has been performed. Our workload simulations reveal, respectively, up to 12- and 2.5-fold acceleration in the 64-core and 16-core versions of the SoC. Such speed-up comes at 40% increase in die-area and a 60% rise in power dissipation when implemented in 2-D. In contrast, the 3-D counterpart not only minimizes the footprint but also yields 20% power savings, attributable to a 40% reduction in wirelength. The article further highlights the importance of pipeline restructuring to leverage the potential of 3-D technology for achieving lower latency and more efficient memory access. Finally, we discuss the thermal implications of various 3-D partitioning schemes in High Performance Computing (HPC) and mobile applications. Our analysis reveals that, unlike high-power density HPC cases, 3-D mobile case increases <inline-formula> <tex-math>$T_{max }$ </tex-math></inline-formula> only by <inline-formula> <tex-math>$2~^{circ } $ </tex-math></inline-formula>C–<inline-formula> <tex-math>$3~^{circ } $ </tex-math></inline-formula>C compared to 2-D, while the HPC scenario analysis requires multiconstrained efficient partitioning for 3-D implementations.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"346-357"},"PeriodicalIF":2.8,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Edge PoolFormer: Modeling and Training of PoolFormer Network on RRAM Crossbar for Edge-AI Applications 边缘PoolFormer：边缘ai应用中基于RRAM Crossbar的PoolFormer网络建模与训练

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-10-15 DOI: 10.1109/TVLSI.2024.3472270

Tiancheng Cao;Weihao Yu;Yuan Gao;Chen Liu;Tantan Zhang;Shuicheng Yan;Wang Ling Goh

PoolFormer is a subset of Transformer neural network with a key difference of replacing computationally demanding token mixer with pooling function. In this work, a memristor-based PoolFormer network modeling and training framework for edge-artificial intelligence (AI) applications is presented. The original PoolFormer structure is further optimized for hardware implementation on RRAM crossbar by replacing the normalization operation with scaling. In addition, the nonidealities of RRAM crossbar from device to array level as well as peripheral readout circuits are analyzed. By integrating these factors into one training framework, the overall neural network performance is evaluated holistically and the impact of nonidealities to the network performance can be effectively mitigated. Implemented in Python and PyTorch, a 16-block PoolFormer network is built with

$64times 64$

four-level RRAM crossbar array model extracted from measurement results. The total number of the proposed Edge PoolFormer network parameters is 0.246 M, which is at least one order smaller than the conventional CNN implementation. This network achieved inference accuracy of 88.07% for CIFAR-10 image classification tasks with accuracy degradation of 1.5% compared to the ideal software model with FP32 precision weights.

PoolFormer是Transformer神经网络的一个子集，其关键区别是用池化功能取代计算要求高的令牌混合器。在这项工作中，提出了一个基于忆阻器的边缘人工智能（AI）应用的PoolFormer网络建模和训练框架。通过将原来的PoolFormer结构用缩放操作替换为规范化操作，进一步优化了在RRAM crossbar上的硬件实现。此外，还分析了从器件到阵列的RRAM横条以及外围读出电路的非理想性。通过将这些因素整合到一个训练框架中，可以对神经网络的整体性能进行整体评估，并且可以有效地减轻非理想性对网络性能的影响。在Python和PyTorch中实现的16块PoolFormer网络使用从测量结果中提取的$64 × 64$四级RRAM交叉棒数组模型构建。提出的Edge PoolFormer网络参数总数为0.246 M，比传统的CNN实现至少小一个数量级。该网络对CIFAR-10图像分类任务的推理准确率为88.07%，与具有FP32精度权重的理想软件模型相比，准确率下降了1.5%。

{"title":"Edge PoolFormer: Modeling and Training of PoolFormer Network on RRAM Crossbar for Edge-AI Applications","authors":"Tiancheng Cao;Weihao Yu;Yuan Gao;Chen Liu;Tantan Zhang;Shuicheng Yan;Wang Ling Goh","doi":"10.1109/TVLSI.2024.3472270","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3472270","url":null,"abstract":"PoolFormer is a subset of Transformer neural network with a key difference of replacing computationally demanding token mixer with pooling function. In this work, a memristor-based PoolFormer network modeling and training framework for edge-artificial intelligence (AI) applications is presented. The original PoolFormer structure is further optimized for hardware implementation on RRAM crossbar by replacing the normalization operation with scaling. In addition, the nonidealities of RRAM crossbar from device to array level as well as peripheral readout circuits are analyzed. By integrating these factors into one training framework, the overall neural network performance is evaluated holistically and the impact of nonidealities to the network performance can be effectively mitigated. Implemented in Python and PyTorch, a 16-block PoolFormer network is built with <inline-formula> <tex-math>$64times 64$ </tex-math></inline-formula> four-level RRAM crossbar array model extracted from measurement results. The total number of the proposed Edge PoolFormer network parameters is 0.246 M, which is at least one order smaller than the conventional CNN implementation. This network achieved inference accuracy of 88.07% for CIFAR-10 image classification tasks with accuracy degradation of 1.5% compared to the ideal software model with FP32 precision weights.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"384-394"},"PeriodicalIF":2.8,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient ORBGRAND Implementation With Parallel Noise Sequence Generation 高效的ORBGRAND实现与并行噪声序列生成

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-10-15 DOI: 10.1109/TVLSI.2024.3466474

Chao Ji;Xiaohu You;Chuan Zhang;Christoph Studer

Guessing random additive noise decoding (GRAND) is establishing itself as a universal method for decoding linear block codes, and ordered reliability bits GRAND (ORBGRAND) is a hardware-friendly variant that processes soft-input information. In this work, we propose an efficient hardware implementation of ORBGRAND that significantly reduces the cost of querying noise sequences with slight frame error rate (FER) performance degradation. Different from logistic weight order (LWO) and improved LWO (iLWO) typically used to generate noise sequences, we introduce a reduced-complexity and hardware-friendly method called shift LWO (sLWO), of which the shift factor can be chosen empirically to trade the FER performance and query complexity well. To effectively generate noise sequences with sLWO, we utilize a hardware-friendly lookup-table (LUT)-aided strategy, which improves throughput as well as area and energy efficiency. To demonstrate the efficacy of our solution, we use synthesis results evaluated on polar codes in a 65-nm CMOS technology. While maintaining similar FER performance, our ORBGRAND implementations achieve 53.6-Gbps average throughput (

$1.26times $

higher), 4.2-Mbps worst case throughput (

$8.24times $

higher), 2.4-Mbps/mm2 worst case area efficiency (

$12times $

higher), and

$4.66times 10 ^{{4}}$

pJ/bit worst case energy efficiency (

$9.96times $

lower) compared with the synthesized ORBGRAND design with LWO for a (128, 105) polar code and also provide

$8.62times $

higher average throughput and

$9.4times $

higher average area efficiency but

$7.51times $

worse average energy efficiency than the ORBGRAND chip for a (256, 240) polar code, at a target FER of

$10^{-7}$

.

猜测随机加性噪声解码（GRAND）正在成为线性分组码解码的通用方法，而有序可靠位GRAND （ORBGRAND）是一种硬件友好的变体，用于处理软输入信息。在这项工作中，我们提出了一种有效的ORBGRAND硬件实现，可以显着降低查询噪声序列的成本，并且具有轻微的帧错误率（FER）性能下降。与通常用于生成噪声序列的逻辑权重排序（LWO）和改进LWO （iLWO）不同，我们引入了一种降低复杂度和硬件友好的移位LWO （sLWO）方法，该方法可以根据经验选择移位因子来很好地权衡FER性能和查询复杂度。为了利用sLWO有效地生成噪声序列，我们采用了硬件友好的查找表（LUT）辅助策略，从而提高了吞吐量以及面积和能源效率。为了证明我们的解决方案的有效性，我们使用了65纳米CMOS技术中极性代码的合成结果。在保持类似的FER性能的同时，我们的ORBGRAND实现了53.6 gbps的平均吞吐量（高1.26美元），4.2 mbps的最坏情况吞吐量（高8.24美元），2.4 mbps /mm2的最坏情况面积效率（高12美元），和4.66倍10 ^{{4}}$ pJ/bit的最坏情况能源效率（低9.96倍美元）。105)极性代码，并且还提供8.62美元的平均吞吐量和9.4美元的平均面积效率，但平均能源效率比（256,240）极性代码的ORBGRAND芯片低7.51美元，目标FER为10^{-7}美元。

{"title":"Efficient ORBGRAND Implementation With Parallel Noise Sequence Generation","authors":"Chao Ji;Xiaohu You;Chuan Zhang;Christoph Studer","doi":"10.1109/TVLSI.2024.3466474","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3466474","url":null,"abstract":"Guessing random additive noise decoding (GRAND) is establishing itself as a universal method for decoding linear block codes, and ordered reliability bits GRAND (ORBGRAND) is a hardware-friendly variant that processes soft-input information. In this work, we propose an efficient hardware implementation of ORBGRAND that significantly reduces the cost of querying noise sequences with slight frame error rate (FER) performance degradation. Different from logistic weight order (LWO) and improved LWO (iLWO) typically used to generate noise sequences, we introduce a reduced-complexity and hardware-friendly method called shift LWO (sLWO), of which the shift factor can be chosen empirically to trade the FER performance and query complexity well. To effectively generate noise sequences with sLWO, we utilize a hardware-friendly lookup-table (LUT)-aided strategy, which improves throughput as well as area and energy efficiency. To demonstrate the efficacy of our solution, we use synthesis results evaluated on polar codes in a 65-nm CMOS technology. While maintaining similar FER performance, our ORBGRAND implementations achieve 53.6-Gbps average throughput (<inline-formula> <tex-math>$1.26times $ </tex-math></inline-formula> higher), 4.2-Mbps worst case throughput (<inline-formula> <tex-math>$8.24times $ </tex-math></inline-formula> higher), 2.4-Mbps/mm2 worst case area efficiency (<inline-formula> <tex-math>$12times $ </tex-math></inline-formula> higher), and <inline-formula> <tex-math>$4.66times 10 ^{{4}}$ </tex-math></inline-formula> pJ/bit worst case energy efficiency (<inline-formula> <tex-math>$9.96times $ </tex-math></inline-formula> lower) compared with the synthesized ORBGRAND design with LWO for a (128, 105) polar code and also provide <inline-formula> <tex-math>$8.62times $ </tex-math></inline-formula> higher average throughput and <inline-formula> <tex-math>$9.4times $ </tex-math></inline-formula> higher average area efficiency but <inline-formula> <tex-math>$7.51times $ </tex-math></inline-formula> worse average energy efficiency than the ORBGRAND chip for a (256, 240) polar code, at a target FER of <inline-formula> <tex-math>$10^{-7}$ </tex-math></inline-formula>.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"435-448"},"PeriodicalIF":2.8,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Research on Hardware Acceleration of Traffic Sign Recognition Based on Spiking Neural Network and FPGA Platform 基于峰值神经网络和FPGA平台的交通标志识别硬件加速研究

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-10-14 DOI: 10.1109/TVLSI.2024.3470834

Huarun Chen;Yijun Liu;Wujian Ye;Jialiang Ye;Yuehai Chen;Shaozhen Chen;Chao Han

Most of the existing methods for traffic sign recognition exploited deep learning technology such as convolutional neural networks (CNNs) to achieve a breakthrough in detection accuracy; however, due to the large number of CNN’s parameters, there are problems in practical applications such as high power consumption, large calculation, and slow speed. Compared with CNN, a spiking neural network (SNN) can effectively simulate the information processing mechanism of biological brain, with stronger parallel processing capability, better sparsity, and real-time performance. Thus, we design and realize a novel traffic sign recognition system [called SNN on FPGA-traffic sign recognition system (SFPGA-TSRS)] based on spiking CNN (SCNN) and FPGA platform. Specifically, to improve the recognition accuracy, a traffic sign recognition model spatial attention SCNN (SA-SCNN) is proposed by combining LIF/IF neurons based SCNN with SA mechanism; and to accelerate the model inference, a neuron module is implemented with high performance, and an input coding module is designed as the input layer of the recognition model. The experiments show that compared with existing systems, the proposed SFPGA-TSRS can efficiently support the deployment of SCNN models, with a higher recognition accuracy of 99.22%, a faster frame rate of 66.38 frames per second (FPS), and lower power consumption of 1.423 W on the GTSRB dataset.

现有的交通标志识别方法大多利用卷积神经网络（cnn）等深度学习技术来实现检测精度的突破；但由于CNN的参数较多，在实际应用中存在功耗高、计算量大、速度慢等问题。与CNN相比，SNN能够有效地模拟生物大脑的信息处理机制，具有更强的并行处理能力、更好的稀疏性和实时性。因此，我们设计并实现了一种基于峰值CNN （SCNN）和FPGA平台的新型交通标志识别系统[称为SNN on FPGA-交通标志识别系统（SFPGA-TSRS）]。具体来说，为了提高识别精度，将基于LIF/IF神经元的SCNN与SA机制相结合，提出了一种交通标志识别模型空间注意SCNN (SA-SCNN)；为了加速模型推理，实现了高性能的神经元模块，并设计了输入编码模块作为识别模型的输入层。实验表明，与现有系统相比，本文提出的SFPGA-TSRS能够有效地支持SCNN模型的部署，在GTSRB数据集上的识别准确率达到99.22%，帧率达到66.38帧/秒，功耗低至1.423 W。

{"title":"Research on Hardware Acceleration of Traffic Sign Recognition Based on Spiking Neural Network and FPGA Platform","authors":"Huarun Chen;Yijun Liu;Wujian Ye;Jialiang Ye;Yuehai Chen;Shaozhen Chen;Chao Han","doi":"10.1109/TVLSI.2024.3470834","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3470834","url":null,"abstract":"Most of the existing methods for traffic sign recognition exploited deep learning technology such as convolutional neural networks (CNNs) to achieve a breakthrough in detection accuracy; however, due to the large number of CNN’s parameters, there are problems in practical applications such as high power consumption, large calculation, and slow speed. Compared with CNN, a spiking neural network (SNN) can effectively simulate the information processing mechanism of biological brain, with stronger parallel processing capability, better sparsity, and real-time performance. Thus, we design and realize a novel traffic sign recognition system [called SNN on FPGA-traffic sign recognition system (SFPGA-TSRS)] based on spiking CNN (SCNN) and FPGA platform. Specifically, to improve the recognition accuracy, a traffic sign recognition model spatial attention SCNN (SA-SCNN) is proposed by combining LIF/IF neurons based SCNN with SA mechanism; and to accelerate the model inference, a neuron module is implemented with high performance, and an input coding module is designed as the input layer of the recognition model. The experiments show that compared with existing systems, the proposed SFPGA-TSRS can efficiently support the deployment of SCNN models, with a higher recognition accuracy of 99.22%, a faster frame rate of 66.38 frames per second (FPS), and lower power consumption of 1.423 W on the GTSRB dataset.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"499-511"},"PeriodicalIF":2.8,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

3DNN-Xplorer: A Machine Learning Framework for Design Space Exploration of Heterogeneous 3-D DNN Accelerators 3DNN-Xplorer：一个用于异构三维DNN加速器设计空间探索的机器学习框架

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-10-14 DOI: 10.1109/TVLSI.2024.3471496

Gauthaman Murali;Min Gyu Park;Sung Kyu Lim

This article presents 3DNN-Xplorer, the first machine learning (ML)-based framework for predicting the performance of heterogeneous 3-D deep neural network (DNN) accelerators. Our ML framework facilitates the design space exploration (DSE) of heterogeneous 3-D accelerators with a two-tier compute-on-memory (CoM) configuration, considering 3-D physical design factors. Our design space encompasses four distinct heterogeneous 3-D integration styles, combining 28- and 16-nm technology nodes for both compute and memory tiers. Using extrapolation techniques with ML models trained on 10-to-256 processing element (PE) accelerator configurations, we estimate the performance of systems featuring 75–16384 PEs, achieving a maximum absolute error of 13.9% (the number of PEs is not continuous and varies based on the accelerator architecture). To ensure balanced tier areas in the design, our framework assumes the same number of PEs or on-chip memory capacity across the four integration styles, accounting for area imbalance resulting from different technology nodes. Our analysis reveals that the heterogeneous 3-D style with 28-nm compute and 16-nm memory is energy-efficient and offers notable energy savings of up to 50% and an 8.8% reduction in runtime compared to other 3-D integration styles with the same number of PEs. Similarly, the heterogeneous 3-D style with 16-nm compute and 28-nm memory is area-efficient and shows up to 8.3% runtime reduction compared to other 3-D styles with the same on-chip memory capacity.

本文介绍了3DNN-Xplorer，这是第一个基于机器学习（ML）的框架，用于预测异构3d深度神经网络（DNN）加速器的性能。我们的机器学习框架通过考虑三维物理设计因素的两层内存上计算（CoM）配置，促进了异构三维加速器的设计空间探索（DSE）。我们的设计空间包含四种不同的异构3d集成风格，为计算层和存储层结合了28纳米和16纳米技术节点。使用外推技术，在10到256个处理元素（PE）加速器配置上训练ML模型，我们估计了具有75-16384个PE的系统的性能，实现了13.9%的最大绝对误差（PE的数量不是连续的，并且根据加速器架构而变化）。为了确保设计中的层面积平衡，我们的框架在四种集成风格中假设相同数量的pe或片上存储器容量，考虑到不同技术节点导致的面积不平衡。我们的分析表明，具有28纳米计算和16纳米内存的异构3-D风格是节能的，与具有相同数量pe的其他3-D集成风格相比，节能高达50%，运行时间减少8.8%。同样，具有16纳米计算和28纳米内存的异构3-D风格具有面积效率，与具有相同片上存储容量的其他3-D风格相比，运行时间减少了8.3%。

{"title":"3DNN-Xplorer: A Machine Learning Framework for Design Space Exploration of Heterogeneous 3-D DNN Accelerators","authors":"Gauthaman Murali;Min Gyu Park;Sung Kyu Lim","doi":"10.1109/TVLSI.2024.3471496","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3471496","url":null,"abstract":"This article presents 3DNN-Xplorer, the first machine learning (ML)-based framework for predicting the performance of heterogeneous 3-D deep neural network (DNN) accelerators. Our ML framework facilitates the design space exploration (DSE) of heterogeneous 3-D accelerators with a two-tier compute-on-memory (CoM) configuration, considering 3-D physical design factors. Our design space encompasses four distinct heterogeneous 3-D integration styles, combining 28- and 16-nm technology nodes for both compute and memory tiers. Using extrapolation techniques with ML models trained on 10-to-256 processing element (PE) accelerator configurations, we estimate the performance of systems featuring 75–16384 PEs, achieving a maximum absolute error of 13.9% (the number of PEs is not continuous and varies based on the accelerator architecture). To ensure balanced tier areas in the design, our framework assumes the same number of PEs or on-chip memory capacity across the four integration styles, accounting for area imbalance resulting from different technology nodes. Our analysis reveals that the heterogeneous 3-D style with 28-nm compute and 16-nm memory is energy-efficient and offers notable energy savings of up to 50% and an 8.8% reduction in runtime compared to other 3-D integration styles with the same number of PEs. Similarly, the heterogeneous 3-D style with 16-nm compute and 28-nm memory is area-efficient and shows up to 8.3% runtime reduction compared to other 3-D styles with the same on-chip memory capacity.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"358-370"},"PeriodicalIF":2.8,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Scalable and Efficient NTT/INTT Architecture Using Group-Based Pairwise Memory Access and Fast Interstage Reordering 基于组对存储器访问和快速级间重排序的可扩展高效NTT/INTT体系结构

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-10-09 DOI: 10.1109/TVLSI.2024.3465010

Zihang Wang;Yushu Yang;Jianfei Wang;Jia Hou;Yang Su;Chen Yang

Polynomial multiplication is a significant bottleneck in mainstream postquantum cryptography (PQC) schemes. To speed it up, number theoretic transform (NTT) is widely used, which decreases the time complexity from

${O}(n^{2})$

to

$O[nlog _{2}(n)]$

. However, it is challenging to ensure optimal hardware efficiency in conjunction with scalability. This brief proposes a novel pipelined NTT/inverse-NTT (INTT) architecture on field-programmable gate array (FPGA). A group-based pairwise memory access (GPMA) scheme is proposed, and a scratchpad and reordering unit (SRU) is designed to form an efficient dataflow that simplifies control units and achieves almost

$n/2$

processing cycles on average for n-point NTT. Moreover, our architecture can support varying parameters. Compared to the state-of-the-art works, our architecture achieves up to

$4.8times $

latency improvements and up to

$4.3times $

improvements on area time product (ATP).

多项式乘法是后量子加密（PQC）的主要瓶颈。为了提高速度，广泛采用数论变换（number theoretical transform， NTT），将时间复杂度从${O}(n^{2})$降低到$O[nlog _{2}(n)]$。然而，要确保结合可伸缩性的最佳硬件效率是一项挑战。本文提出了一种基于现场可编程门阵列（FPGA）的新型流水线NTT/反NTT （INTT）架构。提出了一种基于组的两两存储器访问（GPMA）方案，并设计了一个刮记板和重新排序单元（SRU），以形成一个高效的数据流，简化了控制单元，并在n点NTT中平均实现了近$n/2$的处理周期。此外，我们的体系结构可以支持各种参数。与最先进的作品相比，我们的架构实现了高达4.8倍的延迟改进和高达4.3倍的面积时间积（ATP）改进。

{"title":"A Scalable and Efficient NTT/INTT Architecture Using Group-Based Pairwise Memory Access and Fast Interstage Reordering","authors":"Zihang Wang;Yushu Yang;Jianfei Wang;Jia Hou;Yang Su;Chen Yang","doi":"10.1109/TVLSI.2024.3465010","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3465010","url":null,"abstract":"Polynomial multiplication is a significant bottleneck in mainstream postquantum cryptography (PQC) schemes. To speed it up, number theoretic transform (NTT) is widely used, which decreases the time complexity from <inline-formula> <tex-math>${O}(n^{2})$ </tex-math></inline-formula> to <inline-formula> <tex-math>$O[nlog _{2}(n)]$ </tex-math></inline-formula>. However, it is challenging to ensure optimal hardware efficiency in conjunction with scalability. This brief proposes a novel pipelined NTT/inverse-NTT (INTT) architecture on field-programmable gate array (FPGA). A group-based pairwise memory access (GPMA) scheme is proposed, and a scratchpad and reordering unit (SRU) is designed to form an efficient dataflow that simplifies control units and achieves almost <inline-formula> <tex-math>$n/2$ </tex-math></inline-formula> processing cycles on average for n-point NTT. Moreover, our architecture can support varying parameters. Compared to the state-of-the-art works, our architecture achieves up to <inline-formula> <tex-math>$4.8times $ </tex-math></inline-formula> latency improvements and up to <inline-formula> <tex-math>$4.3times $ </tex-math></inline-formula> improvements on area time product (ATP).","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"588-592"},"PeriodicalIF":2.8,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Toggle SOT-MRAM Architecture With Self-Terminating Write Operation 切换带有自终止写操作的SOT-MRAM架构

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-10-09 DOI: 10.1109/TVLSI.2024.3471528

Ebenezer C. Usih;Naimul Hassan;Alexander J. Edwards;Felipe Garcia-Sanchez;Pedram Khalili Amiri;Joseph S. Friedman

Toggle spin-orbit torque (SOT)-driven magnetoresistive random access memory (MRAM) with perpendicular anisotropy has a simple material stack and is more robust than directional SOT-MRAM. However, a read-before-write operation is required to use the toggle SOT-MRAM for directional switching, which threatens to increase the write delay. To resolve these issues, we propose a high-speed memory architecture for toggle SOT-MRAM that includes a minimum-sized bit cell and a custom read-write driver. The proposed driver induces an analog self-terminating SOT current that functions via an analog feedback mechanism that can read and write the toggle SOT-MRAM bit cell within a single clock cycle. As the read and write operations are completed within 570 ps, this memory architecture provides the first viable solution for nonvolatile L3 cache.

垂直各向异性的开关自旋轨道转矩驱动磁阻随机存取存储器（MRAM）具有材料堆叠简单、鲁棒性好等优点。然而，使用切换式SOT-MRAM进行定向切换需要先读后写操作，这可能会增加写入延迟。为了解决这些问题，我们提出了一种用于切换SOT-MRAM的高速存储器架构，其中包括最小尺寸的位单元和自定义读写驱动程序。所提出的驱动器诱导模拟自终止SOT电流，该电流通过模拟反馈机制起作用，该机制可以在单个时钟周期内读写切换SOT- mram位单元。由于读写操作在570ps内完成，这种内存体系结构为非易失性L3缓存提供了第一个可行的解决方案。

引用次数: 0

Analog Probe Module (APM) for Enhanced IC Observability: From Concept to Application 用于增强集成电路可观察性的模拟探针模块 (APM)：从概念到应用

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-10-09 DOI: 10.1109/TVLSI.2024.3470342

Anshaj Shrivastava;Gaurab Banerjee

This study presents a compact, on-chip analog probe module (APM) to augment the IEEE 1149.4 (or P1687.2) standard for efficiently probing multiple internal nodes. The complete approach to APM implementation, from conceptualization to practical application, is discussed in detail. The APM aims to utilize a minimum area for a maximum number of probe channels, achieving an optimal size of 4:15. At the transistor level, the design minimizes the impact of glitches in asynchronous operations through a symmetrical layout and a unique arrangement of all probe channels. However, glitches in asynchronous circuits can still exist; hence, a state transition matrix (STM) concept is devised. STMs help visualize hazardous transitions, allowing the identification of a common hazard-free transition sequence suitable for hardware implementation. The verified APM design is integrated with several analog/RF circuits fabricated in a commercially available 0.18-

$mu $

m RF-CMOS process as part of a radar-on-chip system. An important APM application of enabling the prediction of an IC’s corner disposition by measuring dc-node voltages during postsilicon IC testing is demonstrated for 16 fabricated ICs.

本研究提出了一种紧凑的片上模拟探针模块（APM），以增强IEEE 1149.4（或P1687.2）标准，有效地探测多个内部节点。详细讨论了APM实现的完整方法，从概念到实际应用。APM的目标是利用最小的面积来实现最大数量的探测通道，达到4:15的最佳尺寸。在晶体管层面，该设计通过对称布局和所有探针通道的独特安排，最大限度地减少了异步操作中故障的影响。然而，异步电路中的故障仍然存在；因此，提出了状态转移矩阵（STM）概念。stm帮助可视化危险转换，允许识别适合硬件实现的通用无危险转换序列。经过验证的APM设计集成了几个模拟/RF电路，这些电路采用市售的0.18- $ $ μ $ m RF- cmos工艺制造，作为片上雷达系统的一部分。在硅后集成电路测试过程中，通过测量直流节点电压来预测集成电路的拐角配置，这是一个重要的APM应用。

{"title":"Analog Probe Module (APM) for Enhanced IC Observability: From Concept to Application","authors":"Anshaj Shrivastava;Gaurab Banerjee","doi":"10.1109/TVLSI.2024.3470342","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3470342","url":null,"abstract":"This study presents a compact, on-chip analog probe module (APM) to augment the IEEE 1149.4 (or P1687.2) standard for efficiently probing multiple internal nodes. The complete approach to APM implementation, from conceptualization to practical application, is discussed in detail. The APM aims to utilize a minimum area for a maximum number of probe channels, achieving an optimal size of 4:15. At the transistor level, the design minimizes the impact of glitches in asynchronous operations through a symmetrical layout and a unique arrangement of all probe channels. However, glitches in asynchronous circuits can still exist; hence, a state transition matrix (STM) concept is devised. STMs help visualize hazardous transitions, allowing the identification of a common hazard-free transition sequence suitable for hardware implementation. The verified APM design is integrated with several analog/RF circuits fabricated in a commercially available 0.18-\u0000<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>\u0000m RF-CMOS process as part of a radar-on-chip system. An important APM application of enabling the prediction of an IC’s corner disposition by measuring dc-node voltages during postsilicon IC testing is demonstrated for 16 fabricated ICs.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 12","pages":"2355-2367"},"PeriodicalIF":2.8,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142821187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0