IEEE Transactions on Very Large Scale Integration (VLSI) Systems最新文献_第7页

A Real-Time Rotation Calibration for Interchannel Offset Mismatch in Time-Interleaved SAR ADCs

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-10-16 DOI: 10.1109/TVLSI.2024.3472095

Yixiao Luo;Hongzhi Liang;Zeyu Peng;Yukui Yu;Shubin Liu;Ruixue Ding;Zhangming Zhu

This brief presents an on-chip, real-time rotation calibration (RRC) technique aimed at alleviating the inter-channel offset mismatch in time-interleaved (TI) successive-approximation register analog-to-digital converter (SAR ADC). By leveraging auto-rotation calibration and self-compensation strategies in the analog domain, the proposed technique demonstrates robust performance across PVT variations. Two additional sub-channels are involved in the TI quantization mechanism, where the continuous rotation of the sampling clock distribution ensures their operation in calibration mode. To validate the effectiveness of the proposed calibration, an

$8times 8$

bit 8 GS/s TI-SAR ADC is designed and implemented in a 28-nm process and occupies an active area of 0.273 mm2, with each sub-channel SAR ADC covering only

$86times 23~mu $

m. Extensive simulation results validate the efficacy of RRC, demonstrating significant improvements in dynamic performance. Specifically, SNDR increases from 37.1 to 45.4 dB, while SFDR rises from 57.8 to 60.7 dB, as observed at the Nyquist input frequency.

{"title":"A Real-Time Rotation Calibration for Interchannel Offset Mismatch in Time-Interleaved SAR ADCs","authors":"Yixiao Luo;Hongzhi Liang;Zeyu Peng;Yukui Yu;Shubin Liu;Ruixue Ding;Zhangming Zhu","doi":"10.1109/TVLSI.2024.3472095","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3472095","url":null,"abstract":"This brief presents an on-chip, real-time rotation calibration (RRC) technique aimed at alleviating the inter-channel offset mismatch in time-interleaved (TI) successive-approximation register analog-to-digital converter (SAR ADC). By leveraging auto-rotation calibration and self-compensation strategies in the analog domain, the proposed technique demonstrates robust performance across PVT variations. Two additional sub-channels are involved in the TI quantization mechanism, where the continuous rotation of the sampling clock distribution ensures their operation in calibration mode. To validate the effectiveness of the proposed calibration, an <inline-formula> <tex-math>$8times 8$ </tex-math></inline-formula> bit 8 GS/s TI-SAR ADC is designed and implemented in a 28-nm process and occupies an active area of 0.273 mm2, with each sub-channel SAR ADC covering only <inline-formula> <tex-math>$86times 23~mu $ </tex-math></inline-formula>m. Extensive simulation results validate the efficacy of RRC, demonstrating significant improvements in dynamic performance. Specifically, SNDR increases from 37.1 to 45.4 dB, while SFDR rises from 57.8 to 60.7 dB, as observed at the Nyquist input frequency.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"897-901"},"PeriodicalIF":2.8,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143496463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Edge PoolFormer: Modeling and Training of PoolFormer Network on RRAM Crossbar for Edge-AI Applications 边缘PoolFormer：边缘ai应用中基于RRAM Crossbar的PoolFormer网络建模与训练

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-10-15 DOI: 10.1109/TVLSI.2024.3472270

Tiancheng Cao;Weihao Yu;Yuan Gao;Chen Liu;Tantan Zhang;Shuicheng Yan;Wang Ling Goh

PoolFormer is a subset of Transformer neural network with a key difference of replacing computationally demanding token mixer with pooling function. In this work, a memristor-based PoolFormer network modeling and training framework for edge-artificial intelligence (AI) applications is presented. The original PoolFormer structure is further optimized for hardware implementation on RRAM crossbar by replacing the normalization operation with scaling. In addition, the nonidealities of RRAM crossbar from device to array level as well as peripheral readout circuits are analyzed. By integrating these factors into one training framework, the overall neural network performance is evaluated holistically and the impact of nonidealities to the network performance can be effectively mitigated. Implemented in Python and PyTorch, a 16-block PoolFormer network is built with

$64times 64$

four-level RRAM crossbar array model extracted from measurement results. The total number of the proposed Edge PoolFormer network parameters is 0.246 M, which is at least one order smaller than the conventional CNN implementation. This network achieved inference accuracy of 88.07% for CIFAR-10 image classification tasks with accuracy degradation of 1.5% compared to the ideal software model with FP32 precision weights.

PoolFormer是Transformer神经网络的一个子集，其关键区别是用池化功能取代计算要求高的令牌混合器。在这项工作中，提出了一个基于忆阻器的边缘人工智能（AI）应用的PoolFormer网络建模和训练框架。通过将原来的PoolFormer结构用缩放操作替换为规范化操作，进一步优化了在RRAM crossbar上的硬件实现。此外，还分析了从器件到阵列的RRAM横条以及外围读出电路的非理想性。通过将这些因素整合到一个训练框架中，可以对神经网络的整体性能进行整体评估，并且可以有效地减轻非理想性对网络性能的影响。在Python和PyTorch中实现的16块PoolFormer网络使用从测量结果中提取的$64 × 64$四级RRAM交叉棒数组模型构建。提出的Edge PoolFormer网络参数总数为0.246 M，比传统的CNN实现至少小一个数量级。该网络对CIFAR-10图像分类任务的推理准确率为88.07%，与具有FP32精度权重的理想软件模型相比，准确率下降了1.5%。

{"title":"Edge PoolFormer: Modeling and Training of PoolFormer Network on RRAM Crossbar for Edge-AI Applications","authors":"Tiancheng Cao;Weihao Yu;Yuan Gao;Chen Liu;Tantan Zhang;Shuicheng Yan;Wang Ling Goh","doi":"10.1109/TVLSI.2024.3472270","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3472270","url":null,"abstract":"PoolFormer is a subset of Transformer neural network with a key difference of replacing computationally demanding token mixer with pooling function. In this work, a memristor-based PoolFormer network modeling and training framework for edge-artificial intelligence (AI) applications is presented. The original PoolFormer structure is further optimized for hardware implementation on RRAM crossbar by replacing the normalization operation with scaling. In addition, the nonidealities of RRAM crossbar from device to array level as well as peripheral readout circuits are analyzed. By integrating these factors into one training framework, the overall neural network performance is evaluated holistically and the impact of nonidealities to the network performance can be effectively mitigated. Implemented in Python and PyTorch, a 16-block PoolFormer network is built with <inline-formula> <tex-math>$64times 64$ </tex-math></inline-formula> four-level RRAM crossbar array model extracted from measurement results. The total number of the proposed Edge PoolFormer network parameters is 0.246 M, which is at least one order smaller than the conventional CNN implementation. This network achieved inference accuracy of 88.07% for CIFAR-10 image classification tasks with accuracy degradation of 1.5% compared to the ideal software model with FP32 precision weights.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"384-394"},"PeriodicalIF":2.8,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A 10-Gb/s/lane, Energy-Efficient Transceiver With Reference-Less Hybrid CDR for Mobile Display Link Interfaces

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-10-15 DOI: 10.1109/TVLSI.2024.3472073

Jonghyun Oh;Kwanseo Park;Young-Ha Hwang

This brief presents an energy-efficient transceiver supporting a 10-Gb/s/lane display link interface between the application processor (AP) integrated circuits (IC) and timing controller (TCON)-embedded source driver IC for mobile applications. An embedded clocking scheme is adopted to save clock distribution power, which also reduces the required number of off-chip I/O channels. A transmitter (TX) sends 20-Gb/s aggregate data through two differential data lanes, and a receiver recovers a 5-GHz half-rate clock. The TX employs a latch-less serializer using divided clocks in a staggered phase, achieving energy efficiency of 0.43 pJ/b/lane. In the RX, a hybrid clock and data recovery (CDR) tracks a half-data rate with a digital loop filter (DLF) and subsequently locks the frequency and phase with an analog loop filter (ALF). By deactivating the DLF and edge deserializer once a coarse frequency lock is acquired, the RX achieves an energy efficiency of 0.53 pJ/b/lane. The prototype transceiver, fabricated using a 28-nm CMOS technology, occupies an active area of 0.196 mm2 and achieves an energy efficiency of 1.23 pJ/b/lane, including a charge-pump phase-locked loop (CP-PLL) with clock distribution.

{"title":"A 10-Gb/s/lane, Energy-Efficient Transceiver With Reference-Less Hybrid CDR for Mobile Display Link Interfaces","authors":"Jonghyun Oh;Kwanseo Park;Young-Ha Hwang","doi":"10.1109/TVLSI.2024.3472073","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3472073","url":null,"abstract":"This brief presents an energy-efficient transceiver supporting a 10-Gb/s/lane display link interface between the application processor (AP) integrated circuits (IC) and timing controller (TCON)-embedded source driver IC for mobile applications. An embedded clocking scheme is adopted to save clock distribution power, which also reduces the required number of off-chip I/O channels. A transmitter (TX) sends 20-Gb/s aggregate data through two differential data lanes, and a receiver recovers a 5-GHz half-rate clock. The TX employs a latch-less serializer using divided clocks in a staggered phase, achieving energy efficiency of 0.43 pJ/b/lane. In the RX, a hybrid clock and data recovery (CDR) tracks a half-data rate with a digital loop filter (DLF) and subsequently locks the frequency and phase with an analog loop filter (ALF). By deactivating the DLF and edge deserializer once a coarse frequency lock is acquired, the RX achieves an energy efficiency of 0.53 pJ/b/lane. The prototype transceiver, fabricated using a 28-nm CMOS technology, occupies an active area of 0.196 mm2 and achieves an energy efficiency of 1.23 pJ/b/lane, including a charge-pump phase-locked loop (CP-PLL) with clock distribution.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"887-891"},"PeriodicalIF":2.8,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143496495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient ORBGRAND Implementation With Parallel Noise Sequence Generation 高效的ORBGRAND实现与并行噪声序列生成

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-10-15 DOI: 10.1109/TVLSI.2024.3466474

Chao Ji;Xiaohu You;Chuan Zhang;Christoph Studer

Guessing random additive noise decoding (GRAND) is establishing itself as a universal method for decoding linear block codes, and ordered reliability bits GRAND (ORBGRAND) is a hardware-friendly variant that processes soft-input information. In this work, we propose an efficient hardware implementation of ORBGRAND that significantly reduces the cost of querying noise sequences with slight frame error rate (FER) performance degradation. Different from logistic weight order (LWO) and improved LWO (iLWO) typically used to generate noise sequences, we introduce a reduced-complexity and hardware-friendly method called shift LWO (sLWO), of which the shift factor can be chosen empirically to trade the FER performance and query complexity well. To effectively generate noise sequences with sLWO, we utilize a hardware-friendly lookup-table (LUT)-aided strategy, which improves throughput as well as area and energy efficiency. To demonstrate the efficacy of our solution, we use synthesis results evaluated on polar codes in a 65-nm CMOS technology. While maintaining similar FER performance, our ORBGRAND implementations achieve 53.6-Gbps average throughput (

$1.26times $

higher), 4.2-Mbps worst case throughput (

$8.24times $

higher), 2.4-Mbps/mm2 worst case area efficiency (

$12times $

higher), and

$4.66times 10 ^{{4}}$

pJ/bit worst case energy efficiency (

$9.96times $

lower) compared with the synthesized ORBGRAND design with LWO for a (128, 105) polar code and also provide

$8.62times $

higher average throughput and

$9.4times $

higher average area efficiency but

$7.51times $

worse average energy efficiency than the ORBGRAND chip for a (256, 240) polar code, at a target FER of

$10^{-7}$

.

猜测随机加性噪声解码（GRAND）正在成为线性分组码解码的通用方法，而有序可靠位GRAND （ORBGRAND）是一种硬件友好的变体，用于处理软输入信息。在这项工作中，我们提出了一种有效的ORBGRAND硬件实现，可以显着降低查询噪声序列的成本，并且具有轻微的帧错误率（FER）性能下降。与通常用于生成噪声序列的逻辑权重排序（LWO）和改进LWO （iLWO）不同，我们引入了一种降低复杂度和硬件友好的移位LWO （sLWO）方法，该方法可以根据经验选择移位因子来很好地权衡FER性能和查询复杂度。为了利用sLWO有效地生成噪声序列，我们采用了硬件友好的查找表（LUT）辅助策略，从而提高了吞吐量以及面积和能源效率。为了证明我们的解决方案的有效性，我们使用了65纳米CMOS技术中极性代码的合成结果。在保持类似的FER性能的同时，我们的ORBGRAND实现了53.6 gbps的平均吞吐量（高1.26美元），4.2 mbps的最坏情况吞吐量（高8.24美元），2.4 mbps /mm2的最坏情况面积效率（高12美元），和4.66倍10 ^{{4}}$ pJ/bit的最坏情况能源效率（低9.96倍美元）。105)极性代码，并且还提供8.62美元的平均吞吐量和9.4美元的平均面积效率，但平均能源效率比（256,240）极性代码的ORBGRAND芯片低7.51美元，目标FER为10^{-7}美元。

{"title":"Efficient ORBGRAND Implementation With Parallel Noise Sequence Generation","authors":"Chao Ji;Xiaohu You;Chuan Zhang;Christoph Studer","doi":"10.1109/TVLSI.2024.3466474","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3466474","url":null,"abstract":"Guessing random additive noise decoding (GRAND) is establishing itself as a universal method for decoding linear block codes, and ordered reliability bits GRAND (ORBGRAND) is a hardware-friendly variant that processes soft-input information. In this work, we propose an efficient hardware implementation of ORBGRAND that significantly reduces the cost of querying noise sequences with slight frame error rate (FER) performance degradation. Different from logistic weight order (LWO) and improved LWO (iLWO) typically used to generate noise sequences, we introduce a reduced-complexity and hardware-friendly method called shift LWO (sLWO), of which the shift factor can be chosen empirically to trade the FER performance and query complexity well. To effectively generate noise sequences with sLWO, we utilize a hardware-friendly lookup-table (LUT)-aided strategy, which improves throughput as well as area and energy efficiency. To demonstrate the efficacy of our solution, we use synthesis results evaluated on polar codes in a 65-nm CMOS technology. While maintaining similar FER performance, our ORBGRAND implementations achieve 53.6-Gbps average throughput (<inline-formula> <tex-math>$1.26times $ </tex-math></inline-formula> higher), 4.2-Mbps worst case throughput (<inline-formula> <tex-math>$8.24times $ </tex-math></inline-formula> higher), 2.4-Mbps/mm2 worst case area efficiency (<inline-formula> <tex-math>$12times $ </tex-math></inline-formula> higher), and <inline-formula> <tex-math>$4.66times 10 ^{{4}}$ </tex-math></inline-formula> pJ/bit worst case energy efficiency (<inline-formula> <tex-math>$9.96times $ </tex-math></inline-formula> lower) compared with the synthesized ORBGRAND design with LWO for a (128, 105) polar code and also provide <inline-formula> <tex-math>$8.62times $ </tex-math></inline-formula> higher average throughput and <inline-formula> <tex-math>$9.4times $ </tex-math></inline-formula> higher average area efficiency but <inline-formula> <tex-math>$7.51times $ </tex-math></inline-formula> worse average energy efficiency than the ORBGRAND chip for a (256, 240) polar code, at a target FER of <inline-formula> <tex-math>$10^{-7}$ </tex-math></inline-formula>.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"435-448"},"PeriodicalIF":2.8,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Combo EMI Suppression Scheme for Multimode PSR Flyback Converter

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-10-14 DOI: 10.1109/TVLSI.2024.3470837

Yongyuan Li;Zhuliang Li;Wei Guo;Qiang Wu;Yongbo Zhang;Yong You;Zhangming Zhu

Electromagnetic interference (EMI) is an inevitable issue in power electronics applications. Although many kinds of solutions have been presented to attenuate EMI noise, there is still little research about the EMI suppression scheme utilized in multimode primary-side regulation (PSR) flyback converters. Targeting EMI regulation in multimode PSR flyback converter, a combo EMI suppression scheme comprised of frequency modulation and dual-slope gate driver is adopted to meet stringent EMI requirements, simplifying peripheral components and design of EMI filter. The proposed scheme is implemented in

$0.18~mu $

m 5/40 V BCD process and occupies a die size (with pads) of

$1.05times 0.8$

mm2. The experimental results show that the conducted EMI waveforms with line/neutral polarity can easily comply with regulations. The deviations of the output voltage are within ±1.3% under different inputs and loads while the peak effciency of 90% is achieved.

引用次数: 0

Research on Hardware Acceleration of Traffic Sign Recognition Based on Spiking Neural Network and FPGA Platform 基于峰值神经网络和FPGA平台的交通标志识别硬件加速研究

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-10-14 DOI: 10.1109/TVLSI.2024.3470834

Huarun Chen;Yijun Liu;Wujian Ye;Jialiang Ye;Yuehai Chen;Shaozhen Chen;Chao Han

Most of the existing methods for traffic sign recognition exploited deep learning technology such as convolutional neural networks (CNNs) to achieve a breakthrough in detection accuracy; however, due to the large number of CNN’s parameters, there are problems in practical applications such as high power consumption, large calculation, and slow speed. Compared with CNN, a spiking neural network (SNN) can effectively simulate the information processing mechanism of biological brain, with stronger parallel processing capability, better sparsity, and real-time performance. Thus, we design and realize a novel traffic sign recognition system [called SNN on FPGA-traffic sign recognition system (SFPGA-TSRS)] based on spiking CNN (SCNN) and FPGA platform. Specifically, to improve the recognition accuracy, a traffic sign recognition model spatial attention SCNN (SA-SCNN) is proposed by combining LIF/IF neurons based SCNN with SA mechanism; and to accelerate the model inference, a neuron module is implemented with high performance, and an input coding module is designed as the input layer of the recognition model. The experiments show that compared with existing systems, the proposed SFPGA-TSRS can efficiently support the deployment of SCNN models, with a higher recognition accuracy of 99.22%, a faster frame rate of 66.38 frames per second (FPS), and lower power consumption of 1.423 W on the GTSRB dataset.

现有的交通标志识别方法大多利用卷积神经网络（cnn）等深度学习技术来实现检测精度的突破；但由于CNN的参数较多，在实际应用中存在功耗高、计算量大、速度慢等问题。与CNN相比，SNN能够有效地模拟生物大脑的信息处理机制，具有更强的并行处理能力、更好的稀疏性和实时性。因此，我们设计并实现了一种基于峰值CNN （SCNN）和FPGA平台的新型交通标志识别系统[称为SNN on FPGA-交通标志识别系统（SFPGA-TSRS）]。具体来说，为了提高识别精度，将基于LIF/IF神经元的SCNN与SA机制相结合，提出了一种交通标志识别模型空间注意SCNN (SA-SCNN)；为了加速模型推理，实现了高性能的神经元模块，并设计了输入编码模块作为识别模型的输入层。实验表明，与现有系统相比，本文提出的SFPGA-TSRS能够有效地支持SCNN模型的部署，在GTSRB数据集上的识别准确率达到99.22%，帧率达到66.38帧/秒，功耗低至1.423 W。

{"title":"Research on Hardware Acceleration of Traffic Sign Recognition Based on Spiking Neural Network and FPGA Platform","authors":"Huarun Chen;Yijun Liu;Wujian Ye;Jialiang Ye;Yuehai Chen;Shaozhen Chen;Chao Han","doi":"10.1109/TVLSI.2024.3470834","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3470834","url":null,"abstract":"Most of the existing methods for traffic sign recognition exploited deep learning technology such as convolutional neural networks (CNNs) to achieve a breakthrough in detection accuracy; however, due to the large number of CNN’s parameters, there are problems in practical applications such as high power consumption, large calculation, and slow speed. Compared with CNN, a spiking neural network (SNN) can effectively simulate the information processing mechanism of biological brain, with stronger parallel processing capability, better sparsity, and real-time performance. Thus, we design and realize a novel traffic sign recognition system [called SNN on FPGA-traffic sign recognition system (SFPGA-TSRS)] based on spiking CNN (SCNN) and FPGA platform. Specifically, to improve the recognition accuracy, a traffic sign recognition model spatial attention SCNN (SA-SCNN) is proposed by combining LIF/IF neurons based SCNN with SA mechanism; and to accelerate the model inference, a neuron module is implemented with high performance, and an input coding module is designed as the input layer of the recognition model. The experiments show that compared with existing systems, the proposed SFPGA-TSRS can efficiently support the deployment of SCNN models, with a higher recognition accuracy of 99.22%, a faster frame rate of 66.38 frames per second (FPS), and lower power consumption of 1.423 W on the GTSRB dataset.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"499-511"},"PeriodicalIF":2.8,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

3DNN-Xplorer: A Machine Learning Framework for Design Space Exploration of Heterogeneous 3-D DNN Accelerators 3DNN-Xplorer：一个用于异构三维DNN加速器设计空间探索的机器学习框架

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-10-14 DOI: 10.1109/TVLSI.2024.3471496

Gauthaman Murali;Min Gyu Park;Sung Kyu Lim

This article presents 3DNN-Xplorer, the first machine learning (ML)-based framework for predicting the performance of heterogeneous 3-D deep neural network (DNN) accelerators. Our ML framework facilitates the design space exploration (DSE) of heterogeneous 3-D accelerators with a two-tier compute-on-memory (CoM) configuration, considering 3-D physical design factors. Our design space encompasses four distinct heterogeneous 3-D integration styles, combining 28- and 16-nm technology nodes for both compute and memory tiers. Using extrapolation techniques with ML models trained on 10-to-256 processing element (PE) accelerator configurations, we estimate the performance of systems featuring 75–16384 PEs, achieving a maximum absolute error of 13.9% (the number of PEs is not continuous and varies based on the accelerator architecture). To ensure balanced tier areas in the design, our framework assumes the same number of PEs or on-chip memory capacity across the four integration styles, accounting for area imbalance resulting from different technology nodes. Our analysis reveals that the heterogeneous 3-D style with 28-nm compute and 16-nm memory is energy-efficient and offers notable energy savings of up to 50% and an 8.8% reduction in runtime compared to other 3-D integration styles with the same number of PEs. Similarly, the heterogeneous 3-D style with 16-nm compute and 28-nm memory is area-efficient and shows up to 8.3% runtime reduction compared to other 3-D styles with the same on-chip memory capacity.

本文介绍了3DNN-Xplorer，这是第一个基于机器学习（ML）的框架，用于预测异构3d深度神经网络（DNN）加速器的性能。我们的机器学习框架通过考虑三维物理设计因素的两层内存上计算（CoM）配置，促进了异构三维加速器的设计空间探索（DSE）。我们的设计空间包含四种不同的异构3d集成风格，为计算层和存储层结合了28纳米和16纳米技术节点。使用外推技术，在10到256个处理元素（PE）加速器配置上训练ML模型，我们估计了具有75-16384个PE的系统的性能，实现了13.9%的最大绝对误差（PE的数量不是连续的，并且根据加速器架构而变化）。为了确保设计中的层面积平衡，我们的框架在四种集成风格中假设相同数量的pe或片上存储器容量，考虑到不同技术节点导致的面积不平衡。我们的分析表明，具有28纳米计算和16纳米内存的异构3-D风格是节能的，与具有相同数量pe的其他3-D集成风格相比，节能高达50%，运行时间减少8.8%。同样，具有16纳米计算和28纳米内存的异构3-D风格具有面积效率，与具有相同片上存储容量的其他3-D风格相比，运行时间减少了8.3%。

{"title":"3DNN-Xplorer: A Machine Learning Framework for Design Space Exploration of Heterogeneous 3-D DNN Accelerators","authors":"Gauthaman Murali;Min Gyu Park;Sung Kyu Lim","doi":"10.1109/TVLSI.2024.3471496","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3471496","url":null,"abstract":"This article presents 3DNN-Xplorer, the first machine learning (ML)-based framework for predicting the performance of heterogeneous 3-D deep neural network (DNN) accelerators. Our ML framework facilitates the design space exploration (DSE) of heterogeneous 3-D accelerators with a two-tier compute-on-memory (CoM) configuration, considering 3-D physical design factors. Our design space encompasses four distinct heterogeneous 3-D integration styles, combining 28- and 16-nm technology nodes for both compute and memory tiers. Using extrapolation techniques with ML models trained on 10-to-256 processing element (PE) accelerator configurations, we estimate the performance of systems featuring 75–16384 PEs, achieving a maximum absolute error of 13.9% (the number of PEs is not continuous and varies based on the accelerator architecture). To ensure balanced tier areas in the design, our framework assumes the same number of PEs or on-chip memory capacity across the four integration styles, accounting for area imbalance resulting from different technology nodes. Our analysis reveals that the heterogeneous 3-D style with 28-nm compute and 16-nm memory is energy-efficient and offers notable energy savings of up to 50% and an 8.8% reduction in runtime compared to other 3-D integration styles with the same number of PEs. Similarly, the heterogeneous 3-D style with 16-nm compute and 28-nm memory is area-efficient and shows up to 8.3% runtime reduction compared to other 3-D styles with the same on-chip memory capacity.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"358-370"},"PeriodicalIF":2.8,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Scalable and Efficient NTT/INTT Architecture Using Group-Based Pairwise Memory Access and Fast Interstage Reordering 基于组对存储器访问和快速级间重排序的可扩展高效NTT/INTT体系结构

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-10-09 DOI: 10.1109/TVLSI.2024.3465010

Zihang Wang;Yushu Yang;Jianfei Wang;Jia Hou;Yang Su;Chen Yang

Polynomial multiplication is a significant bottleneck in mainstream postquantum cryptography (PQC) schemes. To speed it up, number theoretic transform (NTT) is widely used, which decreases the time complexity from

${O}(n^{2})$

to

$O[nlog _{2}(n)]$

. However, it is challenging to ensure optimal hardware efficiency in conjunction with scalability. This brief proposes a novel pipelined NTT/inverse-NTT (INTT) architecture on field-programmable gate array (FPGA). A group-based pairwise memory access (GPMA) scheme is proposed, and a scratchpad and reordering unit (SRU) is designed to form an efficient dataflow that simplifies control units and achieves almost

$n/2$

processing cycles on average for n-point NTT. Moreover, our architecture can support varying parameters. Compared to the state-of-the-art works, our architecture achieves up to

$4.8times $

latency improvements and up to

$4.3times $

improvements on area time product (ATP).

多项式乘法是后量子加密（PQC）的主要瓶颈。为了提高速度，广泛采用数论变换（number theoretical transform， NTT），将时间复杂度从${O}(n^{2})$降低到$O[nlog _{2}(n)]$。然而，要确保结合可伸缩性的最佳硬件效率是一项挑战。本文提出了一种基于现场可编程门阵列（FPGA）的新型流水线NTT/反NTT （INTT）架构。提出了一种基于组的两两存储器访问（GPMA）方案，并设计了一个刮记板和重新排序单元（SRU），以形成一个高效的数据流，简化了控制单元，并在n点NTT中平均实现了近$n/2$的处理周期。此外，我们的体系结构可以支持各种参数。与最先进的作品相比，我们的架构实现了高达4.8倍的延迟改进和高达4.3倍的面积时间积（ATP）改进。

{"title":"A Scalable and Efficient NTT/INTT Architecture Using Group-Based Pairwise Memory Access and Fast Interstage Reordering","authors":"Zihang Wang;Yushu Yang;Jianfei Wang;Jia Hou;Yang Su;Chen Yang","doi":"10.1109/TVLSI.2024.3465010","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3465010","url":null,"abstract":"Polynomial multiplication is a significant bottleneck in mainstream postquantum cryptography (PQC) schemes. To speed it up, number theoretic transform (NTT) is widely used, which decreases the time complexity from <inline-formula> <tex-math>${O}(n^{2})$ </tex-math></inline-formula> to <inline-formula> <tex-math>$O[nlog _{2}(n)]$ </tex-math></inline-formula>. However, it is challenging to ensure optimal hardware efficiency in conjunction with scalability. This brief proposes a novel pipelined NTT/inverse-NTT (INTT) architecture on field-programmable gate array (FPGA). A group-based pairwise memory access (GPMA) scheme is proposed, and a scratchpad and reordering unit (SRU) is designed to form an efficient dataflow that simplifies control units and achieves almost <inline-formula> <tex-math>$n/2$ </tex-math></inline-formula> processing cycles on average for n-point NTT. Moreover, our architecture can support varying parameters. Compared to the state-of-the-art works, our architecture achieves up to <inline-formula> <tex-math>$4.8times $ </tex-math></inline-formula> latency improvements and up to <inline-formula> <tex-math>$4.3times $ </tex-math></inline-formula> improvements on area time product (ATP).","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"588-592"},"PeriodicalIF":2.8,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Toggle SOT-MRAM Architecture With Self-Terminating Write Operation 切换带有自终止写操作的SOT-MRAM架构

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-10-09 DOI: 10.1109/TVLSI.2024.3471528

Ebenezer C. Usih;Naimul Hassan;Alexander J. Edwards;Felipe Garcia-Sanchez;Pedram Khalili Amiri;Joseph S. Friedman

Toggle spin-orbit torque (SOT)-driven magnetoresistive random access memory (MRAM) with perpendicular anisotropy has a simple material stack and is more robust than directional SOT-MRAM. However, a read-before-write operation is required to use the toggle SOT-MRAM for directional switching, which threatens to increase the write delay. To resolve these issues, we propose a high-speed memory architecture for toggle SOT-MRAM that includes a minimum-sized bit cell and a custom read-write driver. The proposed driver induces an analog self-terminating SOT current that functions via an analog feedback mechanism that can read and write the toggle SOT-MRAM bit cell within a single clock cycle. As the read and write operations are completed within 570 ps, this memory architecture provides the first viable solution for nonvolatile L3 cache.

垂直各向异性的开关自旋轨道转矩驱动磁阻随机存取存储器（MRAM）具有材料堆叠简单、鲁棒性好等优点。然而，使用切换式SOT-MRAM进行定向切换需要先读后写操作，这可能会增加写入延迟。为了解决这些问题，我们提出了一种用于切换SOT-MRAM的高速存储器架构，其中包括最小尺寸的位单元和自定义读写驱动程序。所提出的驱动器诱导模拟自终止SOT电流，该电流通过模拟反馈机制起作用，该机制可以在单个时钟周期内读写切换SOT- mram位单元。由于读写操作在570ps内完成，这种内存体系结构为非易失性L3缓存提供了第一个可行的解决方案。

引用次数: 0

Analog Probe Module (APM) for Enhanced IC Observability: From Concept to Application 用于增强集成电路可观察性的模拟探针模块 (APM)：从概念到应用

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

Pub Date : 2024-10-09 DOI: 10.1109/TVLSI.2024.3470342

Anshaj Shrivastava;Gaurab Banerjee

This study presents a compact, on-chip analog probe module (APM) to augment the IEEE 1149.4 (or P1687.2) standard for efficiently probing multiple internal nodes. The complete approach to APM implementation, from conceptualization to practical application, is discussed in detail. The APM aims to utilize a minimum area for a maximum number of probe channels, achieving an optimal size of 4:15. At the transistor level, the design minimizes the impact of glitches in asynchronous operations through a symmetrical layout and a unique arrangement of all probe channels. However, glitches in asynchronous circuits can still exist; hence, a state transition matrix (STM) concept is devised. STMs help visualize hazardous transitions, allowing the identification of a common hazard-free transition sequence suitable for hardware implementation. The verified APM design is integrated with several analog/RF circuits fabricated in a commercially available 0.18-

$mu $

m RF-CMOS process as part of a radar-on-chip system. An important APM application of enabling the prediction of an IC’s corner disposition by measuring dc-node voltages during postsilicon IC testing is demonstrated for 16 fabricated ICs.

本研究提出了一种紧凑的片上模拟探针模块（APM），以增强IEEE 1149.4（或P1687.2）标准，有效地探测多个内部节点。详细讨论了APM实现的完整方法，从概念到实际应用。APM的目标是利用最小的面积来实现最大数量的探测通道，达到4:15的最佳尺寸。在晶体管层面，该设计通过对称布局和所有探针通道的独特安排，最大限度地减少了异步操作中故障的影响。然而，异步电路中的故障仍然存在；因此，提出了状态转移矩阵（STM）概念。stm帮助可视化危险转换，允许识别适合硬件实现的通用无危险转换序列。经过验证的APM设计集成了几个模拟/RF电路，这些电路采用市售的0.18- $ $ μ $ m RF- cmos工艺制造，作为片上雷达系统的一部分。在硅后集成电路测试过程中，通过测量直流节点电压来预测集成电路的拐角配置，这是一个重要的APM应用。

{"title":"Analog Probe Module (APM) for Enhanced IC Observability: From Concept to Application","authors":"Anshaj Shrivastava;Gaurab Banerjee","doi":"10.1109/TVLSI.2024.3470342","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3470342","url":null,"abstract":"This study presents a compact, on-chip analog probe module (APM) to augment the IEEE 1149.4 (or P1687.2) standard for efficiently probing multiple internal nodes. The complete approach to APM implementation, from conceptualization to practical application, is discussed in detail. The APM aims to utilize a minimum area for a maximum number of probe channels, achieving an optimal size of 4:15. At the transistor level, the design minimizes the impact of glitches in asynchronous operations through a symmetrical layout and a unique arrangement of all probe channels. However, glitches in asynchronous circuits can still exist; hence, a state transition matrix (STM) concept is devised. STMs help visualize hazardous transitions, allowing the identification of a common hazard-free transition sequence suitable for hardware implementation. The verified APM design is integrated with several analog/RF circuits fabricated in a commercially available 0.18-\u0000<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>\u0000m RF-CMOS process as part of a radar-on-chip system. An important APM application of enabling the prediction of an IC’s corner disposition by measuring dc-node voltages during postsilicon IC testing is demonstrated for 16 fabricated ICs.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 12","pages":"2355-2367"},"PeriodicalIF":2.8,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142821187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0