This brief presents an on-chip, real-time rotation calibration (RRC) technique aimed at alleviating the inter-channel offset mismatch in time-interleaved (TI) successive-approximation register analog-to-digital converter (SAR ADC). By leveraging auto-rotation calibration and self-compensation strategies in the analog domain, the proposed technique demonstrates robust performance across PVT variations. Two additional sub-channels are involved in the TI quantization mechanism, where the continuous rotation of the sampling clock distribution ensures their operation in calibration mode. To validate the effectiveness of the proposed calibration, an $8times 8$ bit 8 GS/s TI-SAR ADC is designed and implemented in a 28-nm process and occupies an active area of 0.273 mm2, with each sub-channel SAR ADC covering only $86times 23~mu $ m. Extensive simulation results validate the efficacy of RRC, demonstrating significant improvements in dynamic performance. Specifically, SNDR increases from 37.1 to 45.4 dB, while SFDR rises from 57.8 to 60.7 dB, as observed at the Nyquist input frequency.
{"title":"A Real-Time Rotation Calibration for Interchannel Offset Mismatch in Time-Interleaved SAR ADCs","authors":"Yixiao Luo;Hongzhi Liang;Zeyu Peng;Yukui Yu;Shubin Liu;Ruixue Ding;Zhangming Zhu","doi":"10.1109/TVLSI.2024.3472095","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3472095","url":null,"abstract":"This brief presents an on-chip, real-time rotation calibration (RRC) technique aimed at alleviating the inter-channel offset mismatch in time-interleaved (TI) successive-approximation register analog-to-digital converter (SAR ADC). By leveraging auto-rotation calibration and self-compensation strategies in the analog domain, the proposed technique demonstrates robust performance across PVT variations. Two additional sub-channels are involved in the TI quantization mechanism, where the continuous rotation of the sampling clock distribution ensures their operation in calibration mode. To validate the effectiveness of the proposed calibration, an <inline-formula> <tex-math>$8times 8$ </tex-math></inline-formula> bit 8 GS/s TI-SAR ADC is designed and implemented in a 28-nm process and occupies an active area of 0.273 mm2, with each sub-channel SAR ADC covering only <inline-formula> <tex-math>$86times 23~mu $ </tex-math></inline-formula>m. Extensive simulation results validate the efficacy of RRC, demonstrating significant improvements in dynamic performance. Specifically, SNDR increases from 37.1 to 45.4 dB, while SFDR rises from 57.8 to 60.7 dB, as observed at the Nyquist input frequency.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"897-901"},"PeriodicalIF":2.8,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143496463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
PoolFormer is a subset of Transformer neural network with a key difference of replacing computationally demanding token mixer with pooling function. In this work, a memristor-based PoolFormer network modeling and training framework for edge-artificial intelligence (AI) applications is presented. The original PoolFormer structure is further optimized for hardware implementation on RRAM crossbar by replacing the normalization operation with scaling. In addition, the nonidealities of RRAM crossbar from device to array level as well as peripheral readout circuits are analyzed. By integrating these factors into one training framework, the overall neural network performance is evaluated holistically and the impact of nonidealities to the network performance can be effectively mitigated. Implemented in Python and PyTorch, a 16-block PoolFormer network is built with $64times 64$ four-level RRAM crossbar array model extracted from measurement results. The total number of the proposed Edge PoolFormer network parameters is 0.246 M, which is at least one order smaller than the conventional CNN implementation. This network achieved inference accuracy of 88.07% for CIFAR-10 image classification tasks with accuracy degradation of 1.5% compared to the ideal software model with FP32 precision weights.
{"title":"Edge PoolFormer: Modeling and Training of PoolFormer Network on RRAM Crossbar for Edge-AI Applications","authors":"Tiancheng Cao;Weihao Yu;Yuan Gao;Chen Liu;Tantan Zhang;Shuicheng Yan;Wang Ling Goh","doi":"10.1109/TVLSI.2024.3472270","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3472270","url":null,"abstract":"PoolFormer is a subset of Transformer neural network with a key difference of replacing computationally demanding token mixer with pooling function. In this work, a memristor-based PoolFormer network modeling and training framework for edge-artificial intelligence (AI) applications is presented. The original PoolFormer structure is further optimized for hardware implementation on RRAM crossbar by replacing the normalization operation with scaling. In addition, the nonidealities of RRAM crossbar from device to array level as well as peripheral readout circuits are analyzed. By integrating these factors into one training framework, the overall neural network performance is evaluated holistically and the impact of nonidealities to the network performance can be effectively mitigated. Implemented in Python and PyTorch, a 16-block PoolFormer network is built with <inline-formula> <tex-math>$64times 64$ </tex-math></inline-formula> four-level RRAM crossbar array model extracted from measurement results. The total number of the proposed Edge PoolFormer network parameters is 0.246 M, which is at least one order smaller than the conventional CNN implementation. This network achieved inference accuracy of 88.07% for CIFAR-10 image classification tasks with accuracy degradation of 1.5% compared to the ideal software model with FP32 precision weights.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"384-394"},"PeriodicalIF":2.8,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-15DOI: 10.1109/TVLSI.2024.3472073
Jonghyun Oh;Kwanseo Park;Young-Ha Hwang
This brief presents an energy-efficient transceiver supporting a 10-Gb/s/lane display link interface between the application processor (AP) integrated circuits (IC) and timing controller (TCON)-embedded source driver IC for mobile applications. An embedded clocking scheme is adopted to save clock distribution power, which also reduces the required number of off-chip I/O channels. A transmitter (TX) sends 20-Gb/s aggregate data through two differential data lanes, and a receiver recovers a 5-GHz half-rate clock. The TX employs a latch-less serializer using divided clocks in a staggered phase, achieving energy efficiency of 0.43 pJ/b/lane. In the RX, a hybrid clock and data recovery (CDR) tracks a half-data rate with a digital loop filter (DLF) and subsequently locks the frequency and phase with an analog loop filter (ALF). By deactivating the DLF and edge deserializer once a coarse frequency lock is acquired, the RX achieves an energy efficiency of 0.53 pJ/b/lane. The prototype transceiver, fabricated using a 28-nm CMOS technology, occupies an active area of 0.196 mm2 and achieves an energy efficiency of 1.23 pJ/b/lane, including a charge-pump phase-locked loop (CP-PLL) with clock distribution.
{"title":"A 10-Gb/s/lane, Energy-Efficient Transceiver With Reference-Less Hybrid CDR for Mobile Display Link Interfaces","authors":"Jonghyun Oh;Kwanseo Park;Young-Ha Hwang","doi":"10.1109/TVLSI.2024.3472073","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3472073","url":null,"abstract":"This brief presents an energy-efficient transceiver supporting a 10-Gb/s/lane display link interface between the application processor (AP) integrated circuits (IC) and timing controller (TCON)-embedded source driver IC for mobile applications. An embedded clocking scheme is adopted to save clock distribution power, which also reduces the required number of off-chip I/O channels. A transmitter (TX) sends 20-Gb/s aggregate data through two differential data lanes, and a receiver recovers a 5-GHz half-rate clock. The TX employs a latch-less serializer using divided clocks in a staggered phase, achieving energy efficiency of 0.43 pJ/b/lane. In the RX, a hybrid clock and data recovery (CDR) tracks a half-data rate with a digital loop filter (DLF) and subsequently locks the frequency and phase with an analog loop filter (ALF). By deactivating the DLF and edge deserializer once a coarse frequency lock is acquired, the RX achieves an energy efficiency of 0.53 pJ/b/lane. The prototype transceiver, fabricated using a 28-nm CMOS technology, occupies an active area of 0.196 mm2 and achieves an energy efficiency of 1.23 pJ/b/lane, including a charge-pump phase-locked loop (CP-PLL) with clock distribution.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"887-891"},"PeriodicalIF":2.8,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143496495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-15DOI: 10.1109/TVLSI.2024.3466474
Chao Ji;Xiaohu You;Chuan Zhang;Christoph Studer
Guessing random additive noise decoding (GRAND) is establishing itself as a universal method for decoding linear block codes, and ordered reliability bits GRAND (ORBGRAND) is a hardware-friendly variant that processes soft-input information. In this work, we propose an efficient hardware implementation of ORBGRAND that significantly reduces the cost of querying noise sequences with slight frame error rate (FER) performance degradation. Different from logistic weight order (LWO) and improved LWO (iLWO) typically used to generate noise sequences, we introduce a reduced-complexity and hardware-friendly method called shift LWO (sLWO), of which the shift factor can be chosen empirically to trade the FER performance and query complexity well. To effectively generate noise sequences with sLWO, we utilize a hardware-friendly lookup-table (LUT)-aided strategy, which improves throughput as well as area and energy efficiency. To demonstrate the efficacy of our solution, we use synthesis results evaluated on polar codes in a 65-nm CMOS technology. While maintaining similar FER performance, our ORBGRAND implementations achieve 53.6-Gbps average throughput ($1.26times $ higher), 4.2-Mbps worst case throughput ($8.24times $ higher), 2.4-Mbps/mm2 worst case area efficiency ($12times $ higher), and $4.66times 10 ^{{4}}$ pJ/bit worst case energy efficiency ($9.96times $ lower) compared with the synthesized ORBGRAND design with LWO for a (128, 105) polar code and also provide $8.62times $ higher average throughput and $9.4times $ higher average area efficiency but $7.51times $ worse average energy efficiency than the ORBGRAND chip for a (256, 240) polar code, at a target FER of $10^{-7}$ .
{"title":"Efficient ORBGRAND Implementation With Parallel Noise Sequence Generation","authors":"Chao Ji;Xiaohu You;Chuan Zhang;Christoph Studer","doi":"10.1109/TVLSI.2024.3466474","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3466474","url":null,"abstract":"Guessing random additive noise decoding (GRAND) is establishing itself as a universal method for decoding linear block codes, and ordered reliability bits GRAND (ORBGRAND) is a hardware-friendly variant that processes soft-input information. In this work, we propose an efficient hardware implementation of ORBGRAND that significantly reduces the cost of querying noise sequences with slight frame error rate (FER) performance degradation. Different from logistic weight order (LWO) and improved LWO (iLWO) typically used to generate noise sequences, we introduce a reduced-complexity and hardware-friendly method called shift LWO (sLWO), of which the shift factor can be chosen empirically to trade the FER performance and query complexity well. To effectively generate noise sequences with sLWO, we utilize a hardware-friendly lookup-table (LUT)-aided strategy, which improves throughput as well as area and energy efficiency. To demonstrate the efficacy of our solution, we use synthesis results evaluated on polar codes in a 65-nm CMOS technology. While maintaining similar FER performance, our ORBGRAND implementations achieve 53.6-Gbps average throughput (<inline-formula> <tex-math>$1.26times $ </tex-math></inline-formula> higher), 4.2-Mbps worst case throughput (<inline-formula> <tex-math>$8.24times $ </tex-math></inline-formula> higher), 2.4-Mbps/mm2 worst case area efficiency (<inline-formula> <tex-math>$12times $ </tex-math></inline-formula> higher), and <inline-formula> <tex-math>$4.66times 10 ^{{4}}$ </tex-math></inline-formula> pJ/bit worst case energy efficiency (<inline-formula> <tex-math>$9.96times $ </tex-math></inline-formula> lower) compared with the synthesized ORBGRAND design with LWO for a (128, 105) polar code and also provide <inline-formula> <tex-math>$8.62times $ </tex-math></inline-formula> higher average throughput and <inline-formula> <tex-math>$9.4times $ </tex-math></inline-formula> higher average area efficiency but <inline-formula> <tex-math>$7.51times $ </tex-math></inline-formula> worse average energy efficiency than the ORBGRAND chip for a (256, 240) polar code, at a target FER of <inline-formula> <tex-math>$10^{-7}$ </tex-math></inline-formula>.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"435-448"},"PeriodicalIF":2.8,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Electromagnetic interference (EMI) is an inevitable issue in power electronics applications. Although many kinds of solutions have been presented to attenuate EMI noise, there is still little research about the EMI suppression scheme utilized in multimode primary-side regulation (PSR) flyback converters. Targeting EMI regulation in multimode PSR flyback converter, a combo EMI suppression scheme comprised of frequency modulation and dual-slope gate driver is adopted to meet stringent EMI requirements, simplifying peripheral components and design of EMI filter. The proposed scheme is implemented in $0.18~mu $ m 5/40 V BCD process and occupies a die size (with pads) of $1.05times 0.8$ mm2. The experimental results show that the conducted EMI waveforms with line/neutral polarity can easily comply with regulations. The deviations of the output voltage are within ±1.3% under different inputs and loads while the peak effciency of 90% is achieved.
{"title":"A Combo EMI Suppression Scheme for Multimode PSR Flyback Converter","authors":"Yongyuan Li;Zhuliang Li;Wei Guo;Qiang Wu;Yongbo Zhang;Yong You;Zhangming Zhu","doi":"10.1109/TVLSI.2024.3470837","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3470837","url":null,"abstract":"Electromagnetic interference (EMI) is an inevitable issue in power electronics applications. Although many kinds of solutions have been presented to attenuate EMI noise, there is still little research about the EMI suppression scheme utilized in multimode primary-side regulation (PSR) flyback converters. Targeting EMI regulation in multimode PSR flyback converter, a combo EMI suppression scheme comprised of frequency modulation and dual-slope gate driver is adopted to meet stringent EMI requirements, simplifying peripheral components and design of EMI filter. The proposed scheme is implemented in <inline-formula> <tex-math>$0.18~mu $ </tex-math></inline-formula> m 5/40 V BCD process and occupies a die size (with pads) of <inline-formula> <tex-math>$1.05times 0.8$ </tex-math></inline-formula> mm2. The experimental results show that the conducted EMI waveforms with line/neutral polarity can easily comply with regulations. The deviations of the output voltage are within ±1.3% under different inputs and loads while the peak effciency of 90% is achieved.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"892-896"},"PeriodicalIF":2.8,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143496496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-14DOI: 10.1109/TVLSI.2024.3470834
Huarun Chen;Yijun Liu;Wujian Ye;Jialiang Ye;Yuehai Chen;Shaozhen Chen;Chao Han
Most of the existing methods for traffic sign recognition exploited deep learning technology such as convolutional neural networks (CNNs) to achieve a breakthrough in detection accuracy; however, due to the large number of CNN’s parameters, there are problems in practical applications such as high power consumption, large calculation, and slow speed. Compared with CNN, a spiking neural network (SNN) can effectively simulate the information processing mechanism of biological brain, with stronger parallel processing capability, better sparsity, and real-time performance. Thus, we design and realize a novel traffic sign recognition system [called SNN on FPGA-traffic sign recognition system (SFPGA-TSRS)] based on spiking CNN (SCNN) and FPGA platform. Specifically, to improve the recognition accuracy, a traffic sign recognition model spatial attention SCNN (SA-SCNN) is proposed by combining LIF/IF neurons based SCNN with SA mechanism; and to accelerate the model inference, a neuron module is implemented with high performance, and an input coding module is designed as the input layer of the recognition model. The experiments show that compared with existing systems, the proposed SFPGA-TSRS can efficiently support the deployment of SCNN models, with a higher recognition accuracy of 99.22%, a faster frame rate of 66.38 frames per second (FPS), and lower power consumption of 1.423 W on the GTSRB dataset.
现有的交通标志识别方法大多利用卷积神经网络(cnn)等深度学习技术来实现检测精度的突破;但由于CNN的参数较多,在实际应用中存在功耗高、计算量大、速度慢等问题。与CNN相比,SNN能够有效地模拟生物大脑的信息处理机制,具有更强的并行处理能力、更好的稀疏性和实时性。因此,我们设计并实现了一种基于峰值CNN (SCNN)和FPGA平台的新型交通标志识别系统[称为SNN on FPGA-交通标志识别系统(SFPGA-TSRS)]。具体来说,为了提高识别精度,将基于LIF/IF神经元的SCNN与SA机制相结合,提出了一种交通标志识别模型空间注意SCNN (SA-SCNN);为了加速模型推理,实现了高性能的神经元模块,并设计了输入编码模块作为识别模型的输入层。实验表明,与现有系统相比,本文提出的SFPGA-TSRS能够有效地支持SCNN模型的部署,在GTSRB数据集上的识别准确率达到99.22%,帧率达到66.38帧/秒,功耗低至1.423 W。
{"title":"Research on Hardware Acceleration of Traffic Sign Recognition Based on Spiking Neural Network and FPGA Platform","authors":"Huarun Chen;Yijun Liu;Wujian Ye;Jialiang Ye;Yuehai Chen;Shaozhen Chen;Chao Han","doi":"10.1109/TVLSI.2024.3470834","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3470834","url":null,"abstract":"Most of the existing methods for traffic sign recognition exploited deep learning technology such as convolutional neural networks (CNNs) to achieve a breakthrough in detection accuracy; however, due to the large number of CNN’s parameters, there are problems in practical applications such as high power consumption, large calculation, and slow speed. Compared with CNN, a spiking neural network (SNN) can effectively simulate the information processing mechanism of biological brain, with stronger parallel processing capability, better sparsity, and real-time performance. Thus, we design and realize a novel traffic sign recognition system [called SNN on FPGA-traffic sign recognition system (SFPGA-TSRS)] based on spiking CNN (SCNN) and FPGA platform. Specifically, to improve the recognition accuracy, a traffic sign recognition model spatial attention SCNN (SA-SCNN) is proposed by combining LIF/IF neurons based SCNN with SA mechanism; and to accelerate the model inference, a neuron module is implemented with high performance, and an input coding module is designed as the input layer of the recognition model. The experiments show that compared with existing systems, the proposed SFPGA-TSRS can efficiently support the deployment of SCNN models, with a higher recognition accuracy of 99.22%, a faster frame rate of 66.38 frames per second (FPS), and lower power consumption of 1.423 W on the GTSRB dataset.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"499-511"},"PeriodicalIF":2.8,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-14DOI: 10.1109/TVLSI.2024.3471496
Gauthaman Murali;Min Gyu Park;Sung Kyu Lim
This article presents 3DNN-Xplorer, the first machine learning (ML)-based framework for predicting the performance of heterogeneous 3-D deep neural network (DNN) accelerators. Our ML framework facilitates the design space exploration (DSE) of heterogeneous 3-D accelerators with a two-tier compute-on-memory (CoM) configuration, considering 3-D physical design factors. Our design space encompasses four distinct heterogeneous 3-D integration styles, combining 28- and 16-nm technology nodes for both compute and memory tiers. Using extrapolation techniques with ML models trained on 10-to-256 processing element (PE) accelerator configurations, we estimate the performance of systems featuring 75–16384 PEs, achieving a maximum absolute error of 13.9% (the number of PEs is not continuous and varies based on the accelerator architecture). To ensure balanced tier areas in the design, our framework assumes the same number of PEs or on-chip memory capacity across the four integration styles, accounting for area imbalance resulting from different technology nodes. Our analysis reveals that the heterogeneous 3-D style with 28-nm compute and 16-nm memory is energy-efficient and offers notable energy savings of up to 50% and an 8.8% reduction in runtime compared to other 3-D integration styles with the same number of PEs. Similarly, the heterogeneous 3-D style with 16-nm compute and 28-nm memory is area-efficient and shows up to 8.3% runtime reduction compared to other 3-D styles with the same on-chip memory capacity.
{"title":"3DNN-Xplorer: A Machine Learning Framework for Design Space Exploration of Heterogeneous 3-D DNN Accelerators","authors":"Gauthaman Murali;Min Gyu Park;Sung Kyu Lim","doi":"10.1109/TVLSI.2024.3471496","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3471496","url":null,"abstract":"This article presents 3DNN-Xplorer, the first machine learning (ML)-based framework for predicting the performance of heterogeneous 3-D deep neural network (DNN) accelerators. Our ML framework facilitates the design space exploration (DSE) of heterogeneous 3-D accelerators with a two-tier compute-on-memory (CoM) configuration, considering 3-D physical design factors. Our design space encompasses four distinct heterogeneous 3-D integration styles, combining 28- and 16-nm technology nodes for both compute and memory tiers. Using extrapolation techniques with ML models trained on 10-to-256 processing element (PE) accelerator configurations, we estimate the performance of systems featuring 75–16384 PEs, achieving a maximum absolute error of 13.9% (the number of PEs is not continuous and varies based on the accelerator architecture). To ensure balanced tier areas in the design, our framework assumes the same number of PEs or on-chip memory capacity across the four integration styles, accounting for area imbalance resulting from different technology nodes. Our analysis reveals that the heterogeneous 3-D style with 28-nm compute and 16-nm memory is energy-efficient and offers notable energy savings of up to 50% and an 8.8% reduction in runtime compared to other 3-D integration styles with the same number of PEs. Similarly, the heterogeneous 3-D style with 16-nm compute and 28-nm memory is area-efficient and shows up to 8.3% runtime reduction compared to other 3-D styles with the same on-chip memory capacity.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"358-370"},"PeriodicalIF":2.8,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-09DOI: 10.1109/TVLSI.2024.3465010
Zihang Wang;Yushu Yang;Jianfei Wang;Jia Hou;Yang Su;Chen Yang
Polynomial multiplication is a significant bottleneck in mainstream postquantum cryptography (PQC) schemes. To speed it up, number theoretic transform (NTT) is widely used, which decreases the time complexity from ${O}(n^{2})$ to $O[nlog _{2}(n)]$ . However, it is challenging to ensure optimal hardware efficiency in conjunction with scalability. This brief proposes a novel pipelined NTT/inverse-NTT (INTT) architecture on field-programmable gate array (FPGA). A group-based pairwise memory access (GPMA) scheme is proposed, and a scratchpad and reordering unit (SRU) is designed to form an efficient dataflow that simplifies control units and achieves almost $n/2$ processing cycles on average for n-point NTT. Moreover, our architecture can support varying parameters. Compared to the state-of-the-art works, our architecture achieves up to $4.8times $ latency improvements and up to $4.3times $ improvements on area time product (ATP).
{"title":"A Scalable and Efficient NTT/INTT Architecture Using Group-Based Pairwise Memory Access and Fast Interstage Reordering","authors":"Zihang Wang;Yushu Yang;Jianfei Wang;Jia Hou;Yang Su;Chen Yang","doi":"10.1109/TVLSI.2024.3465010","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3465010","url":null,"abstract":"Polynomial multiplication is a significant bottleneck in mainstream postquantum cryptography (PQC) schemes. To speed it up, number theoretic transform (NTT) is widely used, which decreases the time complexity from <inline-formula> <tex-math>${O}(n^{2})$ </tex-math></inline-formula> to <inline-formula> <tex-math>$O[nlog _{2}(n)]$ </tex-math></inline-formula>. However, it is challenging to ensure optimal hardware efficiency in conjunction with scalability. This brief proposes a novel pipelined NTT/inverse-NTT (INTT) architecture on field-programmable gate array (FPGA). A group-based pairwise memory access (GPMA) scheme is proposed, and a scratchpad and reordering unit (SRU) is designed to form an efficient dataflow that simplifies control units and achieves almost <inline-formula> <tex-math>$n/2$ </tex-math></inline-formula> processing cycles on average for n-point NTT. Moreover, our architecture can support varying parameters. Compared to the state-of-the-art works, our architecture achieves up to <inline-formula> <tex-math>$4.8times $ </tex-math></inline-formula> latency improvements and up to <inline-formula> <tex-math>$4.3times $ </tex-math></inline-formula> improvements on area time product (ATP).","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"588-592"},"PeriodicalIF":2.8,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-09DOI: 10.1109/TVLSI.2024.3471528
Ebenezer C. Usih;Naimul Hassan;Alexander J. Edwards;Felipe Garcia-Sanchez;Pedram Khalili Amiri;Joseph S. Friedman
Toggle spin-orbit torque (SOT)-driven magnetoresistive random access memory (MRAM) with perpendicular anisotropy has a simple material stack and is more robust than directional SOT-MRAM. However, a read-before-write operation is required to use the toggle SOT-MRAM for directional switching, which threatens to increase the write delay. To resolve these issues, we propose a high-speed memory architecture for toggle SOT-MRAM that includes a minimum-sized bit cell and a custom read-write driver. The proposed driver induces an analog self-terminating SOT current that functions via an analog feedback mechanism that can read and write the toggle SOT-MRAM bit cell within a single clock cycle. As the read and write operations are completed within 570 ps, this memory architecture provides the first viable solution for nonvolatile L3 cache.
{"title":"Toggle SOT-MRAM Architecture With Self-Terminating Write Operation","authors":"Ebenezer C. Usih;Naimul Hassan;Alexander J. Edwards;Felipe Garcia-Sanchez;Pedram Khalili Amiri;Joseph S. Friedman","doi":"10.1109/TVLSI.2024.3471528","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3471528","url":null,"abstract":"Toggle spin-orbit torque (SOT)-driven magnetoresistive random access memory (MRAM) with perpendicular anisotropy has a simple material stack and is more robust than directional SOT-MRAM. However, a read-before-write operation is required to use the toggle SOT-MRAM for directional switching, which threatens to increase the write delay. To resolve these issues, we propose a high-speed memory architecture for toggle SOT-MRAM that includes a minimum-sized bit cell and a custom read-write driver. The proposed driver induces an analog self-terminating SOT current that functions via an analog feedback mechanism that can read and write the toggle SOT-MRAM bit cell within a single clock cycle. As the read and write operations are completed within 570 ps, this memory architecture provides the first viable solution for nonvolatile L3 cache.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"337-345"},"PeriodicalIF":2.8,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-09DOI: 10.1109/TVLSI.2024.3470342
Anshaj Shrivastava;Gaurab Banerjee
This study presents a compact, on-chip analog probe module (APM) to augment the IEEE 1149.4 (or P1687.2) standard for efficiently probing multiple internal nodes. The complete approach to APM implementation, from conceptualization to practical application, is discussed in detail. The APM aims to utilize a minimum area for a maximum number of probe channels, achieving an optimal size of 4:15. At the transistor level, the design minimizes the impact of glitches in asynchronous operations through a symmetrical layout and a unique arrangement of all probe channels. However, glitches in asynchronous circuits can still exist; hence, a state transition matrix (STM) concept is devised. STMs help visualize hazardous transitions, allowing the identification of a common hazard-free transition sequence suitable for hardware implementation. The verified APM design is integrated with several analog/RF circuits fabricated in a commercially available 0.18- $mu $