Content-addressable memory (CAM) is a specialized type of memory that facilitates massively parallel comparison of a search pattern against its entire content. State-of-the-art (SOTA) CAM solutions are either fast but power-hungry (NOR CAM) or slow while consuming less power (nand CAM). These limitations stem from the dynamic precharge operation, leading to excessive power consumption in NOR CAMs and charge-sharing issues in NAND CAMs. In this work, we propose a precharge-free CAM (PCAM) class for energy-efficient applications. By avoiding precharge operation, PCAM consumes less energy than a NAND CAM, while achieving search speed comparable to a NOR CAM. PCAM was designed using a 65-nm CMOS technology and comprehensively evaluated under extensive Monte Carlo (MC) simulations while taking into account layout parasitics. When benchmarked against conventional NAND CAM, PCAM demonstrates improved search run time (reduced by more than 30%) and 15% less search energy. Moreover, PCAM can cut energy consumption by more than 75% when compared to conventional NOR CAM. We further extend our analysis to the application level, functionally evaluating the CAM designs as a fully associative cache using a CPU simulator running various benchmark workloads. This analysis confirms that PCAMs represent an optimal energy-performance design choice for associative memories and their broad spectrum of applications.
{"title":"Designing Precharge-Free Energy-Efficient Content-Addressable Memories","authors":"Ramiro Taco;Esteban Garzón;Robert Hanhan;Adam Teman;Leonid Yavits;Marco Lanuzza","doi":"10.1109/TVLSI.2024.3475036","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3475036","url":null,"abstract":"Content-addressable memory (CAM) is a specialized type of memory that facilitates massively parallel comparison of a search pattern against its entire content. State-of-the-art (SOTA) CAM solutions are either fast but power-hungry (NOR CAM) or slow while consuming less power (nand CAM). These limitations stem from the dynamic precharge operation, leading to excessive power consumption in NOR CAMs and charge-sharing issues in NAND CAMs. In this work, we propose a precharge-free CAM (PCAM) class for energy-efficient applications. By avoiding precharge operation, PCAM consumes less energy than a NAND CAM, while achieving search speed comparable to a NOR CAM. PCAM was designed using a 65-nm CMOS technology and comprehensively evaluated under extensive Monte Carlo (MC) simulations while taking into account layout parasitics. When benchmarked against conventional NAND CAM, PCAM demonstrates improved search run time (reduced by more than 30%) and 15% less search energy. Moreover, PCAM can cut energy consumption by more than 75% when compared to conventional NOR CAM. We further extend our analysis to the application level, functionally evaluating the CAM designs as a fully associative cache using a CPU simulator running various benchmark workloads. This analysis confirms that PCAMs represent an optimal energy-performance design choice for associative memories and their broad spectrum of applications.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 12","pages":"2303-2314"},"PeriodicalIF":2.8,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10723100","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142821282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-17DOI: 10.1109/TVLSI.2024.3474183
Negar Shabanzadeh;Aarno Pärssinen;Timo Rahkonen
To optimize the linearity of mixers, one needs to recognize the origins and mixing mechanisms of dominant nonlinearities. This article presents a time-varying (TV) nonlinear analysis, where TV polynomial models are used to yield harmonic mixing functions in four simple mixer structures. Local oscillator (LO) waveform engineering has been utilized to tailor the nonlinearity coefficients, which were later fed into a numerical distortion contribution analysis tool to calculate the nonlinearity using a Volterra series-based approach. The spectra of the nonlinear voltages have been drawn, followed by plots breaking the final IM3 distortion down into its contributions. The effect of filtering on distortion has been illustrated in the end.
为了优化混频器的线性度,我们需要认识主要非线性的起源和混合机制。本文介绍了一种时变(TV)非线性分析,利用 TV 多项式模型在四个简单混频器结构中产生谐波混合函数。本地振荡器(LO)波形工程被用来定制非线性系数,然后将其输入数值失真贡献分析工具,利用基于 Volterra 系列的方法计算非线性。绘制了非线性电压的频谱,然后绘制了将最终 IM3 失真分解为其贡献的图谱。最后还说明了滤波对失真产生的影响。
{"title":"A Study on Nonlinearity in Mixers Using a Time-Varying Volterra-Based Distortion Contribution Analysis Tool","authors":"Negar Shabanzadeh;Aarno Pärssinen;Timo Rahkonen","doi":"10.1109/TVLSI.2024.3474183","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3474183","url":null,"abstract":"To optimize the linearity of mixers, one needs to recognize the origins and mixing mechanisms of dominant nonlinearities. This article presents a time-varying (TV) nonlinear analysis, where TV polynomial models are used to yield harmonic mixing functions in four simple mixer structures. Local oscillator (LO) waveform engineering has been utilized to tailor the nonlinearity coefficients, which were later fed into a numerical distortion contribution analysis tool to calculate the nonlinearity using a Volterra series-based approach. The spectra of the nonlinear voltages have been drawn, followed by plots breaking the final IM3 distortion down into its contributions. The effect of filtering on distortion has been illustrated in the end.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 12","pages":"2232-2242"},"PeriodicalIF":2.8,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10721238","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142821285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The ongoing integration of advanced functionalities in contemporary system-on-chips (SoCs) poses significant challenges related to memory bandwidth, capacity, and thermal stability. These challenges are further amplified with the advancement of artificial intelligence (AI), necessitating enhanced memory and interconnect bandwidth and latency. This article presents a comprehensive study encompassing architectural modifications of an interconnect-dominated many-core SoC targeting the significant increase of intermediate, on-chip cache memory bandwidth and access latency tuning. The proposed SoC has been implemented in 3-D using A10 nanosheet technology and early thermal analysis has been performed. Our workload simulations reveal, respectively, up to 12- and 2.5-fold acceleration in the 64-core and 16-core versions of the SoC. Such speed-up comes at 40% increase in die-area and a 60% rise in power dissipation when implemented in 2-D. In contrast, the 3-D counterpart not only minimizes the footprint but also yields 20% power savings, attributable to a 40% reduction in wirelength. The article further highlights the importance of pipeline restructuring to leverage the potential of 3-D technology for achieving lower latency and more efficient memory access. Finally, we discuss the thermal implications of various 3-D partitioning schemes in High Performance Computing (HPC) and mobile applications. Our analysis reveals that, unlike high-power density HPC cases, 3-D mobile case increases $T_{max }$ only by $2~^{circ } $ C–$3~^{circ } $ C compared to 2-D, while the HPC scenario analysis requires multiconstrained efficient partitioning for 3-D implementations.
现代系统芯片(soc)不断集成高级功能,对内存带宽、容量和热稳定性提出了重大挑战。随着人工智能(AI)的进步,这些挑战进一步放大,需要增强内存、互连带宽和延迟。本文提出了一项全面的研究,包括互连主导的多核SoC的架构修改,目标是显著增加中间,片上缓存存储器带宽和访问延迟调优。所提出的SoC已经使用A10纳米片技术实现了3d,并进行了早期热分析。我们的工作负载模拟显示,64核和16核版本的SoC分别具有高达12倍和2.5倍的加速。当在2d中实现时,这种加速带来了40%的模面积增加和60%的功耗增加。相比之下,3d打印不仅可以最大限度地减少占地面积,还可以节省20%的电力,因为无线长度减少了40%。本文进一步强调了管道重组的重要性,以利用3d技术的潜力来实现更低的延迟和更高效的内存访问。最后,我们讨论了各种三维分区方案在高性能计算(HPC)和移动应用中的热影响。我们的分析表明,与高功率密度HPC情况不同,3- d移动情况下,与2- d相比,$T_{max}$仅增加$2~^{circ}$ C - $3~^{circ}$ C,而HPC场景分析需要多约束的高效分区来实现3- d实现。
{"title":"Bandwidth-Latency-Thermal Co-Optimization of Interconnect-Dominated Many-Core 3D-IC","authors":"Sudipta Das;Samuel Riedel;Mohamed Naeim;Moritz Brunion;Marco Bertuletti;Luca Benini;Julien Ryckaert;James Myers;Dwaipayan Biswas;Dragomir Milojevic","doi":"10.1109/TVLSI.2024.3467148","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3467148","url":null,"abstract":"The ongoing integration of advanced functionalities in contemporary system-on-chips (SoCs) poses significant challenges related to memory bandwidth, capacity, and thermal stability. These challenges are further amplified with the advancement of artificial intelligence (AI), necessitating enhanced memory and interconnect bandwidth and latency. This article presents a comprehensive study encompassing architectural modifications of an interconnect-dominated many-core SoC targeting the significant increase of intermediate, on-chip cache memory bandwidth and access latency tuning. The proposed SoC has been implemented in 3-D using A10 nanosheet technology and early thermal analysis has been performed. Our workload simulations reveal, respectively, up to 12- and 2.5-fold acceleration in the 64-core and 16-core versions of the SoC. Such speed-up comes at 40% increase in die-area and a 60% rise in power dissipation when implemented in 2-D. In contrast, the 3-D counterpart not only minimizes the footprint but also yields 20% power savings, attributable to a 40% reduction in wirelength. The article further highlights the importance of pipeline restructuring to leverage the potential of 3-D technology for achieving lower latency and more efficient memory access. Finally, we discuss the thermal implications of various 3-D partitioning schemes in High Performance Computing (HPC) and mobile applications. Our analysis reveals that, unlike high-power density HPC cases, 3-D mobile case increases <inline-formula> <tex-math>$T_{max }$ </tex-math></inline-formula> only by <inline-formula> <tex-math>$2~^{circ } $ </tex-math></inline-formula>C–<inline-formula> <tex-math>$3~^{circ } $ </tex-math></inline-formula>C compared to 2-D, while the HPC scenario analysis requires multiconstrained efficient partitioning for 3-D implementations.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"346-357"},"PeriodicalIF":2.8,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
PoolFormer is a subset of Transformer neural network with a key difference of replacing computationally demanding token mixer with pooling function. In this work, a memristor-based PoolFormer network modeling and training framework for edge-artificial intelligence (AI) applications is presented. The original PoolFormer structure is further optimized for hardware implementation on RRAM crossbar by replacing the normalization operation with scaling. In addition, the nonidealities of RRAM crossbar from device to array level as well as peripheral readout circuits are analyzed. By integrating these factors into one training framework, the overall neural network performance is evaluated holistically and the impact of nonidealities to the network performance can be effectively mitigated. Implemented in Python and PyTorch, a 16-block PoolFormer network is built with $64times 64$ four-level RRAM crossbar array model extracted from measurement results. The total number of the proposed Edge PoolFormer network parameters is 0.246 M, which is at least one order smaller than the conventional CNN implementation. This network achieved inference accuracy of 88.07% for CIFAR-10 image classification tasks with accuracy degradation of 1.5% compared to the ideal software model with FP32 precision weights.
{"title":"Edge PoolFormer: Modeling and Training of PoolFormer Network on RRAM Crossbar for Edge-AI Applications","authors":"Tiancheng Cao;Weihao Yu;Yuan Gao;Chen Liu;Tantan Zhang;Shuicheng Yan;Wang Ling Goh","doi":"10.1109/TVLSI.2024.3472270","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3472270","url":null,"abstract":"PoolFormer is a subset of Transformer neural network with a key difference of replacing computationally demanding token mixer with pooling function. In this work, a memristor-based PoolFormer network modeling and training framework for edge-artificial intelligence (AI) applications is presented. The original PoolFormer structure is further optimized for hardware implementation on RRAM crossbar by replacing the normalization operation with scaling. In addition, the nonidealities of RRAM crossbar from device to array level as well as peripheral readout circuits are analyzed. By integrating these factors into one training framework, the overall neural network performance is evaluated holistically and the impact of nonidealities to the network performance can be effectively mitigated. Implemented in Python and PyTorch, a 16-block PoolFormer network is built with <inline-formula> <tex-math>$64times 64$ </tex-math></inline-formula> four-level RRAM crossbar array model extracted from measurement results. The total number of the proposed Edge PoolFormer network parameters is 0.246 M, which is at least one order smaller than the conventional CNN implementation. This network achieved inference accuracy of 88.07% for CIFAR-10 image classification tasks with accuracy degradation of 1.5% compared to the ideal software model with FP32 precision weights.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"384-394"},"PeriodicalIF":2.8,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-15DOI: 10.1109/TVLSI.2024.3466474
Chao Ji;Xiaohu You;Chuan Zhang;Christoph Studer
Guessing random additive noise decoding (GRAND) is establishing itself as a universal method for decoding linear block codes, and ordered reliability bits GRAND (ORBGRAND) is a hardware-friendly variant that processes soft-input information. In this work, we propose an efficient hardware implementation of ORBGRAND that significantly reduces the cost of querying noise sequences with slight frame error rate (FER) performance degradation. Different from logistic weight order (LWO) and improved LWO (iLWO) typically used to generate noise sequences, we introduce a reduced-complexity and hardware-friendly method called shift LWO (sLWO), of which the shift factor can be chosen empirically to trade the FER performance and query complexity well. To effectively generate noise sequences with sLWO, we utilize a hardware-friendly lookup-table (LUT)-aided strategy, which improves throughput as well as area and energy efficiency. To demonstrate the efficacy of our solution, we use synthesis results evaluated on polar codes in a 65-nm CMOS technology. While maintaining similar FER performance, our ORBGRAND implementations achieve 53.6-Gbps average throughput ($1.26times $ higher), 4.2-Mbps worst case throughput ($8.24times $ higher), 2.4-Mbps/mm2 worst case area efficiency ($12times $ higher), and $4.66times 10 ^{{4}}$ pJ/bit worst case energy efficiency ($9.96times $ lower) compared with the synthesized ORBGRAND design with LWO for a (128, 105) polar code and also provide $8.62times $ higher average throughput and $9.4times $ higher average area efficiency but $7.51times $ worse average energy efficiency than the ORBGRAND chip for a (256, 240) polar code, at a target FER of $10^{-7}$ .
{"title":"Efficient ORBGRAND Implementation With Parallel Noise Sequence Generation","authors":"Chao Ji;Xiaohu You;Chuan Zhang;Christoph Studer","doi":"10.1109/TVLSI.2024.3466474","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3466474","url":null,"abstract":"Guessing random additive noise decoding (GRAND) is establishing itself as a universal method for decoding linear block codes, and ordered reliability bits GRAND (ORBGRAND) is a hardware-friendly variant that processes soft-input information. In this work, we propose an efficient hardware implementation of ORBGRAND that significantly reduces the cost of querying noise sequences with slight frame error rate (FER) performance degradation. Different from logistic weight order (LWO) and improved LWO (iLWO) typically used to generate noise sequences, we introduce a reduced-complexity and hardware-friendly method called shift LWO (sLWO), of which the shift factor can be chosen empirically to trade the FER performance and query complexity well. To effectively generate noise sequences with sLWO, we utilize a hardware-friendly lookup-table (LUT)-aided strategy, which improves throughput as well as area and energy efficiency. To demonstrate the efficacy of our solution, we use synthesis results evaluated on polar codes in a 65-nm CMOS technology. While maintaining similar FER performance, our ORBGRAND implementations achieve 53.6-Gbps average throughput (<inline-formula> <tex-math>$1.26times $ </tex-math></inline-formula> higher), 4.2-Mbps worst case throughput (<inline-formula> <tex-math>$8.24times $ </tex-math></inline-formula> higher), 2.4-Mbps/mm2 worst case area efficiency (<inline-formula> <tex-math>$12times $ </tex-math></inline-formula> higher), and <inline-formula> <tex-math>$4.66times 10 ^{{4}}$ </tex-math></inline-formula> pJ/bit worst case energy efficiency (<inline-formula> <tex-math>$9.96times $ </tex-math></inline-formula> lower) compared with the synthesized ORBGRAND design with LWO for a (128, 105) polar code and also provide <inline-formula> <tex-math>$8.62times $ </tex-math></inline-formula> higher average throughput and <inline-formula> <tex-math>$9.4times $ </tex-math></inline-formula> higher average area efficiency but <inline-formula> <tex-math>$7.51times $ </tex-math></inline-formula> worse average energy efficiency than the ORBGRAND chip for a (256, 240) polar code, at a target FER of <inline-formula> <tex-math>$10^{-7}$ </tex-math></inline-formula>.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"435-448"},"PeriodicalIF":2.8,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-14DOI: 10.1109/TVLSI.2024.3470834
Huarun Chen;Yijun Liu;Wujian Ye;Jialiang Ye;Yuehai Chen;Shaozhen Chen;Chao Han
Most of the existing methods for traffic sign recognition exploited deep learning technology such as convolutional neural networks (CNNs) to achieve a breakthrough in detection accuracy; however, due to the large number of CNN’s parameters, there are problems in practical applications such as high power consumption, large calculation, and slow speed. Compared with CNN, a spiking neural network (SNN) can effectively simulate the information processing mechanism of biological brain, with stronger parallel processing capability, better sparsity, and real-time performance. Thus, we design and realize a novel traffic sign recognition system [called SNN on FPGA-traffic sign recognition system (SFPGA-TSRS)] based on spiking CNN (SCNN) and FPGA platform. Specifically, to improve the recognition accuracy, a traffic sign recognition model spatial attention SCNN (SA-SCNN) is proposed by combining LIF/IF neurons based SCNN with SA mechanism; and to accelerate the model inference, a neuron module is implemented with high performance, and an input coding module is designed as the input layer of the recognition model. The experiments show that compared with existing systems, the proposed SFPGA-TSRS can efficiently support the deployment of SCNN models, with a higher recognition accuracy of 99.22%, a faster frame rate of 66.38 frames per second (FPS), and lower power consumption of 1.423 W on the GTSRB dataset.
现有的交通标志识别方法大多利用卷积神经网络(cnn)等深度学习技术来实现检测精度的突破;但由于CNN的参数较多,在实际应用中存在功耗高、计算量大、速度慢等问题。与CNN相比,SNN能够有效地模拟生物大脑的信息处理机制,具有更强的并行处理能力、更好的稀疏性和实时性。因此,我们设计并实现了一种基于峰值CNN (SCNN)和FPGA平台的新型交通标志识别系统[称为SNN on FPGA-交通标志识别系统(SFPGA-TSRS)]。具体来说,为了提高识别精度,将基于LIF/IF神经元的SCNN与SA机制相结合,提出了一种交通标志识别模型空间注意SCNN (SA-SCNN);为了加速模型推理,实现了高性能的神经元模块,并设计了输入编码模块作为识别模型的输入层。实验表明,与现有系统相比,本文提出的SFPGA-TSRS能够有效地支持SCNN模型的部署,在GTSRB数据集上的识别准确率达到99.22%,帧率达到66.38帧/秒,功耗低至1.423 W。
{"title":"Research on Hardware Acceleration of Traffic Sign Recognition Based on Spiking Neural Network and FPGA Platform","authors":"Huarun Chen;Yijun Liu;Wujian Ye;Jialiang Ye;Yuehai Chen;Shaozhen Chen;Chao Han","doi":"10.1109/TVLSI.2024.3470834","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3470834","url":null,"abstract":"Most of the existing methods for traffic sign recognition exploited deep learning technology such as convolutional neural networks (CNNs) to achieve a breakthrough in detection accuracy; however, due to the large number of CNN’s parameters, there are problems in practical applications such as high power consumption, large calculation, and slow speed. Compared with CNN, a spiking neural network (SNN) can effectively simulate the information processing mechanism of biological brain, with stronger parallel processing capability, better sparsity, and real-time performance. Thus, we design and realize a novel traffic sign recognition system [called SNN on FPGA-traffic sign recognition system (SFPGA-TSRS)] based on spiking CNN (SCNN) and FPGA platform. Specifically, to improve the recognition accuracy, a traffic sign recognition model spatial attention SCNN (SA-SCNN) is proposed by combining LIF/IF neurons based SCNN with SA mechanism; and to accelerate the model inference, a neuron module is implemented with high performance, and an input coding module is designed as the input layer of the recognition model. The experiments show that compared with existing systems, the proposed SFPGA-TSRS can efficiently support the deployment of SCNN models, with a higher recognition accuracy of 99.22%, a faster frame rate of 66.38 frames per second (FPS), and lower power consumption of 1.423 W on the GTSRB dataset.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"499-511"},"PeriodicalIF":2.8,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-14DOI: 10.1109/TVLSI.2024.3471496
Gauthaman Murali;Min Gyu Park;Sung Kyu Lim
This article presents 3DNN-Xplorer, the first machine learning (ML)-based framework for predicting the performance of heterogeneous 3-D deep neural network (DNN) accelerators. Our ML framework facilitates the design space exploration (DSE) of heterogeneous 3-D accelerators with a two-tier compute-on-memory (CoM) configuration, considering 3-D physical design factors. Our design space encompasses four distinct heterogeneous 3-D integration styles, combining 28- and 16-nm technology nodes for both compute and memory tiers. Using extrapolation techniques with ML models trained on 10-to-256 processing element (PE) accelerator configurations, we estimate the performance of systems featuring 75–16384 PEs, achieving a maximum absolute error of 13.9% (the number of PEs is not continuous and varies based on the accelerator architecture). To ensure balanced tier areas in the design, our framework assumes the same number of PEs or on-chip memory capacity across the four integration styles, accounting for area imbalance resulting from different technology nodes. Our analysis reveals that the heterogeneous 3-D style with 28-nm compute and 16-nm memory is energy-efficient and offers notable energy savings of up to 50% and an 8.8% reduction in runtime compared to other 3-D integration styles with the same number of PEs. Similarly, the heterogeneous 3-D style with 16-nm compute and 28-nm memory is area-efficient and shows up to 8.3% runtime reduction compared to other 3-D styles with the same on-chip memory capacity.
{"title":"3DNN-Xplorer: A Machine Learning Framework for Design Space Exploration of Heterogeneous 3-D DNN Accelerators","authors":"Gauthaman Murali;Min Gyu Park;Sung Kyu Lim","doi":"10.1109/TVLSI.2024.3471496","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3471496","url":null,"abstract":"This article presents 3DNN-Xplorer, the first machine learning (ML)-based framework for predicting the performance of heterogeneous 3-D deep neural network (DNN) accelerators. Our ML framework facilitates the design space exploration (DSE) of heterogeneous 3-D accelerators with a two-tier compute-on-memory (CoM) configuration, considering 3-D physical design factors. Our design space encompasses four distinct heterogeneous 3-D integration styles, combining 28- and 16-nm technology nodes for both compute and memory tiers. Using extrapolation techniques with ML models trained on 10-to-256 processing element (PE) accelerator configurations, we estimate the performance of systems featuring 75–16384 PEs, achieving a maximum absolute error of 13.9% (the number of PEs is not continuous and varies based on the accelerator architecture). To ensure balanced tier areas in the design, our framework assumes the same number of PEs or on-chip memory capacity across the four integration styles, accounting for area imbalance resulting from different technology nodes. Our analysis reveals that the heterogeneous 3-D style with 28-nm compute and 16-nm memory is energy-efficient and offers notable energy savings of up to 50% and an 8.8% reduction in runtime compared to other 3-D integration styles with the same number of PEs. Similarly, the heterogeneous 3-D style with 16-nm compute and 28-nm memory is area-efficient and shows up to 8.3% runtime reduction compared to other 3-D styles with the same on-chip memory capacity.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"358-370"},"PeriodicalIF":2.8,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-09DOI: 10.1109/TVLSI.2024.3465010
Zihang Wang;Yushu Yang;Jianfei Wang;Jia Hou;Yang Su;Chen Yang
Polynomial multiplication is a significant bottleneck in mainstream postquantum cryptography (PQC) schemes. To speed it up, number theoretic transform (NTT) is widely used, which decreases the time complexity from ${O}(n^{2})$ to $O[nlog _{2}(n)]$ . However, it is challenging to ensure optimal hardware efficiency in conjunction with scalability. This brief proposes a novel pipelined NTT/inverse-NTT (INTT) architecture on field-programmable gate array (FPGA). A group-based pairwise memory access (GPMA) scheme is proposed, and a scratchpad and reordering unit (SRU) is designed to form an efficient dataflow that simplifies control units and achieves almost $n/2$ processing cycles on average for n-point NTT. Moreover, our architecture can support varying parameters. Compared to the state-of-the-art works, our architecture achieves up to $4.8times $ latency improvements and up to $4.3times $ improvements on area time product (ATP).
{"title":"A Scalable and Efficient NTT/INTT Architecture Using Group-Based Pairwise Memory Access and Fast Interstage Reordering","authors":"Zihang Wang;Yushu Yang;Jianfei Wang;Jia Hou;Yang Su;Chen Yang","doi":"10.1109/TVLSI.2024.3465010","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3465010","url":null,"abstract":"Polynomial multiplication is a significant bottleneck in mainstream postquantum cryptography (PQC) schemes. To speed it up, number theoretic transform (NTT) is widely used, which decreases the time complexity from <inline-formula> <tex-math>${O}(n^{2})$ </tex-math></inline-formula> to <inline-formula> <tex-math>$O[nlog _{2}(n)]$ </tex-math></inline-formula>. However, it is challenging to ensure optimal hardware efficiency in conjunction with scalability. This brief proposes a novel pipelined NTT/inverse-NTT (INTT) architecture on field-programmable gate array (FPGA). A group-based pairwise memory access (GPMA) scheme is proposed, and a scratchpad and reordering unit (SRU) is designed to form an efficient dataflow that simplifies control units and achieves almost <inline-formula> <tex-math>$n/2$ </tex-math></inline-formula> processing cycles on average for n-point NTT. Moreover, our architecture can support varying parameters. Compared to the state-of-the-art works, our architecture achieves up to <inline-formula> <tex-math>$4.8times $ </tex-math></inline-formula> latency improvements and up to <inline-formula> <tex-math>$4.3times $ </tex-math></inline-formula> improvements on area time product (ATP).","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"588-592"},"PeriodicalIF":2.8,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-09DOI: 10.1109/TVLSI.2024.3471528
Ebenezer C. Usih;Naimul Hassan;Alexander J. Edwards;Felipe Garcia-Sanchez;Pedram Khalili Amiri;Joseph S. Friedman
Toggle spin-orbit torque (SOT)-driven magnetoresistive random access memory (MRAM) with perpendicular anisotropy has a simple material stack and is more robust than directional SOT-MRAM. However, a read-before-write operation is required to use the toggle SOT-MRAM for directional switching, which threatens to increase the write delay. To resolve these issues, we propose a high-speed memory architecture for toggle SOT-MRAM that includes a minimum-sized bit cell and a custom read-write driver. The proposed driver induces an analog self-terminating SOT current that functions via an analog feedback mechanism that can read and write the toggle SOT-MRAM bit cell within a single clock cycle. As the read and write operations are completed within 570 ps, this memory architecture provides the first viable solution for nonvolatile L3 cache.
{"title":"Toggle SOT-MRAM Architecture With Self-Terminating Write Operation","authors":"Ebenezer C. Usih;Naimul Hassan;Alexander J. Edwards;Felipe Garcia-Sanchez;Pedram Khalili Amiri;Joseph S. Friedman","doi":"10.1109/TVLSI.2024.3471528","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3471528","url":null,"abstract":"Toggle spin-orbit torque (SOT)-driven magnetoresistive random access memory (MRAM) with perpendicular anisotropy has a simple material stack and is more robust than directional SOT-MRAM. However, a read-before-write operation is required to use the toggle SOT-MRAM for directional switching, which threatens to increase the write delay. To resolve these issues, we propose a high-speed memory architecture for toggle SOT-MRAM that includes a minimum-sized bit cell and a custom read-write driver. The proposed driver induces an analog self-terminating SOT current that functions via an analog feedback mechanism that can read and write the toggle SOT-MRAM bit cell within a single clock cycle. As the read and write operations are completed within 570 ps, this memory architecture provides the first viable solution for nonvolatile L3 cache.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"337-345"},"PeriodicalIF":2.8,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-09DOI: 10.1109/TVLSI.2024.3470342
Anshaj Shrivastava;Gaurab Banerjee
This study presents a compact, on-chip analog probe module (APM) to augment the IEEE 1149.4 (or P1687.2) standard for efficiently probing multiple internal nodes. The complete approach to APM implementation, from conceptualization to practical application, is discussed in detail. The APM aims to utilize a minimum area for a maximum number of probe channels, achieving an optimal size of 4:15. At the transistor level, the design minimizes the impact of glitches in asynchronous operations through a symmetrical layout and a unique arrangement of all probe channels. However, glitches in asynchronous circuits can still exist; hence, a state transition matrix (STM) concept is devised. STMs help visualize hazardous transitions, allowing the identification of a common hazard-free transition sequence suitable for hardware implementation. The verified APM design is integrated with several analog/RF circuits fabricated in a commercially available 0.18- $mu $