Pub Date : 2024-10-08DOI: 10.1109/TVLSI.2024.3469217
Lixun Wang;Yuejun Zhang;Pengjun Wang;Jianguo Yang;Huihong Zhang;Gang Li;Qikang Li
The novel nonvolatile computing-in-memory (nvCIM) technology enables data to be stored and processed in situ, providing a feasible solution for the widespread deployment of machine learning algorithms in edge AI devices. However, current nvCIM approaches based on weighted current summation face challenges such as device nonidealities and substantial time, storage, and energy overheads when handling high-precision analog signals. To address these issues, we propose a resistive random access memory (RRAM)-based binary convolution macro for constructing a complete binary convolutional neural network (BCNN) hardware circuit, accelerating edge AI applications with low-weight precision. This macro performs error compensation at the circuit level and provides stable rail-to-rail output, eliminating the need for any ADCs or processor to perform auxiliary computations. Experimental results demonstrate that the proposed BCNN full-hardware computing system achieves on-chip recognition accuracy of 90.7% (98.64%) on the CIFAR10 (MNIST) dataset, which represents a decrease of 0.98% (0.59%) compared to software recognition accuracy. In addition, this binary convolution macro achieves a maximum throughput of 320 GOPS and a peak energy efficiency of 578 TOPS/W at 136 MHz.
{"title":"A 578-TOPS/W RRAM-Based Binary Convolutional Neural Network Macro for Tiny AI Edge Devices","authors":"Lixun Wang;Yuejun Zhang;Pengjun Wang;Jianguo Yang;Huihong Zhang;Gang Li;Qikang Li","doi":"10.1109/TVLSI.2024.3469217","DOIUrl":"https://doi.org/10.1109/TVLSI.2024.3469217","url":null,"abstract":"The novel nonvolatile computing-in-memory (nvCIM) technology enables data to be stored and processed in situ, providing a feasible solution for the widespread deployment of machine learning algorithms in edge AI devices. However, current nvCIM approaches based on weighted current summation face challenges such as device nonidealities and substantial time, storage, and energy overheads when handling high-precision analog signals. To address these issues, we propose a resistive random access memory (RRAM)-based binary convolution macro for constructing a complete binary convolutional neural network (BCNN) hardware circuit, accelerating edge AI applications with low-weight precision. This macro performs error compensation at the circuit level and provides stable rail-to-rail output, eliminating the need for any ADCs or processor to perform auxiliary computations. Experimental results demonstrate that the proposed BCNN full-hardware computing system achieves on-chip recognition accuracy of 90.7% (98.64%) on the CIFAR10 (MNIST) dataset, which represents a decrease of 0.98% (0.59%) compared to software recognition accuracy. In addition, this binary convolution macro achieves a maximum throughput of 320 GOPS and a peak energy efficiency of 578 TOPS/W at 136 MHz.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 2","pages":"371-383"},"PeriodicalIF":2.8,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-04DOI: 10.1109/TVLSI.2024.3461330
Casper Cromjongh;Yongding Tian;H. Peter Hofstee;Zaid Al-Ars
As dedicated hardware is becoming more prevalent in accelerating complex applications, methods are needed to enable easy integration of multiple hardware components into a single accelerator system. However, this vision of composable hardware is hindered by the lack of standards for interfaces that allow such components to communicate. To address this challenge, the Tydi standard was proposed to facilitate the representation of streaming data in digital circuits, notably providing interface specifications of composite and variable-length data structures. At the same time, constructing hardware in a Scala embedded language (Chisel) provides a suitable environment for deploying Tydi-centric components due to its abstraction level and customizability. This article introduces Tydi-Chisel, a library that integrates the Tydi standard within Chisel, along with a toolchain and methodology for designing data-streaming accelerators. This toolchain reduces the effort needed to design streaming hardware accelerators by raising the abstraction level for streams and module interfaces, hereby avoiding writing boilerplate code, and allows for easy integration of accelerator components from different designers. This is demonstrated through an example project incorporating various scenarios where the interface-related declaration is reduced by 6–14 times. Tydi-Chisel project repository is available at https://github.com/abs-tudelft/Tydi-Chisel