{"title":"VISTA: A Memory-Efficient CNN Processor for Video and Image Spatial/Temporal Interpolation Acceleration","authors":"Kai-Ping Lin;Jia-Han Liu;Hong-Chuan Liao;Jyun-Yi Wu;Tong Wu;Chao-Tsung Huang","doi":"10.1109/JSSC.2025.3547982","DOIUrl":null,"url":null,"abstract":"Video convolutional neural networks (V-CNNs) take multiple frames as input and leverage temporal information to enhance quality and temporal consistency, making them promising solutions for high-resolution imaging tasks, such as video super-resolution (VSR) and video frame interpolation (VFI). Previous works have proposed CNN accelerators for single-image high-resolution imaging tasks, using layer-fusion (LF) workflows to reduce the need for external memory access (EMA) of intermediate feature maps (FMs). However, V-CNNs demand more EMA and computational complexity, posing implementation challenges for edge devices. Additionally, using deformable convolution (DC) to break through the fixed shape of the kernel receptive field can improve image quality and temporal consistency but requires additional storage and computational logic. In this article, we present a memory-efficient V-CNN processor, VISTA. We introduce a cuboid-based LF (CBLF) workflow for V-CNNs to reuse temporal information from overlapped FMs at different time points, reducing EMA and computational complexity. Moreover, the VISTA adopts a heterogeneous reuse-recomputing approach to handle overlaps between region-of-influence (ROI) pyramids and uses reference-frame-first scheduling (RFFS) to reduce the need for extensive memory usage during cross-frame alignment computations. Furthermore, we apply a hardware-model co-design to devise tile-based offset-confined DC (TODC), which reduces computational logic and saves line buffer usage for the search window with 0.06–0.18 dB of peak signal-to-noise ratio (PSNR) drop in image quality. The 12.6-mm2 VISTA is fabricated using 40-nm CMOS technology and achieves peak throughput of 4K-UHD 60 and 50 frames/s for supporting VSR and VFI applications, respectively. It reduces 33%–53% of input EMA, 19% of activation static random-access memory (SRAM), and 19%–42% of computational complexity.","PeriodicalId":13129,"journal":{"name":"IEEE Journal of Solid-state Circuits","volume":"60 9","pages":"3416-3427"},"PeriodicalIF":5.6000,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Solid-state Circuits","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10924713/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Video convolutional neural networks (V-CNNs) take multiple frames as input and leverage temporal information to enhance quality and temporal consistency, making them promising solutions for high-resolution imaging tasks, such as video super-resolution (VSR) and video frame interpolation (VFI). Previous works have proposed CNN accelerators for single-image high-resolution imaging tasks, using layer-fusion (LF) workflows to reduce the need for external memory access (EMA) of intermediate feature maps (FMs). However, V-CNNs demand more EMA and computational complexity, posing implementation challenges for edge devices. Additionally, using deformable convolution (DC) to break through the fixed shape of the kernel receptive field can improve image quality and temporal consistency but requires additional storage and computational logic. In this article, we present a memory-efficient V-CNN processor, VISTA. We introduce a cuboid-based LF (CBLF) workflow for V-CNNs to reuse temporal information from overlapped FMs at different time points, reducing EMA and computational complexity. Moreover, the VISTA adopts a heterogeneous reuse-recomputing approach to handle overlaps between region-of-influence (ROI) pyramids and uses reference-frame-first scheduling (RFFS) to reduce the need for extensive memory usage during cross-frame alignment computations. Furthermore, we apply a hardware-model co-design to devise tile-based offset-confined DC (TODC), which reduces computational logic and saves line buffer usage for the search window with 0.06–0.18 dB of peak signal-to-noise ratio (PSNR) drop in image quality. The 12.6-mm2 VISTA is fabricated using 40-nm CMOS technology and achieves peak throughput of 4K-UHD 60 and 50 frames/s for supporting VSR and VFI applications, respectively. It reduces 33%–53% of input EMA, 19% of activation static random-access memory (SRAM), and 19%–42% of computational complexity.
期刊介绍:
The IEEE Journal of Solid-State Circuits publishes papers each month in the broad area of solid-state circuits with particular emphasis on transistor-level design of integrated circuits. It also provides coverage of topics such as circuits modeling, technology, systems design, layout, and testing that relate directly to IC design. Integrated circuits and VLSI are of principal interest; material related to discrete circuit design is seldom published. Experimental verification is strongly encouraged.