VISTA: A Memory-Efficient CNN Processor for Video and Image Spatial/Temporal Interpolation Acceleration

IF 5.6 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Journal of Solid-state Circuits Pub Date : 2025-03-12 DOI:10.1109/JSSC.2025.3547982

Kai-Ping Lin;Jia-Han Liu;Hong-Chuan Liao;Jyun-Yi Wu;Tong Wu;Chao-Tsung Huang

{"title":"VISTA: A Memory-Efficient CNN Processor for Video and Image Spatial/Temporal Interpolation Acceleration","authors":"Kai-Ping Lin;Jia-Han Liu;Hong-Chuan Liao;Jyun-Yi Wu;Tong Wu;Chao-Tsung Huang","doi":"10.1109/JSSC.2025.3547982","DOIUrl":null,"url":null,"abstract":"Video convolutional neural networks (V-CNNs) take multiple frames as input and leverage temporal information to enhance quality and temporal consistency, making them promising solutions for high-resolution imaging tasks, such as video super-resolution (VSR) and video frame interpolation (VFI). Previous works have proposed CNN accelerators for single-image high-resolution imaging tasks, using layer-fusion (LF) workflows to reduce the need for external memory access (EMA) of intermediate feature maps (FMs). However, V-CNNs demand more EMA and computational complexity, posing implementation challenges for edge devices. Additionally, using deformable convolution (DC) to break through the fixed shape of the kernel receptive field can improve image quality and temporal consistency but requires additional storage and computational logic. In this article, we present a memory-efficient V-CNN processor, VISTA. We introduce a cuboid-based LF (CBLF) workflow for V-CNNs to reuse temporal information from overlapped FMs at different time points, reducing EMA and computational complexity. Moreover, the VISTA adopts a heterogeneous reuse-recomputing approach to handle overlaps between region-of-influence (ROI) pyramids and uses reference-frame-first scheduling (RFFS) to reduce the need for extensive memory usage during cross-frame alignment computations. Furthermore, we apply a hardware-model co-design to devise tile-based offset-confined DC (TODC), which reduces computational logic and saves line buffer usage for the search window with 0.06–0.18 dB of peak signal-to-noise ratio (PSNR) drop in image quality. The 12.6-mm2 VISTA is fabricated using 40-nm CMOS technology and achieves peak throughput of 4K-UHD 60 and 50 frames/s for supporting VSR and VFI applications, respectively. It reduces 33%–53% of input EMA, 19% of activation static random-access memory (SRAM), and 19%–42% of computational complexity.","PeriodicalId":13129,"journal":{"name":"IEEE Journal of Solid-state Circuits","volume":"60 9","pages":"3416-3427"},"PeriodicalIF":5.6000,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Solid-state Circuits","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10924713/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Video convolutional neural networks (V-CNNs) take multiple frames as input and leverage temporal information to enhance quality and temporal consistency, making them promising solutions for high-resolution imaging tasks, such as video super-resolution (VSR) and video frame interpolation (VFI). Previous works have proposed CNN accelerators for single-image high-resolution imaging tasks, using layer-fusion (LF) workflows to reduce the need for external memory access (EMA) of intermediate feature maps (FMs). However, V-CNNs demand more EMA and computational complexity, posing implementation challenges for edge devices. Additionally, using deformable convolution (DC) to break through the fixed shape of the kernel receptive field can improve image quality and temporal consistency but requires additional storage and computational logic. In this article, we present a memory-efficient V-CNN processor, VISTA. We introduce a cuboid-based LF (CBLF) workflow for V-CNNs to reuse temporal information from overlapped FMs at different time points, reducing EMA and computational complexity. Moreover, the VISTA adopts a heterogeneous reuse-recomputing approach to handle overlaps between region-of-influence (ROI) pyramids and uses reference-frame-first scheduling (RFFS) to reduce the need for extensive memory usage during cross-frame alignment computations. Furthermore, we apply a hardware-model co-design to devise tile-based offset-confined DC (TODC), which reduces computational logic and saves line buffer usage for the search window with 0.06–0.18 dB of peak signal-to-noise ratio (PSNR) drop in image quality. The 12.6-mm2 VISTA is fabricated using 40-nm CMOS technology and achieves peak throughput of 4K-UHD 60 and 50 frames/s for supporting VSR and VFI applications, respectively. It reduces 33%–53% of input EMA, 19% of activation static random-access memory (SRAM), and 19%–42% of computational complexity.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于视频和图像空间/时间插值加速的内存高效CNN处理器

视频卷积神经网络（v - cnn）以多帧作为输入，利用时间信息来提高质量和时间一致性，使其成为高分辨率成像任务的解决方案，如视频超分辨率（VSR）和视频帧插值（VFI）。以前的工作已经提出了用于单图像高分辨率成像任务的CNN加速器，使用层融合（LF）工作流来减少对中间特征映射（fm）的外部存储器访问（EMA）的需求。然而，v - cnn需要更高的EMA和计算复杂度，这给边缘设备的实现带来了挑战。此外，使用可变形卷积（DC）突破核接受野的固定形状可以提高图像质量和时间一致性，但需要额外的存储和计算逻辑。在这篇文章中，我们提出了一个内存高效的V-CNN处理器，VISTA。我们为v - cnn引入了一种基于立方体的LF （CBLF）工作流，以重用来自不同时间点重叠的fm的时间信息，从而降低了EMA和计算复杂度。此外，VISTA采用异构重用-重计算方法来处理影响区域（ROI）金字塔之间的重叠，并使用参考帧优先调度（RFFS）来减少跨帧对齐计算期间对大量内存使用的需求。此外，我们采用硬件模型协同设计来设计基于tile的偏移限制DC (TODC)，该方法减少了计算逻辑，并节省了搜索窗口的行缓冲区使用，图像质量峰值信噪比（PSNR）下降0.06-0.18 dB。12.6 mm2的VISTA采用40纳米CMOS技术制造，可实现4K-UHD 60帧/秒和50帧/秒的峰值吞吐量，分别支持VSR和VFI应用。它减少了33%-53%的输入EMA， 19%的激活静态随机存取存储器（SRAM）和19% - 42%的计算复杂性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Journal of Solid-state Circuits 工程技术-工程：电子与电气

CiteScore

11.00

自引率

20.40%

发文量

351

审稿时长

3-6 weeks

期刊介绍： The IEEE Journal of Solid-State Circuits publishes papers each month in the broad area of solid-state circuits with particular emphasis on transistor-level design of integrated circuits. It also provides coverage of topics such as circuits modeling, technology, systems design, layout, and testing that relate directly to IC design. Integrated circuits and VLSI are of principal interest; material related to discrete circuit design is seldom published. Experimental verification is strongly encouraged.