VISTA: A Memory-Efficient CNN Processor for Video and Image Spatial/Temporal Interpolation Acceleration

IF 5.6 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Journal of Solid-state Circuits Pub Date : 2025-03-12 DOI:10.1109/JSSC.2025.3547982
Kai-Ping Lin;Jia-Han Liu;Hong-Chuan Liao;Jyun-Yi Wu;Tong Wu;Chao-Tsung Huang
{"title":"VISTA: A Memory-Efficient CNN Processor for Video and Image Spatial/Temporal Interpolation Acceleration","authors":"Kai-Ping Lin;Jia-Han Liu;Hong-Chuan Liao;Jyun-Yi Wu;Tong Wu;Chao-Tsung Huang","doi":"10.1109/JSSC.2025.3547982","DOIUrl":null,"url":null,"abstract":"Video convolutional neural networks (V-CNNs) take multiple frames as input and leverage temporal information to enhance quality and temporal consistency, making them promising solutions for high-resolution imaging tasks, such as video super-resolution (VSR) and video frame interpolation (VFI). Previous works have proposed CNN accelerators for single-image high-resolution imaging tasks, using layer-fusion (LF) workflows to reduce the need for external memory access (EMA) of intermediate feature maps (FMs). However, V-CNNs demand more EMA and computational complexity, posing implementation challenges for edge devices. Additionally, using deformable convolution (DC) to break through the fixed shape of the kernel receptive field can improve image quality and temporal consistency but requires additional storage and computational logic. In this article, we present a memory-efficient V-CNN processor, VISTA. We introduce a cuboid-based LF (CBLF) workflow for V-CNNs to reuse temporal information from overlapped FMs at different time points, reducing EMA and computational complexity. Moreover, the VISTA adopts a heterogeneous reuse-recomputing approach to handle overlaps between region-of-influence (ROI) pyramids and uses reference-frame-first scheduling (RFFS) to reduce the need for extensive memory usage during cross-frame alignment computations. Furthermore, we apply a hardware-model co-design to devise tile-based offset-confined DC (TODC), which reduces computational logic and saves line buffer usage for the search window with 0.06–0.18 dB of peak signal-to-noise ratio (PSNR) drop in image quality. The 12.6-mm2 VISTA is fabricated using 40-nm CMOS technology and achieves peak throughput of 4K-UHD 60 and 50 frames/s for supporting VSR and VFI applications, respectively. It reduces 33%–53% of input EMA, 19% of activation static random-access memory (SRAM), and 19%–42% of computational complexity.","PeriodicalId":13129,"journal":{"name":"IEEE Journal of Solid-state Circuits","volume":"60 9","pages":"3416-3427"},"PeriodicalIF":5.6000,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Solid-state Circuits","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10924713/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Video convolutional neural networks (V-CNNs) take multiple frames as input and leverage temporal information to enhance quality and temporal consistency, making them promising solutions for high-resolution imaging tasks, such as video super-resolution (VSR) and video frame interpolation (VFI). Previous works have proposed CNN accelerators for single-image high-resolution imaging tasks, using layer-fusion (LF) workflows to reduce the need for external memory access (EMA) of intermediate feature maps (FMs). However, V-CNNs demand more EMA and computational complexity, posing implementation challenges for edge devices. Additionally, using deformable convolution (DC) to break through the fixed shape of the kernel receptive field can improve image quality and temporal consistency but requires additional storage and computational logic. In this article, we present a memory-efficient V-CNN processor, VISTA. We introduce a cuboid-based LF (CBLF) workflow for V-CNNs to reuse temporal information from overlapped FMs at different time points, reducing EMA and computational complexity. Moreover, the VISTA adopts a heterogeneous reuse-recomputing approach to handle overlaps between region-of-influence (ROI) pyramids and uses reference-frame-first scheduling (RFFS) to reduce the need for extensive memory usage during cross-frame alignment computations. Furthermore, we apply a hardware-model co-design to devise tile-based offset-confined DC (TODC), which reduces computational logic and saves line buffer usage for the search window with 0.06–0.18 dB of peak signal-to-noise ratio (PSNR) drop in image quality. The 12.6-mm2 VISTA is fabricated using 40-nm CMOS technology and achieves peak throughput of 4K-UHD 60 and 50 frames/s for supporting VSR and VFI applications, respectively. It reduces 33%–53% of input EMA, 19% of activation static random-access memory (SRAM), and 19%–42% of computational complexity.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用于视频和图像空间/时间插值加速的内存高效CNN处理器
视频卷积神经网络(v - cnn)以多帧作为输入,利用时间信息来提高质量和时间一致性,使其成为高分辨率成像任务的解决方案,如视频超分辨率(VSR)和视频帧插值(VFI)。以前的工作已经提出了用于单图像高分辨率成像任务的CNN加速器,使用层融合(LF)工作流来减少对中间特征映射(fm)的外部存储器访问(EMA)的需求。然而,v - cnn需要更高的EMA和计算复杂度,这给边缘设备的实现带来了挑战。此外,使用可变形卷积(DC)突破核接受野的固定形状可以提高图像质量和时间一致性,但需要额外的存储和计算逻辑。在这篇文章中,我们提出了一个内存高效的V-CNN处理器,VISTA。我们为v - cnn引入了一种基于立方体的LF (CBLF)工作流,以重用来自不同时间点重叠的fm的时间信息,从而降低了EMA和计算复杂度。此外,VISTA采用异构重用-重计算方法来处理影响区域(ROI)金字塔之间的重叠,并使用参考帧优先调度(RFFS)来减少跨帧对齐计算期间对大量内存使用的需求。此外,我们采用硬件模型协同设计来设计基于tile的偏移限制DC (TODC),该方法减少了计算逻辑,并节省了搜索窗口的行缓冲区使用,图像质量峰值信噪比(PSNR)下降0.06-0.18 dB。12.6 mm2的VISTA采用40纳米CMOS技术制造,可实现4K-UHD 60帧/秒和50帧/秒的峰值吞吐量,分别支持VSR和VFI应用。它减少了33%-53%的输入EMA, 19%的激活静态随机存取存储器(SRAM)和19% - 42%的计算复杂性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Journal of Solid-state Circuits
IEEE Journal of Solid-state Circuits 工程技术-工程:电子与电气
CiteScore
11.00
自引率
20.40%
发文量
351
审稿时长
3-6 weeks
期刊介绍: The IEEE Journal of Solid-State Circuits publishes papers each month in the broad area of solid-state circuits with particular emphasis on transistor-level design of integrated circuits. It also provides coverage of topics such as circuits modeling, technology, systems design, layout, and testing that relate directly to IC design. Integrated circuits and VLSI are of principal interest; material related to discrete circuit design is seldom published. Experimental verification is strongly encouraged.
期刊最新文献
A 23.4–42.1-GHz Fractional-N Synthesizer With ADC-Based Direct Phase Digitization A 0.5–2.5-GS/s Resettable Ring-VCO-Based ADC Eliminating Quantization-Noise Shaping Adaptive Linearity Enhancement of Low-Noise Amplifiers Using Doherty Active Load Modulation A 13.8% Speed-Enhanced 1 T Mask ROM by Algorithmically Signed Program Data on 3-nm Fin-FET Logic CMOS A 10–72 GHz SDR Receiver With Compact and Low-Phase-Noise LO Frequency Quintupler
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1