{"title":"RVFIT: Real-time Video Frame Interpolation Transformer","authors":"Linlin Ou, Yuanping Chen","doi":"10.1117/12.2669055","DOIUrl":null,"url":null,"abstract":"Video frame interpolation (VFI), which aims to synthesize predictive frames from bidirectional historical references, has made remarkable progress with the development of deep convolutional neural networks (CNNs) over the past years. Existing CNNs generally face challenges in handing large motions due to the locality of convolution operations, resulting in a slow inference structure. We introduce a Real-time video frame interpolation transformer (RVFIT), a novel framework to overcome this limitation. Unlike traditional methods based on CNNs, this paper does not process video frames separately with different network modules in the spatial domain but batches adjacent frames through a single UNet-style structure end-to-end Transformer network architecture. Moreover, this paper creatively sets up two-stage interpolation sampling before and after the end-to-end network to maximize the performance of the traditional CV algorithm. The experimental results show that compared with SOTA TMNet, RVFIT has only 50% of the network size (6.2M vs 12.3M, parameters) while ensuring comparable performance, and the speed is increased by 80% (26.1 fps vs 14.3 fps, frame size is 720*576).","PeriodicalId":236099,"journal":{"name":"International Workshop on Frontiers of Graphics and Image Processing","volume":"211 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Workshop on Frontiers of Graphics and Image Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.2669055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Video frame interpolation (VFI), which aims to synthesize predictive frames from bidirectional historical references, has made remarkable progress with the development of deep convolutional neural networks (CNNs) over the past years. Existing CNNs generally face challenges in handing large motions due to the locality of convolution operations, resulting in a slow inference structure. We introduce a Real-time video frame interpolation transformer (RVFIT), a novel framework to overcome this limitation. Unlike traditional methods based on CNNs, this paper does not process video frames separately with different network modules in the spatial domain but batches adjacent frames through a single UNet-style structure end-to-end Transformer network architecture. Moreover, this paper creatively sets up two-stage interpolation sampling before and after the end-to-end network to maximize the performance of the traditional CV algorithm. The experimental results show that compared with SOTA TMNet, RVFIT has only 50% of the network size (6.2M vs 12.3M, parameters) while ensuring comparable performance, and the speed is increased by 80% (26.1 fps vs 14.3 fps, frame size is 720*576).
视频帧插值(VFI)旨在从双向历史参考合成预测帧,近年来随着深度卷积神经网络(cnn)的发展取得了显著进展。由于卷积运算的局部性,现有cnn在处理大运动时普遍面临挑战,导致推理结构缓慢。我们介绍了一种实时视频帧插值转换器(RVFIT),这是一种克服这一限制的新框架。与传统的基于cnn的方法不同,本文没有在空间域中使用不同的网络模块分别处理视频帧,而是通过单一的unet风格的端到端Transformer网络架构对相邻帧进行批量处理。此外,本文创造性地设置了端到端网络前后两阶段插值采样,最大限度地提高了传统CV算法的性能。实验结果表明,与SOTA TMNet相比,RVFIT在保证相当性能的同时,网络大小仅为前者的50% (6.2M vs 12.3M,参数),速度提高了80% (26.1 fps vs 14.3 fps,帧大小为720*576)。