MSEConv：用于视频帧插值的统一经编框架

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE ACM Transactions on Asian and Low-Resource Language Information Processing Pub Date : 2024-02-14 DOI:10.1145/3648364

Xiangling Ding, Pu Huang, Dengyong Zhang, Wei Liang, Feng Li, Gaobo Yang, Xin Liao, Yue Li

{"title":"MSEConv：用于视频帧插值的统一经编框架","authors":"Xiangling Ding, Pu Huang, Dengyong Zhang, Wei Liang, Feng Li, Gaobo Yang, Xin Liao, Yue Li","doi":"10.1145/3648364","DOIUrl":null,"url":null,"abstract":"<p>Within the context of video frame interpolation, complex motion modeling is the task of capturing, in a video sequence, where the moving objects are located in the interpolated frame, and how to maintain the temporal consistency of motion. Existing video frame interpolation methods typically assign either a fixed size of the motion kernel or a refined optical flow to model complex motions. However, they have the limitation of data redundancy and inaccuracy representation of motion. This paper introduces a unified warping framework, named multi-scale expandable deformable convolution (MSEConv), for simultaneously performing complex motion modeling and frame interpolation. In the proposed framework, a deep fully convolutional neural network with global attention is proposed to estimate multiple small-scale kernel weights with different expansion degrees and adaptive weight allocation for each pixel synthesis. Moreover, most of the kernel-based interpolation methods can be treated as the special case of the proposed MSEConv, thus, MSEConv can be easily transferred to other kernel-based frame interpolation methods for performance improvement. To further improve the robustness of motion occlusions, an operation of mask occlusion is introduced. As a consequence, our proposed MSEConv shows strong performance on par or even better than the state-of-the-art kernel-based frame interpolation works on public datasets. Our source code and visual comparable results are available at https://github.com/Pumpkin123709/MSEConv.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":1.8000,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MSEConv: A Unified Warping Framework for Video Frame Interpolation\",\"authors\":\"Xiangling Ding, Pu Huang, Dengyong Zhang, Wei Liang, Feng Li, Gaobo Yang, Xin Liao, Yue Li\",\"doi\":\"10.1145/3648364\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Within the context of video frame interpolation, complex motion modeling is the task of capturing, in a video sequence, where the moving objects are located in the interpolated frame, and how to maintain the temporal consistency of motion. Existing video frame interpolation methods typically assign either a fixed size of the motion kernel or a refined optical flow to model complex motions. However, they have the limitation of data redundancy and inaccuracy representation of motion. This paper introduces a unified warping framework, named multi-scale expandable deformable convolution (MSEConv), for simultaneously performing complex motion modeling and frame interpolation. In the proposed framework, a deep fully convolutional neural network with global attention is proposed to estimate multiple small-scale kernel weights with different expansion degrees and adaptive weight allocation for each pixel synthesis. Moreover, most of the kernel-based interpolation methods can be treated as the special case of the proposed MSEConv, thus, MSEConv can be easily transferred to other kernel-based frame interpolation methods for performance improvement. To further improve the robustness of motion occlusions, an operation of mask occlusion is introduced. As a consequence, our proposed MSEConv shows strong performance on par or even better than the state-of-the-art kernel-based frame interpolation works on public datasets. Our source code and visual comparable results are available at https://github.com/Pumpkin123709/MSEConv.</p>\",\"PeriodicalId\":54312,\"journal\":{\"name\":\"ACM Transactions on Asian and Low-Resource Language Information Processing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2024-02-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Asian and Low-Resource Language Information Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3648364\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Asian and Low-Resource Language Information Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3648364","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

在视频帧插值中，复杂运动建模的任务是在视频序列中捕捉运动物体在插值帧中的位置，以及如何保持运动的时间一致性。现有的视频帧插值方法通常采用固定大小的运动核或精细光流来建立复杂运动模型。然而，这些方法都存在数据冗余和运动表示不准确的局限性。本文介绍了一种统一的扭曲框架，名为多尺度可扩展变形卷积（MSEConv），可同时执行复杂运动建模和帧插值。在该框架中，提出了一种具有全局注意力的深度全卷积神经网络，用于估计具有不同扩展度的多个小尺度内核权重，并为每个像素合成进行自适应权重分配。此外，大多数基于内核的插值方法都可以被视为 MSEConv 的特例，因此 MSEConv 可以很容易地移植到其他基于内核的帧插值方法中以提高性能。为了进一步提高运动遮挡的鲁棒性，我们引入了遮挡操作。因此，我们提出的 MSEConv 在公共数据集上显示出与最先进的基于内核的帧插值方法相当甚至更好的性能。我们的源代码和可视化比较结果可在 https://github.com/Pumpkin123709/MSEConv 上获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MSEConv: A Unified Warping Framework for Video Frame Interpolation

Within the context of video frame interpolation, complex motion modeling is the task of capturing, in a video sequence, where the moving objects are located in the interpolated frame, and how to maintain the temporal consistency of motion. Existing video frame interpolation methods typically assign either a fixed size of the motion kernel or a refined optical flow to model complex motions. However, they have the limitation of data redundancy and inaccuracy representation of motion. This paper introduces a unified warping framework, named multi-scale expandable deformable convolution (MSEConv), for simultaneously performing complex motion modeling and frame interpolation. In the proposed framework, a deep fully convolutional neural network with global attention is proposed to estimate multiple small-scale kernel weights with different expansion degrees and adaptive weight allocation for each pixel synthesis. Moreover, most of the kernel-based interpolation methods can be treated as the special case of the proposed MSEConv, thus, MSEConv can be easily transferred to other kernel-based frame interpolation methods for performance improvement. To further improve the robustness of motion occlusions, an operation of mask occlusion is introduced. As a consequence, our proposed MSEConv shows strong performance on par or even better than the state-of-the-art kernel-based frame interpolation works on public datasets. Our source code and visual comparable results are available at https://github.com/Pumpkin123709/MSEConv.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Asian and Low-Resource Language Information Processing Computer Science-General Computer Science

CiteScore

3.60

自引率

15.00%

发文量

241

期刊介绍： The ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) publishes high quality original archival papers and technical notes in the areas of computation and processing of information in Asian languages, low-resource languages of Africa, Australasia, Oceania and the Americas, as well as related disciplines. The subject areas covered by TALLIP include, but are not limited to: -Computational Linguistics: including computational phonology, computational morphology, computational syntax (e.g. parsing), computational semantics, computational pragmatics, etc. -Linguistic Resources: including computational lexicography, terminology, electronic dictionaries, cross-lingual dictionaries, electronic thesauri, etc. -Hardware and software algorithms and tools for Asian or low-resource language processing, e.g., handwritten character recognition. -Information Understanding: including text understanding, speech understanding, character recognition, discourse processing, dialogue systems, etc. -Machine Translation involving Asian or low-resource languages. -Information Retrieval: including natural language processing (NLP) for concept-based indexing, natural language query interfaces, semantic relevance judgments, etc. -Information Extraction and Filtering: including automatic abstraction, user profiling, etc. -Speech processing: including text-to-speech synthesis and automatic speech recognition. -Multimedia Asian Information Processing: including speech, image, video, image/text translation, etc. -Cross-lingual information processing involving Asian or low-resource languages. -Papers that deal in theory, systems design, evaluation and applications in the aforesaid subjects are appropriate for TALLIP. Emphasis will be placed on the originality and the practical significance of the reported research.