{"title":"Bidirectional Error-Aware Fusion Network for Video Inpainting","authors":"Jiacheng Hou;Zhong Ji;Jinyu Yang;Feng Zheng","doi":"10.1109/TCSVT.2024.3454641","DOIUrl":null,"url":null,"abstract":"Existing video inpainting approaches tend to adopt vision transformers with rare customized designs, which poses two limitations. Firstly, the conventional self-attention mechanism treats tokens from invalid and valid regions equally and mingles them, which may incur blurriness. Secondly, these approaches merely employ forward frames as references, while ignoring the past inpainted frames, which are also valuable in enhancing temporal consistency and offering more available information. In this paper, we propose a new video inpainting network, called Bidirectional Error-Aware Fusion Network (BEAF-Net). Concretely, on one hand, we propose a tailored Error-Aware Transformer (EAT) that discerns different tokens by assigning dynamic weights to bridle the use of erroneous tokens. Meanwhile, each EAT is equipped with a Spatial Feature Enhancement (SFE) layer to synthesize features with multi-scales. On the other hand, we apply a pair of EATs to utilize forward reference frames and past inpainted frames simultaneously, and a proposed Bidirectional Fusion (BiF) layer is exerted to blend the aggregation results adaptively. By coupling these novel designs, our proposed BEAF-Net completely leverages the location priors, multi-scale perception, and past predictions to produce more faithful and consistent inpainting results. We corroborate our BEAF-Net on two commonly-used video inpainting datasets: DAVIS and Youtube-VOS, where the experimental results demonstrate BEAF-Net compares favorably with state-of-the-art solutions. Video examples can be found at <uri>https://github.com/JCATCV/BEAF-Net</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"577-588"},"PeriodicalIF":11.1000,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10666730/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Existing video inpainting approaches tend to adopt vision transformers with rare customized designs, which poses two limitations. Firstly, the conventional self-attention mechanism treats tokens from invalid and valid regions equally and mingles them, which may incur blurriness. Secondly, these approaches merely employ forward frames as references, while ignoring the past inpainted frames, which are also valuable in enhancing temporal consistency and offering more available information. In this paper, we propose a new video inpainting network, called Bidirectional Error-Aware Fusion Network (BEAF-Net). Concretely, on one hand, we propose a tailored Error-Aware Transformer (EAT) that discerns different tokens by assigning dynamic weights to bridle the use of erroneous tokens. Meanwhile, each EAT is equipped with a Spatial Feature Enhancement (SFE) layer to synthesize features with multi-scales. On the other hand, we apply a pair of EATs to utilize forward reference frames and past inpainted frames simultaneously, and a proposed Bidirectional Fusion (BiF) layer is exerted to blend the aggregation results adaptively. By coupling these novel designs, our proposed BEAF-Net completely leverages the location priors, multi-scale perception, and past predictions to produce more faithful and consistent inpainting results. We corroborate our BEAF-Net on two commonly-used video inpainting datasets: DAVIS and Youtube-VOS, where the experimental results demonstrate BEAF-Net compares favorably with state-of-the-art solutions. Video examples can be found at https://github.com/JCATCV/BEAF-Net.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.