Cut-and-Paste: Subject-driven video editing with attention control

IF 6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neural Networks Pub Date : 2024-10-19 DOI:10.1016/j.neunet.2024.106818

Zhichao Zuo , Zhao Zhang , Yan Luo , Yang Zhao , Haijun Zhang , Yi Yang , Meng Wang

{"title":"Cut-and-Paste: Subject-driven video editing with attention control","authors":"Zhichao Zuo , Zhao Zhang , Yan Luo , Yang Zhao , Haijun Zhang , Yi Yang , Meng Wang","doi":"10.1016/j.neunet.2024.106818","DOIUrl":null,"url":null,"abstract":"<div><div>This paper presents a novel framework termed Cut-and-Paste for real-word semantic video editing under the guidance of text prompt and additional reference image. While the text-driven video editing has demonstrated remarkable ability to generate highly diverse videos following given text prompts, the fine-grained semantic edits are hard to control by plain textual prompt only in terms of object details and edited region, and cumbersome long text descriptions are usually needed for the task. We therefore investigate subject-driven video editing for more precise control of both edited regions and background preservation, and fine-grained semantic generation. We achieve this goal by introducing an reference image as supplementary input to the text-driven video editing, which avoids racking your brain to come up with a cumbersome text prompt describing the detailed appearance of the object. To limit the editing area, we refer to a method of cross attention control in image editing and successfully extend it to video editing by fusing the attention map of adjacent frames, which strikes a balance between maintaining video background and spatio-temporal consistency. Compared with current methods, the whole process of our method is like “cut” the source object to be edited and then “paste” the target object provided by reference image. We demonstrate that our method performs favorably over prior arts for video editing under the guidance of text prompt and extra reference image, as measured by both quantitative and subjective evaluations.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"181 ","pages":"Article 106818"},"PeriodicalIF":6.0000,"publicationDate":"2024-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608024007421","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

This paper presents a novel framework termed Cut-and-Paste for real-word semantic video editing under the guidance of text prompt and additional reference image. While the text-driven video editing has demonstrated remarkable ability to generate highly diverse videos following given text prompts, the fine-grained semantic edits are hard to control by plain textual prompt only in terms of object details and edited region, and cumbersome long text descriptions are usually needed for the task. We therefore investigate subject-driven video editing for more precise control of both edited regions and background preservation, and fine-grained semantic generation. We achieve this goal by introducing an reference image as supplementary input to the text-driven video editing, which avoids racking your brain to come up with a cumbersome text prompt describing the detailed appearance of the object. To limit the editing area, we refer to a method of cross attention control in image editing and successfully extend it to video editing by fusing the attention map of adjacent frames, which strikes a balance between maintaining video background and spatio-temporal consistency. Compared with current methods, the whole process of our method is like “cut” the source object to be edited and then “paste” the target object provided by reference image. We demonstrate that our method performs favorably over prior arts for video editing under the guidance of text prompt and extra reference image, as measured by both quantitative and subjective evaluations.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

剪切粘贴主题驱动的视频编辑与注意力控制

本文提出了一个新颖的框架，称为 "剪贴"（Cut-and-Paste），用于在文本提示和附加参考图像的指导下进行实词语义视频编辑。虽然文本驱动的视频编辑已经证明了根据给定的文本提示生成高度多样化视频的卓越能力，但仅靠纯文本提示很难控制对象细节和编辑区域的细粒度语义编辑，而且通常需要繁琐的长文本描述来完成任务。因此，我们研究了主题驱动视频编辑，以便更精确地控制编辑区域和背景保存，并生成细粒度语义。为了实现这一目标，我们引入了参考图像作为文本驱动视频编辑的辅助输入，这样就可以避免绞尽脑汁想出繁琐的文本提示来描述对象的详细外观。为了限制编辑区域，我们参考了图像编辑中的交叉注意力控制方法，并通过融合相邻帧的注意力图谱成功地将其扩展到视频编辑中，从而在保持视频背景和时空一致性之间取得了平衡。与现有方法相比，我们的方法的整个过程就像 "剪切 "待编辑的源对象，然后 "粘贴 "参考图像提供的目标对象。通过定量和主观评价，我们证明了在文本提示和额外参考图像的指导下，我们的方法在视频编辑方面的表现优于现有技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Neural Networks 工程技术-计算机：人工智能

CiteScore

13.90

自引率

7.70%

发文量

425

审稿时长

67 days

期刊介绍： Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.