Cut-and-Paste: Subject-driven video editing with attention control

IF 6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neural Networks Pub Date : 2024-10-19 DOI:10.1016/j.neunet.2024.106818
Zhichao Zuo , Zhao Zhang , Yan Luo , Yang Zhao , Haijun Zhang , Yi Yang , Meng Wang
{"title":"Cut-and-Paste: Subject-driven video editing with attention control","authors":"Zhichao Zuo ,&nbsp;Zhao Zhang ,&nbsp;Yan Luo ,&nbsp;Yang Zhao ,&nbsp;Haijun Zhang ,&nbsp;Yi Yang ,&nbsp;Meng Wang","doi":"10.1016/j.neunet.2024.106818","DOIUrl":null,"url":null,"abstract":"<div><div>This paper presents a novel framework termed Cut-and-Paste for real-word semantic video editing under the guidance of text prompt and additional reference image. While the text-driven video editing has demonstrated remarkable ability to generate highly diverse videos following given text prompts, the fine-grained semantic edits are hard to control by plain textual prompt only in terms of object details and edited region, and cumbersome long text descriptions are usually needed for the task. We therefore investigate subject-driven video editing for more precise control of both edited regions and background preservation, and fine-grained semantic generation. We achieve this goal by introducing an reference image as supplementary input to the text-driven video editing, which avoids racking your brain to come up with a cumbersome text prompt describing the detailed appearance of the object. To limit the editing area, we refer to a method of cross attention control in image editing and successfully extend it to video editing by fusing the attention map of adjacent frames, which strikes a balance between maintaining video background and spatio-temporal consistency. Compared with current methods, the whole process of our method is like “cut” the source object to be edited and then “paste” the target object provided by reference image. We demonstrate that our method performs favorably over prior arts for video editing under the guidance of text prompt and extra reference image, as measured by both quantitative and subjective evaluations.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"181 ","pages":"Article 106818"},"PeriodicalIF":6.0000,"publicationDate":"2024-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608024007421","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

This paper presents a novel framework termed Cut-and-Paste for real-word semantic video editing under the guidance of text prompt and additional reference image. While the text-driven video editing has demonstrated remarkable ability to generate highly diverse videos following given text prompts, the fine-grained semantic edits are hard to control by plain textual prompt only in terms of object details and edited region, and cumbersome long text descriptions are usually needed for the task. We therefore investigate subject-driven video editing for more precise control of both edited regions and background preservation, and fine-grained semantic generation. We achieve this goal by introducing an reference image as supplementary input to the text-driven video editing, which avoids racking your brain to come up with a cumbersome text prompt describing the detailed appearance of the object. To limit the editing area, we refer to a method of cross attention control in image editing and successfully extend it to video editing by fusing the attention map of adjacent frames, which strikes a balance between maintaining video background and spatio-temporal consistency. Compared with current methods, the whole process of our method is like “cut” the source object to be edited and then “paste” the target object provided by reference image. We demonstrate that our method performs favorably over prior arts for video editing under the guidance of text prompt and extra reference image, as measured by both quantitative and subjective evaluations.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
剪切粘贴主题驱动的视频编辑与注意力控制
本文提出了一个新颖的框架,称为 "剪贴"(Cut-and-Paste),用于在文本提示和附加参考图像的指导下进行实词语义视频编辑。虽然文本驱动的视频编辑已经证明了根据给定的文本提示生成高度多样化视频的卓越能力,但仅靠纯文本提示很难控制对象细节和编辑区域的细粒度语义编辑,而且通常需要繁琐的长文本描述来完成任务。因此,我们研究了主题驱动视频编辑,以便更精确地控制编辑区域和背景保存,并生成细粒度语义。为了实现这一目标,我们引入了参考图像作为文本驱动视频编辑的辅助输入,这样就可以避免绞尽脑汁想出繁琐的文本提示来描述对象的详细外观。为了限制编辑区域,我们参考了图像编辑中的交叉注意力控制方法,并通过融合相邻帧的注意力图谱成功地将其扩展到视频编辑中,从而在保持视频背景和时空一致性之间取得了平衡。与现有方法相比,我们的方法的整个过程就像 "剪切 "待编辑的源对象,然后 "粘贴 "参考图像提供的目标对象。通过定量和主观评价,我们证明了在文本提示和额外参考图像的指导下,我们的方法在视频编辑方面的表现优于现有技术。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Neural Networks
Neural Networks 工程技术-计算机:人工智能
CiteScore
13.90
自引率
7.70%
发文量
425
审稿时长
67 days
期刊介绍: Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.
期刊最新文献
Multi-source Selective Graph Domain Adaptation Network for cross-subject EEG emotion recognition. Spectral integrated neural networks (SINNs) for solving forward and inverse dynamic problems. Corrigendum to "Multi-view Graph Pooling with Coarsened Graph Disentanglement" [Neural Networks 174 (2024) 1-10/106221]. Multi-compartment neuron and population encoding powered spiking neural network for deep distributional reinforcement learning. Multiscroll hopfield neural network with extreme multistability and its application in video encryption for IIoT.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1