Object-Centric Diffusion for Efficient Video Editing

ArXiv Pub Date : 2024-01-11 DOI:10.48550/arXiv.2401.05735
Kumara Kahatapitiya, Adil Karjauv, Davide Abati, F. Porikli, Yuki M. Asano, A. Habibian
{"title":"Object-Centric Diffusion for Efficient Video Editing","authors":"Kumara Kahatapitiya, Adil Karjauv, Davide Abati, F. Porikli, Yuki M. Asano, A. Habibian","doi":"10.48550/arXiv.2401.05735","DOIUrl":null,"url":null,"abstract":"Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we conduct an analysis of such inefficiencies, and suggest simple yet effective modifications that allow significant speed-ups whilst maintaining quality. Moreover, we introduce Object-Centric Diffusion, coined as OCD, to further reduce latency by allocating computations more towards foreground edited regions that are arguably more important for perceptual quality. We achieve this by two novel proposals: i) Object-Centric Sampling, decoupling the diffusion steps spent on salient regions or background, allocating most of the model capacity to the former, and ii) Object-Centric 3D Token Merging, which reduces cost of cross-frame attention by fusing redundant tokens in unimportant background regions. Both techniques are readily applicable to a given video editing model \\textit{without} retraining, and can drastically reduce its memory and computational cost. We evaluate our proposals on inversion-based and control-signal-based editing pipelines, and show a latency reduction up to 10x for a comparable synthesis quality.","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":" 19","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"0","ListUrlMain":"https://doi.org/10.48550/arXiv.2401.05735","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we conduct an analysis of such inefficiencies, and suggest simple yet effective modifications that allow significant speed-ups whilst maintaining quality. Moreover, we introduce Object-Centric Diffusion, coined as OCD, to further reduce latency by allocating computations more towards foreground edited regions that are arguably more important for perceptual quality. We achieve this by two novel proposals: i) Object-Centric Sampling, decoupling the diffusion steps spent on salient regions or background, allocating most of the model capacity to the former, and ii) Object-Centric 3D Token Merging, which reduces cost of cross-frame attention by fusing redundant tokens in unimportant background regions. Both techniques are readily applicable to a given video editing model \textit{without} retraining, and can drastically reduce its memory and computational cost. We evaluate our proposals on inversion-based and control-signal-based editing pipelines, and show a latency reduction up to 10x for a comparable synthesis quality.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
以对象为中心扩散,实现高效视频编辑
基于扩散技术的视频编辑已经达到了令人印象深刻的质量,可以根据文本编辑提示,改变给定视频输入的整体风格、局部结构和属性。然而,此类解决方案通常会以扩散反转和/或跨帧关注的形式产生时间上一致的帧,从而产生高昂的内存和计算成本。在本文中,我们对这种低效率进行了分析,并提出了简单而有效的修改建议,从而在保证质量的同时大幅提高速度。此外,我们还引入了 "以对象为中心的扩散"(Object-Centric Diffusion,又称 OCD),通过将计算更多地分配给对感知质量更重要的前景编辑区域,进一步减少延迟。为此,我们提出了两个新颖的建议:i) 以对象为中心的采样(Object-Centric Sampling),将用于突出区域或背景的扩散步骤分离开来,将大部分模型容量分配给前者;ii) 以对象为中心的 3D 标记合并(Object-Centric 3D Token Merging),通过融合不重要背景区域的冗余标记来降低跨帧关注的成本。这两项技术都适用于给定的视频编辑模型(textit{without} retraining),并能大幅降低其内存和计算成本。我们在基于反转和基于控制信号的编辑流水线上对我们的建议进行了评估,结果表明,在合成质量相当的情况下,延迟时间最多可缩短 10 倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Generative Distribution Embeddings: Lifting autoencoders to the space of distributions for multiscale representation learning. Simple 3D Pose Features Support Human and Machine Social Scene Understanding. Parameter-free representations outperform single-cell foundation models on downstream benchmarks. Optical Inversion and Spectral Unmixing of Spectroscopic Photoacoustic Images with Physics-Informed Neural Networks. Integrating Chain-of-Thought and Retrieval Augmented Generation Enhances Rare Disease Diagnosis from Clinical Notes.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1