Object-Centric Diffusion for Efficient Video Editing

ArXiv Pub Date : 2024-01-11 DOI:10.48550/arXiv.2401.05735

Kumara Kahatapitiya, Adil Karjauv, Davide Abati, F. Porikli, Yuki M. Asano, A. Habibian

{"title":"Object-Centric Diffusion for Efficient Video Editing","authors":"Kumara Kahatapitiya, Adil Karjauv, Davide Abati, F. Porikli, Yuki M. Asano, A. Habibian","doi":"10.48550/arXiv.2401.05735","DOIUrl":null,"url":null,"abstract":"Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we conduct an analysis of such inefficiencies, and suggest simple yet effective modifications that allow significant speed-ups whilst maintaining quality. Moreover, we introduce Object-Centric Diffusion, coined as OCD, to further reduce latency by allocating computations more towards foreground edited regions that are arguably more important for perceptual quality. We achieve this by two novel proposals: i) Object-Centric Sampling, decoupling the diffusion steps spent on salient regions or background, allocating most of the model capacity to the former, and ii) Object-Centric 3D Token Merging, which reduces cost of cross-frame attention by fusing redundant tokens in unimportant background regions. Both techniques are readily applicable to a given video editing model \\textit{without} retraining, and can drastically reduce its memory and computational cost. We evaluate our proposals on inversion-based and control-signal-based editing pipelines, and show a latency reduction up to 10x for a comparable synthesis quality.","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":" 19","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"0","ListUrlMain":"https://doi.org/10.48550/arXiv.2401.05735","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we conduct an analysis of such inefficiencies, and suggest simple yet effective modifications that allow significant speed-ups whilst maintaining quality. Moreover, we introduce Object-Centric Diffusion, coined as OCD, to further reduce latency by allocating computations more towards foreground edited regions that are arguably more important for perceptual quality. We achieve this by two novel proposals: i) Object-Centric Sampling, decoupling the diffusion steps spent on salient regions or background, allocating most of the model capacity to the former, and ii) Object-Centric 3D Token Merging, which reduces cost of cross-frame attention by fusing redundant tokens in unimportant background regions. Both techniques are readily applicable to a given video editing model \textit{without} retraining, and can drastically reduce its memory and computational cost. We evaluate our proposals on inversion-based and control-signal-based editing pipelines, and show a latency reduction up to 10x for a comparable synthesis quality.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

以对象为中心扩散，实现高效视频编辑

基于扩散技术的视频编辑已经达到了令人印象深刻的质量，可以根据文本编辑提示，改变给定视频输入的整体风格、局部结构和属性。然而，此类解决方案通常会以扩散反转和/或跨帧关注的形式产生时间上一致的帧，从而产生高昂的内存和计算成本。在本文中，我们对这种低效率进行了分析，并提出了简单而有效的修改建议，从而在保证质量的同时大幅提高速度。此外，我们还引入了 "以对象为中心的扩散"（Object-Centric Diffusion，又称 OCD），通过将计算更多地分配给对感知质量更重要的前景编辑区域，进一步减少延迟。为此，我们提出了两个新颖的建议：i) 以对象为中心的采样（Object-Centric Sampling），将用于突出区域或背景的扩散步骤分离开来，将大部分模型容量分配给前者；ii) 以对象为中心的 3D 标记合并（Object-Centric 3D Token Merging），通过融合不重要背景区域的冗余标记来降低跨帧关注的成本。这两项技术都适用于给定的视频编辑模型（textit{without} retraining），并能大幅降低其内存和计算成本。我们在基于反转和基于控制信号的编辑流水线上对我们的建议进行了评估，结果表明，在合成质量相当的情况下，延迟时间最多可缩短 10 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ArXiv

自引率

0.00%

发文量