Object-Centric Diffusion for Efficient Video Editing

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-01-11 DOI:arxiv-2401.05735

Kumara Kahatapitiya, Adil Karjauv, Davide Abati, Fatih Porikli, Yuki M. Asano, Amirhossein Habibian

{"title":"Object-Centric Diffusion for Efficient Video Editing","authors":"Kumara Kahatapitiya, Adil Karjauv, Davide Abati, Fatih Porikli, Yuki M. Asano, Amirhossein Habibian","doi":"arxiv-2401.05735","DOIUrl":null,"url":null,"abstract":"Diffusion-based video editing have reached impressive quality and can\ntransform either the global style, local structure, and attributes of given\nvideo inputs, following textual edit prompts. However, such solutions typically\nincur heavy memory and computational costs to generate temporally-coherent\nframes, either in the form of diffusion inversion and/or cross-frame attention.\nIn this paper, we conduct an analysis of such inefficiencies, and suggest\nsimple yet effective modifications that allow significant speed-ups whilst\nmaintaining quality. Moreover, we introduce Object-Centric Diffusion, coined as\nOCD, to further reduce latency by allocating computations more towards\nforeground edited regions that are arguably more important for perceptual\nquality. We achieve this by two novel proposals: i) Object-Centric Sampling,\ndecoupling the diffusion steps spent on salient regions or background,\nallocating most of the model capacity to the former, and ii) Object-Centric 3D\nToken Merging, which reduces cost of cross-frame attention by fusing redundant\ntokens in unimportant background regions. Both techniques are readily\napplicable to a given video editing model \\textit{without} retraining, and can\ndrastically reduce its memory and computational cost. We evaluate our proposals\non inversion-based and control-signal-based editing pipelines, and show a\nlatency reduction up to 10x for a comparable synthesis quality.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2401.05735","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Diffusion-based video editing have reached impressive quality and can transform either the global style, local structure, and attributes of given video inputs, following textual edit prompts. However, such solutions typically incur heavy memory and computational costs to generate temporally-coherent frames, either in the form of diffusion inversion and/or cross-frame attention. In this paper, we conduct an analysis of such inefficiencies, and suggest simple yet effective modifications that allow significant speed-ups whilst maintaining quality. Moreover, we introduce Object-Centric Diffusion, coined as OCD, to further reduce latency by allocating computations more towards foreground edited regions that are arguably more important for perceptual quality. We achieve this by two novel proposals: i) Object-Centric Sampling, decoupling the diffusion steps spent on salient regions or background, allocating most of the model capacity to the former, and ii) Object-Centric 3D Token Merging, which reduces cost of cross-frame attention by fusing redundant tokens in unimportant background regions. Both techniques are readily applicable to a given video editing model \textit{without} retraining, and can drastically reduce its memory and computational cost. We evaluate our proposals on inversion-based and control-signal-based editing pipelines, and show a latency reduction up to 10x for a comparable synthesis quality.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

以对象为中心扩散，实现高效视频编辑

基于扩散技术的视频编辑已经达到了令人印象深刻的质量，并能根据文本编辑提示对给定视频输入的全局风格、局部结构和属性进行转换。然而，此类解决方案在生成时间上一致的帧时，通常会以扩散反转和/或跨帧关注的形式耗费大量内存和计算成本。在本文中，我们对这种低效率进行了分析，并提出了简单而有效的修改建议，从而在保证质量的同时大幅提高速度。此外，我们还引入了 "以对象为中心的扩散"（Object-Centric Diffusion，又称 "OCD"），通过将计算更多分配给对感知质量更重要的前景编辑区域，进一步减少延迟。为此，我们提出了两项新建议：i) 以对象为中心的采样（Object-Centric Sampling），将用于突出区域或背景的扩散步骤解耦，将大部分模型容量分配给前者；ii) 以对象为中心的三维标记合并（Object-Centric 3DToken Merging），通过合并不重要背景区域的冗余标记来降低跨帧关注的成本。这两种技术都可随时应用于给定的视频编辑模型（textit{without} retraining），并能大大降低其内存和计算成本。我们在基于反转和基于控制信号的编辑流水线上对我们的建议进行了评估，结果表明，在合成质量相当的情况下，延迟降低了 10 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Computer Vision and Pattern Recognition

自引率

0.00%

发文量

期刊最新文献

Massively Multi-Person 3D Human Motion Forecasting with Scene Context Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Precise Forecasting of Sky Images Using Spatial Warping JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation Applications of Knowledge Distillation in Remote Sensing: A Survey