Kumara Kahatapitiya, Adil Karjauv, Davide Abati, Fatih Porikli, Yuki M. Asano, Amirhossein Habibian
{"title":"Object-Centric Diffusion for Efficient Video Editing","authors":"Kumara Kahatapitiya, Adil Karjauv, Davide Abati, Fatih Porikli, Yuki M. Asano, Amirhossein Habibian","doi":"arxiv-2401.05735","DOIUrl":null,"url":null,"abstract":"Diffusion-based video editing have reached impressive quality and can\ntransform either the global style, local structure, and attributes of given\nvideo inputs, following textual edit prompts. However, such solutions typically\nincur heavy memory and computational costs to generate temporally-coherent\nframes, either in the form of diffusion inversion and/or cross-frame attention.\nIn this paper, we conduct an analysis of such inefficiencies, and suggest\nsimple yet effective modifications that allow significant speed-ups whilst\nmaintaining quality. Moreover, we introduce Object-Centric Diffusion, coined as\nOCD, to further reduce latency by allocating computations more towards\nforeground edited regions that are arguably more important for perceptual\nquality. We achieve this by two novel proposals: i) Object-Centric Sampling,\ndecoupling the diffusion steps spent on salient regions or background,\nallocating most of the model capacity to the former, and ii) Object-Centric 3D\nToken Merging, which reduces cost of cross-frame attention by fusing redundant\ntokens in unimportant background regions. Both techniques are readily\napplicable to a given video editing model \\textit{without} retraining, and can\ndrastically reduce its memory and computational cost. We evaluate our proposals\non inversion-based and control-signal-based editing pipelines, and show a\nlatency reduction up to 10x for a comparable synthesis quality.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2401.05735","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Diffusion-based video editing have reached impressive quality and can
transform either the global style, local structure, and attributes of given
video inputs, following textual edit prompts. However, such solutions typically
incur heavy memory and computational costs to generate temporally-coherent
frames, either in the form of diffusion inversion and/or cross-frame attention.
In this paper, we conduct an analysis of such inefficiencies, and suggest
simple yet effective modifications that allow significant speed-ups whilst
maintaining quality. Moreover, we introduce Object-Centric Diffusion, coined as
OCD, to further reduce latency by allocating computations more towards
foreground edited regions that are arguably more important for perceptual
quality. We achieve this by two novel proposals: i) Object-Centric Sampling,
decoupling the diffusion steps spent on salient regions or background,
allocating most of the model capacity to the former, and ii) Object-Centric 3D
Token Merging, which reduces cost of cross-frame attention by fusing redundant
tokens in unimportant background regions. Both techniques are readily
applicable to a given video editing model \textit{without} retraining, and can
drastically reduce its memory and computational cost. We evaluate our proposals
on inversion-based and control-signal-based editing pipelines, and show a
latency reduction up to 10x for a comparable synthesis quality.