Still-Moving: Customized Video Generation without Customized Video Data

IF 7.8 1区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING ACM Transactions on Graphics Pub Date : 2024-11-19 DOI:10.1145/3687945
Hila Chefer, Shiran Zada, Roni Paiss, Ariel Ephrat, Omer Tov, Michael Rubinstein, Lior Wolf, Tali Dekel, Tomer Michaeli, Inbar Mosseri
{"title":"Still-Moving: Customized Video Generation without Customized Video Data","authors":"Hila Chefer, Shiran Zada, Roni Paiss, Ariel Ephrat, Omer Tov, Michael Rubinstein, Lior Wolf, Tali Dekel, Tomer Michaeli, Inbar Mosseri","doi":"10.1145/3687945","DOIUrl":null,"url":null,"abstract":"Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a T2I model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight <jats:italic>Spatial Adapters</jats:italic> that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on <jats:italic>\"frozen videos\"</jats:italic> (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel <jats:italic>Motion Adapter</jats:italic> module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":"19 1","pages":""},"PeriodicalIF":7.8000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Graphics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3687945","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

Abstract

Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a T2I model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on "frozen videos" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
静止不动:无需定制视频数据即可生成定制视频
定制文本到图像(T2I)模型最近取得了巨大进展,尤其是在个性化、风格化和条件生成等领域。然而,将这一进展扩展到视频生成仍处于起步阶段,这主要是由于缺乏定制的视频数据。在这项工作中,我们引入了一个新颖的通用框架--Still-Moving,用于定制文本到视频(T2V)模型,而不需要任何定制的视频数据。该框架适用于著名的 T2V 设计,其中视频模型建立在 T2I 模型之上(例如,通过膨胀)。我们假定可以访问定制版的 T2I 模型,该模型只在静态图像数据上进行训练(例如,使用 DreamBooth)。天真地将定制 T2I 模型的权重插入 T2V 模型往往会导致明显的伪影或与定制数据的一致性不足。为了克服这一问题,我们训练了轻量级空间适配器,以调整注入的 T2I 层产生的特征。重要的是,我们的适配器是在 "冻结视频"(即重复图像)上进行训练的,而 "冻结视频 "是由定制的 T2I 模型生成的图像样本构建的。新颖的运动适配器模块为这种训练提供了便利,它允许我们在这种静态视频上进行训练,同时保留视频模型的运动先验。测试时,我们会移除运动适配器模块,只保留经过训练的空间适配器。这就恢复了 T2V 模型的运动先验,同时遵循了定制 T2I 模型的空间先验。我们在个性化、风格化和条件生成等不同任务中展示了我们方法的有效性。在所有评估场景中,我们的方法将定制 T2I 模型的空间先验与 T2V 模型提供的运动先验进行了无缝整合。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
ACM Transactions on Graphics
ACM Transactions on Graphics 工程技术-计算机:软件工程
CiteScore
14.30
自引率
25.80%
发文量
193
审稿时长
12 months
期刊介绍: ACM Transactions on Graphics (TOG) is a peer-reviewed scientific journal that aims to disseminate the latest findings of note in the field of computer graphics. It has been published since 1982 by the Association for Computing Machinery. Starting in 2003, all papers accepted for presentation at the annual SIGGRAPH conference are printed in a special summer issue of the journal.
期刊最新文献
Appearance-Preserving Scene Aggregation for Level-of-Detail Rendering Unified Pressure, Surface Tension and Friction for SPH Fluids Direct Manipulation of Procedural Implicit Surfaces 3DGSR: Implicit Surface Reconstruction with 3D Gaussian Splatting Quark: Real-time, High-resolution, and General Neural View Synthesis
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1