Still-Moving: Customized Video Generation without Customized Video Data

IF 7.8 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING ACM Transactions on Graphics Pub Date : 2024-11-19 DOI:10.1145/3687945

Hila Chefer, Shiran Zada, Roni Paiss, Ariel Ephrat, Omer Tov, Michael Rubinstein, Lior Wolf, Tali Dekel, Tomer Michaeli, Inbar Mosseri

{"title":"Still-Moving: Customized Video Generation without Customized Video Data","authors":"Hila Chefer, Shiran Zada, Roni Paiss, Ariel Ephrat, Omer Tov, Michael Rubinstein, Lior Wolf, Tali Dekel, Tomer Michaeli, Inbar Mosseri","doi":"10.1145/3687945","DOIUrl":null,"url":null,"abstract":"Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a T2I model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on \"frozen videos\" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.","PeriodicalId":50913,"journal":{"name":"ACM Transactions on Graphics","volume":"19 1","pages":""},"PeriodicalIF":7.8000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Graphics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3687945","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a T2I model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on "frozen videos" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

静止不动：无需定制视频数据即可生成定制视频

定制文本到图像（T2I）模型最近取得了巨大进展，尤其是在个性化、风格化和条件生成等领域。然而，将这一进展扩展到视频生成仍处于起步阶段，这主要是由于缺乏定制的视频数据。在这项工作中，我们引入了一个新颖的通用框架--Still-Moving，用于定制文本到视频（T2V）模型，而不需要任何定制的视频数据。该框架适用于著名的 T2V 设计，其中视频模型建立在 T2I 模型之上（例如，通过膨胀）。我们假定可以访问定制版的 T2I 模型，该模型只在静态图像数据上进行训练（例如，使用 DreamBooth）。天真地将定制 T2I 模型的权重插入 T2V 模型往往会导致明显的伪影或与定制数据的一致性不足。为了克服这一问题，我们训练了轻量级空间适配器，以调整注入的 T2I 层产生的特征。重要的是，我们的适配器是在 "冻结视频"（即重复图像）上进行训练的，而 "冻结视频 "是由定制的 T2I 模型生成的图像样本构建的。新颖的运动适配器模块为这种训练提供了便利，它允许我们在这种静态视频上进行训练，同时保留视频模型的运动先验。测试时，我们会移除运动适配器模块，只保留经过训练的空间适配器。这就恢复了 T2V 模型的运动先验，同时遵循了定制 T2I 模型的空间先验。我们在个性化、风格化和条件生成等不同任务中展示了我们方法的有效性。在所有评估场景中，我们的方法将定制 T2I 模型的空间先验与 T2V 模型提供的运动先验进行了无缝整合。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Graphics 工程技术-计算机：软件工程

CiteScore

14.30

自引率

25.80%

发文量

193

审稿时长

12 months

期刊介绍： ACM Transactions on Graphics (TOG) is a peer-reviewed scientific journal that aims to disseminate the latest findings of note in the field of computer graphics. It has been published since 1982 by the Association for Computing Machinery. Starting in 2003, all papers accepted for presentation at the annual SIGGRAPH conference are printed in a special summer issue of the journal.

期刊最新文献

Encoded Marker Clusters for Auto-Labeling in Optical Motion Capture Direct Rendering of Intrinsic Triangulations Texture Size Reduction Through Symmetric Overlap and Texture Carving Don't Splat your Gaussians: Volumetric Ray-Traced Primitives for Modeling and Rendering Scattering and Emissive Media Implicit Bonded Discrete Element Method with Manifold Optimization