重现一切:利用运动文本反转实现语义视频运动转移

Manuel Kansy, Jacek Naruniec, Christopher Schroers, Markus Gross, Romann M. Weber
{"title":"重现一切:利用运动文本反转实现语义视频运动转移","authors":"Manuel Kansy, Jacek Naruniec, Christopher Schroers, Markus Gross, Romann M. Weber","doi":"arxiv-2408.00458","DOIUrl":null,"url":null,"abstract":"Recent years have seen a tremendous improvement in the quality of video\ngeneration and editing approaches. While several techniques focus on editing\nappearance, few address motion. Current approaches using text, trajectories, or\nbounding boxes are limited to simple motions, so we specify motions with a\nsingle motion reference video instead. We further propose to use a pre-trained\nimage-to-video model rather than a text-to-video model. This approach allows us\nto preserve the exact appearance and position of a target object or scene and\nhelps disentangle appearance from motion. Our method, called motion-textual\ninversion, leverages our observation that image-to-video models extract\nappearance mainly from the (latent) image input, while the text/image embedding\ninjected via cross-attention predominantly controls motion. We thus represent\nmotion using text/image embedding tokens. By operating on an inflated\nmotion-text embedding containing multiple text/image embedding tokens per\nframe, we achieve a high temporal motion granularity. Once optimized on the\nmotion reference video, this embedding can be applied to various target images\nto generate videos with semantically similar motions. Our approach does not\nrequire spatial alignment between the motion reference video and target image,\ngeneralizes across various domains, and can be applied to various tasks such as\nfull-body and face reenactment, as well as controlling the motion of inanimate\nobjects and the camera. We empirically demonstrate the effectiveness of our\nmethod in the semantic video motion transfer task, significantly outperforming\nexisting methods in this context.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"36 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion\",\"authors\":\"Manuel Kansy, Jacek Naruniec, Christopher Schroers, Markus Gross, Romann M. Weber\",\"doi\":\"arxiv-2408.00458\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent years have seen a tremendous improvement in the quality of video\\ngeneration and editing approaches. While several techniques focus on editing\\nappearance, few address motion. Current approaches using text, trajectories, or\\nbounding boxes are limited to simple motions, so we specify motions with a\\nsingle motion reference video instead. We further propose to use a pre-trained\\nimage-to-video model rather than a text-to-video model. This approach allows us\\nto preserve the exact appearance and position of a target object or scene and\\nhelps disentangle appearance from motion. Our method, called motion-textual\\ninversion, leverages our observation that image-to-video models extract\\nappearance mainly from the (latent) image input, while the text/image embedding\\ninjected via cross-attention predominantly controls motion. We thus represent\\nmotion using text/image embedding tokens. By operating on an inflated\\nmotion-text embedding containing multiple text/image embedding tokens per\\nframe, we achieve a high temporal motion granularity. Once optimized on the\\nmotion reference video, this embedding can be applied to various target images\\nto generate videos with semantically similar motions. Our approach does not\\nrequire spatial alignment between the motion reference video and target image,\\ngeneralizes across various domains, and can be applied to various tasks such as\\nfull-body and face reenactment, as well as controlling the motion of inanimate\\nobjects and the camera. We empirically demonstrate the effectiveness of our\\nmethod in the semantic video motion transfer task, significantly outperforming\\nexisting methods in this context.\",\"PeriodicalId\":501174,\"journal\":{\"name\":\"arXiv - CS - Graphics\",\"volume\":\"36 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Graphics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.00458\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.00458","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

近年来,视频生成和编辑方法的质量有了极大的提高。虽然有几种技术侧重于编辑外观,但很少有技术能解决运动问题。目前使用文本、轨迹或边框的方法仅限于简单的运动,因此我们使用单个运动参考视频来指定运动。我们还建议使用预先训练好的 "图像到视频 "模型,而不是 "文本到视频 "模型。这种方法可以保留目标对象或场景的准确外观和位置,并有助于将外观与运动区分开来。我们的方法被称为 "运动-文本转换",它利用了我们的观察结果,即图像-视频模型主要从(潜在)图像输入中提取外观,而通过交叉注意注入的文本/图像嵌入则主要控制运动。因此,我们使用文本/图像嵌入标记来表示运动。通过对每帧包含多个文本/图像嵌入标记的膨胀运动文本嵌入进行操作,我们实现了较高的时间运动粒度。在运动参考视频上进行优化后,这种嵌入可应用于各种目标图像,生成具有语义相似运动的视频。我们的方法不要求运动参考视频和目标图像之间的空间对齐,可通用于各种领域,并可应用于各种任务,如全身和面部再现,以及控制无生命物体和摄像机的运动。我们通过实证证明了我们的方法在语义视频运动转移任务中的有效性,在这方面明显优于现有的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion
Recent years have seen a tremendous improvement in the quality of video generation and editing approaches. While several techniques focus on editing appearance, few address motion. Current approaches using text, trajectories, or bounding boxes are limited to simple motions, so we specify motions with a single motion reference video instead. We further propose to use a pre-trained image-to-video model rather than a text-to-video model. This approach allows us to preserve the exact appearance and position of a target object or scene and helps disentangle appearance from motion. Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input, while the text/image embedding injected via cross-attention predominantly controls motion. We thus represent motion using text/image embedding tokens. By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity. Once optimized on the motion reference video, this embedding can be applied to various target images to generate videos with semantically similar motions. Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks such as full-body and face reenactment, as well as controlling the motion of inanimate objects and the camera. We empirically demonstrate the effectiveness of our method in the semantic video motion transfer task, significantly outperforming existing methods in this context.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
GaussianHeads: End-to-End Learning of Drivable Gaussian Head Avatars from Coarse-to-fine Representations A Missing Data Imputation GAN for Character Sprite Generation Visualizing Temporal Topic Embeddings with a Compass Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models Phys3DGS: Physically-based 3D Gaussian Splatting for Inverse Rendering
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1