Manuel Kansy, Jacek Naruniec, Christopher Schroers, Markus Gross, Romann M. Weber
{"title":"重现一切:利用运动文本反转实现语义视频运动转移","authors":"Manuel Kansy, Jacek Naruniec, Christopher Schroers, Markus Gross, Romann M. Weber","doi":"arxiv-2408.00458","DOIUrl":null,"url":null,"abstract":"Recent years have seen a tremendous improvement in the quality of video\ngeneration and editing approaches. While several techniques focus on editing\nappearance, few address motion. Current approaches using text, trajectories, or\nbounding boxes are limited to simple motions, so we specify motions with a\nsingle motion reference video instead. We further propose to use a pre-trained\nimage-to-video model rather than a text-to-video model. This approach allows us\nto preserve the exact appearance and position of a target object or scene and\nhelps disentangle appearance from motion. Our method, called motion-textual\ninversion, leverages our observation that image-to-video models extract\nappearance mainly from the (latent) image input, while the text/image embedding\ninjected via cross-attention predominantly controls motion. We thus represent\nmotion using text/image embedding tokens. By operating on an inflated\nmotion-text embedding containing multiple text/image embedding tokens per\nframe, we achieve a high temporal motion granularity. Once optimized on the\nmotion reference video, this embedding can be applied to various target images\nto generate videos with semantically similar motions. Our approach does not\nrequire spatial alignment between the motion reference video and target image,\ngeneralizes across various domains, and can be applied to various tasks such as\nfull-body and face reenactment, as well as controlling the motion of inanimate\nobjects and the camera. We empirically demonstrate the effectiveness of our\nmethod in the semantic video motion transfer task, significantly outperforming\nexisting methods in this context.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"36 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion\",\"authors\":\"Manuel Kansy, Jacek Naruniec, Christopher Schroers, Markus Gross, Romann M. Weber\",\"doi\":\"arxiv-2408.00458\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent years have seen a tremendous improvement in the quality of video\\ngeneration and editing approaches. While several techniques focus on editing\\nappearance, few address motion. Current approaches using text, trajectories, or\\nbounding boxes are limited to simple motions, so we specify motions with a\\nsingle motion reference video instead. We further propose to use a pre-trained\\nimage-to-video model rather than a text-to-video model. This approach allows us\\nto preserve the exact appearance and position of a target object or scene and\\nhelps disentangle appearance from motion. Our method, called motion-textual\\ninversion, leverages our observation that image-to-video models extract\\nappearance mainly from the (latent) image input, while the text/image embedding\\ninjected via cross-attention predominantly controls motion. We thus represent\\nmotion using text/image embedding tokens. By operating on an inflated\\nmotion-text embedding containing multiple text/image embedding tokens per\\nframe, we achieve a high temporal motion granularity. Once optimized on the\\nmotion reference video, this embedding can be applied to various target images\\nto generate videos with semantically similar motions. Our approach does not\\nrequire spatial alignment between the motion reference video and target image,\\ngeneralizes across various domains, and can be applied to various tasks such as\\nfull-body and face reenactment, as well as controlling the motion of inanimate\\nobjects and the camera. We empirically demonstrate the effectiveness of our\\nmethod in the semantic video motion transfer task, significantly outperforming\\nexisting methods in this context.\",\"PeriodicalId\":501174,\"journal\":{\"name\":\"arXiv - CS - Graphics\",\"volume\":\"36 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Graphics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.00458\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.00458","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion
Recent years have seen a tremendous improvement in the quality of video
generation and editing approaches. While several techniques focus on editing
appearance, few address motion. Current approaches using text, trajectories, or
bounding boxes are limited to simple motions, so we specify motions with a
single motion reference video instead. We further propose to use a pre-trained
image-to-video model rather than a text-to-video model. This approach allows us
to preserve the exact appearance and position of a target object or scene and
helps disentangle appearance from motion. Our method, called motion-textual
inversion, leverages our observation that image-to-video models extract
appearance mainly from the (latent) image input, while the text/image embedding
injected via cross-attention predominantly controls motion. We thus represent
motion using text/image embedding tokens. By operating on an inflated
motion-text embedding containing multiple text/image embedding tokens per
frame, we achieve a high temporal motion granularity. Once optimized on the
motion reference video, this embedding can be applied to various target images
to generate videos with semantically similar motions. Our approach does not
require spatial alignment between the motion reference video and target image,
generalizes across various domains, and can be applied to various tasks such as
full-body and face reenactment, as well as controlling the motion of inanimate
objects and the camera. We empirically demonstrate the effectiveness of our
method in the semantic video motion transfer task, significantly outperforming
existing methods in this context.