Nikos Athanasiou, Alpár Ceske, Markos Diomataris, Michael J. Black, Gül Varol
{"title":"MotionFix:文本驱动的 3D 人体动作编辑","authors":"Nikos Athanasiou, Alpár Ceske, Markos Diomataris, Michael J. Black, Gül Varol","doi":"arxiv-2408.00712","DOIUrl":null,"url":null,"abstract":"The focus of this paper is 3D motion editing. Given a 3D human motion and a\ntextual description of the desired modification, our goal is to generate an\nedited motion as described by the text. The challenges include the lack of\ntraining data and the design of a model that faithfully edits the source\nmotion. In this paper, we address both these challenges. We build a methodology\nto semi-automatically collect a dataset of triplets in the form of (i) a source\nmotion, (ii) a target motion, and (iii) an edit text, and create the new\nMotionFix dataset. Having access to such data allows us to train a conditional\ndiffusion model, TMED, that takes both the source motion and the edit text as\ninput. We further build various baselines trained only on text-motion pairs\ndatasets, and show superior performance of our model trained on triplets. We\nintroduce new retrieval-based metrics for motion editing and establish a new\nbenchmark on the evaluation set of MotionFix. Our results are encouraging,\npaving the way for further research on finegrained motion generation. Code and\nmodels will be made publicly available.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"25 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MotionFix: Text-Driven 3D Human Motion Editing\",\"authors\":\"Nikos Athanasiou, Alpár Ceske, Markos Diomataris, Michael J. Black, Gül Varol\",\"doi\":\"arxiv-2408.00712\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The focus of this paper is 3D motion editing. Given a 3D human motion and a\\ntextual description of the desired modification, our goal is to generate an\\nedited motion as described by the text. The challenges include the lack of\\ntraining data and the design of a model that faithfully edits the source\\nmotion. In this paper, we address both these challenges. We build a methodology\\nto semi-automatically collect a dataset of triplets in the form of (i) a source\\nmotion, (ii) a target motion, and (iii) an edit text, and create the new\\nMotionFix dataset. Having access to such data allows us to train a conditional\\ndiffusion model, TMED, that takes both the source motion and the edit text as\\ninput. We further build various baselines trained only on text-motion pairs\\ndatasets, and show superior performance of our model trained on triplets. We\\nintroduce new retrieval-based metrics for motion editing and establish a new\\nbenchmark on the evaluation set of MotionFix. Our results are encouraging,\\npaving the way for further research on finegrained motion generation. Code and\\nmodels will be made publicly available.\",\"PeriodicalId\":501174,\"journal\":{\"name\":\"arXiv - CS - Graphics\",\"volume\":\"25 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Graphics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.00712\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.00712","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
The focus of this paper is 3D motion editing. Given a 3D human motion and a
textual description of the desired modification, our goal is to generate an
edited motion as described by the text. The challenges include the lack of
training data and the design of a model that faithfully edits the source
motion. In this paper, we address both these challenges. We build a methodology
to semi-automatically collect a dataset of triplets in the form of (i) a source
motion, (ii) a target motion, and (iii) an edit text, and create the new
MotionFix dataset. Having access to such data allows us to train a conditional
diffusion model, TMED, that takes both the source motion and the edit text as
input. We further build various baselines trained only on text-motion pairs
datasets, and show superior performance of our model trained on triplets. We
introduce new retrieval-based metrics for motion editing and establish a new
benchmark on the evaluation set of MotionFix. Our results are encouraging,
paving the way for further research on finegrained motion generation. Code and
models will be made publicly available.