{"title":"动态运动合成:屏蔽音频文本条件时空变换器","authors":"Sohan Anisetty, James Hays","doi":"arxiv-2409.01591","DOIUrl":null,"url":null,"abstract":"Our research presents a novel motion generation framework designed to produce\nwhole-body motion sequences conditioned on multiple modalities simultaneously,\nspecifically text and audio inputs. Leveraging Vector Quantized Variational\nAutoencoders (VQVAEs) for motion discretization and a bidirectional Masked\nLanguage Modeling (MLM) strategy for efficient token prediction, our approach\nachieves improved processing efficiency and coherence in the generated motions.\nBy integrating spatial attention mechanisms and a token critic we ensure\nconsistency and naturalness in the generated motions. This framework expands\nthe possibilities of motion generation, addressing the limitations of existing\napproaches and opening avenues for multimodal motion synthesis.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"16 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers\",\"authors\":\"Sohan Anisetty, James Hays\",\"doi\":\"arxiv-2409.01591\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Our research presents a novel motion generation framework designed to produce\\nwhole-body motion sequences conditioned on multiple modalities simultaneously,\\nspecifically text and audio inputs. Leveraging Vector Quantized Variational\\nAutoencoders (VQVAEs) for motion discretization and a bidirectional Masked\\nLanguage Modeling (MLM) strategy for efficient token prediction, our approach\\nachieves improved processing efficiency and coherence in the generated motions.\\nBy integrating spatial attention mechanisms and a token critic we ensure\\nconsistency and naturalness in the generated motions. This framework expands\\nthe possibilities of motion generation, addressing the limitations of existing\\napproaches and opening avenues for multimodal motion synthesis.\",\"PeriodicalId\":501174,\"journal\":{\"name\":\"arXiv - CS - Graphics\",\"volume\":\"16 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Graphics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.01591\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.01591","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Our research presents a novel motion generation framework designed to produce
whole-body motion sequences conditioned on multiple modalities simultaneously,
specifically text and audio inputs. Leveraging Vector Quantized Variational
Autoencoders (VQVAEs) for motion discretization and a bidirectional Masked
Language Modeling (MLM) strategy for efficient token prediction, our approach
achieves improved processing efficiency and coherence in the generated motions.
By integrating spatial attention mechanisms and a token critic we ensure
consistency and naturalness in the generated motions. This framework expands
the possibilities of motion generation, addressing the limitations of existing
approaches and opening avenues for multimodal motion synthesis.