{"title":"EmotionGesture:音频驱动的多元情感共语 3D 手势生成","authors":"Xingqun Qi;Chen Liu;Lincheng Li;Jie Hou;Haoran Xin;Xin Yu","doi":"10.1109/TMM.2024.3407692","DOIUrl":null,"url":null,"abstract":"Generating vivid and diverse 3D co-speech gestures is crucial for various applications in animating virtual avatars. While most existing methods can generate gestures from audio directly, they usually overlook that \n<italic><b>emotion</b></i>\n is one of the key factors of authentic co-speech gesture generation. In this work, we propose \n<italic><b>EmotionGesture</b></i>\n, a novel framework for synthesizing vivid and diverse emotional co-speech 3D gestures from audio. Considering emotion is often entangled with the rhythmic beat in speech audio, we first develop an Emotion-Beat Mining module (EBM) to extract the emotion and audio beat features as well as model their correlation via a transcript-based visual-rhythm alignment. Then, we propose an initial pose based Spatial-Temporal Prompter (STP) to generate future gestures from the given initial poses. STP effectively models the spatial-temporal correlations between the initial poses and the future gestures, thus producing the spatial-temporal coherent pose prompt. Once we obtain pose prompts, emotion, and audio beat features, we will generate 3D co-speech gestures through a transformer architecture. However, considering the poses of existing datasets often contain jittering effects, this would lead to generating unstable gestures. To address this issue, we propose an effective objective function, dubbed Motion-Smooth Loss. Specifically, we model motion offset to compensate for jittering ground-truth by forcing gestures to be smooth. Last, we present an emotion-conditioned VAE to sample emotion features, enabling us to generate diverse emotional results. Extensive experiments demonstrate that our framework outperforms the state-of-the-art, achieving vivid and diverse emotional co-speech 3D gestures.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10420-10430"},"PeriodicalIF":9.9000,"publicationDate":"2024-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation\",\"authors\":\"Xingqun Qi;Chen Liu;Lincheng Li;Jie Hou;Haoran Xin;Xin Yu\",\"doi\":\"10.1109/TMM.2024.3407692\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Generating vivid and diverse 3D co-speech gestures is crucial for various applications in animating virtual avatars. While most existing methods can generate gestures from audio directly, they usually overlook that \\n<italic><b>emotion</b></i>\\n is one of the key factors of authentic co-speech gesture generation. In this work, we propose \\n<italic><b>EmotionGesture</b></i>\\n, a novel framework for synthesizing vivid and diverse emotional co-speech 3D gestures from audio. Considering emotion is often entangled with the rhythmic beat in speech audio, we first develop an Emotion-Beat Mining module (EBM) to extract the emotion and audio beat features as well as model their correlation via a transcript-based visual-rhythm alignment. Then, we propose an initial pose based Spatial-Temporal Prompter (STP) to generate future gestures from the given initial poses. STP effectively models the spatial-temporal correlations between the initial poses and the future gestures, thus producing the spatial-temporal coherent pose prompt. Once we obtain pose prompts, emotion, and audio beat features, we will generate 3D co-speech gestures through a transformer architecture. However, considering the poses of existing datasets often contain jittering effects, this would lead to generating unstable gestures. To address this issue, we propose an effective objective function, dubbed Motion-Smooth Loss. Specifically, we model motion offset to compensate for jittering ground-truth by forcing gestures to be smooth. Last, we present an emotion-conditioned VAE to sample emotion features, enabling us to generate diverse emotional results. Extensive experiments demonstrate that our framework outperforms the state-of-the-art, achieving vivid and diverse emotional co-speech 3D gestures.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"26 \",\"pages\":\"10420-10430\"},\"PeriodicalIF\":9.9000,\"publicationDate\":\"2024-03-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10543093/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10543093/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
摘要
生成生动多样的三维协同语音手势对于虚拟化身动画的各种应用至关重要。虽然大多数现有方法可以直接从音频生成手势,但它们通常忽略了情感是生成真实共同语音手势的关键因素之一。在这项工作中,我们提出了 EmotionGesture,一个从音频合成生动多样的情感共语 3D 手势的新型框架。考虑到情感往往与语音音频中的节奏节拍纠缠在一起,我们首先开发了一个情感节拍挖掘模块(EBM),以提取情感和音频节拍特征,并通过基于文本的视觉-节奏对齐来模拟它们之间的相关性。然后,我们提出了一种基于初始姿势的时空提示器(STP),可根据给定的初始姿势生成未来的手势。STP 可有效模拟初始姿势与未来手势之间的时空相关性,从而生成时空连贯的姿势提示。获得姿势提示、情感和音频节拍特征后,我们将通过转换器架构生成三维协同语音手势。然而,考虑到现有数据集的姿势通常包含抖动效应,这会导致生成的手势不稳定。为解决这一问题,我们提出了一种有效的目标函数,即运动平滑损失。具体来说,我们建立了运动偏移模型,通过迫使手势平滑来补偿抖动的地面实况。最后,我们提出了一种情感条件 VAE 来采样情感特征,使我们能够生成多样化的情感结果。广泛的实验证明,我们的框架优于最先进的框架,可以实现生动多样的情感共语 3D 手势。
EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation
Generating vivid and diverse 3D co-speech gestures is crucial for various applications in animating virtual avatars. While most existing methods can generate gestures from audio directly, they usually overlook that
emotion
is one of the key factors of authentic co-speech gesture generation. In this work, we propose
EmotionGesture
, a novel framework for synthesizing vivid and diverse emotional co-speech 3D gestures from audio. Considering emotion is often entangled with the rhythmic beat in speech audio, we first develop an Emotion-Beat Mining module (EBM) to extract the emotion and audio beat features as well as model their correlation via a transcript-based visual-rhythm alignment. Then, we propose an initial pose based Spatial-Temporal Prompter (STP) to generate future gestures from the given initial poses. STP effectively models the spatial-temporal correlations between the initial poses and the future gestures, thus producing the spatial-temporal coherent pose prompt. Once we obtain pose prompts, emotion, and audio beat features, we will generate 3D co-speech gestures through a transformer architecture. However, considering the poses of existing datasets often contain jittering effects, this would lead to generating unstable gestures. To address this issue, we propose an effective objective function, dubbed Motion-Smooth Loss. Specifically, we model motion offset to compensate for jittering ground-truth by forcing gestures to be smooth. Last, we present an emotion-conditioned VAE to sample emotion features, enabling us to generate diverse emotional results. Extensive experiments demonstrate that our framework outperforms the state-of-the-art, achieving vivid and diverse emotional co-speech 3D gestures.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.