EmotionGesture：音频驱动的多元情感共语 3D 手势生成

IF 9.9 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Transactions on Multimedia Pub Date : 2024-03-31 DOI:10.1109/TMM.2024.3407692

Xingqun Qi;Chen Liu;Lincheng Li;Jie Hou;Haoran Xin;Xin Yu

{"title":"EmotionGesture：音频驱动的多元情感共语 3D 手势生成","authors":"Xingqun Qi;Chen Liu;Lincheng Li;Jie Hou;Haoran Xin;Xin Yu","doi":"10.1109/TMM.2024.3407692","DOIUrl":null,"url":null,"abstract":"Generating vivid and diverse 3D co-speech gestures is crucial for various applications in animating virtual avatars. While most existing methods can generate gestures from audio directly, they usually overlook that \n<italic>emotion\n is one of the key factors of authentic co-speech gesture generation. In this work, we propose \n<italic>EmotionGesture\n, a novel framework for synthesizing vivid and diverse emotional co-speech 3D gestures from audio. Considering emotion is often entangled with the rhythmic beat in speech audio, we first develop an Emotion-Beat Mining module (EBM) to extract the emotion and audio beat features as well as model their correlation via a transcript-based visual-rhythm alignment. Then, we propose an initial pose based Spatial-Temporal Prompter (STP) to generate future gestures from the given initial poses. STP effectively models the spatial-temporal correlations between the initial poses and the future gestures, thus producing the spatial-temporal coherent pose prompt. Once we obtain pose prompts, emotion, and audio beat features, we will generate 3D co-speech gestures through a transformer architecture. However, considering the poses of existing datasets often contain jittering effects, this would lead to generating unstable gestures. To address this issue, we propose an effective objective function, dubbed Motion-Smooth Loss. Specifically, we model motion offset to compensate for jittering ground-truth by forcing gestures to be smooth. Last, we present an emotion-conditioned VAE to sample emotion features, enabling us to generate diverse emotional results. Extensive experiments demonstrate that our framework outperforms the state-of-the-art, achieving vivid and diverse emotional co-speech 3D gestures.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10420-10430"},"PeriodicalIF":9.9000,"publicationDate":"2024-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation\",\"authors\":\"Xingqun Qi;Chen Liu;Lincheng Li;Jie Hou;Haoran Xin;Xin Yu\",\"doi\":\"10.1109/TMM.2024.3407692\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Generating vivid and diverse 3D co-speech gestures is crucial for various applications in animating virtual avatars. While most existing methods can generate gestures from audio directly, they usually overlook that \\n<italic>emotion\\n is one of the key factors of authentic co-speech gesture generation. In this work, we propose \\n<italic>EmotionGesture\\n, a novel framework for synthesizing vivid and diverse emotional co-speech 3D gestures from audio. Considering emotion is often entangled with the rhythmic beat in speech audio, we first develop an Emotion-Beat Mining module (EBM) to extract the emotion and audio beat features as well as model their correlation via a transcript-based visual-rhythm alignment. Then, we propose an initial pose based Spatial-Temporal Prompter (STP) to generate future gestures from the given initial poses. STP effectively models the spatial-temporal correlations between the initial poses and the future gestures, thus producing the spatial-temporal coherent pose prompt. Once we obtain pose prompts, emotion, and audio beat features, we will generate 3D co-speech gestures through a transformer architecture. However, considering the poses of existing datasets often contain jittering effects, this would lead to generating unstable gestures. To address this issue, we propose an effective objective function, dubbed Motion-Smooth Loss. Specifically, we model motion offset to compensate for jittering ground-truth by forcing gestures to be smooth. Last, we present an emotion-conditioned VAE to sample emotion features, enabling us to generate diverse emotional results. Extensive experiments demonstrate that our framework outperforms the state-of-the-art, achieving vivid and diverse emotional co-speech 3D gestures.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"26 \",\"pages\":\"10420-10430\"},\"PeriodicalIF\":9.9000,\"publicationDate\":\"2024-03-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10543093/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10543093/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

生成生动多样的三维协同语音手势对于虚拟化身动画的各种应用至关重要。虽然大多数现有方法可以直接从音频生成手势，但它们通常忽略了情感是生成真实共同语音手势的关键因素之一。在这项工作中，我们提出了 EmotionGesture，一个从音频合成生动多样的情感共语 3D 手势的新型框架。考虑到情感往往与语音音频中的节奏节拍纠缠在一起，我们首先开发了一个情感节拍挖掘模块（EBM），以提取情感和音频节拍特征，并通过基于文本的视觉-节奏对齐来模拟它们之间的相关性。然后，我们提出了一种基于初始姿势的时空提示器（STP），可根据给定的初始姿势生成未来的手势。STP 可有效模拟初始姿势与未来手势之间的时空相关性，从而生成时空连贯的姿势提示。获得姿势提示、情感和音频节拍特征后，我们将通过转换器架构生成三维协同语音手势。然而，考虑到现有数据集的姿势通常包含抖动效应，这会导致生成的手势不稳定。为解决这一问题，我们提出了一种有效的目标函数，即运动平滑损失。具体来说，我们建立了运动偏移模型，通过迫使手势平滑来补偿抖动的地面实况。最后，我们提出了一种情感条件 VAE 来采样情感特征，使我们能够生成多样化的情感结果。广泛的实验证明，我们的框架优于最先进的框架，可以实现生动多样的情感共语 3D 手势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation

Generating vivid and diverse 3D co-speech gestures is crucial for various applications in animating virtual avatars. While most existing methods can generate gestures from audio directly, they usually overlook that emotion is one of the key factors of authentic co-speech gesture generation. In this work, we propose EmotionGesture , a novel framework for synthesizing vivid and diverse emotional co-speech 3D gestures from audio. Considering emotion is often entangled with the rhythmic beat in speech audio, we first develop an Emotion-Beat Mining module (EBM) to extract the emotion and audio beat features as well as model their correlation via a transcript-based visual-rhythm alignment. Then, we propose an initial pose based Spatial-Temporal Prompter (STP) to generate future gestures from the given initial poses. STP effectively models the spatial-temporal correlations between the initial poses and the future gestures, thus producing the spatial-temporal coherent pose prompt. Once we obtain pose prompts, emotion, and audio beat features, we will generate 3D co-speech gestures through a transformer architecture. However, considering the poses of existing datasets often contain jittering effects, this would lead to generating unstable gestures. To address this issue, we propose an effective objective function, dubbed Motion-Smooth Loss. Specifically, we model motion offset to compensate for jittering ground-truth by forcing gestures to be smooth. Last, we present an emotion-conditioned VAE to sample emotion features, enabling us to generate diverse emotional results. Extensive experiments demonstrate that our framework outperforms the state-of-the-art, achieving vivid and diverse emotional co-speech 3D gestures.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.