EmotionGesture:音频驱动的多元情感共语 3D 手势生成

IF 9.9 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS IEEE Transactions on Multimedia Pub Date : 2024-03-31 DOI:10.1109/TMM.2024.3407692
Xingqun Qi;Chen Liu;Lincheng Li;Jie Hou;Haoran Xin;Xin Yu
{"title":"EmotionGesture:音频驱动的多元情感共语 3D 手势生成","authors":"Xingqun Qi;Chen Liu;Lincheng Li;Jie Hou;Haoran Xin;Xin Yu","doi":"10.1109/TMM.2024.3407692","DOIUrl":null,"url":null,"abstract":"Generating vivid and diverse 3D co-speech gestures is crucial for various applications in animating virtual avatars. While most existing methods can generate gestures from audio directly, they usually overlook that \n<italic><b>emotion</b></i>\n is one of the key factors of authentic co-speech gesture generation. In this work, we propose \n<italic><b>EmotionGesture</b></i>\n, a novel framework for synthesizing vivid and diverse emotional co-speech 3D gestures from audio. Considering emotion is often entangled with the rhythmic beat in speech audio, we first develop an Emotion-Beat Mining module (EBM) to extract the emotion and audio beat features as well as model their correlation via a transcript-based visual-rhythm alignment. Then, we propose an initial pose based Spatial-Temporal Prompter (STP) to generate future gestures from the given initial poses. STP effectively models the spatial-temporal correlations between the initial poses and the future gestures, thus producing the spatial-temporal coherent pose prompt. Once we obtain pose prompts, emotion, and audio beat features, we will generate 3D co-speech gestures through a transformer architecture. However, considering the poses of existing datasets often contain jittering effects, this would lead to generating unstable gestures. To address this issue, we propose an effective objective function, dubbed Motion-Smooth Loss. Specifically, we model motion offset to compensate for jittering ground-truth by forcing gestures to be smooth. Last, we present an emotion-conditioned VAE to sample emotion features, enabling us to generate diverse emotional results. Extensive experiments demonstrate that our framework outperforms the state-of-the-art, achieving vivid and diverse emotional co-speech 3D gestures.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10420-10430"},"PeriodicalIF":9.9000,"publicationDate":"2024-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation\",\"authors\":\"Xingqun Qi;Chen Liu;Lincheng Li;Jie Hou;Haoran Xin;Xin Yu\",\"doi\":\"10.1109/TMM.2024.3407692\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Generating vivid and diverse 3D co-speech gestures is crucial for various applications in animating virtual avatars. While most existing methods can generate gestures from audio directly, they usually overlook that \\n<italic><b>emotion</b></i>\\n is one of the key factors of authentic co-speech gesture generation. In this work, we propose \\n<italic><b>EmotionGesture</b></i>\\n, a novel framework for synthesizing vivid and diverse emotional co-speech 3D gestures from audio. Considering emotion is often entangled with the rhythmic beat in speech audio, we first develop an Emotion-Beat Mining module (EBM) to extract the emotion and audio beat features as well as model their correlation via a transcript-based visual-rhythm alignment. Then, we propose an initial pose based Spatial-Temporal Prompter (STP) to generate future gestures from the given initial poses. STP effectively models the spatial-temporal correlations between the initial poses and the future gestures, thus producing the spatial-temporal coherent pose prompt. Once we obtain pose prompts, emotion, and audio beat features, we will generate 3D co-speech gestures through a transformer architecture. However, considering the poses of existing datasets often contain jittering effects, this would lead to generating unstable gestures. To address this issue, we propose an effective objective function, dubbed Motion-Smooth Loss. Specifically, we model motion offset to compensate for jittering ground-truth by forcing gestures to be smooth. Last, we present an emotion-conditioned VAE to sample emotion features, enabling us to generate diverse emotional results. Extensive experiments demonstrate that our framework outperforms the state-of-the-art, achieving vivid and diverse emotional co-speech 3D gestures.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"26 \",\"pages\":\"10420-10430\"},\"PeriodicalIF\":9.9000,\"publicationDate\":\"2024-03-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10543093/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10543093/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

生成生动多样的三维协同语音手势对于虚拟化身动画的各种应用至关重要。虽然大多数现有方法可以直接从音频生成手势,但它们通常忽略了情感是生成真实共同语音手势的关键因素之一。在这项工作中,我们提出了 EmotionGesture,一个从音频合成生动多样的情感共语 3D 手势的新型框架。考虑到情感往往与语音音频中的节奏节拍纠缠在一起,我们首先开发了一个情感节拍挖掘模块(EBM),以提取情感和音频节拍特征,并通过基于文本的视觉-节奏对齐来模拟它们之间的相关性。然后,我们提出了一种基于初始姿势的时空提示器(STP),可根据给定的初始姿势生成未来的手势。STP 可有效模拟初始姿势与未来手势之间的时空相关性,从而生成时空连贯的姿势提示。获得姿势提示、情感和音频节拍特征后,我们将通过转换器架构生成三维协同语音手势。然而,考虑到现有数据集的姿势通常包含抖动效应,这会导致生成的手势不稳定。为解决这一问题,我们提出了一种有效的目标函数,即运动平滑损失。具体来说,我们建立了运动偏移模型,通过迫使手势平滑来补偿抖动的地面实况。最后,我们提出了一种情感条件 VAE 来采样情感特征,使我们能够生成多样化的情感结果。广泛的实验证明,我们的框架优于最先进的框架,可以实现生动多样的情感共语 3D 手势。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation
Generating vivid and diverse 3D co-speech gestures is crucial for various applications in animating virtual avatars. While most existing methods can generate gestures from audio directly, they usually overlook that emotion is one of the key factors of authentic co-speech gesture generation. In this work, we propose EmotionGesture , a novel framework for synthesizing vivid and diverse emotional co-speech 3D gestures from audio. Considering emotion is often entangled with the rhythmic beat in speech audio, we first develop an Emotion-Beat Mining module (EBM) to extract the emotion and audio beat features as well as model their correlation via a transcript-based visual-rhythm alignment. Then, we propose an initial pose based Spatial-Temporal Prompter (STP) to generate future gestures from the given initial poses. STP effectively models the spatial-temporal correlations between the initial poses and the future gestures, thus producing the spatial-temporal coherent pose prompt. Once we obtain pose prompts, emotion, and audio beat features, we will generate 3D co-speech gestures through a transformer architecture. However, considering the poses of existing datasets often contain jittering effects, this would lead to generating unstable gestures. To address this issue, we propose an effective objective function, dubbed Motion-Smooth Loss. Specifically, we model motion offset to compensate for jittering ground-truth by forcing gestures to be smooth. Last, we present an emotion-conditioned VAE to sample emotion features, enabling us to generate diverse emotional results. Extensive experiments demonstrate that our framework outperforms the state-of-the-art, achieving vivid and diverse emotional co-speech 3D gestures.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE Transactions on Multimedia
IEEE Transactions on Multimedia 工程技术-电信学
CiteScore
11.70
自引率
11.00%
发文量
576
审稿时长
5.5 months
期刊介绍: The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.
期刊最新文献
Screen Detection from Egocentric Image Streams Leveraging Multi-View Vision Language Model. TMT: Tri-Modal Translation Between Speech, Image, and Text by Processing Different Modalities as Different Languages HMS2Net: Heterogeneous Multimodal State Space Network via CLIP for Dynamic Scene Classification in Livestreaming Soundscape Captioning Using Sound Affective Quality Network and Large Language Model Denoised Semantic Features for Local Consistent No-Reference Image Quality Assessment
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1