DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework

Fan Zhang, Naye Ji, Fuxing Gao, Bozuo Zhao, Jingmei Wu, Yanbing Jiang, Hui Du, Zhenqing Ye, Jiayang Zhu, WeiFan Zhong, Leyao Yan, Xiaomeng Ma
{"title":"DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework","authors":"Fan Zhang, Naye Ji, Fuxing Gao, Bozuo Zhao, Jingmei Wu, Yanbing Jiang, Hui Du, Zhenqing Ye, Jiayang Zhu, WeiFan Zhong, Leyao Yan, Xiaomeng Ma","doi":"arxiv-2408.00370","DOIUrl":null,"url":null,"abstract":"Speech-driven gesture generation is an emerging domain within virtual human\ncreation, where current methods predominantly utilize Transformer-based\narchitectures that necessitate extensive memory and are characterized by slow\ninference speeds. In response to these limitations, we propose\n\\textit{DiM-Gestures}, a novel end-to-end generative model crafted to create\nhighly personalized 3D full-body gestures solely from raw speech audio,\nemploying Mamba-based architectures. This model integrates a Mamba-based fuzzy\nfeature extractor with a non-autoregressive Adaptive Layer Normalization\n(AdaLN) Mamba-2 diffusion architecture. The extractor, leveraging a Mamba\nframework and a WavLM pre-trained model, autonomously derives implicit,\ncontinuous fuzzy features, which are then unified into a singular latent\nfeature. This feature is processed by the AdaLN Mamba-2, which implements a\nuniform conditional mechanism across all tokens to robustly model the interplay\nbetween the fuzzy features and the resultant gesture sequence. This innovative\napproach guarantees high fidelity in gesture-speech synchronization while\nmaintaining the naturalness of the gestures. Employing a diffusion model for\ntraining and inference, our framework has undergone extensive subjective and\nobjective evaluations on the ZEGGS and BEAT datasets. These assessments\nsubstantiate our model's enhanced performance relative to contemporary\nstate-of-the-art methods, demonstrating competitive outcomes with the DiTs\narchitecture (Persona-Gestors) while optimizing memory usage and accelerating\ninference speed.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.00370","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Speech-driven gesture generation is an emerging domain within virtual human creation, where current methods predominantly utilize Transformer-based architectures that necessitate extensive memory and are characterized by slow inference speeds. In response to these limitations, we propose \textit{DiM-Gestures}, a novel end-to-end generative model crafted to create highly personalized 3D full-body gestures solely from raw speech audio, employing Mamba-based architectures. This model integrates a Mamba-based fuzzy feature extractor with a non-autoregressive Adaptive Layer Normalization (AdaLN) Mamba-2 diffusion architecture. The extractor, leveraging a Mamba framework and a WavLM pre-trained model, autonomously derives implicit, continuous fuzzy features, which are then unified into a singular latent feature. This feature is processed by the AdaLN Mamba-2, which implements a uniform conditional mechanism across all tokens to robustly model the interplay between the fuzzy features and the resultant gesture sequence. This innovative approach guarantees high fidelity in gesture-speech synchronization while maintaining the naturalness of the gestures. Employing a diffusion model for training and inference, our framework has undergone extensive subjective and objective evaluations on the ZEGGS and BEAT datasets. These assessments substantiate our model's enhanced performance relative to contemporary state-of-the-art methods, demonstrating competitive outcomes with the DiTs architecture (Persona-Gestors) while optimizing memory usage and accelerating inference speed.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
DiM-Gesture:利用自适应层归一化技术生成协同语音手势 Mamba-2 框架
语音驱动的手势生成是虚拟人创作中的一个新兴领域,目前的方法主要使用基于变压器的架构,这种架构需要大量内存,而且推理速度较慢。针对这些局限性,我们提出了 "DiM-Gestures"(DiM-手势)这一新颖的端到端生成模型,该模型采用基于 Mamba 的体系结构,可完全根据原始语音音频创建高度个性化的 3D 全身手势。该模型集成了一个基于 Mamba 的模糊特征提取器和一个非自回归自适应层归一化(AdaLN)Mamba-2 扩散架构。该提取器利用 Mambaframework 和 WavLM 预训练模型,自主提取隐含的连续模糊特征,然后将其统一为一个奇异的潜在特征。该特征由 AdaLN Mamba-2 处理,Mamba-2 对所有标记实施统一的条件机制,以对模糊特征和由此产生的手势序列之间的相互作用进行稳健建模。这种创新方法保证了手势与语音同步的高保真性,同时保持了手势的自然性。我们的框架采用扩散模型进行训练和推理,并在 ZEGGS 和 BEAT 数据集上进行了广泛的主观和客观评估。这些评估证明,与当代最先进的方法相比,我们的模型性能更强,与 DiTs 架构(Persona-Gestors)相比具有竞争力,同时优化了内存使用并加快了推理速度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Benchmarking Sub-Genre Classification For Mainstage Dance Music PDAF: A Phonetic Debiasing Attention Framework For Speaker Verification Evaluation of real-time transcriptions using end-to-end ASR models Machine Anomalous Sound Detection Using Spectral-temporal Modulation Representations Derived from Machine-specific Filterbanks Harmonic Reasoning in Large Language Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1