Speech-driven head motion generation from waveforms

IF 2.4 3区 计算机科学 Q2 ACOUSTICS Speech Communication Pub Date : 2024-03-01 DOI:10.1016/j.specom.2024.103056
JinHong Lu, Hiroshi Shimodaira
{"title":"Speech-driven head motion generation from waveforms","authors":"JinHong Lu,&nbsp;Hiroshi Shimodaira","doi":"10.1016/j.specom.2024.103056","DOIUrl":null,"url":null,"abstract":"<div><p>Head motion generation task for speech-driven virtual agent animation is commonly explored with handcrafted audio features, such as MFCCs as input features, plus additional features, such as energy and F0 in the literature. In this paper, we study the direct use of speech waveform to generate head motion. We claim that creating a task-specific feature from waveform to generate head motion leads to better performance than using standard acoustic features to generate head motion overall. At the same time, we completely abandon the handcrafted feature extraction process, leading to more effectiveness. However, the difficulty of creating a task-specific feature from waveform is their staggering quantity of irrelevant information, implicating potential cumbrance for neural network training. Thus, we apply a canonical-correlation-constrained autoencoder (CCCAE), where we are able to compress the high-dimensional waveform into a low-dimensional embedded feature, with the minimal error in reconstruction, and sustain the relevant information with the maximal cannonical correlation to head motion. We extend our previous research by including more speakers in our dataset and also adapt with a recurrent neural network, to show the feasibility of our proposed feature. Through comparisons between different acoustic features, our proposed feature, <span><math><msub><mrow><mtext>Wav</mtext></mrow><mrow><mtext>CCCAE</mtext></mrow></msub></math></span>, shows at least a 20% improvement in the correlation from the waveform, and outperforms the popular acoustic feature, MFCC, by at least 5% respectively for all speakers. Through the comparison in the feedforward neural network regression (FNN-regression) system, the <span><math><msub><mrow><mtext>Wav</mtext></mrow><mrow><mtext>CCCAE</mtext></mrow></msub></math></span>-based system shows comparable performance in objective evaluation. In long short-term memory (LSTM) experiments, LSTM-models improve the overall performance in normalised mean square error (NMSE) and CCA metrics, and adapt the <span><math><msub><mrow><mtext>Wav</mtext></mrow><mrow><mtext>CCCAE</mtext></mrow></msub></math></span>feature better, which makes the proposed LSTM-regression system outperform the MFCC-based system. We also re-design the subjective evaluation, and the subjective results show the animations generated by models where <span><math><msub><mrow><mtext>Wav</mtext></mrow><mrow><mtext>CCCAE</mtext></mrow></msub></math></span>was chosen to be better than the other models by the participants of MUSHRA test.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000281/pdfft?md5=3e4ce95ea878ead804890332c3362074&pid=1-s2.0-S0167639324000281-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324000281","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

Abstract

Head motion generation task for speech-driven virtual agent animation is commonly explored with handcrafted audio features, such as MFCCs as input features, plus additional features, such as energy and F0 in the literature. In this paper, we study the direct use of speech waveform to generate head motion. We claim that creating a task-specific feature from waveform to generate head motion leads to better performance than using standard acoustic features to generate head motion overall. At the same time, we completely abandon the handcrafted feature extraction process, leading to more effectiveness. However, the difficulty of creating a task-specific feature from waveform is their staggering quantity of irrelevant information, implicating potential cumbrance for neural network training. Thus, we apply a canonical-correlation-constrained autoencoder (CCCAE), where we are able to compress the high-dimensional waveform into a low-dimensional embedded feature, with the minimal error in reconstruction, and sustain the relevant information with the maximal cannonical correlation to head motion. We extend our previous research by including more speakers in our dataset and also adapt with a recurrent neural network, to show the feasibility of our proposed feature. Through comparisons between different acoustic features, our proposed feature, WavCCCAE, shows at least a 20% improvement in the correlation from the waveform, and outperforms the popular acoustic feature, MFCC, by at least 5% respectively for all speakers. Through the comparison in the feedforward neural network regression (FNN-regression) system, the WavCCCAE-based system shows comparable performance in objective evaluation. In long short-term memory (LSTM) experiments, LSTM-models improve the overall performance in normalised mean square error (NMSE) and CCA metrics, and adapt the WavCCCAEfeature better, which makes the proposed LSTM-regression system outperform the MFCC-based system. We also re-design the subjective evaluation, and the subjective results show the animations generated by models where WavCCCAEwas chosen to be better than the other models by the participants of MUSHRA test.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
根据波形生成语音驱动的头部动作
针对语音驱动的虚拟代理动画的头部动作生成任务,文献中通常使用手工制作的音频特征(如 MFCC)作为输入特征,再加上额外的特征(如能量和 F0)进行探索。在本文中,我们研究了直接使用语音波形生成头部动作的方法。我们认为,从波形中创建特定任务特征来生成头部运动,比使用标准声学特征来生成头部运动的整体效果更好。同时,我们完全放弃了手工特征提取过程,从而提高了效率。然而,从波形中创建特定任务特征的难点在于其数量惊人的不相关信息,这对神经网络训练造成了潜在的负担。因此,我们应用了一种规范相关约束自动编码器(CCCAE),它能将高维波形压缩成低维嵌入特征,重建误差最小,并以与头部运动的最大规范相关性维持相关信息。我们扩展了之前的研究,在数据集中加入了更多的扬声器,并使用递归神经网络进行调整,以证明我们提出的特征的可行性。通过不同声学特征之间的比较,我们提出的特征 WavCCCAE 在与波形的相关性方面至少提高了 20%,在所有扬声器中分别比流行的声学特征 MFCC 高出至少 5%。通过在前馈神经网络回归(FNN-regression)系统中的比较,基于 WavCCCAE 的系统在客观评估中表现出了相当的性能。在长短期记忆(LSTM)实验中,LSTM 模型改善了归一化均方误差(NMSE)和 CCA 指标的整体性能,并更好地适应了 WavCCCAE 特征,这使得所提出的 LSTM 回归系统优于基于 MFCC 的系统。我们还重新设计了主观评价,主观结果显示了 MUSHRA 测试参与者选择 WavCCCAE 优于其他模型的模型生成的动画。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Speech Communication
Speech Communication 工程技术-计算机:跨学科应用
CiteScore
6.80
自引率
6.20%
发文量
94
审稿时长
19.2 weeks
期刊介绍: Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.
期刊最新文献
A corpus of audio-visual recordings of linguistically balanced, Danish sentences for speech-in-noise experiments Forms, factors and functions of phonetic convergence: Editorial Feasibility of acoustic features of vowel sounds in estimating the upper airway cross sectional area during wakefulness: A pilot study Zero-shot voice conversion based on feature disentanglement Multi-modal co-learning for silent speech recognition based on ultrasound tongue images
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1