Double-DCCCAE: Estimation of Body Gestures From Speech Waveform

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2021-06-06 DOI:10.1109/ICASSP39728.2021.9414660

Jinhong Lu, Tianhang Liu, Shuzhuang Xu, H. Shimodaira

{"title":"Double-DCCCAE: Estimation of Body Gestures From Speech Waveform","authors":"Jinhong Lu, Tianhang Liu, Shuzhuang Xu, H. Shimodaira","doi":"10.1109/ICASSP39728.2021.9414660","DOIUrl":null,"url":null,"abstract":"This paper presents an approach for body-motion estimation from audio-speech waveform, where context information in both input and output streams is taken in to account without using recurrent models. Previous works commonly use multiple frames of input to estimate one frame of motion data, where the temporal information of the generated motion is little considered. To resolve the problems, we extend our previous work and propose a system, double deep canonical-correlation-constrained autoencoder (D-DCCCAE), which encodes each of speech and motion segments into fixed-length embedded features that are well correlated with the segments of the other modality. The learnt motion embedded feature is estimated from the learnt speech-embedded feature through a simple neural network and further decoded back to the sequential motion. The proposed pair of embedded features showed higher correlation than spectral features with motion data, and our model was more preferred than the baseline model (BA) in terms of human-likeness and comparable in terms of similar appropriateness.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP39728.2021.9414660","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

This paper presents an approach for body-motion estimation from audio-speech waveform, where context information in both input and output streams is taken in to account without using recurrent models. Previous works commonly use multiple frames of input to estimate one frame of motion data, where the temporal information of the generated motion is little considered. To resolve the problems, we extend our previous work and propose a system, double deep canonical-correlation-constrained autoencoder (D-DCCCAE), which encodes each of speech and motion segments into fixed-length embedded features that are well correlated with the segments of the other modality. The learnt motion embedded feature is estimated from the learnt speech-embedded feature through a simple neural network and further decoded back to the sequential motion. The proposed pair of embedded features showed higher correlation than spectral features with motion data, and our model was more preferred than the baseline model (BA) in terms of human-likeness and comparable in terms of similar appropriateness.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

双dcccae:从语音波形中估计肢体动作

本文提出了一种从音频-语音波形中估计身体运动的方法，其中输入和输出流中的上下文信息都被考虑在内，而不使用循环模型。以往的工作通常使用多帧输入来估计一帧运动数据，很少考虑生成的运动的时间信息。为了解决这些问题，我们扩展了之前的工作并提出了一个系统，双深度经典相关约束自动编码器(D-DCCCAE)，它将每个语音和运动片段编码为固定长度的嵌入特征，这些特征与其他模态的片段具有良好的相关性。通过简单的神经网络从学习到的语音嵌入特征中估计出学习到的运动嵌入特征，并进一步解码回序列运动。所提出的嵌入特征对与运动数据的相关性高于光谱特征，并且我们的模型在人类相似性方面比基线模型(BA)更优选，在相似适当性方面具有可比性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量