原因:从语音中估计跨模动作单元序列

Interspeech Pub Date : 2022-09-18 DOI:10.21437/interspeech.2022-11232

H. Kameoka, Takuhiro Kaneko, Shogo Seki, Kou Tanaka

{"title":"原因:从语音中估计跨模动作单元序列","authors":"H. Kameoka, Takuhiro Kaneko, Shogo Seki, Kou Tanaka","doi":"10.21437/interspeech.2022-11232","DOIUrl":null,"url":null,"abstract":"This paper proposes a task and method for estimating a sequence of facial action units (AUs) solely from speech. AUs were introduced in the facial action coding system to objectively describe facial muscle activations. Our motivation is that AUs can be useful continuous quantities for represent-ing speaker’s subtle emotional states, attitudes, and moods in a variety of applications such as expressive speech synthesis and emotional voice conversion. We hypothesize that the information about the speaker’s facial muscle movements is expressed in the generated speech and can somehow be predicted from speech alone. To verify this, we devise a neural network model that predicts an AU sequence from the mel-spectrogram of input speech and train it using a large-scale audio-visual dataset consisting of many speaking face-tracks. We call our method and model “crossmodal AU sequence es-timation/estimator (CAUSE)”. We implemented several of the most basic architectures for CAUSE, and quantitatively conﬁrmed that the fully convolutional architecture performed best. Furthermore, by combining CAUSE with an AU-conditioned image-to-image translation method, we implemented a system that animates a given still face image from speech. Using this system, we conﬁrmed the potential usefulness of AUs as a representation of non-linguistic features via subjective evaluations.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"506-510"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"CAUSE: Crossmodal Action Unit Sequence Estimation from Speech\",\"authors\":\"H. Kameoka, Takuhiro Kaneko, Shogo Seki, Kou Tanaka\",\"doi\":\"10.21437/interspeech.2022-11232\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes a task and method for estimating a sequence of facial action units (AUs) solely from speech. AUs were introduced in the facial action coding system to objectively describe facial muscle activations. Our motivation is that AUs can be useful continuous quantities for represent-ing speaker’s subtle emotional states, attitudes, and moods in a variety of applications such as expressive speech synthesis and emotional voice conversion. We hypothesize that the information about the speaker’s facial muscle movements is expressed in the generated speech and can somehow be predicted from speech alone. To verify this, we devise a neural network model that predicts an AU sequence from the mel-spectrogram of input speech and train it using a large-scale audio-visual dataset consisting of many speaking face-tracks. We call our method and model “crossmodal AU sequence es-timation/estimator (CAUSE)”. We implemented several of the most basic architectures for CAUSE, and quantitatively conﬁrmed that the fully convolutional architecture performed best. Furthermore, by combining CAUSE with an AU-conditioned image-to-image translation method, we implemented a system that animates a given still face image from speech. Using this system, we conﬁrmed the potential usefulness of AUs as a representation of non-linguistic features via subjective evaluations.\",\"PeriodicalId\":73500,\"journal\":{\"name\":\"Interspeech\",\"volume\":\"1 1\",\"pages\":\"506-510\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Interspeech\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/interspeech.2022-11232\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2022-11232","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

本文提出了一种仅从语音中估计面部动作单元序列的任务和方法。在面部动作编码系统中引入AUs来客观地描述面部肌肉的激活。我们的动机是，在各种应用中，例如表达性语音合成和情感语音转换，au可以作为有用的连续量来表示说话者微妙的情绪状态、态度和情绪。我们假设，关于说话者面部肌肉运动的信息是在生成的语音中表达出来的，并且可以以某种方式仅从语音中进行预测。为了验证这一点，我们设计了一个神经网络模型，该模型从输入语音的梅尔谱图中预测一个AU序列，并使用由许多说话面部轨迹组成的大规模视听数据集对其进行训练。我们把我们的方法和模型称为“跨模AU序列估计/估计(CAUSE)”。我们为CAUSE实现了几个最基本的体系结构，并定量地证实了完全卷积体系结构的性能最好。此外，通过将CAUSE与au条件下的图像到图像翻译方法相结合，我们实现了一个系统，该系统可以从语音中激活给定的静止人脸图像。使用这个系统，我们通过主观评价确认了AUs作为非语言特征表示的潜在有用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

CAUSE: Crossmodal Action Unit Sequence Estimation from Speech

This paper proposes a task and method for estimating a sequence of facial action units (AUs) solely from speech. AUs were introduced in the facial action coding system to objectively describe facial muscle activations. Our motivation is that AUs can be useful continuous quantities for represent-ing speaker’s subtle emotional states, attitudes, and moods in a variety of applications such as expressive speech synthesis and emotional voice conversion. We hypothesize that the information about the speaker’s facial muscle movements is expressed in the generated speech and can somehow be predicted from speech alone. To verify this, we devise a neural network model that predicts an AU sequence from the mel-spectrogram of input speech and train it using a large-scale audio-visual dataset consisting of many speaking face-tracks. We call our method and model “crossmodal AU sequence es-timation/estimator (CAUSE)”. We implemented several of the most basic architectures for CAUSE, and quantitatively conﬁrmed that the fully convolutional architecture performed best. Furthermore, by combining CAUSE with an AU-conditioned image-to-image translation method, we implemented a system that animates a given still face image from speech. Using this system, we conﬁrmed the potential usefulness of AUs as a representation of non-linguistic features via subjective evaluations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Interspeech

自引率

0.00%

发文量