H. Kameoka, Takuhiro Kaneko, Shogo Seki, Kou Tanaka
{"title":"原因:从语音中估计跨模动作单元序列","authors":"H. Kameoka, Takuhiro Kaneko, Shogo Seki, Kou Tanaka","doi":"10.21437/interspeech.2022-11232","DOIUrl":null,"url":null,"abstract":"This paper proposes a task and method for estimating a sequence of facial action units (AUs) solely from speech. AUs were introduced in the facial action coding system to objectively describe facial muscle activations. Our motivation is that AUs can be useful continuous quantities for represent-ing speaker’s subtle emotional states, attitudes, and moods in a variety of applications such as expressive speech synthesis and emotional voice conversion. We hypothesize that the information about the speaker’s facial muscle movements is expressed in the generated speech and can somehow be predicted from speech alone. To verify this, we devise a neural network model that predicts an AU sequence from the mel-spectrogram of input speech and train it using a large-scale audio-visual dataset consisting of many speaking face-tracks. We call our method and model “crossmodal AU sequence es-timation/estimator (CAUSE)”. We implemented several of the most basic architectures for CAUSE, and quantitatively confirmed that the fully convolutional architecture performed best. Furthermore, by combining CAUSE with an AU-conditioned image-to-image translation method, we implemented a system that animates a given still face image from speech. Using this system, we confirmed the potential usefulness of AUs as a representation of non-linguistic features via subjective evaluations.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"506-510"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"CAUSE: Crossmodal Action Unit Sequence Estimation from Speech\",\"authors\":\"H. Kameoka, Takuhiro Kaneko, Shogo Seki, Kou Tanaka\",\"doi\":\"10.21437/interspeech.2022-11232\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes a task and method for estimating a sequence of facial action units (AUs) solely from speech. AUs were introduced in the facial action coding system to objectively describe facial muscle activations. Our motivation is that AUs can be useful continuous quantities for represent-ing speaker’s subtle emotional states, attitudes, and moods in a variety of applications such as expressive speech synthesis and emotional voice conversion. We hypothesize that the information about the speaker’s facial muscle movements is expressed in the generated speech and can somehow be predicted from speech alone. To verify this, we devise a neural network model that predicts an AU sequence from the mel-spectrogram of input speech and train it using a large-scale audio-visual dataset consisting of many speaking face-tracks. We call our method and model “crossmodal AU sequence es-timation/estimator (CAUSE)”. We implemented several of the most basic architectures for CAUSE, and quantitatively confirmed that the fully convolutional architecture performed best. Furthermore, by combining CAUSE with an AU-conditioned image-to-image translation method, we implemented a system that animates a given still face image from speech. Using this system, we confirmed the potential usefulness of AUs as a representation of non-linguistic features via subjective evaluations.\",\"PeriodicalId\":73500,\"journal\":{\"name\":\"Interspeech\",\"volume\":\"1 1\",\"pages\":\"506-510\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Interspeech\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/interspeech.2022-11232\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2022-11232","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
CAUSE: Crossmodal Action Unit Sequence Estimation from Speech
This paper proposes a task and method for estimating a sequence of facial action units (AUs) solely from speech. AUs were introduced in the facial action coding system to objectively describe facial muscle activations. Our motivation is that AUs can be useful continuous quantities for represent-ing speaker’s subtle emotional states, attitudes, and moods in a variety of applications such as expressive speech synthesis and emotional voice conversion. We hypothesize that the information about the speaker’s facial muscle movements is expressed in the generated speech and can somehow be predicted from speech alone. To verify this, we devise a neural network model that predicts an AU sequence from the mel-spectrogram of input speech and train it using a large-scale audio-visual dataset consisting of many speaking face-tracks. We call our method and model “crossmodal AU sequence es-timation/estimator (CAUSE)”. We implemented several of the most basic architectures for CAUSE, and quantitatively confirmed that the fully convolutional architecture performed best. Furthermore, by combining CAUSE with an AU-conditioned image-to-image translation method, we implemented a system that animates a given still face image from speech. Using this system, we confirmed the potential usefulness of AUs as a representation of non-linguistic features via subjective evaluations.