{"title":"Controllable Multi-Speaker Emotional Speech Synthesis With an Emotion Representation of High Generalization Capability","authors":"Junjie Zheng;Jian Zhou;Wenming Zheng;Liang Tao;Hon Keung Kwan","doi":"10.1109/TAFFC.2024.3412152","DOIUrl":null,"url":null,"abstract":"The aim of multi-speaker emotional speech synthesis is to generate speech for a designated speaker in a desired emotional state. The task is challenging due to the presence of speech variations, such as noise, content, and timbre, which can obstruct emotion extraction and transfer. This paper proposes a new approach to performing multi-speaker emotional speech synthesis. The proposed method, which is based on a seq2seq synthesizer, integrates emotion embedding as a conditioned variable to convey exact emotional information from reference audio to the synthesized speech. To boost emotion representation capability, we utilize a three-dimensional acoustic feature as input. And an emotion generalization module with adaptive instance normalization (AdaIN) is proposed to obtain emotion embedding with high generalization ability, which also results in improved controllability. The derived emotion embedding from the generalization module can be readily conditioned by affine parameters, allowing for control both the emotion category and the emotion intensity of synthesized speech. Various emotional speech synthesis experimental results of the propposed method demonstrate its state-of-the-art performance in multi-speaker emotional speech synthesis, coupled with its advantage of high emotion controllability.","PeriodicalId":13131,"journal":{"name":"IEEE Transactions on Affective Computing","volume":"16 1","pages":"68-82"},"PeriodicalIF":9.8000,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Affective Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10553521/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The aim of multi-speaker emotional speech synthesis is to generate speech for a designated speaker in a desired emotional state. The task is challenging due to the presence of speech variations, such as noise, content, and timbre, which can obstruct emotion extraction and transfer. This paper proposes a new approach to performing multi-speaker emotional speech synthesis. The proposed method, which is based on a seq2seq synthesizer, integrates emotion embedding as a conditioned variable to convey exact emotional information from reference audio to the synthesized speech. To boost emotion representation capability, we utilize a three-dimensional acoustic feature as input. And an emotion generalization module with adaptive instance normalization (AdaIN) is proposed to obtain emotion embedding with high generalization ability, which also results in improved controllability. The derived emotion embedding from the generalization module can be readily conditioned by affine parameters, allowing for control both the emotion category and the emotion intensity of synthesized speech. Various emotional speech synthesis experimental results of the propposed method demonstrate its state-of-the-art performance in multi-speaker emotional speech synthesis, coupled with its advantage of high emotion controllability.
期刊介绍:
The IEEE Transactions on Affective Computing is an international and interdisciplinary journal. Its primary goal is to share research findings on the development of systems capable of recognizing, interpreting, and simulating human emotions and related affective phenomena. The journal publishes original research on the underlying principles and theories that explain how and why affective factors shape human-technology interactions. It also focuses on how techniques for sensing and simulating affect can enhance our understanding of human emotions and processes. Additionally, the journal explores the design, implementation, and evaluation of systems that prioritize the consideration of affect in their usability. We also welcome surveys of existing work that provide new perspectives on the historical and future directions of this field.