Controllable Emphatic Speech Synthesis based on Forward Attention for Expressive Speech Synthesis

2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI:10.1109/SLT48900.2021.9383537

Liangqi Liu, Jiankun Hu, Zhiyong Wu, Song Yang, Songfan Yang, Jia Jia, H. Meng

{"title":"Controllable Emphatic Speech Synthesis based on Forward Attention for Expressive Speech Synthesis","authors":"Liangqi Liu, Jiankun Hu, Zhiyong Wu, Song Yang, Songfan Yang, Jia Jia, H. Meng","doi":"10.1109/SLT48900.2021.9383537","DOIUrl":null,"url":null,"abstract":"In speech interaction scenarios, speech emphasis is essential for expressing the underlying intention and attitude. Recently, end-to-end emphatic speech synthesis greatly improves the naturalness of synthetic speech, but also brings new problems: 1) lack of interpretability for how emphatic codes affect the model; 2) no separate control of emphasis on duration and on intonation and energy. We propose a novel way to build an interpretable and controllable emphatic speech synthesis framework based on forward attention. Firstly, we explicitly model the local variation of speaking rate for emphasized words and neutral words with modified forward attention to manifest emphasized words in terms of duration. The 2-layers LSTM in decoder is further divided into attention-RNN and decoder-RNN to disentangle the influence of emphasis on duration and on intonation and energy. The emphasis information is injected into decoder-RNN for highlighting emphasized words in the aspects of intonation and energy. Experimental results have shown that our model can not only provide separate control of emphasis on duration and on intonation and energy, but also generate more robust and prominent emphatic speech with high quality and naturalness.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT48900.2021.9383537","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

In speech interaction scenarios, speech emphasis is essential for expressing the underlying intention and attitude. Recently, end-to-end emphatic speech synthesis greatly improves the naturalness of synthetic speech, but also brings new problems: 1) lack of interpretability for how emphatic codes affect the model; 2) no separate control of emphasis on duration and on intonation and energy. We propose a novel way to build an interpretable and controllable emphatic speech synthesis framework based on forward attention. Firstly, we explicitly model the local variation of speaking rate for emphasized words and neutral words with modified forward attention to manifest emphasized words in terms of duration. The 2-layers LSTM in decoder is further divided into attention-RNN and decoder-RNN to disentangle the influence of emphasis on duration and on intonation and energy. The emphasis information is injected into decoder-RNN for highlighting emphasized words in the aspects of intonation and energy. Experimental results have shown that our model can not only provide separate control of emphasis on duration and on intonation and energy, but also generate more robust and prominent emphatic speech with high quality and naturalness.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

表达性语音合成中基于前向注意的可控重点语音合成

在言语互动场景中，言语强调对于表达潜在的意图和态度至关重要。近年来，端到端强调语音合成大大提高了合成语音的自然度，但也带来了新的问题:1)强调代码对模型的影响缺乏可解释性;2)没有单独控制对音长、语调和能量的强调。提出了一种基于前向注意的可解释可控的重点语音合成框架。首先，我们对强调词和中性词的局部语速变化进行了明确的建模，并对前向注意进行了修正，以显示强调词的持续时间。将解码器中的两层LSTM进一步划分为注意- rnn和解码器- rnn，以理清重音对音长、语调和能量的影响。将重音信息注入到解码器- rnn中，在语调和能量方面突出重音词。实验结果表明，该模型不仅能够实现对重音持续时间、语调和能量的独立控制，而且能够生成更加鲁棒、突出、高质量、自然的重音语音。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2021 IEEE Spoken Language Technology Workshop (SLT)

自引率

0.00%

发文量