基于 CWT 的多任务学习语音合成优化方法：Tacotron2 案例研究

IF 1.9 4区工程技术 Q2 Engineering EURASIP Journal on Advances in Signal Processing Pub Date : 2024-01-02 DOI:10.1186/s13634-023-01096-x

Guoqiang Hu, Zhuofan Ruan, Wenqiu Guo, Yujuan Quan

{"title":"基于 CWT 的多任务学习语音合成优化方法：Tacotron2 案例研究","authors":"Guoqiang Hu, Zhuofan Ruan, Wenqiu Guo, Yujuan Quan","doi":"10.1186/s13634-023-01096-x","DOIUrl":null,"url":null,"abstract":"<p>Text-to-speech synthesis plays an essential role in facilitating human-computer interaction. Currently, the predominant approach in Text-to-speech acoustic models selects only the Mel spectrum as an intermediate feature for converting text to speech. However, the Mel spectrograms obtained may exhibit ambiguity in some aspects owing to the limited capability of the Fourier transform to capture mutation signals during the acquisition of the Mel spectrograms. With the aim of improving the clarity of synthesized speech, this study proposes a multi-task learning optimization method and conducts experiments on the Tacotron2 speech synthesis system to demonstrate the effectiveness of the proposed method. The method in the study introduces an additional task: wavelet spectrograms. The continuous wavelet transform has gained significant popularity in various applications, including speech enhancement and speech recognition, which is primarily attributed to its capability to adaptively vary the time-frequency resolution and its excellent performance in capturing non-stationary signals. This study highlights that the clarity of Tacotron2 synthesized speech can be improved by introducing Wavelet-spectrogram as an auxiliary task through theoretical and experimental analysis: a feature extraction network is added, and Wavelet-spectrogram features are extracted from the Mel spectrum output generated by the decoder. Experimental findings indicate that the Mean Opinion Score achieved for the speech synthesized by the model using multi-task learning is 0.17 higher compared to the baseline model. Furthermore, by analyzing the factors contributing to the success of the continuous wavelet transform-based multi-task learning method in the Tacotron2 model, as well as the effectiveness of multi-task learning, the study conjectures that the proposed method has the potential to enhance the performance of other acoustic models.</p>","PeriodicalId":11816,"journal":{"name":"EURASIP Journal on Advances in Signal Processing","volume":"34 4 1","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A multi-task learning speech synthesis optimization method based on CWT: a case study of Tacotron2\",\"authors\":\"Guoqiang Hu, Zhuofan Ruan, Wenqiu Guo, Yujuan Quan\",\"doi\":\"10.1186/s13634-023-01096-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Text-to-speech synthesis plays an essential role in facilitating human-computer interaction. Currently, the predominant approach in Text-to-speech acoustic models selects only the Mel spectrum as an intermediate feature for converting text to speech. However, the Mel spectrograms obtained may exhibit ambiguity in some aspects owing to the limited capability of the Fourier transform to capture mutation signals during the acquisition of the Mel spectrograms. With the aim of improving the clarity of synthesized speech, this study proposes a multi-task learning optimization method and conducts experiments on the Tacotron2 speech synthesis system to demonstrate the effectiveness of the proposed method. The method in the study introduces an additional task: wavelet spectrograms. The continuous wavelet transform has gained significant popularity in various applications, including speech enhancement and speech recognition, which is primarily attributed to its capability to adaptively vary the time-frequency resolution and its excellent performance in capturing non-stationary signals. This study highlights that the clarity of Tacotron2 synthesized speech can be improved by introducing Wavelet-spectrogram as an auxiliary task through theoretical and experimental analysis: a feature extraction network is added, and Wavelet-spectrogram features are extracted from the Mel spectrum output generated by the decoder. Experimental findings indicate that the Mean Opinion Score achieved for the speech synthesized by the model using multi-task learning is 0.17 higher compared to the baseline model. Furthermore, by analyzing the factors contributing to the success of the continuous wavelet transform-based multi-task learning method in the Tacotron2 model, as well as the effectiveness of multi-task learning, the study conjectures that the proposed method has the potential to enhance the performance of other acoustic models.</p>\",\"PeriodicalId\":11816,\"journal\":{\"name\":\"EURASIP Journal on Advances in Signal Processing\",\"volume\":\"34 4 1\",\"pages\":\"\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2024-01-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"EURASIP Journal on Advances in Signal Processing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://doi.org/10.1186/s13634-023-01096-x\",\"RegionNum\":4,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Engineering\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"EURASIP Journal on Advances in Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1186/s13634-023-01096-x","RegionNum":4,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Engineering","Score":null,"Total":0}

引用次数: 0

摘要

文本到语音合成在促进人机交互方面发挥着重要作用。目前，文本到语音声学模型的主要方法是仅选择 Mel 频谱作为将文本转换为语音的中间特征。然而，由于傅立叶变换在获取梅尔频谱图时捕捉突变信号的能力有限，所获得的梅尔频谱图在某些方面可能表现出模糊性。为了提高合成语音的清晰度，本研究提出了一种多任务学习优化方法，并在 Tacotron2 语音合成系统上进行了实验，以证明所提方法的有效性。本研究的方法引入了一项额外任务：小波频谱图。连续小波变换在语音增强和语音识别等各种应用中广受欢迎，这主要归功于它能够自适应地改变时频分辨率，以及在捕捉非稳态信号方面的出色表现。本研究通过理论和实验分析，强调通过引入小波频谱图作为辅助任务，可以提高 Tacotron2 合成语音的清晰度：添加特征提取网络，并从解码器生成的梅尔频谱输出中提取小波频谱图特征。实验结果表明，使用多任务学习的模型合成的语音的平均意见得分比基线模型高 0.17。此外，通过分析基于连续小波变换的多任务学习方法在 Tacotron2 模型中取得成功的因素以及多任务学习的有效性，研究推测所提出的方法有可能提高其他声学模型的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A multi-task learning speech synthesis optimization method based on CWT: a case study of Tacotron2

Text-to-speech synthesis plays an essential role in facilitating human-computer interaction. Currently, the predominant approach in Text-to-speech acoustic models selects only the Mel spectrum as an intermediate feature for converting text to speech. However, the Mel spectrograms obtained may exhibit ambiguity in some aspects owing to the limited capability of the Fourier transform to capture mutation signals during the acquisition of the Mel spectrograms. With the aim of improving the clarity of synthesized speech, this study proposes a multi-task learning optimization method and conducts experiments on the Tacotron2 speech synthesis system to demonstrate the effectiveness of the proposed method. The method in the study introduces an additional task: wavelet spectrograms. The continuous wavelet transform has gained significant popularity in various applications, including speech enhancement and speech recognition, which is primarily attributed to its capability to adaptively vary the time-frequency resolution and its excellent performance in capturing non-stationary signals. This study highlights that the clarity of Tacotron2 synthesized speech can be improved by introducing Wavelet-spectrogram as an auxiliary task through theoretical and experimental analysis: a feature extraction network is added, and Wavelet-spectrogram features are extracted from the Mel spectrum output generated by the decoder. Experimental findings indicate that the Mean Opinion Score achieved for the speech synthesized by the model using multi-task learning is 0.17 higher compared to the baseline model. Furthermore, by analyzing the factors contributing to the success of the continuous wavelet transform-based multi-task learning method in the Tacotron2 model, as well as the effectiveness of multi-task learning, the study conjectures that the proposed method has the potential to enhance the performance of other acoustic models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

EURASIP Journal on Advances in Signal Processing 工程技术-工程：电子与电气

CiteScore

3.50

自引率

10.50%

发文量

109

审稿时长

2.6 months

期刊介绍： The aim of the EURASIP Journal on Advances in Signal Processing is to highlight the theoretical and practical aspects of signal processing in new and emerging technologies. The journal is directed as much at the practicing engineer as at the academic researcher. Authors of articles with novel contributions to the theory and/or practice of signal processing are welcome to submit their articles for consideration.