文本到语音评价的自然度和可理解性监测

Speech Prosody 2022 Pub Date : 2022-05-23 DOI:10.21437/speechprosody.2022-91

Àlex Peiró-Lilja, Guillermo Cámbara, M. Farrús, J. Luque

{"title":"文本到语音评价的自然度和可理解性监测","authors":"Àlex Peiró-Lilja, Guillermo Cámbara, M. Farrús, J. Luque","doi":"10.21437/speechprosody.2022-91","DOIUrl":null,"url":null,"abstract":"Current text-to-speech (TTS) systems are deep learning-based models capable of learning phonetic articulation and intelligibility, as well as prosodic attributes that model speaking style, providing naturalness to synthetic voices. However, the performance of these models highly depends on their training of hyper-parameters and iterations. Besides, a conventional loss function does not reﬂect a correct voice modeling; thus, we believe a dedicated training assessment on TTS is needed. To this end, we monitor intelligibility and naturalness during training of Tacotron2 model in a 2-step process. First, we report the analysis of a method to follow up the intelligibility of the TTS in terms of character-level token error rate (TER) by using ﬁve different automatic speech recognition (ASR) systems. Sec-ond, we extend this work with a recently published TTS naturalness predictor that estimates this aspect in terms of mean opinion scores (MOS). Finally, we unify predicted MOS with TER measurements to return, over each training checkpoint, a single score that we name Full Assessment Score (FAS). We report the relevant preference of our listeners on the checkpoint with maximum FAS rather than the one with minimum validation loss, both in intelligibility and naturalness —up to 62 . 3% in the latter.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"131 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Naturalness and Intelligibility Monitoring for Text-to-Speech Evaluation\",\"authors\":\"Àlex Peiró-Lilja, Guillermo Cámbara, M. Farrús, J. Luque\",\"doi\":\"10.21437/speechprosody.2022-91\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Current text-to-speech (TTS) systems are deep learning-based models capable of learning phonetic articulation and intelligibility, as well as prosodic attributes that model speaking style, providing naturalness to synthetic voices. However, the performance of these models highly depends on their training of hyper-parameters and iterations. Besides, a conventional loss function does not reﬂect a correct voice modeling; thus, we believe a dedicated training assessment on TTS is needed. To this end, we monitor intelligibility and naturalness during training of Tacotron2 model in a 2-step process. First, we report the analysis of a method to follow up the intelligibility of the TTS in terms of character-level token error rate (TER) by using ﬁve different automatic speech recognition (ASR) systems. Sec-ond, we extend this work with a recently published TTS naturalness predictor that estimates this aspect in terms of mean opinion scores (MOS). Finally, we unify predicted MOS with TER measurements to return, over each training checkpoint, a single score that we name Full Assessment Score (FAS). We report the relevant preference of our listeners on the checkpoint with maximum FAS rather than the one with minimum validation loss, both in intelligibility and naturalness —up to 62 . 3% in the latter.\",\"PeriodicalId\":442842,\"journal\":{\"name\":\"Speech Prosody 2022\",\"volume\":\"131 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Prosody 2022\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/speechprosody.2022-91\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Prosody 2022","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/speechprosody.2022-91","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

当前的文本到语音(TTS)系统是基于深度学习的模型，能够学习语音的发音和可理解性，以及模仿说话风格的韵律属性，为合成声音提供自然性。然而，这些模型的性能在很大程度上依赖于它们的超参数训练和迭代。此外，传统的损失函数不能反映正确的语音建模;因此，我们认为有必要对TTS进行专门的培训评估。为此，我们分两步对Tacotron2模型训练过程中的可理解性和自然度进行监测。首先，我们报告了一种方法来跟踪字符级令牌错误率(TER)方面的TTS通过使用五种不同的自动语音识别(ASR)系统的可理解性分析。其次，我们用最近发表的TTS自然度预测器来扩展这项工作，该预测器根据平均意见分数(MOS)来估计这方面的情况。最后，我们将预测的MOS与TER测量统一起来，在每个训练检查点返回一个单一的分数，我们称之为完整评估分数(FAS)。我们报告了听众在FAS最大的检查点上的相关偏好，而不是在验证损失最小的检查点上的相关偏好，可理解性和自然度都高达62。后者为3%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Naturalness and Intelligibility Monitoring for Text-to-Speech Evaluation

Current text-to-speech (TTS) systems are deep learning-based models capable of learning phonetic articulation and intelligibility, as well as prosodic attributes that model speaking style, providing naturalness to synthetic voices. However, the performance of these models highly depends on their training of hyper-parameters and iterations. Besides, a conventional loss function does not reﬂect a correct voice modeling; thus, we believe a dedicated training assessment on TTS is needed. To this end, we monitor intelligibility and naturalness during training of Tacotron2 model in a 2-step process. First, we report the analysis of a method to follow up the intelligibility of the TTS in terms of character-level token error rate (TER) by using ﬁve different automatic speech recognition (ASR) systems. Sec-ond, we extend this work with a recently published TTS naturalness predictor that estimates this aspect in terms of mean opinion scores (MOS). Finally, we unify predicted MOS with TER measurements to return, over each training checkpoint, a single score that we name Full Assessment Score (FAS). We report the relevant preference of our listeners on the checkpoint with maximum FAS rather than the one with minimum validation loss, both in intelligibility and naturalness —up to 62 . 3% in the latter.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Speech Prosody 2022

自引率

0.00%

发文量