文本到语音评价的自然度和可理解性监测

Àlex Peiró-Lilja, Guillermo Cámbara, M. Farrús, J. Luque
{"title":"文本到语音评价的自然度和可理解性监测","authors":"Àlex Peiró-Lilja, Guillermo Cámbara, M. Farrús, J. Luque","doi":"10.21437/speechprosody.2022-91","DOIUrl":null,"url":null,"abstract":"Current text-to-speech (TTS) systems are deep learning-based models capable of learning phonetic articulation and intelligibility, as well as prosodic attributes that model speaking style, providing naturalness to synthetic voices. However, the performance of these models highly depends on their training of hyper-parameters and iterations. Besides, a conventional loss function does not reflect a correct voice modeling; thus, we believe a dedicated training assessment on TTS is needed. To this end, we monitor intelligibility and naturalness during training of Tacotron2 model in a 2-step process. First, we report the analysis of a method to follow up the intelligibility of the TTS in terms of character-level token error rate (TER) by using five different automatic speech recognition (ASR) systems. Sec-ond, we extend this work with a recently published TTS naturalness predictor that estimates this aspect in terms of mean opinion scores (MOS). Finally, we unify predicted MOS with TER measurements to return, over each training checkpoint, a single score that we name Full Assessment Score (FAS). We report the relevant preference of our listeners on the checkpoint with maximum FAS rather than the one with minimum validation loss, both in intelligibility and naturalness —up to 62 . 3% in the latter.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"131 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Naturalness and Intelligibility Monitoring for Text-to-Speech Evaluation\",\"authors\":\"Àlex Peiró-Lilja, Guillermo Cámbara, M. Farrús, J. Luque\",\"doi\":\"10.21437/speechprosody.2022-91\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Current text-to-speech (TTS) systems are deep learning-based models capable of learning phonetic articulation and intelligibility, as well as prosodic attributes that model speaking style, providing naturalness to synthetic voices. However, the performance of these models highly depends on their training of hyper-parameters and iterations. Besides, a conventional loss function does not reflect a correct voice modeling; thus, we believe a dedicated training assessment on TTS is needed. To this end, we monitor intelligibility and naturalness during training of Tacotron2 model in a 2-step process. First, we report the analysis of a method to follow up the intelligibility of the TTS in terms of character-level token error rate (TER) by using five different automatic speech recognition (ASR) systems. Sec-ond, we extend this work with a recently published TTS naturalness predictor that estimates this aspect in terms of mean opinion scores (MOS). Finally, we unify predicted MOS with TER measurements to return, over each training checkpoint, a single score that we name Full Assessment Score (FAS). We report the relevant preference of our listeners on the checkpoint with maximum FAS rather than the one with minimum validation loss, both in intelligibility and naturalness —up to 62 . 3% in the latter.\",\"PeriodicalId\":442842,\"journal\":{\"name\":\"Speech Prosody 2022\",\"volume\":\"131 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Prosody 2022\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/speechprosody.2022-91\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Prosody 2022","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/speechprosody.2022-91","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

当前的文本到语音(TTS)系统是基于深度学习的模型,能够学习语音的发音和可理解性,以及模仿说话风格的韵律属性,为合成声音提供自然性。然而,这些模型的性能在很大程度上依赖于它们的超参数训练和迭代。此外,传统的损失函数不能反映正确的语音建模;因此,我们认为有必要对TTS进行专门的培训评估。为此,我们分两步对Tacotron2模型训练过程中的可理解性和自然度进行监测。首先,我们报告了一种方法来跟踪字符级令牌错误率(TER)方面的TTS通过使用五种不同的自动语音识别(ASR)系统的可理解性分析。其次,我们用最近发表的TTS自然度预测器来扩展这项工作,该预测器根据平均意见分数(MOS)来估计这方面的情况。最后,我们将预测的MOS与TER测量统一起来,在每个训练检查点返回一个单一的分数,我们称之为完整评估分数(FAS)。我们报告了听众在FAS最大的检查点上的相关偏好,而不是在验证损失最小的检查点上的相关偏好,可理解性和自然度都高达62。后者为3%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Naturalness and Intelligibility Monitoring for Text-to-Speech Evaluation
Current text-to-speech (TTS) systems are deep learning-based models capable of learning phonetic articulation and intelligibility, as well as prosodic attributes that model speaking style, providing naturalness to synthetic voices. However, the performance of these models highly depends on their training of hyper-parameters and iterations. Besides, a conventional loss function does not reflect a correct voice modeling; thus, we believe a dedicated training assessment on TTS is needed. To this end, we monitor intelligibility and naturalness during training of Tacotron2 model in a 2-step process. First, we report the analysis of a method to follow up the intelligibility of the TTS in terms of character-level token error rate (TER) by using five different automatic speech recognition (ASR) systems. Sec-ond, we extend this work with a recently published TTS naturalness predictor that estimates this aspect in terms of mean opinion scores (MOS). Finally, we unify predicted MOS with TER measurements to return, over each training checkpoint, a single score that we name Full Assessment Score (FAS). We report the relevant preference of our listeners on the checkpoint with maximum FAS rather than the one with minimum validation loss, both in intelligibility and naturalness —up to 62 . 3% in the latter.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
相关文献
二甲双胍通过HDAC6和FoxO3a转录调控肌肉生长抑制素诱导肌肉萎缩
IF 8.9 1区 医学Journal of Cachexia, Sarcopenia and MusclePub Date : 2021-11-02 DOI: 10.1002/jcsm.12833
Min Ju Kang, Ji Wook Moon, Jung Ok Lee, Ji Hae Kim, Eun Jeong Jung, Su Jin Kim, Joo Yeon Oh, Sang Woo Wu, Pu Reum Lee, Sun Hwa Park, Hyeon Soo Kim
具有疾病敏感单倍型的非亲属供体脐带血移植后的1型糖尿病
IF 3.2 3区 医学Journal of Diabetes InvestigationPub Date : 2022-11-02 DOI: 10.1111/jdi.13939
Kensuke Matsumoto, Taisuke Matsuyama, Ritsu Sumiyoshi, Matsuo Takuji, Tadashi Yamamoto, Ryosuke Shirasaki, Haruko Tashiro
封面:蛋白质组学分析确定IRSp53和fastin是PRV输出和直接细胞-细胞传播的关键
IF 3.4 4区 生物学ProteomicsPub Date : 2019-12-02 DOI: 10.1002/pmic.201970201
Fei-Long Yu, Huan Miao, Jinjin Xia, Fan Jia, Huadong Wang, Fuqiang Xu, Lin Guo
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Conversational Correlates of Prosodic Entrainment in Youth with and without Autism Spectrum Disorder Individual variation in F0 marking of turn-taking in natural conversation in German and Swedish Contribution of voice quality to prediction of turn-taking events Production of Lexical Stress Matures Late in Typically Developing Children Can Prosody Transfer Embeddings be Used for Prosody Assessment?
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1