优化越南语文本转语音系统自然度的数据处理

V. Phung, Phan Huy Kinh, Anh-Tuan Dinh, Quoc Bao Nguyen
{"title":"优化越南语文本转语音系统自然度的数据处理","authors":"V. Phung, Phan Huy Kinh, Anh-Tuan Dinh, Quoc Bao Nguyen","doi":"10.1109/O-COCOSDA50338.2020.9295025","DOIUrl":null,"url":null,"abstract":"End-to-end text-to-speech (TTS) systems has proved its great success in the presence of a large amount of high-quality training data recorded in an anechoic room with high-quality microphones. Another approach is to use available source of found data like radio broadcast news. We aim to optimize the naturalness of TTS system on the found data using a novel data processing method. The data processing method includes 1) utterance selection and 2) prosodic punctuation insertion to prepare training data which can optimize the naturalness of TTS systems. We showed that using the processing data method, an end-to-end TTS achieved a mean opinion score (MOS) of 4.1 compared to 4.3 of natural speech. We showed that the punctuation insertion contributed the most to the result. To facilitate the research and development of TTS systems, we distributed the processed data, which is known as Zalo-TTS database at https://forms.gle/6Hk5YkqgDxAaC2BU6; It consists of 18-hours of speech at a sampling rate of 44.1 kHz of one speaker with Hanoi dialect.","PeriodicalId":385266,"journal":{"name":"2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Data Processing for Optimizing Naturalness of Vietnamese Text-to-speech System\",\"authors\":\"V. Phung, Phan Huy Kinh, Anh-Tuan Dinh, Quoc Bao Nguyen\",\"doi\":\"10.1109/O-COCOSDA50338.2020.9295025\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"End-to-end text-to-speech (TTS) systems has proved its great success in the presence of a large amount of high-quality training data recorded in an anechoic room with high-quality microphones. Another approach is to use available source of found data like radio broadcast news. We aim to optimize the naturalness of TTS system on the found data using a novel data processing method. The data processing method includes 1) utterance selection and 2) prosodic punctuation insertion to prepare training data which can optimize the naturalness of TTS systems. We showed that using the processing data method, an end-to-end TTS achieved a mean opinion score (MOS) of 4.1 compared to 4.3 of natural speech. We showed that the punctuation insertion contributed the most to the result. To facilitate the research and development of TTS systems, we distributed the processed data, which is known as Zalo-TTS database at https://forms.gle/6Hk5YkqgDxAaC2BU6; It consists of 18-hours of speech at a sampling rate of 44.1 kHz of one speaker with Hanoi dialect.\",\"PeriodicalId\":385266,\"journal\":{\"name\":\"2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-04-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/O-COCOSDA50338.2020.9295025\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 23rd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/O-COCOSDA50338.2020.9295025","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

端到端文本到语音(TTS)系统在具有高质量麦克风的消声室中记录了大量高质量的训练数据,证明了其巨大的成功。另一种方法是使用现有的数据来源,如无线电广播新闻。我们的目标是利用一种新的数据处理方法来优化TTS系统对发现数据的自然度。数据处理方法包括1)话语选择和2)韵律标点插入,以制备训练数据,优化TTS系统的自然度。我们发现,使用处理数据的方法,端到端TTS的平均意见得分(MOS)为4.1,而自然语音的平均意见得分为4.3。我们发现标点符号的插入对结果的贡献最大。为了方便TTS系统的研究和开发,我们将处理后的数据分发到https://forms.gle/6Hk5YkqgDxAaC2BU6,称为Zalo-TTS数据库;它由一个河内方言的讲话者以44.1 kHz的采样率进行的18小时的讲话组成。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Data Processing for Optimizing Naturalness of Vietnamese Text-to-speech System
End-to-end text-to-speech (TTS) systems has proved its great success in the presence of a large amount of high-quality training data recorded in an anechoic room with high-quality microphones. Another approach is to use available source of found data like radio broadcast news. We aim to optimize the naturalness of TTS system on the found data using a novel data processing method. The data processing method includes 1) utterance selection and 2) prosodic punctuation insertion to prepare training data which can optimize the naturalness of TTS systems. We showed that using the processing data method, an end-to-end TTS achieved a mean opinion score (MOS) of 4.1 compared to 4.3 of natural speech. We showed that the punctuation insertion contributed the most to the result. To facilitate the research and development of TTS systems, we distributed the processed data, which is known as Zalo-TTS database at https://forms.gle/6Hk5YkqgDxAaC2BU6; It consists of 18-hours of speech at a sampling rate of 44.1 kHz of one speaker with Hanoi dialect.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A Front-End Technique for Automatic Noisy Speech Recognition Improving Valence Prediction in Dimensional Speech Emotion Recognition Using Linguistic Information A Comparative Study of Named Entity Recognition on Myanmar Language Intent Classification on Myanmar Social Media Data in Telecommunication Domain Using Convolutional Neural Network and Word2Vec Prosodic Information-Assisted DNN-based Mandarin Spontaneous-Speech Recognition
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1