印地语语料库的设计与创建方法

D. Magdum, Manisha Shukla Dubey, T. Patil, Ronak Shah, S. Belhe, Mahesh Kulkarni
{"title":"印地语语料库的设计与创建方法","authors":"D. Magdum, Manisha Shukla Dubey, T. Patil, Ronak Shah, S. Belhe, Mahesh Kulkarni","doi":"10.1109/SPACES.2015.7058279","DOIUrl":null,"url":null,"abstract":"In this paper we have described the methodologies that we have used in data collection and recording for our Hindi Text to Speech system. Design of the speech corpus plays a very important role in overall quality of the text-to-speech system. A huge text corpus of one million words was created for existing text-to-speech system. We have crawled text from many domains like financial, government, current news etc. along with pre-built dictionaries. For the very first time, we have also generated and incorporated text from Hindi Short-Messaging-Service (SMS). The efforts were made to make the generic speech corpus for Hindi. The crawled text was first filtered for correctness e.g. spelling mistakes, validity to Hindi, word lengths etc. The filtered words were then carefully analyzed and ensured that phonetically balanced text is prepared. This cured text is then recorded by professional recordist in a studio environment. The recorded speech data is then processed and annotated to generate the final speech corpus. The paper explains the speech corpus creation process, beginning with text data crawling, filtering, recording and annotation phases. The final speech corpus thus generated is used in the Hindi Text-to-Speech system with the MOS of 2.8.","PeriodicalId":432479,"journal":{"name":"2015 International Conference on Signal Processing and Communication Engineering Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Methodology for designing and creating Hindi speech corpus\",\"authors\":\"D. Magdum, Manisha Shukla Dubey, T. Patil, Ronak Shah, S. Belhe, Mahesh Kulkarni\",\"doi\":\"10.1109/SPACES.2015.7058279\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we have described the methodologies that we have used in data collection and recording for our Hindi Text to Speech system. Design of the speech corpus plays a very important role in overall quality of the text-to-speech system. A huge text corpus of one million words was created for existing text-to-speech system. We have crawled text from many domains like financial, government, current news etc. along with pre-built dictionaries. For the very first time, we have also generated and incorporated text from Hindi Short-Messaging-Service (SMS). The efforts were made to make the generic speech corpus for Hindi. The crawled text was first filtered for correctness e.g. spelling mistakes, validity to Hindi, word lengths etc. The filtered words were then carefully analyzed and ensured that phonetically balanced text is prepared. This cured text is then recorded by professional recordist in a studio environment. The recorded speech data is then processed and annotated to generate the final speech corpus. The paper explains the speech corpus creation process, beginning with text data crawling, filtering, recording and annotation phases. The final speech corpus thus generated is used in the Hindi Text-to-Speech system with the MOS of 2.8.\",\"PeriodicalId\":432479,\"journal\":{\"name\":\"2015 International Conference on Signal Processing and Communication Engineering Systems\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-03-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 International Conference on Signal Processing and Communication Engineering Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SPACES.2015.7058279\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Signal Processing and Communication Engineering Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPACES.2015.7058279","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

摘要

在本文中,我们描述了我们在印地语文本到语音系统的数据收集和记录中使用的方法。语音语料库的设计对文本转语音系统的整体质量起着非常重要的作用。为现有的文本转语音系统创建了一个100万字的庞大文本语料库。我们已经从许多领域抓取了文本,如金融、政府、时事新闻等,以及预先构建的词典。这是我们第一次从印度短消息服务(SMS)中生成和合并文本。为制作印地语通用语料库作出了努力。首先过滤抓取的文本的正确性,例如拼写错误,对印地语的有效性,单词长度等。然后仔细分析过滤的单词,并确保准备语音平衡的文本。然后由专业的录音师在录音室环境中录制此固化文本。然后对所记录的语音数据进行处理和注释以生成最终的语音语料库。本文阐述了语音语料库的创建过程,从文本数据抓取、过滤、记录和标注四个阶段开始。最终生成的语音语料库用于MOS为2.8的印地语文本到语音系统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Methodology for designing and creating Hindi speech corpus
In this paper we have described the methodologies that we have used in data collection and recording for our Hindi Text to Speech system. Design of the speech corpus plays a very important role in overall quality of the text-to-speech system. A huge text corpus of one million words was created for existing text-to-speech system. We have crawled text from many domains like financial, government, current news etc. along with pre-built dictionaries. For the very first time, we have also generated and incorporated text from Hindi Short-Messaging-Service (SMS). The efforts were made to make the generic speech corpus for Hindi. The crawled text was first filtered for correctness e.g. spelling mistakes, validity to Hindi, word lengths etc. The filtered words were then carefully analyzed and ensured that phonetically balanced text is prepared. This cured text is then recorded by professional recordist in a studio environment. The recorded speech data is then processed and annotated to generate the final speech corpus. The paper explains the speech corpus creation process, beginning with text data crawling, filtering, recording and annotation phases. The final speech corpus thus generated is used in the Hindi Text-to-Speech system with the MOS of 2.8.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
BTSWASH: Brain tumour segmentation by water shed algorithm Path loss prediction analysis by ray tracing approach for NLOS indoor propagation Enhancing the performance of AOA estimation in wireless communication using the MUSIC algorithm Preventing black hole attacks in MANETs using secure knowledge algorithm Redundancy based WEP routing technology (IoT-WSN)
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1