{"title":"Corpus-based evaluation of Chinese text normalization","authors":"Sunhee Kim","doi":"10.1109/ICSDA.2017.8384473","DOIUrl":null,"url":null,"abstract":"This paper aims to present a method of developing a corpus consisting of various categories of Non-Standard Words (NSWs) and a representative test set which will be used for the evaluation of the text normalization modules proposed for Standard Mandarin and Taiwanese Mandarin. A total of 191,431 sentences with NSWs are extracted for the Standard Mandarin and a total of 731,524 sentences with NSWs are extracted for Taiwanese Mandarin. In order to make a representative test set, 1,000 sentences for Standard Mandarin and Taiwanese Mandarin are randomly chosen from these sentences, maintaining the same proportion of the source corpus as well as the similar proportion of each category of NSWs.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"353 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSDA.2017.8384473","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

This paper aims to present a method of developing a corpus consisting of various categories of Non-Standard Words (NSWs) and a representative test set which will be used for the evaluation of the text normalization modules proposed for Standard Mandarin and Taiwanese Mandarin. A total of 191,431 sentences with NSWs are extracted for the Standard Mandarin and a total of 731,524 sentences with NSWs are extracted for Taiwanese Mandarin. In order to make a representative test set, 1,000 sentences for Standard Mandarin and Taiwanese Mandarin are randomly chosen from these sentences, maintaining the same proportion of the source corpus as well as the similar proportion of each category of NSWs.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于语料库的中文文本规范化评价
本文旨在提出一种开发非标准词语料库的方法,该语料库由不同类别的非标准词组成,并提供一个具有代表性的测试集,用于评估标准普通话和台湾普通话的文本规范化模块。标准普通话共提取了191431个带有新音的句子,台湾普通话共提取了731524个带有新音的句子。为了制作一个有代表性的测试集,从这些句子中随机抽取标准普通话和台湾普通话各1000个句子,保持源语料库的比例相同,同时保持新语类各占比相近。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Feature selection method for real-time speech emotion recognition Spectral analysis of English voiced palato-alveolar fricative /Ʒ/ produced by Chinese WU Speakers Corpus-based evaluation of Chinese text normalization Acoustic analysis of vowels in five low resource north East Indian languages of Nagaland A progress report of the Taiwan Mandarin radio speech corpus project
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1