{"title":"Corpus-based evaluation of Chinese text normalization","authors":"Sunhee Kim","doi":"10.1109/ICSDA.2017.8384473","DOIUrl":null,"url":null,"abstract":"This paper aims to present a method of developing a corpus consisting of various categories of Non-Standard Words (NSWs) and a representative test set which will be used for the evaluation of the text normalization modules proposed for Standard Mandarin and Taiwanese Mandarin. A total of 191,431 sentences with NSWs are extracted for the Standard Mandarin and a total of 731,524 sentences with NSWs are extracted for Taiwanese Mandarin. In order to make a representative test set, 1,000 sentences for Standard Mandarin and Taiwanese Mandarin are randomly chosen from these sentences, maintaining the same proportion of the source corpus as well as the similar proportion of each category of NSWs.","PeriodicalId":255147,"journal":{"name":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","volume":"353 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSDA.2017.8384473","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
This paper aims to present a method of developing a corpus consisting of various categories of Non-Standard Words (NSWs) and a representative test set which will be used for the evaluation of the text normalization modules proposed for Standard Mandarin and Taiwanese Mandarin. A total of 191,431 sentences with NSWs are extracted for the Standard Mandarin and a total of 731,524 sentences with NSWs are extracted for Taiwanese Mandarin. In order to make a representative test set, 1,000 sentences for Standard Mandarin and Taiwanese Mandarin are randomly chosen from these sentences, maintaining the same proportion of the source corpus as well as the similar proportion of each category of NSWs.