Corpus-based evaluation of Chinese text normalization

2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA) Pub Date : 2017-11-01 DOI:10.1109/ICSDA.2017.8384473

Sunhee Kim

引用次数: 1

Abstract

This paper aims to present a method of developing a corpus consisting of various categories of Non-Standard Words (NSWs) and a representative test set which will be used for the evaluation of the text normalization modules proposed for Standard Mandarin and Taiwanese Mandarin. A total of 191,431 sentences with NSWs are extracted for the Standard Mandarin and a total of 731,524 sentences with NSWs are extracted for Taiwanese Mandarin. In order to make a representative test set, 1,000 sentences for Standard Mandarin and Taiwanese Mandarin are randomly chosen from these sentences, maintaining the same proportion of the source corpus as well as the similar proportion of each category of NSWs.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于语料库的中文文本规范化评价

本文旨在提出一种开发非标准词语料库的方法，该语料库由不同类别的非标准词组成，并提供一个具有代表性的测试集，用于评估标准普通话和台湾普通话的文本规范化模块。标准普通话共提取了191431个带有新音的句子，台湾普通话共提取了731524个带有新音的句子。为了制作一个有代表性的测试集，从这些句子中随机抽取标准普通话和台湾普通话各1000个句子，保持源语料库的比例相同，同时保持新语类各占比相近。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA)

自引率

0.00%

发文量