Statistical Analysis of Mandarin Acoustic Units and Automatic Extraction of Phonetically Rich Sentences Based Upon a very Large Chinese Text Corpus

Int. J. Comput. Linguistics Chin. Lang. Process. Pub Date : 1998-08-01 DOI:10.30019/IJCLCLP.199808.0005

H. Wang

{"title":"Statistical Analysis of Mandarin Acoustic Units and Automatic Extraction of Phonetically Rich Sentences Based Upon a very Large Chinese Text Corpus","authors":"H. Wang","doi":"10.30019/IJCLCLP.199808.0005","DOIUrl":null,"url":null,"abstract":"Automatic speech recognition by computers can provide humans with the most convenient method to communicate with computers. Because the Chinese language is not alphabetic and input of Chinese characters into computers is very difficult, Mandarin speech recognition is very highly desired. Recently, high performance speech recognition systems have begun to emerge from research institutes. However, it is believed that an adequate speech database for training acoustic models and evaluating performance is certainly critical for successful deployment of such systems in realistic operating environments. Thus, designing a set of phonetically rich sentences to be used in efficiently training and evaluating a speech recognition system has become very important. This paper first presents statistical analysis of various Mandarin acoustic units based upon a very large Chinese text corpus collected from daily newspapers and then presents an algorithm to automatically extract phonetically rich sentences from the text corpus to be used in training and evaluating a Mandarin speech recognition system.","PeriodicalId":436300,"journal":{"name":"Int. J. Comput. Linguistics Chin. Lang. Process.","volume":"235 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1998-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Comput. Linguistics Chin. Lang. Process.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30019/IJCLCLP.199808.0005","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

Automatic speech recognition by computers can provide humans with the most convenient method to communicate with computers. Because the Chinese language is not alphabetic and input of Chinese characters into computers is very difficult, Mandarin speech recognition is very highly desired. Recently, high performance speech recognition systems have begun to emerge from research institutes. However, it is believed that an adequate speech database for training acoustic models and evaluating performance is certainly critical for successful deployment of such systems in realistic operating environments. Thus, designing a set of phonetically rich sentences to be used in efficiently training and evaluating a speech recognition system has become very important. This paper first presents statistical analysis of various Mandarin acoustic units based upon a very large Chinese text corpus collected from daily newspapers and then presents an algorithm to automatically extract phonetically rich sentences from the text corpus to be used in training and evaluating a Mandarin speech recognition system.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于大型汉语语料库的汉语语音单元统计分析及语音丰富句子自动提取

计算机的自动语音识别为人类与计算机的交流提供了最便捷的方式。由于中文不是按字母顺序排列的，而且将汉字输入计算机非常困难，因此对普通话语音识别的需求非常高。最近，高性能的语音识别系统开始在研究机构中出现。然而，人们认为，一个足够的语音数据库来训练声学模型和评估性能，对于在实际操作环境中成功部署此类系统至关重要。因此，设计一组语音丰富的句子来有效地训练和评估语音识别系统变得非常重要。本文首先对从日报中收集的大量汉语文本语料库进行了各种普通话声学单位的统计分析，然后提出了一种从文本语料库中自动提取语音丰富句子的算法，用于普通话语音识别系统的训练和评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Int. J. Comput. Linguistics Chin. Lang. Process.

自引率

0.00%

发文量