A language model based on semantically clustered words in a Chinese character recognition system

Proceedings of 3rd International Conference on Document Analysis and Recognition Pub Date : 1995-08-14 DOI:10.1109/ICDAR.1995.599033

Hsi-Jian Lee, Cheng-Huang Tung

引用次数: 11

Abstract

This paper presents a new method for clustering the words in a dictionary into word groups, which are applied in a Chinese character recognition system with a language model to describe the contextual information. The Chinese synonym dictionary Tong2yi4ci2 ci2lin2 providing the semantic features is used to train the weights of the semantic attributes of the character-based word classes. The weights of the semantic attributes are next updated according to the words of the behavior dictionary, which has a rather complete word set. Then, the updated word classes are clustered into m groups according to the semantic measurement by a greedy method. The words in the behavior dictionary can finally be assigned into the m groups. The parameter space for bigram contextual information of the character recognition system is m/sup 2/. From the experimental results, the recognition system with the proposed model has shown better performance than that of a character-based bigram language model.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于语义聚类词的汉字识别语言模型

本文提出了一种将词典中的词聚类成词组的新方法，并将其应用于基于语言模型描述上下文信息的汉字识别系统中。使用提供语义特征的汉语同义词词典Tong2yi4ci2 ci2lin2来训练基于字符的词类的语义属性权值。然后根据行为字典中的单词更新语义属性的权重，该字典具有相当完整的单词集。然后，根据语义度量，采用贪心方法将更新后的词类聚为m组。行为字典中的单词最终可以分配到m组中。字符识别系统的双字母上下文信息的参数空间为m/sup 2/。实验结果表明，基于该模型的识别系统比基于字符的双字语言模型具有更好的识别性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of 3rd International Conference on Document Analysis and Recognition

自引率

0.00%

发文量