Two-Character Chinese Word Extraction Based on Hybrid of Internal and Contextual Measures

Workshop on Chinese Language Processing Pub Date : 2003-07-11 DOI:10.3115/1119250.1119254

Shengfen Luo, Maosong Sun

引用次数: 37

Abstract

Word extraction is one of the important tasks in text information processing. There are mainly two kinds of statistic-based measures for word extraction: the internal measure and the contextual measure. This paper discusses these two kinds of measures for Chinese word extraction. First, nine widely adopted internal measures are tested and compared on individual basis. Then various schemes of combining these measures are tried so as to improve the performance. Finally, the left/right entropy is integrated to see the effect of contextual measures. Genetic algorithm is explored to automatically adjust the weights of combination and thresholds. Experiments focusing on two-character Chinese word extraction show a promising result: the F-measure of mutual information, the most powerful internal measure, is 57.82%, whereas the best combination scheme of internal measures achieves the F-measure of 59.87%. With the integration of the contextual measure, the word extraction achieves the F-measure of 68.48% at last.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于内部度量和上下文度量混合的汉语两字词提取

词提取是文本信息处理中的重要任务之一。基于统计的词提取方法主要有两种:内部方法和上下文方法。本文对这两种方法进行了探讨。首先，对九种被广泛采用的内部措施进行了个别测试和比较。然后尝试了将这些措施结合起来的各种方案，以提高性能。最后，将左/右熵进行整合，以查看上下文度量的效果。探讨了遗传算法自动调整组合权值和阈值的方法。以两字中文词提取为研究对象的实验结果表明:互信息的f值为57.82%，是最强大的内部测度，而内部测度的最佳组合方案的f值为59.87%。结合语境测度，最终提取出68.48%的f测度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Workshop on Chinese Language Processing

自引率

0.00%

发文量

期刊最新文献

Building a Large Chinese Corpus Annotated with Semantic Dependency A Two-stage Statistical Word Segmentation System for Chinese Unsupervised Training for Overlapping Ambiguity Resolution in Chinese Word Segmentation Chinese Word Segmentation in MSR-NLP Annotating the Propositions in the Penn Chinese Treebank