Chinese term extraction from web pages based on expected point-wise mutual information

2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD) Pub Date : 2016-08-01 DOI:10.1109/FSKD.2016.7603424

Liping Du, Xiaoge Li, Dayi Lin

引用次数: 5

Abstract

Point-wise Mutual Information(PMI) has been widely used in many areas of lexicon construction, term extraction and text mining. However, PMI has a well-known tendency, which is overvaluing the relatedness of word pairs that involve low-frequency words. To overcome this limitation, Expected Point-wise Mutual Information (PMIK) has been proposed empirically. In this paper, we propose an automatic term recognition system for Chinese and theoretically prove that with variant k ≥ 3, PMIK method can overcome the bias of low-frequency words. The experiment results on Chinese SINA blog and Baidu Tieba corpus show that with a proper k value of 5, the system can achieve a precision greater than 81% for top 1000 extracted terms without decreasing the recall.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于期望点互信息的网页中文词提取

点互信息在词典构建、术语抽取和文本挖掘等领域得到了广泛的应用。然而，PMI有一个众所周知的倾向，即高估了涉及低频词的词对的相关性。为了克服这一局限，人们从经验上提出了期望点互信息(PMIK)。本文提出了一种中文术语自动识别系统，并从理论上证明了当变量k≥3时，PMIK方法可以克服低频词的偏差。在中文新浪博客和百度贴吧语料库上的实验结果表明，当k值为5时，系统在不降低召回率的情况下，对抽取的前1000个词的准确率可以达到81%以上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)

自引率

0.00%

发文量