Building a Bracketed Corpus Using Φ2 Statistics

Int. J. Comput. Linguistics Chin. Lang. Process. Pub Date : 1997-08-01 DOI:10.30019/IJCLCLP.199708.0001

Yue-Shi Lee, Hsin-Hsi Chen

引用次数: 0

Abstract

Research based on treebanks is ongoing for many natural language applications. However, the work involved in building a large-scale treebank is laborious and time-consuming. Thus, speeding up the process of building a treebank has become an important task. This paper proposes two versions of probabilistic chunkers to aid the development of a bracketed corpus. The basic version partitions part-of-speech sequences into chunk sequences, which form a partially bracketed corpus. Applying the chunking action recursively, the recursive version generates a fully bracketed corpus. Rather than using a treebank as a training corpus, a corpus, which is tagged with part-of-speech information only, is used. The experimental results show that the probabilistic chunker has a correct rate of more than 94% in producing a partially bracketed corpus and also gives very encouraging results in generating a fully bracketed corpus. These two versions of chunkers are simple but effective and can also be applied to many natural language applications.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用Φ2 Statistics构建括号语料库

基于树库的研究正在许多自然语言应用中进行。然而，建造一个大型树木库的工作既费力又耗时。因此，加快建设树库的进程已成为一项重要任务。本文提出了两个版本的概率分块器来帮助括号语料库的开发。基本版本将词性序列划分为块序列，形成部分括号语料库。递归地应用分块操作，递归版本生成一个完全带括号的语料库。不是使用树库作为训练语料库，而是使用仅标记词性信息的语料库。实验结果表明，概率分块器在生成部分括号语料库方面的正确率超过94%，在生成完全括号语料库方面也取得了令人鼓舞的结果。这两个版本的分块器简单但有效，也可以应用于许多自然语言应用程序。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Int. J. Comput. Linguistics Chin. Lang. Process.

自引率

0.00%

发文量