Word segmentation by letter successor varieties

Information Storage and Retrieval Pub Date : 1974-11-01 Epub Date: 2002-08-28 DOI:10.1016/0020-0271(74)90044-8

Margaret A. Hafer, Stephen F. Weiss

引用次数: 238

Abstract

This paper describes a method for automatically segmenting words into their stems and affixes. The process uses certain statistical properties of a corpus (successor and predecessor letter variety counts) to indicate where words should be divided. Consequently, this process is less reliant on human intervention than are other methods for automated stemming.

The segmentation system is used to construct stem dictionaries for document classification. Information retrieval experiments are then performed using documents and queries so classified. Results show not only that this method is capable of high quality word segmentation, but also that its use in information retrieval produces results that are at least as good as those obtained using the more traditional stemming processes.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

按字母后继变体分词

本文描述了一种词干词缀自动分词的方法。该过程使用语料库的某些统计属性(后继字母和前导字母的变化计数)来指示应该在哪里划分单词。因此，与其他自动提取方法相比，该过程较少依赖于人为干预。该分词系统用于构建词干词典，用于文档分类。然后使用分类后的文档和查询进行信息检索实验。结果表明，该方法不仅能够实现高质量的分词，而且在信息检索中所产生的结果至少与使用更传统的词干提取过程所获得的结果一样好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Information Storage and Retrieval

自引率

0.00%

发文量

期刊最新文献

Information Storage: A Multidisciplinary Perspective Computer systems in the library: A handbook for managers and designers Knowing books and men: Knowing computers, too Grundlagen universaler wissensordnung; probleme und möglichkeiten eines universalen klassifikationssystems des wissens Resource sharing in libraries: Why, how, when, next action steps