Optimizing an algorithm for finding sentence ends: applying linguistic methodology

ACM-SE 17 Pub Date : 1979-04-09 DOI:10.1145/503506.503533

D. W. Coleman

{"title":"Optimizing an algorithm for finding sentence ends: applying linguistic methodology","authors":"D. W. Coleman","doi":"10.1145/503506.503533","DOIUrl":null,"url":null,"abstract":"Most computer text editors are oriented around a line of raw text as entered. When moving text, a more natural unit is the sentence. APLATS is a locally-written text editor which uses the sentence as the basic unit of text. [i] APLATS ends a sentence with (condition (a)) a \"?\" or \".\" followed by a blank, a format delimiter (single character), or a carriage return (as typed). A format delimiter which forces the start of a new line of text (condition (b)) also forces a sentence end. The definition of a sentence end used by APLATS sometimes produces \"excess\" sentence divisions. At other times it fails to produce sentence divisions where they are normally expected. Linguistic methodology was applied in an attempt to determine if the algorithm for finding sentence ends could be improved. Now, condition (b) will only rarely produce an \"incorrect\" sentence boundary. Also, consider the type of case where it will: a sentence breaks in the middle at the end of a line; a long quotation set off from the main body of the text, for example, separates it from the rest of the sentence (or perhaps the quotation ends the sentence). Handling of the long quotation as composed of one or more independent \"sentences\" is probably preferred for editing, anyway. Thus, condition (b) presents no major problems. It is condition (a) which produces many \"unexpected\"--and perhaps inconvenient--sentence breaks. In (1)-(4), for example, APLATS forces \"excess\" sentence ends at the points indicated by a \"#\". Further, it fails to produce sentence breaks in (5)-(6) at the points indicated by a \"+\". Sentence breaks would normally be expected at these locations. (Each of the sentences (1)-(13) is assumed to be extracted from a larger, unspecified context.)","PeriodicalId":258426,"journal":{"name":"ACM-SE 17","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1979-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM-SE 17","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/503506.503533","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Most computer text editors are oriented around a line of raw text as entered. When moving text, a more natural unit is the sentence. APLATS is a locally-written text editor which uses the sentence as the basic unit of text. [i] APLATS ends a sentence with (condition (a)) a "?" or "." followed by a blank, a format delimiter (single character), or a carriage return (as typed). A format delimiter which forces the start of a new line of text (condition (b)) also forces a sentence end. The definition of a sentence end used by APLATS sometimes produces "excess" sentence divisions. At other times it fails to produce sentence divisions where they are normally expected. Linguistic methodology was applied in an attempt to determine if the algorithm for finding sentence ends could be improved. Now, condition (b) will only rarely produce an "incorrect" sentence boundary. Also, consider the type of case where it will: a sentence breaks in the middle at the end of a line; a long quotation set off from the main body of the text, for example, separates it from the rest of the sentence (or perhaps the quotation ends the sentence). Handling of the long quotation as composed of one or more independent "sentences" is probably preferred for editing, anyway. Thus, condition (b) presents no major problems. It is condition (a) which produces many "unexpected"--and perhaps inconvenient--sentence breaks. In (1)-(4), for example, APLATS forces "excess" sentence ends at the points indicated by a "#". Further, it fails to produce sentence breaks in (5)-(6) at the points indicated by a "+". Sentence breaks would normally be expected at these locations. (Each of the sentences (1)-(13) is assumed to be extracted from a larger, unspecified context.)

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

优化一种寻找句子结尾的算法:应用语言学方法论

大多数计算机文本编辑器都是围绕一行原始文本输入的。移动文本时，更自然的单位是句子。APLATS是一个本地编写的文本编辑器，它使用句子作为文本的基本单位。[i] APLATS以(condition (a))“?”或“。”结束句子，后面跟着一个空白、格式分隔符(单个字符)或回车(按输入)。强制新文本行开始的格式分隔符(条件(b))也强制句子结束。APLATS对句子结尾的定义有时会产生“多余”的句子划分。在其他时候，它不能产生通常期望的句子分割。运用语言学方法，试图确定寻找句子结尾的算法是否可以改进。现在，条件(b)很少会产生“不正确”的句子边界。此外，考虑一下它会出现的情况:句子在一行的末尾中途中断;例如，从文本主体出发的长引号将其与句子的其余部分分隔开(或者可能引号在句子末尾)。无论如何，将长引文作为由一个或多个独立的“句子”组成的处理可能更适合编辑。因此，条件(b)没有重大问题。正是条件(a)产生了许多“意想不到的”——也许是不方便的——断句。例如，在(1)-(4)中，APLATS强制在“#”表示的点处结束“多余”的句子。此外，它不能在(5)-(6)中以“+”表示的点上产生断句。在这些位置通常会出现断句。(假设每个句子(1)-(13)都是从一个更大的、未指定的上下文中提取出来的。)

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM-SE 17

自引率

0.00%

发文量

期刊最新文献

An assessment of organizational practices in data processing environments Indirect addressing techniques in the design and implementation of an on-line file access program Toward a multiple copy file assignment model for files in a computer system Exact tests of significance in 2xM contingency tables Some properties of relational expressions