Lexical attraction for text compression

Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096) Pub Date : 1999-03-29 DOI:10.1109/DCC.1999.785673

Joscha Bach, I. Witten

{"title":"Lexical attraction for text compression","authors":"Joscha Bach, I. Witten","doi":"10.1109/DCC.1999.785673","DOIUrl":null,"url":null,"abstract":"[Summary form only given]. The best methods of text compression work by conditioning each symbol's probability on its predecessors. Prior symbols establish a context that governs the probability distribution for the next one, and the actual. The next symbol is encoded with respect to this distribution. However, the best predictors for words in natural language are not necessarily their immediate predecessors. Verbs may depend on nouns, pronouns on names, closing brackets on opening ones, question marks on \"wh\"-words. To establish a more appropriate dependency structure, the lexical attraction of a pair of words is defined as the likelihood that they will appear (in that order) within a sentence, regardless of how far apart they are. This is estimated by counting the co-occurrences of words in the sentences of a large corpus. Then, for each sentence, an undirected (planar, acydic) graph is found that maximizes the lexical attraction between linked items, effectively reorganizing the text in the form of a low-entropy model. We encode a series of linked sentences and transmit them in the same manner as order-1 word-level PPM. To prime the lexical attraction linker, the whole document is processed to acquire the co-occurrence counts, and again to re-link the sentences. Pairs that occur twice or less are excluded from the statistics, which significantly reduces the size of the model. The encoding stage utilizes an adaptive PPM-style method. Encouraging results have been obtained using this method.","PeriodicalId":103598,"journal":{"name":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1999.785673","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

[Summary form only given]. The best methods of text compression work by conditioning each symbol's probability on its predecessors. Prior symbols establish a context that governs the probability distribution for the next one, and the actual. The next symbol is encoded with respect to this distribution. However, the best predictors for words in natural language are not necessarily their immediate predecessors. Verbs may depend on nouns, pronouns on names, closing brackets on opening ones, question marks on "wh"-words. To establish a more appropriate dependency structure, the lexical attraction of a pair of words is defined as the likelihood that they will appear (in that order) within a sentence, regardless of how far apart they are. This is estimated by counting the co-occurrences of words in the sentences of a large corpus. Then, for each sentence, an undirected (planar, acydic) graph is found that maximizes the lexical attraction between linked items, effectively reorganizing the text in the form of a low-entropy model. We encode a series of linked sentences and transmit them in the same manner as order-1 word-level PPM. To prime the lexical attraction linker, the whole document is processed to acquire the co-occurrence counts, and again to re-link the sentences. Pairs that occur twice or less are excluded from the statistics, which significantly reduces the size of the model. The encoding stage utilizes an adaptive PPM-style method. Encouraging results have been obtained using this method.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

文本压缩的词汇吸引力

[仅提供摘要形式]。最好的文本压缩方法是根据每个符号的前一个符号的概率进行调整。先验符号建立了一个上下文，该上下文支配着下一个和实际的概率分布。下一个符号是根据这个分布进行编码的。然而，自然语言中单词的最佳预测器不一定是它们的直接前身。动词可以依赖于名词，代词可以依赖于名称，右括号可以依赖于开头，问号可以依赖于“wh”。为了建立一个更合适的依赖关系结构，一对单词的词汇吸引力被定义为它们在一个句子中出现的可能性(按该顺序)，而不管它们相距多远。这是通过计算一个大型语料库中句子中单词的共现次数来估计的。然后，对于每个句子，找到一个无向(平面，无向)图，最大限度地提高链接项之间的词汇吸引力，以低熵模型的形式有效地重组文本。我们对一系列相连的句子进行编码，并以与order-1字级PPM相同的方式传输它们。为了启动词汇吸引链接器，对整个文档进行处理以获得共现次数，并再次重新链接句子。出现两次或更少的配对被排除在统计数据之外，这大大减少了模型的大小。编码阶段采用自适应ppm风格的方法。采用该方法取得了令人鼓舞的效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)

自引率

0.00%

发文量

期刊最新文献

Real-time VBR rate control of MPEG video based upon lexicographic bit allocation Performance of quantizers on noisy channels using structured families of codes SICLIC: a simple inter-color lossless image coder Protein is incompressible Encoding time reduction in fractal image compression