Lexical attraction for text compression

Joscha Bach, I. Witten
{"title":"Lexical attraction for text compression","authors":"Joscha Bach, I. Witten","doi":"10.1109/DCC.1999.785673","DOIUrl":null,"url":null,"abstract":"[Summary form only given]. The best methods of text compression work by conditioning each symbol's probability on its predecessors. Prior symbols establish a context that governs the probability distribution for the next one, and the actual. The next symbol is encoded with respect to this distribution. However, the best predictors for words in natural language are not necessarily their immediate predecessors. Verbs may depend on nouns, pronouns on names, closing brackets on opening ones, question marks on \"wh\"-words. To establish a more appropriate dependency structure, the lexical attraction of a pair of words is defined as the likelihood that they will appear (in that order) within a sentence, regardless of how far apart they are. This is estimated by counting the co-occurrences of words in the sentences of a large corpus. Then, for each sentence, an undirected (planar, acydic) graph is found that maximizes the lexical attraction between linked items, effectively reorganizing the text in the form of a low-entropy model. We encode a series of linked sentences and transmit them in the same manner as order-1 word-level PPM. To prime the lexical attraction linker, the whole document is processed to acquire the co-occurrence counts, and again to re-link the sentences. Pairs that occur twice or less are excluded from the statistics, which significantly reduces the size of the model. The encoding stage utilizes an adaptive PPM-style method. Encouraging results have been obtained using this method.","PeriodicalId":103598,"journal":{"name":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1999.785673","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

[Summary form only given]. The best methods of text compression work by conditioning each symbol's probability on its predecessors. Prior symbols establish a context that governs the probability distribution for the next one, and the actual. The next symbol is encoded with respect to this distribution. However, the best predictors for words in natural language are not necessarily their immediate predecessors. Verbs may depend on nouns, pronouns on names, closing brackets on opening ones, question marks on "wh"-words. To establish a more appropriate dependency structure, the lexical attraction of a pair of words is defined as the likelihood that they will appear (in that order) within a sentence, regardless of how far apart they are. This is estimated by counting the co-occurrences of words in the sentences of a large corpus. Then, for each sentence, an undirected (planar, acydic) graph is found that maximizes the lexical attraction between linked items, effectively reorganizing the text in the form of a low-entropy model. We encode a series of linked sentences and transmit them in the same manner as order-1 word-level PPM. To prime the lexical attraction linker, the whole document is processed to acquire the co-occurrence counts, and again to re-link the sentences. Pairs that occur twice or less are excluded from the statistics, which significantly reduces the size of the model. The encoding stage utilizes an adaptive PPM-style method. Encouraging results have been obtained using this method.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
文本压缩的词汇吸引力
[仅提供摘要形式]。最好的文本压缩方法是根据每个符号的前一个符号的概率进行调整。先验符号建立了一个上下文,该上下文支配着下一个和实际的概率分布。下一个符号是根据这个分布进行编码的。然而,自然语言中单词的最佳预测器不一定是它们的直接前身。动词可以依赖于名词,代词可以依赖于名称,右括号可以依赖于开头,问号可以依赖于“wh”。为了建立一个更合适的依赖关系结构,一对单词的词汇吸引力被定义为它们在一个句子中出现的可能性(按该顺序),而不管它们相距多远。这是通过计算一个大型语料库中句子中单词的共现次数来估计的。然后,对于每个句子,找到一个无向(平面,无向)图,最大限度地提高链接项之间的词汇吸引力,以低熵模型的形式有效地重组文本。我们对一系列相连的句子进行编码,并以与order-1字级PPM相同的方式传输它们。为了启动词汇吸引链接器,对整个文档进行处理以获得共现次数,并再次重新链接句子。出现两次或更少的配对被排除在统计数据之外,这大大减少了模型的大小。编码阶段采用自适应ppm风格的方法。采用该方法取得了令人鼓舞的效果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Real-time VBR rate control of MPEG video based upon lexicographic bit allocation Performance of quantizers on noisy channels using structured families of codes SICLIC: a simple inter-color lossless image coder Protein is incompressible Encoding time reduction in fractal image compression
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1