符号排序文本压缩器

P. Fenwick
{"title":"符号排序文本压缩器","authors":"P. Fenwick","doi":"10.1109/DCC.1997.582093","DOIUrl":null,"url":null,"abstract":"Summary form only given. In 1951 Shannon estimated the entropy of English text by giving human subjects a sample of text and asking them to guess the next letters. He found, in one example, that 79% of the attempts were correct at the first try, 8% needed two attempts and 3% needed 3 attempts. By regarding the number of attempts as an information source he could estimate the language entropy. Shannon also stated that an \"identical twin\" to the original predictor could recover the original text and these ideas are developed here to provide a new taxonomy of text compressors. In all cases these compressors recode the input into \"rankings\" of \"most probable symbol\", \"next most probable symbol\", and so on. The rankings have a very skew distribution (low entropy) and are processed by a conventional statistical compressor. Several \"symbol ranking\" compressors have appeared in the literature, though seldom with that name or even reference to Shannon's work. The author has developed a compressor which uses constant-order contexts and is based on a set-associative cache with LRU update. A software implementation has run at about 1 Mbyte/s with an average compression of 3.6 bits/byte on the Calgary Corpus.","PeriodicalId":403990,"journal":{"name":"Proceedings DCC '97. Data Compression Conference","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1997-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Symbol ranking text compressors\",\"authors\":\"P. Fenwick\",\"doi\":\"10.1109/DCC.1997.582093\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Summary form only given. In 1951 Shannon estimated the entropy of English text by giving human subjects a sample of text and asking them to guess the next letters. He found, in one example, that 79% of the attempts were correct at the first try, 8% needed two attempts and 3% needed 3 attempts. By regarding the number of attempts as an information source he could estimate the language entropy. Shannon also stated that an \\\"identical twin\\\" to the original predictor could recover the original text and these ideas are developed here to provide a new taxonomy of text compressors. In all cases these compressors recode the input into \\\"rankings\\\" of \\\"most probable symbol\\\", \\\"next most probable symbol\\\", and so on. The rankings have a very skew distribution (low entropy) and are processed by a conventional statistical compressor. Several \\\"symbol ranking\\\" compressors have appeared in the literature, though seldom with that name or even reference to Shannon's work. The author has developed a compressor which uses constant-order contexts and is based on a set-associative cache with LRU update. A software implementation has run at about 1 Mbyte/s with an average compression of 3.6 bits/byte on the Calgary Corpus.\",\"PeriodicalId\":403990,\"journal\":{\"name\":\"Proceedings DCC '97. Data Compression Conference\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1997-03-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings DCC '97. Data Compression Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DCC.1997.582093\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC '97. Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1997.582093","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

摘要

只提供摘要形式。1951年,香农通过给人类受试者一份文本样本,并让他们猜测下一个字母,估计了英语文本的熵。他发现,在一个例子中,79%的尝试在第一次尝试时是正确的,8%需要两次尝试,3%需要三次尝试。通过将尝试次数作为信息源,他可以估计语言熵。香农还表示,原始预测器的“同卵双胞胎”可以恢复原始文本,这些想法在这里得到发展,以提供一种新的文本压缩器分类法。在所有情况下,这些压缩器将输入重新编码为“最可能符号”、“下一个最可能符号”的“排名”,以此类推。排名具有非常不均匀的分布(低熵),并由传统的统计压缩器处理。几个“符号排序”压缩器已经出现在文献中,虽然很少有这个名字,甚至参考香农的工作。作者开发了一种使用常序上下文的压缩器,该压缩器基于集关联缓存和LRU更新。在卡尔加里语料库上,软件实现的运行速度约为1mbyte /s,平均压缩率为3.6 bits/byte。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Symbol ranking text compressors
Summary form only given. In 1951 Shannon estimated the entropy of English text by giving human subjects a sample of text and asking them to guess the next letters. He found, in one example, that 79% of the attempts were correct at the first try, 8% needed two attempts and 3% needed 3 attempts. By regarding the number of attempts as an information source he could estimate the language entropy. Shannon also stated that an "identical twin" to the original predictor could recover the original text and these ideas are developed here to provide a new taxonomy of text compressors. In all cases these compressors recode the input into "rankings" of "most probable symbol", "next most probable symbol", and so on. The rankings have a very skew distribution (low entropy) and are processed by a conventional statistical compressor. Several "symbol ranking" compressors have appeared in the literature, though seldom with that name or even reference to Shannon's work. The author has developed a compressor which uses constant-order contexts and is based on a set-associative cache with LRU update. A software implementation has run at about 1 Mbyte/s with an average compression of 3.6 bits/byte on the Calgary Corpus.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Robust image coding with perceptual-based scalability Image coding based on mixture modeling of wavelet coefficients and a fast estimation-quantization framework Region-based video coding with embedded zero-trees Progressive Ziv-Lempel encoding of synthetic images Compressing address trace data for cache simulations
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1