符号排序文本压缩器

Proceedings DCC '97. Data Compression Conference Pub Date : 1997-03-25 DOI:10.1109/DCC.1997.582093

P. Fenwick

{"title":"符号排序文本压缩器","authors":"P. Fenwick","doi":"10.1109/DCC.1997.582093","DOIUrl":null,"url":null,"abstract":"Summary form only given. In 1951 Shannon estimated the entropy of English text by giving human subjects a sample of text and asking them to guess the next letters. He found, in one example, that 79% of the attempts were correct at the first try, 8% needed two attempts and 3% needed 3 attempts. By regarding the number of attempts as an information source he could estimate the language entropy. Shannon also stated that an \"identical twin\" to the original predictor could recover the original text and these ideas are developed here to provide a new taxonomy of text compressors. In all cases these compressors recode the input into \"rankings\" of \"most probable symbol\", \"next most probable symbol\", and so on. The rankings have a very skew distribution (low entropy) and are processed by a conventional statistical compressor. Several \"symbol ranking\" compressors have appeared in the literature, though seldom with that name or even reference to Shannon's work. The author has developed a compressor which uses constant-order contexts and is based on a set-associative cache with LRU update. A software implementation has run at about 1 Mbyte/s with an average compression of 3.6 bits/byte on the Calgary Corpus.","PeriodicalId":403990,"journal":{"name":"Proceedings DCC '97. Data Compression Conference","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1997-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Symbol ranking text compressors\",\"authors\":\"P. Fenwick\",\"doi\":\"10.1109/DCC.1997.582093\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Summary form only given. In 1951 Shannon estimated the entropy of English text by giving human subjects a sample of text and asking them to guess the next letters. He found, in one example, that 79% of the attempts were correct at the first try, 8% needed two attempts and 3% needed 3 attempts. By regarding the number of attempts as an information source he could estimate the language entropy. Shannon also stated that an \\\"identical twin\\\" to the original predictor could recover the original text and these ideas are developed here to provide a new taxonomy of text compressors. In all cases these compressors recode the input into \\\"rankings\\\" of \\\"most probable symbol\\\", \\\"next most probable symbol\\\", and so on. The rankings have a very skew distribution (low entropy) and are processed by a conventional statistical compressor. Several \\\"symbol ranking\\\" compressors have appeared in the literature, though seldom with that name or even reference to Shannon's work. The author has developed a compressor which uses constant-order contexts and is based on a set-associative cache with LRU update. A software implementation has run at about 1 Mbyte/s with an average compression of 3.6 bits/byte on the Calgary Corpus.\",\"PeriodicalId\":403990,\"journal\":{\"name\":\"Proceedings DCC '97. Data Compression Conference\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1997-03-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings DCC '97. Data Compression Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DCC.1997.582093\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC '97. Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1997.582093","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

只提供摘要形式。1951年，香农通过给人类受试者一份文本样本，并让他们猜测下一个字母，估计了英语文本的熵。他发现，在一个例子中，79%的尝试在第一次尝试时是正确的，8%需要两次尝试，3%需要三次尝试。通过将尝试次数作为信息源，他可以估计语言熵。香农还表示，原始预测器的“同卵双胞胎”可以恢复原始文本，这些想法在这里得到发展，以提供一种新的文本压缩器分类法。在所有情况下，这些压缩器将输入重新编码为“最可能符号”、“下一个最可能符号”的“排名”，以此类推。排名具有非常不均匀的分布(低熵)，并由传统的统计压缩器处理。几个“符号排序”压缩器已经出现在文献中，虽然很少有这个名字，甚至参考香农的工作。作者开发了一种使用常序上下文的压缩器，该压缩器基于集关联缓存和LRU更新。在卡尔加里语料库上，软件实现的运行速度约为1mbyte /s，平均压缩率为3.6 bits/byte。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Symbol ranking text compressors

Summary form only given. In 1951 Shannon estimated the entropy of English text by giving human subjects a sample of text and asking them to guess the next letters. He found, in one example, that 79% of the attempts were correct at the first try, 8% needed two attempts and 3% needed 3 attempts. By regarding the number of attempts as an information source he could estimate the language entropy. Shannon also stated that an "identical twin" to the original predictor could recover the original text and these ideas are developed here to provide a new taxonomy of text compressors. In all cases these compressors recode the input into "rankings" of "most probable symbol", "next most probable symbol", and so on. The rankings have a very skew distribution (low entropy) and are processed by a conventional statistical compressor. Several "symbol ranking" compressors have appeared in the literature, though seldom with that name or even reference to Shannon's work. The author has developed a compressor which uses constant-order contexts and is based on a set-associative cache with LRU update. A software implementation has run at about 1 Mbyte/s with an average compression of 3.6 bits/byte on the Calgary Corpus.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings DCC '97. Data Compression Conference

自引率

0.00%

发文量

期刊最新文献

Robust image coding with perceptual-based scalability Image coding based on mixture modeling of wavelet coefficients and a fast estimation-quantization framework Region-based video coding with embedded zero-trees Progressive Ziv-Lempel encoding of synthetic images Compressing address trace data for cache simulations