大型文本文档的基于单词的压缩方法

J. Dvorský, J. Pokorný, V. Snás̃el
{"title":"大型文本文档的基于单词的压缩方法","authors":"J. Dvorský, J. Pokorný, V. Snás̃el","doi":"10.1109/DCC.1999.785680","DOIUrl":null,"url":null,"abstract":"Summary form only given. We present a new compression method, called WLZW, which is a word-based modification of classic LZW. The algorithm is two-phase, it uses only one table for words and non-words (so called tokens), and a single data structure for the lexicon is usable as a text index. The length of words and non-words is restricted. This feature improves the compress ratio achieved. Tokens of unlimited length alternate, when they are read from the input stream. Because of restricted length of tokens alternating of tokens is corrupted, because some tokens are divided into several parts of same type. To save alternating of tokens two special tokens are created. They are empty word and empty non-word. They contain no character. Empty word is inserted between two non-words and empty non-word between two words. Alternating of tokens is saved for all sequences of tokens. The alternating of tokens is an important piece of information. With this knowledge the kind of the next token can be predicted. One selected (so-called victim) non-word can be deleted from input stream. An algorithm to search the victim is also presented. In the decompression phase, a deleted victim is recognized as an error in alternating of words and non-words in sequence. The algorithm was tested on many texts in different formats (ASCII, RTF). The Canterbury corpus, a large set, was used as a standard for publication results. The compression ratio achieved is fairly good, on average 25%-22%. Decompression is very fast. Moreover, the algorithm enables evaluation of database queries in given text. This supports the idea of leaving data in the compressed state as long as possible, and to decompress it when it is necessary.","PeriodicalId":103598,"journal":{"name":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":"{\"title\":\"Word-based compression methods for large text documents\",\"authors\":\"J. Dvorský, J. Pokorný, V. Snás̃el\",\"doi\":\"10.1109/DCC.1999.785680\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Summary form only given. We present a new compression method, called WLZW, which is a word-based modification of classic LZW. The algorithm is two-phase, it uses only one table for words and non-words (so called tokens), and a single data structure for the lexicon is usable as a text index. The length of words and non-words is restricted. This feature improves the compress ratio achieved. Tokens of unlimited length alternate, when they are read from the input stream. Because of restricted length of tokens alternating of tokens is corrupted, because some tokens are divided into several parts of same type. To save alternating of tokens two special tokens are created. They are empty word and empty non-word. They contain no character. Empty word is inserted between two non-words and empty non-word between two words. Alternating of tokens is saved for all sequences of tokens. The alternating of tokens is an important piece of information. With this knowledge the kind of the next token can be predicted. One selected (so-called victim) non-word can be deleted from input stream. An algorithm to search the victim is also presented. In the decompression phase, a deleted victim is recognized as an error in alternating of words and non-words in sequence. The algorithm was tested on many texts in different formats (ASCII, RTF). The Canterbury corpus, a large set, was used as a standard for publication results. The compression ratio achieved is fairly good, on average 25%-22%. Decompression is very fast. Moreover, the algorithm enables evaluation of database queries in given text. This supports the idea of leaving data in the compressed state as long as possible, and to decompress it when it is necessary.\",\"PeriodicalId\":103598,\"journal\":{\"name\":\"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1999-03-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"15\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DCC.1999.785680\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1999.785680","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 15

摘要

只提供摘要形式。我们提出了一种新的压缩方法,称为WLZW,它是对经典LZW的基于词的改进。该算法是两阶段的,它只使用一个表来存储单词和非单词(所谓的令牌),并且词典的单个数据结构可用作文本索引。单词和非单词的长度是有限制的。这个特性提高了压缩比。无限长度的令牌在从输入流中读取时交替使用。由于对令牌长度的限制,令牌的交替被破坏,因为一些令牌被分成几个相同类型的部分。为了节省令牌的交替,创建了两个特殊的令牌。它们是空洞的话语和空洞的非话语。它们不包含字符。空单词插入两个非单词之间,空非单词插入两个单词之间。所有标记序列都保存标记的交替。令牌的交替是一条重要的信息。有了这些知识,就可以预测下一个标记的类型。可以从输入流中删除一个选定的(所谓的受害者)非单词。提出了一种搜索受害者的算法。在解压缩阶段,删除的受害者被识别为单词和非单词顺序交替的错误。该算法在许多不同格式的文本(ASCII, RTF)上进行了测试。坎特伯雷语料库,一个大集合,被用作出版结果的标准。实现的压缩比相当好,平均为25%-22%。解压非常快。此外,该算法支持对给定文本中的数据库查询进行评估。这支持将数据尽可能长时间地保持在压缩状态,并在必要时解压缩的想法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Word-based compression methods for large text documents
Summary form only given. We present a new compression method, called WLZW, which is a word-based modification of classic LZW. The algorithm is two-phase, it uses only one table for words and non-words (so called tokens), and a single data structure for the lexicon is usable as a text index. The length of words and non-words is restricted. This feature improves the compress ratio achieved. Tokens of unlimited length alternate, when they are read from the input stream. Because of restricted length of tokens alternating of tokens is corrupted, because some tokens are divided into several parts of same type. To save alternating of tokens two special tokens are created. They are empty word and empty non-word. They contain no character. Empty word is inserted between two non-words and empty non-word between two words. Alternating of tokens is saved for all sequences of tokens. The alternating of tokens is an important piece of information. With this knowledge the kind of the next token can be predicted. One selected (so-called victim) non-word can be deleted from input stream. An algorithm to search the victim is also presented. In the decompression phase, a deleted victim is recognized as an error in alternating of words and non-words in sequence. The algorithm was tested on many texts in different formats (ASCII, RTF). The Canterbury corpus, a large set, was used as a standard for publication results. The compression ratio achieved is fairly good, on average 25%-22%. Decompression is very fast. Moreover, the algorithm enables evaluation of database queries in given text. This supports the idea of leaving data in the compressed state as long as possible, and to decompress it when it is necessary.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Real-time VBR rate control of MPEG video based upon lexicographic bit allocation Performance of quantizers on noisy channels using structured families of codes SICLIC: a simple inter-color lossless image coder Protein is incompressible Encoding time reduction in fractal image compression
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1