大型文本文档的基于单词的压缩方法

Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096) Pub Date : 1999-03-29 DOI:10.1109/DCC.1999.785680

J. Dvorský, J. Pokorný, V. Snás̃el

{"title":"大型文本文档的基于单词的压缩方法","authors":"J. Dvorský, J. Pokorný, V. Snás̃el","doi":"10.1109/DCC.1999.785680","DOIUrl":null,"url":null,"abstract":"Summary form only given. We present a new compression method, called WLZW, which is a word-based modification of classic LZW. The algorithm is two-phase, it uses only one table for words and non-words (so called tokens), and a single data structure for the lexicon is usable as a text index. The length of words and non-words is restricted. This feature improves the compress ratio achieved. Tokens of unlimited length alternate, when they are read from the input stream. Because of restricted length of tokens alternating of tokens is corrupted, because some tokens are divided into several parts of same type. To save alternating of tokens two special tokens are created. They are empty word and empty non-word. They contain no character. Empty word is inserted between two non-words and empty non-word between two words. Alternating of tokens is saved for all sequences of tokens. The alternating of tokens is an important piece of information. With this knowledge the kind of the next token can be predicted. One selected (so-called victim) non-word can be deleted from input stream. An algorithm to search the victim is also presented. In the decompression phase, a deleted victim is recognized as an error in alternating of words and non-words in sequence. The algorithm was tested on many texts in different formats (ASCII, RTF). The Canterbury corpus, a large set, was used as a standard for publication results. The compression ratio achieved is fairly good, on average 25%-22%. Decompression is very fast. Moreover, the algorithm enables evaluation of database queries in given text. This supports the idea of leaving data in the compressed state as long as possible, and to decompress it when it is necessary.","PeriodicalId":103598,"journal":{"name":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":"{\"title\":\"Word-based compression methods for large text documents\",\"authors\":\"J. Dvorský, J. Pokorný, V. Snás̃el\",\"doi\":\"10.1109/DCC.1999.785680\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Summary form only given. We present a new compression method, called WLZW, which is a word-based modification of classic LZW. The algorithm is two-phase, it uses only one table for words and non-words (so called tokens), and a single data structure for the lexicon is usable as a text index. The length of words and non-words is restricted. This feature improves the compress ratio achieved. Tokens of unlimited length alternate, when they are read from the input stream. Because of restricted length of tokens alternating of tokens is corrupted, because some tokens are divided into several parts of same type. To save alternating of tokens two special tokens are created. They are empty word and empty non-word. They contain no character. Empty word is inserted between two non-words and empty non-word between two words. Alternating of tokens is saved for all sequences of tokens. The alternating of tokens is an important piece of information. With this knowledge the kind of the next token can be predicted. One selected (so-called victim) non-word can be deleted from input stream. An algorithm to search the victim is also presented. In the decompression phase, a deleted victim is recognized as an error in alternating of words and non-words in sequence. The algorithm was tested on many texts in different formats (ASCII, RTF). The Canterbury corpus, a large set, was used as a standard for publication results. The compression ratio achieved is fairly good, on average 25%-22%. Decompression is very fast. Moreover, the algorithm enables evaluation of database queries in given text. This supports the idea of leaving data in the compressed state as long as possible, and to decompress it when it is necessary.\",\"PeriodicalId\":103598,\"journal\":{\"name\":\"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)\",\"volume\":\"32 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1999-03-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"15\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DCC.1999.785680\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1999.785680","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

摘要

只提供摘要形式。我们提出了一种新的压缩方法，称为WLZW，它是对经典LZW的基于词的改进。该算法是两阶段的，它只使用一个表来存储单词和非单词(所谓的令牌)，并且词典的单个数据结构可用作文本索引。单词和非单词的长度是有限制的。这个特性提高了压缩比。无限长度的令牌在从输入流中读取时交替使用。由于对令牌长度的限制，令牌的交替被破坏，因为一些令牌被分成几个相同类型的部分。为了节省令牌的交替，创建了两个特殊的令牌。它们是空洞的话语和空洞的非话语。它们不包含字符。空单词插入两个非单词之间，空非单词插入两个单词之间。所有标记序列都保存标记的交替。令牌的交替是一条重要的信息。有了这些知识，就可以预测下一个标记的类型。可以从输入流中删除一个选定的(所谓的受害者)非单词。提出了一种搜索受害者的算法。在解压缩阶段，删除的受害者被识别为单词和非单词顺序交替的错误。该算法在许多不同格式的文本(ASCII, RTF)上进行了测试。坎特伯雷语料库，一个大集合，被用作出版结果的标准。实现的压缩比相当好，平均为25%-22%。解压非常快。此外，该算法支持对给定文本中的数据库查询进行评估。这支持将数据尽可能长时间地保持在压缩状态，并在必要时解压缩的想法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Word-based compression methods for large text documents

Summary form only given. We present a new compression method, called WLZW, which is a word-based modification of classic LZW. The algorithm is two-phase, it uses only one table for words and non-words (so called tokens), and a single data structure for the lexicon is usable as a text index. The length of words and non-words is restricted. This feature improves the compress ratio achieved. Tokens of unlimited length alternate, when they are read from the input stream. Because of restricted length of tokens alternating of tokens is corrupted, because some tokens are divided into several parts of same type. To save alternating of tokens two special tokens are created. They are empty word and empty non-word. They contain no character. Empty word is inserted between two non-words and empty non-word between two words. Alternating of tokens is saved for all sequences of tokens. The alternating of tokens is an important piece of information. With this knowledge the kind of the next token can be predicted. One selected (so-called victim) non-word can be deleted from input stream. An algorithm to search the victim is also presented. In the decompression phase, a deleted victim is recognized as an error in alternating of words and non-words in sequence. The algorithm was tested on many texts in different formats (ASCII, RTF). The Canterbury corpus, a large set, was used as a standard for publication results. The compression ratio achieved is fairly good, on average 25%-22%. Decompression is very fast. Moreover, the algorithm enables evaluation of database queries in given text. This supports the idea of leaving data in the compressed state as long as possible, and to decompress it when it is necessary.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助