{"title":"基于vq的文本压缩模型设计算法","authors":"S.P. Kim, X. Ginesta","doi":"10.1109/DCC.1995.515544","DOIUrl":null,"url":null,"abstract":"Summary form only given. We propose a new approach for text compression where fast decoding is more desirable than encoding. An example of such a requirement is an information retrieval system. For efficient compression, high-order conditional probability information of text data is analyzed and modeled by utilizing vector quantization concept. Generally, vector quantization (VQ) has been used for lossy compression where the input symbol is not exactly recovered at the decoder, hence it does not seem applicable to lossless text compression problems. However, VQ can be applied to high-order conditional probability information so that the complexity of the information can be reduced. We represent the conditional probability information of a source in a tree structure where each node in the first level of the tree is associated with respective 1-st order conditional probability and the second level nodes with the 2nd order conditional probability. For good text compression performances, it is necessary that fourth or higher order conditional probability information be used. It is essential that the model be simplified enough for training with a reasonable size of training set. We reduce the number of conditional probability tables and also discuss a semi-adaptive operating mode of the model where the tree is derived through training but actual probability information at each node is obtained adaptively from input data. The performance of the proposed algorithm is comparable to or exceeds other methods such as prediction by partial matching (PPM) but requires smaller memory size.","PeriodicalId":107017,"journal":{"name":"Proceedings DCC '95 Data Compression Conference","volume":"117 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1995-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VQ-based model design algorithms for text compression\",\"authors\":\"S.P. Kim, X. Ginesta\",\"doi\":\"10.1109/DCC.1995.515544\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Summary form only given. We propose a new approach for text compression where fast decoding is more desirable than encoding. An example of such a requirement is an information retrieval system. For efficient compression, high-order conditional probability information of text data is analyzed and modeled by utilizing vector quantization concept. Generally, vector quantization (VQ) has been used for lossy compression where the input symbol is not exactly recovered at the decoder, hence it does not seem applicable to lossless text compression problems. However, VQ can be applied to high-order conditional probability information so that the complexity of the information can be reduced. We represent the conditional probability information of a source in a tree structure where each node in the first level of the tree is associated with respective 1-st order conditional probability and the second level nodes with the 2nd order conditional probability. For good text compression performances, it is necessary that fourth or higher order conditional probability information be used. It is essential that the model be simplified enough for training with a reasonable size of training set. We reduce the number of conditional probability tables and also discuss a semi-adaptive operating mode of the model where the tree is derived through training but actual probability information at each node is obtained adaptively from input data. The performance of the proposed algorithm is comparable to or exceeds other methods such as prediction by partial matching (PPM) but requires smaller memory size.\",\"PeriodicalId\":107017,\"journal\":{\"name\":\"Proceedings DCC '95 Data Compression Conference\",\"volume\":\"117 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1995-03-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings DCC '95 Data Compression Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DCC.1995.515544\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC '95 Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1995.515544","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
VQ-based model design algorithms for text compression
Summary form only given. We propose a new approach for text compression where fast decoding is more desirable than encoding. An example of such a requirement is an information retrieval system. For efficient compression, high-order conditional probability information of text data is analyzed and modeled by utilizing vector quantization concept. Generally, vector quantization (VQ) has been used for lossy compression where the input symbol is not exactly recovered at the decoder, hence it does not seem applicable to lossless text compression problems. However, VQ can be applied to high-order conditional probability information so that the complexity of the information can be reduced. We represent the conditional probability information of a source in a tree structure where each node in the first level of the tree is associated with respective 1-st order conditional probability and the second level nodes with the 2nd order conditional probability. For good text compression performances, it is necessary that fourth or higher order conditional probability information be used. It is essential that the model be simplified enough for training with a reasonable size of training set. We reduce the number of conditional probability tables and also discuss a semi-adaptive operating mode of the model where the tree is derived through training but actual probability information at each node is obtained adaptively from input data. The performance of the proposed algorithm is comparable to or exceeds other methods such as prediction by partial matching (PPM) but requires smaller memory size.