Sorted Sliding Window Compression (SSWC) uses a new model (Sorted Sliding Window Model | SSWM) to encode strings e cient, which appear again while encoding a symbol sequence. The SSWM holds statistics of all strings up to certain length k in a sliding window of size n (the sliding window is de ned like in lz77). The compression program can use the SSWM to determine if the string of the next symbols are already contained in the sliding window and returns the length of match. SSWM gives directly statistics (borders of subinterval in an interval) for use in entropy encoding methods like Arithmetic Coding or Dense Coding [Gra97]. For a given number in an interval and the string length the SSWM gives back the corresponding string which is used in decompressing. After an encoding (decoding) step the model is updated with the just encoded (decoded) characters. The Model sorts all string starting points in the sliding window lexicographically. A simple way to implement the SSWM is by exhaustive search in the sliding window. An implementation with a B-tree together with special binary searches is used here. SSWC is a simple compression scheme, which uses this new model to evaluate its properties. It looks on the next characters to encode and determines the longest match with the SSWM. If the match is smaller than 2, the character is encoded. Otherwise the length and the subinterval of the string are encoded. The length values are encoded together with the single characters by using the same adaptive frequency model. Additionally some rules are used to reduce the matching length if the code length get worse. Encoding of frequencies and intervals is done with Dense Coding. SSWC is in average better than gzip [Gai93] on the Calgary corpus: 0:2 0:5 bits-per-byte better on most les and at most 0:03 bits-per-byte worse (progc and progl). This proves the quality and gives con dence in the usability of SSWM as a new building block in models for compression. SSWM has O(log k) computing complexity on all operations and needs O(n) space. SSWM can be used to implement PPM or Markov models in limited space environments because it holds all necessary informations.
{"title":"Sorted sliding window compression","authors":"U. Graf","doi":"10.1109/DCC.1999.785684","DOIUrl":"https://doi.org/10.1109/DCC.1999.785684","url":null,"abstract":"Sorted Sliding Window Compression (SSWC) uses a new model (Sorted Sliding Window Model | SSWM) to encode strings e cient, which appear again while encoding a symbol sequence. The SSWM holds statistics of all strings up to certain length k in a sliding window of size n (the sliding window is de ned like in lz77). The compression program can use the SSWM to determine if the string of the next symbols are already contained in the sliding window and returns the length of match. SSWM gives directly statistics (borders of subinterval in an interval) for use in entropy encoding methods like Arithmetic Coding or Dense Coding [Gra97]. For a given number in an interval and the string length the SSWM gives back the corresponding string which is used in decompressing. After an encoding (decoding) step the model is updated with the just encoded (decoded) characters. The Model sorts all string starting points in the sliding window lexicographically. A simple way to implement the SSWM is by exhaustive search in the sliding window. An implementation with a B-tree together with special binary searches is used here. SSWC is a simple compression scheme, which uses this new model to evaluate its properties. It looks on the next characters to encode and determines the longest match with the SSWM. If the match is smaller than 2, the character is encoded. Otherwise the length and the subinterval of the string are encoded. The length values are encoded together with the single characters by using the same adaptive frequency model. Additionally some rules are used to reduce the matching length if the code length get worse. Encoding of frequencies and intervals is done with Dense Coding. SSWC is in average better than gzip [Gai93] on the Calgary corpus: 0:2 0:5 bits-per-byte better on most les and at most 0:03 bits-per-byte worse (progc and progl). This proves the quality and gives con dence in the usability of SSWM as a new building block in models for compression. SSWM has O(log k) computing complexity on all operations and needs O(n) space. SSWM can be used to implement PPM or Markov models in limited space environments because it holds all necessary informations.","PeriodicalId":103598,"journal":{"name":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114568823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Three-dimensional (2D+T) wavelet coding of video using SPIHT has been shown to outperform standard predictive video coders on complex high-motion sequences, and is competitive with standard predictive video coders on simple low-motion sequences. However, on a number of typical moderate-motion sequences characterized by largely rigid motions, 3D SPIHT performs several dB worse than motion-compensated predictive coders, because it is does not take advantage of the real physical motion underlying the scene. We introduce global motion compensation for 3D subband video coders, and find 0.5 to 2 dB gain on sequences with dominant background motion. Our approach is a hybrid of video coding based on sprites, or mosaics, and subband coding.
{"title":"Three-dimensional wavelet coding of video with global motion compensation","authors":"Albert Wang, Zixiang Xiong, P. Chou, S. Mehrotra","doi":"10.1109/DCC.1999.755690","DOIUrl":"https://doi.org/10.1109/DCC.1999.755690","url":null,"abstract":"Three-dimensional (2D+T) wavelet coding of video using SPIHT has been shown to outperform standard predictive video coders on complex high-motion sequences, and is competitive with standard predictive video coders on simple low-motion sequences. However, on a number of typical moderate-motion sequences characterized by largely rigid motions, 3D SPIHT performs several dB worse than motion-compensated predictive coders, because it is does not take advantage of the real physical motion underlying the scene. We introduce global motion compensation for 3D subband video coders, and find 0.5 to 2 dB gain on sequences with dominant background motion. Our approach is a hybrid of video coding based on sprites, or mosaics, and subband coding.","PeriodicalId":103598,"journal":{"name":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130260346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data mining, a burgeoning new technology, is about looking for patterns in data. Likewise, text mining is about looking for patterns in text. Text mining is possible because you do not have to understand text in order to extract useful information from it. Here are four examples. First, if only names could be identified, links could be inserted automatically to other places that mention the same name, links that are "dynamically evaluated" by calling upon a search engine to bind them at click time. Second, actions can be associated with different types of data, using either explicit programming or programming-by-demonstration techniques. A day/time specification appearing anywhere within one's E-mail could be associated with diary actions such as updating a personal organizer or creating an automatic reminder, and each mention of a day/time in the text could raise a popup menu of calendar-based actions. Third, text could be mined for data in tabular format, allowing databases to be created from formatted tables such as stock-market information on Web pages. Fourth, an agent could monitor incoming newswire stories for company names and collect documents that mention them, an automated press clipping service. This paper aims to promote text compression as a key technology for text mining.
{"title":"Text mining: a new frontier for lossless compression","authors":"I. Witten, Zane Bray, M. Mahoui, W. Teahan","doi":"10.1109/DCC.1999.755669","DOIUrl":"https://doi.org/10.1109/DCC.1999.755669","url":null,"abstract":"Data mining, a burgeoning new technology, is about looking for patterns in data. Likewise, text mining is about looking for patterns in text. Text mining is possible because you do not have to understand text in order to extract useful information from it. Here are four examples. First, if only names could be identified, links could be inserted automatically to other places that mention the same name, links that are \"dynamically evaluated\" by calling upon a search engine to bind them at click time. Second, actions can be associated with different types of data, using either explicit programming or programming-by-demonstration techniques. A day/time specification appearing anywhere within one's E-mail could be associated with diary actions such as updating a personal organizer or creating an automatic reminder, and each mention of a day/time in the text could raise a popup menu of calendar-based actions. Third, text could be mined for data in tabular format, allowing databases to be created from formatted tables such as stock-market information on Web pages. Fourth, an agent could monitor incoming newswire stories for company names and collect documents that mention them, an automated press clipping service. This paper aims to promote text compression as a key technology for text mining.","PeriodicalId":103598,"journal":{"name":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130849680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Signal representations based on low-resolution quantization of redundant expansions is an interesting source coding paradigm, the most important practical case of which is oversampled A/D conversion. Signal reconstruction from quantized coefficients of a redundant expansion and accuracy of representations of this kind are problems which are still not well understood and these are studied in this paper in finite dimensional spaces. It has been previously proven that accuracy of signal representations based on quantized redundant expansions, measured as the squared Euclidean norm of the reconstruction error, cannot be better than O(1/(r/sup 2/)), where r is the expansion redundancy. We give some general conditions under which 1/(r/sup 2/) accuracy can be attained. We also suggest a form of structure for overcomplete families which facilitates reconstruction, and which enables efficient encoding of quantized coefficients with a logarithmic increase of the bit-rate in redundancy.
{"title":"Source coding with quantized redundant expansions: accuracy and reconstruction","authors":"Z. Cvetković","doi":"10.1109/DCC.1999.755684","DOIUrl":"https://doi.org/10.1109/DCC.1999.755684","url":null,"abstract":"Signal representations based on low-resolution quantization of redundant expansions is an interesting source coding paradigm, the most important practical case of which is oversampled A/D conversion. Signal reconstruction from quantized coefficients of a redundant expansion and accuracy of representations of this kind are problems which are still not well understood and these are studied in this paper in finite dimensional spaces. It has been previously proven that accuracy of signal representations based on quantized redundant expansions, measured as the squared Euclidean norm of the reconstruction error, cannot be better than O(1/(r/sup 2/)), where r is the expansion redundancy. We give some general conditions under which 1/(r/sup 2/) accuracy can be attained. We also suggest a form of structure for overcomplete families which facilitates reconstruction, and which enables efficient encoding of quantized coefficients with a logarithmic increase of the bit-rate in redundancy.","PeriodicalId":103598,"journal":{"name":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128812773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Effros, Karthik Venkat Ramanan, S. R. Kulkarni, S. Verdú
We here consider a theoretical evaluation of data compression algorithms based on the Burrows Wheeler transform (BWT). The main contributions include a variety of very simple new techniques for BWT-based universal lossless source coding on finite-memory sources and a set of new rate of convergence results for BWT-based source codes. The result is a theoretical validation and quantification of the earlier experimental observation that BWT-based lossless source codes give performance better than that of Ziv-Lempel-style codes and almost as good as that of prediction by partial mapping (PPM) algorithms.
{"title":"Universal lossless source coding with the Burrows Wheeler transform","authors":"M. Effros, Karthik Venkat Ramanan, S. R. Kulkarni, S. Verdú","doi":"10.1109/DCC.1999.755667","DOIUrl":"https://doi.org/10.1109/DCC.1999.755667","url":null,"abstract":"We here consider a theoretical evaluation of data compression algorithms based on the Burrows Wheeler transform (BWT). The main contributions include a variety of very simple new techniques for BWT-based universal lossless source coding on finite-memory sources and a set of new rate of convergence results for BWT-based source codes. The result is a theoretical validation and quantification of the earlier experimental observation that BWT-based lossless source codes give performance better than that of Ziv-Lempel-style codes and almost as good as that of prediction by partial mapping (PPM) algorithms.","PeriodicalId":103598,"journal":{"name":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132678213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We design an edge-adaptive predictor for lossless image coding. The predictor adaptively weights a four-directional predictor together with an adaptive linear predictor based on information from neighbouring pixels. Although conceptually simple, the performance of the resulting coder is comparable to state-of-the-art image coders when a simple context-based coder is used to encode the prediction errors.
{"title":"Edge-adaptive prediction for lossless image coding","authors":"Wee Sun Lee","doi":"10.1109/DCC.1999.755698","DOIUrl":"https://doi.org/10.1109/DCC.1999.755698","url":null,"abstract":"We design an edge-adaptive predictor for lossless image coding. The predictor adaptively weights a four-directional predictor together with an adaptive linear predictor based on information from neighbouring pixels. Although conceptually simple, the performance of the resulting coder is comparable to state-of-the-art image coders when a simple context-based coder is used to encode the prediction errors.","PeriodicalId":103598,"journal":{"name":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134257490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary form only given. We present an architecture for digital HDTV video decoding (MPEG-2 MP@HL), based on dual decoding data paths controlled in a block layer synchronization manner and an efficient write back scheme. Our fixed schedule controller synchronizes the baseline units on a block basis in both data-paths. This scheme reduces embedded buffer sizes within the decoder and eliminates a lot of external memory bus contentions. In our write back scheme, the display DRAM is physically separated from the anchor picture DRAM, and is added to the display engine, not to the bus. The slight increase in overall DRAM size is acceptable due to the low DRAM cost today. This improves the parallelism in accessing anchor and display pictures and saves about 80 clock cycles per macroblock (based on a 81 MHz clock). Compared to the other decoding approaches such as the slice bar decoding method and the crossing-divided method, this scheme reduces memory access contentions and the amount of embedded local memory required. Our simulations show that with a relatively low speed 81 MHz clock, our architecture uses fewer than the 332 cycles (required real-time decoding upper bound), to decode each macroblock, without a high cost in overall chip area.
{"title":"A novel dual-path architecture for HDTV video decoding","authors":"N. Wang, N. Ling","doi":"10.1109/DCC.1999.785714","DOIUrl":"https://doi.org/10.1109/DCC.1999.785714","url":null,"abstract":"Summary form only given. We present an architecture for digital HDTV video decoding (MPEG-2 MP@HL), based on dual decoding data paths controlled in a block layer synchronization manner and an efficient write back scheme. Our fixed schedule controller synchronizes the baseline units on a block basis in both data-paths. This scheme reduces embedded buffer sizes within the decoder and eliminates a lot of external memory bus contentions. In our write back scheme, the display DRAM is physically separated from the anchor picture DRAM, and is added to the display engine, not to the bus. The slight increase in overall DRAM size is acceptable due to the low DRAM cost today. This improves the parallelism in accessing anchor and display pictures and saves about 80 clock cycles per macroblock (based on a 81 MHz clock). Compared to the other decoding approaches such as the slice bar decoding method and the crossing-divided method, this scheme reduces memory access contentions and the amount of embedded local memory required. Our simulations show that with a relatively low speed 81 MHz clock, our architecture uses fewer than the 332 cycles (required real-time decoding upper bound), to decode each macroblock, without a high cost in overall chip area.","PeriodicalId":103598,"journal":{"name":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","volume":"2007 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127300052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary form only given. The Joint Bi-Level Expert Group (JBIG), an international study group affiliated with the ISO/IEC and ITU-T, has recently completed a committee draft of the JBIG2 standard for lossy and lossless bi-level image compression. We study design considerations for a purely lossless encoder. First, we outline the JBIG2 bitstream, focusing on the options and parameters available to an encoder. Then, we present numerous lossless encoder design strategies, including lossy to lossless coding approaches. For each strategy, we determine the compression performance, and the execution times for both encoding and decoding. The strategy that achieved the highest compression performance in our experiment used a double dictionary approach, with a residue cleanup. In this strategy, small and unique symbols were coded as a generic region residue. Only repeated symbols or those used as a basis for soft matches were added to a dictionary, with the remaining symbols embedded as refinements in the symbol region segment. The second dictionary was encoded as a refinement-aggregate dictionary, where dictionary symbols were encoded as refinements of symbols from the first dictionary, or previous entries in the second dictionary. With all other bitstream parameters optimized, this strategy can easily achieve an additional 30% compression over simpler symbol dictionary approaches. Next, we continue the experiment with an evaluation of each of the bitstream options and configuration parameters, and their impact on complexity and compression. We also demonstrate the consequences of choosing incorrect parameters. We conclude with a summary of our compression results, and general recommendations for encoder designers.
{"title":"Lossless JBIG2 coding performance","authors":"D. Tompkins, F. Kossentini","doi":"10.1109/DCC.1999.785710","DOIUrl":"https://doi.org/10.1109/DCC.1999.785710","url":null,"abstract":"Summary form only given. The Joint Bi-Level Expert Group (JBIG), an international study group affiliated with the ISO/IEC and ITU-T, has recently completed a committee draft of the JBIG2 standard for lossy and lossless bi-level image compression. We study design considerations for a purely lossless encoder. First, we outline the JBIG2 bitstream, focusing on the options and parameters available to an encoder. Then, we present numerous lossless encoder design strategies, including lossy to lossless coding approaches. For each strategy, we determine the compression performance, and the execution times for both encoding and decoding. The strategy that achieved the highest compression performance in our experiment used a double dictionary approach, with a residue cleanup. In this strategy, small and unique symbols were coded as a generic region residue. Only repeated symbols or those used as a basis for soft matches were added to a dictionary, with the remaining symbols embedded as refinements in the symbol region segment. The second dictionary was encoded as a refinement-aggregate dictionary, where dictionary symbols were encoded as refinements of symbols from the first dictionary, or previous entries in the second dictionary. With all other bitstream parameters optimized, this strategy can easily achieve an additional 30% compression over simpler symbol dictionary approaches. Next, we continue the experiment with an evaluation of each of the bitstream options and configuration parameters, and their impact on complexity and compression. We also demonstrate the consequences of choosing incorrect parameters. We conclude with a summary of our compression results, and general recommendations for encoder designers.","PeriodicalId":103598,"journal":{"name":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123037559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
It is well known that Shannon's separation result does not hold under finite computation or finite delay constraints, thus joint source-channel coding is of great interest for practical reasons. For progressive source-channel coding systems, efficient codes have been proposed for feedforward channels and the important problem of rate allocation between the source and channel codes has been solved. For memoryless channels with feedback, the rate allocation problem was studied by Chande et al. (1998). In this paper, we consider the case of fading channels with feedback. Feedback routes are provided in many existing standard wireless channels, making rate allocation with feedback a problem of considerable practical importance. We address the question of rate allocation between the source and channel codes in the forward channel, in the presence of feedback information and under a distortion cost function. We show that the presence of feedback shifts the optimal rate allocation point, resulting in higher rates for error-correcting codes and smaller overall distortion. Simulations on both memoryless and fading channels show that the presence of feedback allows up to 1 dB improvement in PSNR compared to the similarly optimized feedforward scheme.
{"title":"Progressive joint source-channel coding in feedback channels","authors":"Jin Lu, Aria Nosratinia, B. Aazhang","doi":"10.1109/DCC.1999.755663","DOIUrl":"https://doi.org/10.1109/DCC.1999.755663","url":null,"abstract":"It is well known that Shannon's separation result does not hold under finite computation or finite delay constraints, thus joint source-channel coding is of great interest for practical reasons. For progressive source-channel coding systems, efficient codes have been proposed for feedforward channels and the important problem of rate allocation between the source and channel codes has been solved. For memoryless channels with feedback, the rate allocation problem was studied by Chande et al. (1998). In this paper, we consider the case of fading channels with feedback. Feedback routes are provided in many existing standard wireless channels, making rate allocation with feedback a problem of considerable practical importance. We address the question of rate allocation between the source and channel codes in the forward channel, in the presence of feedback information and under a distortion cost function. We show that the presence of feedback shifts the optimal rate allocation point, resulting in higher rates for error-correcting codes and smaller overall distortion. Simulations on both memoryless and fading channels show that the presence of feedback allows up to 1 dB improvement in PSNR compared to the similarly optimized feedforward scheme.","PeriodicalId":103598,"journal":{"name":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","volume":"159-160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122126653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary form only given. A mini-corpus of twelve 'calibrated' binary-data files have been produced for systematic evaluation of compression algorithms. These are generated within the framework of a deterministic theory of string complexity. Here the T-complexity of a string x (measured in taugs) is defined as C/sub T/(x/sub i/)=/spl Sigma//sub i/log/sub 2/(k/sub i/+1), where the positive integers k/sub i/ are the T-expansion parameters for the corresponding string production process. C/sub T/(x) is observed to be the logarithmic integral of the total information content I/sub x/ of x (measured in nats), i.e., C/sub T/(x)=li(I/sub x/). The average entropy is H~/sub x/=I/sub x//|x|, i.e., the total information content divided by the length of x. Thus C/sub T/(x)=li(H~/sub x//spl times/|x|). Alternatively, the information rate along a string may be described by an entropy function H/sub x/(n),0/spl les/n/spl les/|x| for the string. Assuming that H/sub x/(n) is continuously integrable along the length of the x, then I/sub x/=/spl int//sub 0//sup |/x|H/sub x/(n)/spl delta/n. Thus C/sub T/(x)=li(/spl int//sub 0//sup |/x|H/sub x/(n)/spl delta/n). Solving for H/sub x/(n): that is differentiating both sides and rearranging, we get: H/sub x/(n)=(/spl delta/C/sub T/(x|n)//spl delta/n)/spl times/log/sub e/(li/sup -1/(C/sub T/(x|/sub n/))). With x being in fact discrete, and the T-complexity function being computed in terms of the discrete T-augmentation steps, we may accordingly re-express the equation in terms of the T-prefix increments: /spl delta/n/spl ap//spl Delta//sub i/|x|=k/sub i/|p/sub i/|; and from the definition of C/sub T/(x): /spl delta/C/sub T/(x) is replaced by /spl Delta//sub i/C/sub T/(x)=log/sub 2/(k/sub i/+1). The average slope over the i-th T-prefix p/sub i/ increment is then simply (/spl Delta//sub i/C/sub T/(x))/(/spl Delta//sub i/|x|)=(log/sub 2/(k/sub i/+1))/(k/sub i/|p/sub i/|). The entropy function is now replaced by a discrete approximation.
只提供摘要形式。12个“校准”二进制数据文件的迷你语料库已经产生了压缩算法的系统评估。这些都是在弦复杂性的确定性理论框架内生成的。这里,字符串x的T-复杂度(以标签为单位)定义为C/下标T/(x/下标i/)=/spl Sigma//下标i/log/下标2/(k/下标i/+1),其中正整数k/下标i/是对应的字符串生产过程的T-展开参数。观察到C/ T/(x)是x(以纳特为单位)的总信息量I/下标x/的对数积分,即C/ T/(x)=li(I/下标x/)。平均熵为H~/sub x/=I/sub x//|x|,即总信息量除以x的长度,因此C/sub T/(x)=li(H~/sub x//spl乘以/|x|)。或者,沿着字符串的信息速率可以用熵函数H/sub x/(n)来描述,对于字符串,0/spl les/n/spl les/|x|。假设H/下标x/(n)沿x的长度连续可积,则I/下标x/=/spl int//下标0//sup |/x|H/下标x/(n)/spl /n。因此C / sub T /李(x) = (spl int / / sub x | 0 | / /晚餐/ H / sub x / (n) / splδ/ n)。求解H/下标x/(n)也就是两边求导并重新排列,我们得到H/下标x/(n)=(/spl /C/ T/(x|n)//spl /(n) /spl乘以/log/ e/(li/sup -1/(C/下标T/(x|/下标n/)))由于x实际上是离散的,并且t -复杂度函数是用离散的t增积步骤来计算的,因此我们可以用t前缀增量来重新表示方程:/spl delta/n/spl ap//spl delta/ /下标i/|x|=k/下标i/|p/下标i/|;由C/ T/(x)的定义:/spl /C/ T/(x)被/spl //下标i/C/下标T/(x)=log/下标2/(k/下标i/+1)所取代。第i个T前缀p/下标i/增量的平均斜率为(/spl Delta//下标i/C/下标T/(x))/(/spl Delta//下标i/|x|)=(log/下标2/(k/下标i/+1) /(k/下标i/|p/下标i/|))。熵函数现在被一个离散的近似代替了。
{"title":"Towards a calibrated corpus for compression testing","authors":"M. Titchener, P. Fenwick, M. C. Chen","doi":"10.1109/DCC.1999.785711","DOIUrl":"https://doi.org/10.1109/DCC.1999.785711","url":null,"abstract":"Summary form only given. A mini-corpus of twelve 'calibrated' binary-data files have been produced for systematic evaluation of compression algorithms. These are generated within the framework of a deterministic theory of string complexity. Here the T-complexity of a string x (measured in taugs) is defined as C/sub T/(x/sub i/)=/spl Sigma//sub i/log/sub 2/(k/sub i/+1), where the positive integers k/sub i/ are the T-expansion parameters for the corresponding string production process. C/sub T/(x) is observed to be the logarithmic integral of the total information content I/sub x/ of x (measured in nats), i.e., C/sub T/(x)=li(I/sub x/). The average entropy is H~/sub x/=I/sub x//|x|, i.e., the total information content divided by the length of x. Thus C/sub T/(x)=li(H~/sub x//spl times/|x|). Alternatively, the information rate along a string may be described by an entropy function H/sub x/(n),0/spl les/n/spl les/|x| for the string. Assuming that H/sub x/(n) is continuously integrable along the length of the x, then I/sub x/=/spl int//sub 0//sup |/x|H/sub x/(n)/spl delta/n. Thus C/sub T/(x)=li(/spl int//sub 0//sup |/x|H/sub x/(n)/spl delta/n). Solving for H/sub x/(n): that is differentiating both sides and rearranging, we get: H/sub x/(n)=(/spl delta/C/sub T/(x|n)//spl delta/n)/spl times/log/sub e/(li/sup -1/(C/sub T/(x|/sub n/))). With x being in fact discrete, and the T-complexity function being computed in terms of the discrete T-augmentation steps, we may accordingly re-express the equation in terms of the T-prefix increments: /spl delta/n/spl ap//spl Delta//sub i/|x|=k/sub i/|p/sub i/|; and from the definition of C/sub T/(x): /spl delta/C/sub T/(x) is replaced by /spl Delta//sub i/C/sub T/(x)=log/sub 2/(k/sub i/+1). The average slope over the i-th T-prefix p/sub i/ increment is then simply (/spl Delta//sub i/C/sub T/(x))/(/spl Delta//sub i/|x|)=(log/sub 2/(k/sub i/+1))/(k/sub i/|p/sub i/|). The entropy function is now replaced by a discrete approximation.","PeriodicalId":103598,"journal":{"name":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125258810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}