{"title":"迈向压缩测试的校准语料库","authors":"M. Titchener, P. Fenwick, M. C. Chen","doi":"10.1109/DCC.1999.785711","DOIUrl":null,"url":null,"abstract":"Summary form only given. A mini-corpus of twelve 'calibrated' binary-data files have been produced for systematic evaluation of compression algorithms. These are generated within the framework of a deterministic theory of string complexity. Here the T-complexity of a string x (measured in taugs) is defined as C/sub T/(x/sub i/)=/spl Sigma//sub i/log/sub 2/(k/sub i/+1), where the positive integers k/sub i/ are the T-expansion parameters for the corresponding string production process. C/sub T/(x) is observed to be the logarithmic integral of the total information content I/sub x/ of x (measured in nats), i.e., C/sub T/(x)=li(I/sub x/). The average entropy is H~/sub x/=I/sub x//|x|, i.e., the total information content divided by the length of x. Thus C/sub T/(x)=li(H~/sub x//spl times/|x|). Alternatively, the information rate along a string may be described by an entropy function H/sub x/(n),0/spl les/n/spl les/|x| for the string. Assuming that H/sub x/(n) is continuously integrable along the length of the x, then I/sub x/=/spl int//sub 0//sup |/x|H/sub x/(n)/spl delta/n. Thus C/sub T/(x)=li(/spl int//sub 0//sup |/x|H/sub x/(n)/spl delta/n). Solving for H/sub x/(n): that is differentiating both sides and rearranging, we get: H/sub x/(n)=(/spl delta/C/sub T/(x|n)//spl delta/n)/spl times/log/sub e/(li/sup -1/(C/sub T/(x|/sub n/))). With x being in fact discrete, and the T-complexity function being computed in terms of the discrete T-augmentation steps, we may accordingly re-express the equation in terms of the T-prefix increments: /spl delta/n/spl ap//spl Delta//sub i/|x|=k/sub i/|p/sub i/|; and from the definition of C/sub T/(x): /spl delta/C/sub T/(x) is replaced by /spl Delta//sub i/C/sub T/(x)=log/sub 2/(k/sub i/+1). The average slope over the i-th T-prefix p/sub i/ increment is then simply (/spl Delta//sub i/C/sub T/(x))/(/spl Delta//sub i/|x|)=(log/sub 2/(k/sub i/+1))/(k/sub i/|p/sub i/|). The entropy function is now replaced by a discrete approximation.","PeriodicalId":103598,"journal":{"name":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Towards a calibrated corpus for compression testing\",\"authors\":\"M. Titchener, P. Fenwick, M. C. Chen\",\"doi\":\"10.1109/DCC.1999.785711\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Summary form only given. A mini-corpus of twelve 'calibrated' binary-data files have been produced for systematic evaluation of compression algorithms. These are generated within the framework of a deterministic theory of string complexity. Here the T-complexity of a string x (measured in taugs) is defined as C/sub T/(x/sub i/)=/spl Sigma//sub i/log/sub 2/(k/sub i/+1), where the positive integers k/sub i/ are the T-expansion parameters for the corresponding string production process. C/sub T/(x) is observed to be the logarithmic integral of the total information content I/sub x/ of x (measured in nats), i.e., C/sub T/(x)=li(I/sub x/). The average entropy is H~/sub x/=I/sub x//|x|, i.e., the total information content divided by the length of x. Thus C/sub T/(x)=li(H~/sub x//spl times/|x|). Alternatively, the information rate along a string may be described by an entropy function H/sub x/(n),0/spl les/n/spl les/|x| for the string. Assuming that H/sub x/(n) is continuously integrable along the length of the x, then I/sub x/=/spl int//sub 0//sup |/x|H/sub x/(n)/spl delta/n. Thus C/sub T/(x)=li(/spl int//sub 0//sup |/x|H/sub x/(n)/spl delta/n). Solving for H/sub x/(n): that is differentiating both sides and rearranging, we get: H/sub x/(n)=(/spl delta/C/sub T/(x|n)//spl delta/n)/spl times/log/sub e/(li/sup -1/(C/sub T/(x|/sub n/))). With x being in fact discrete, and the T-complexity function being computed in terms of the discrete T-augmentation steps, we may accordingly re-express the equation in terms of the T-prefix increments: /spl delta/n/spl ap//spl Delta//sub i/|x|=k/sub i/|p/sub i/|; and from the definition of C/sub T/(x): /spl delta/C/sub T/(x) is replaced by /spl Delta//sub i/C/sub T/(x)=log/sub 2/(k/sub i/+1). The average slope over the i-th T-prefix p/sub i/ increment is then simply (/spl Delta//sub i/C/sub T/(x))/(/spl Delta//sub i/|x|)=(log/sub 2/(k/sub i/+1))/(k/sub i/|p/sub i/|). The entropy function is now replaced by a discrete approximation.\",\"PeriodicalId\":103598,\"journal\":{\"name\":\"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)\",\"volume\":\"48 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1999-03-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DCC.1999.785711\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1999.785711","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
摘要
只提供摘要形式。12个“校准”二进制数据文件的迷你语料库已经产生了压缩算法的系统评估。这些都是在弦复杂性的确定性理论框架内生成的。这里,字符串x的T-复杂度(以标签为单位)定义为C/下标T/(x/下标i/)=/spl Sigma//下标i/log/下标2/(k/下标i/+1),其中正整数k/下标i/是对应的字符串生产过程的T-展开参数。观察到C/ T/(x)是x(以纳特为单位)的总信息量I/下标x/的对数积分,即C/ T/(x)=li(I/下标x/)。平均熵为H~/sub x/=I/sub x//|x|,即总信息量除以x的长度,因此C/sub T/(x)=li(H~/sub x//spl乘以/|x|)。或者,沿着字符串的信息速率可以用熵函数H/sub x/(n)来描述,对于字符串,0/spl les/n/spl les/|x|。假设H/下标x/(n)沿x的长度连续可积,则I/下标x/=/spl int//下标0//sup |/x|H/下标x/(n)/spl /n。因此C / sub T /李(x) = (spl int / / sub x | 0 | / /晚餐/ H / sub x / (n) / splδ/ n)。求解H/下标x/(n)也就是两边求导并重新排列,我们得到H/下标x/(n)=(/spl /C/ T/(x|n)//spl /(n) /spl乘以/log/ e/(li/sup -1/(C/下标T/(x|/下标n/)))由于x实际上是离散的,并且t -复杂度函数是用离散的t增积步骤来计算的,因此我们可以用t前缀增量来重新表示方程:/spl delta/n/spl ap//spl delta/ /下标i/|x|=k/下标i/|p/下标i/|;由C/ T/(x)的定义:/spl /C/ T/(x)被/spl //下标i/C/下标T/(x)=log/下标2/(k/下标i/+1)所取代。第i个T前缀p/下标i/增量的平均斜率为(/spl Delta//下标i/C/下标T/(x))/(/spl Delta//下标i/|x|)=(log/下标2/(k/下标i/+1) /(k/下标i/|p/下标i/|))。熵函数现在被一个离散的近似代替了。
Towards a calibrated corpus for compression testing
Summary form only given. A mini-corpus of twelve 'calibrated' binary-data files have been produced for systematic evaluation of compression algorithms. These are generated within the framework of a deterministic theory of string complexity. Here the T-complexity of a string x (measured in taugs) is defined as C/sub T/(x/sub i/)=/spl Sigma//sub i/log/sub 2/(k/sub i/+1), where the positive integers k/sub i/ are the T-expansion parameters for the corresponding string production process. C/sub T/(x) is observed to be the logarithmic integral of the total information content I/sub x/ of x (measured in nats), i.e., C/sub T/(x)=li(I/sub x/). The average entropy is H~/sub x/=I/sub x//|x|, i.e., the total information content divided by the length of x. Thus C/sub T/(x)=li(H~/sub x//spl times/|x|). Alternatively, the information rate along a string may be described by an entropy function H/sub x/(n),0/spl les/n/spl les/|x| for the string. Assuming that H/sub x/(n) is continuously integrable along the length of the x, then I/sub x/=/spl int//sub 0//sup |/x|H/sub x/(n)/spl delta/n. Thus C/sub T/(x)=li(/spl int//sub 0//sup |/x|H/sub x/(n)/spl delta/n). Solving for H/sub x/(n): that is differentiating both sides and rearranging, we get: H/sub x/(n)=(/spl delta/C/sub T/(x|n)//spl delta/n)/spl times/log/sub e/(li/sup -1/(C/sub T/(x|/sub n/))). With x being in fact discrete, and the T-complexity function being computed in terms of the discrete T-augmentation steps, we may accordingly re-express the equation in terms of the T-prefix increments: /spl delta/n/spl ap//spl Delta//sub i/|x|=k/sub i/|p/sub i/|; and from the definition of C/sub T/(x): /spl delta/C/sub T/(x) is replaced by /spl Delta//sub i/C/sub T/(x)=log/sub 2/(k/sub i/+1). The average slope over the i-th T-prefix p/sub i/ increment is then simply (/spl Delta//sub i/C/sub T/(x))/(/spl Delta//sub i/|x|)=(log/sub 2/(k/sub i/+1))/(k/sub i/|p/sub i/|). The entropy function is now replaced by a discrete approximation.