Compression for Similarity Identification: Computing the Error Exponent.

Proceedings. Data Compression Conference Pub Date : 2015-04-01 Epub Date: 2015-07-06 DOI:10.1109/DCC.2015.75
Amir Ingber, Tsachy Weissman
{"title":"Compression for Similarity Identification: Computing the Error Exponent.","authors":"Amir Ingber,&nbsp;Tsachy Weissman","doi":"10.1109/DCC.2015.75","DOIUrl":null,"url":null,"abstract":"<p><p>We consider the problem of compressing discrete memoryless data sequences for the purpose of similarity identification, first studied by Ahlswede et al. (1997). In this setting, a source sequence is compressed, where the goal is to be able to identify whether the original source sequence is similar to another given sequence (called the query sequence). There is no requirement that the source will be reproducible from the compressed version. In the case where no false negatives are allowed, a compression scheme is said to be reliable if the probability of error (false positive) vanishes as the sequence length grows. The minimal compression rate in this sense, which is the parallel of the classical rate distortion function, is called the <i>identification rate</i>. The rate at which the error probability vanishes is measured by its exponent, called the identification exponent (which is the analog of the classical excess distortion exponent). While an information-theoretic expression for the identification exponent was found in past work, it is uncomputable due to a dependency on an auxiliary random variable with unbounded cardinality. The main result of this paper is a cardinality bound on the auxiliary random variable in the identification exponent, thereby making the quantity computable (solving the problem that was left open by Ahlswede et al.). The new proof technique relies on the fact that the Lagrangian in the optimization problem (in the expression for the exponent) can be decomposed by coordinate (of the auxiliary random variable). Then a standard Carathéodory - style argument completes the proof.</p>","PeriodicalId":91161,"journal":{"name":"Proceedings. Data Compression Conference","volume":"2015 ","pages":"413-422"},"PeriodicalIF":0.0000,"publicationDate":"2015-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/DCC.2015.75","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.2015.75","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2015/7/6 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

We consider the problem of compressing discrete memoryless data sequences for the purpose of similarity identification, first studied by Ahlswede et al. (1997). In this setting, a source sequence is compressed, where the goal is to be able to identify whether the original source sequence is similar to another given sequence (called the query sequence). There is no requirement that the source will be reproducible from the compressed version. In the case where no false negatives are allowed, a compression scheme is said to be reliable if the probability of error (false positive) vanishes as the sequence length grows. The minimal compression rate in this sense, which is the parallel of the classical rate distortion function, is called the identification rate. The rate at which the error probability vanishes is measured by its exponent, called the identification exponent (which is the analog of the classical excess distortion exponent). While an information-theoretic expression for the identification exponent was found in past work, it is uncomputable due to a dependency on an auxiliary random variable with unbounded cardinality. The main result of this paper is a cardinality bound on the auxiliary random variable in the identification exponent, thereby making the quantity computable (solving the problem that was left open by Ahlswede et al.). The new proof technique relies on the fact that the Lagrangian in the optimization problem (in the expression for the exponent) can be decomposed by coordinate (of the auxiliary random variable). Then a standard Carathéodory - style argument completes the proof.

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
相似性识别的压缩:误差指数的计算。
我们考虑了为了相似性识别而压缩离散无记忆数据序列的问题,Ahlswede等人(1997)首先对此进行了研究。在这种设置中,源序列被压缩,其目标是能够识别原始源序列是否与另一个给定序列(称为查询序列)相似。不需要从压缩版本中复制源代码。在不允许假阴性的情况下,如果错误(假阳性)的概率随着序列长度的增长而消失,则认为压缩方案是可靠的。这种意义上的最小压缩率与经典的率失真函数平行,称为识别率。误差概率消失的速率通过其指数来测量,称为识别指数(这是经典的过度失真指数的类比)。虽然在过去的工作中发现了识别指数的信息理论表达式,但由于依赖于具有无界基数的辅助随机变量,它是不可计算的。本文的主要成果是识别指数中辅助随机变量的基数界,从而使数量可计算(解决了Ahlswede等人留下的开放性问题)。新的证明方法依赖于最优化问题(指数表达式)中的拉格朗日量可以被(辅助随机变量的)坐标分解。然后用一个标准的卡拉萨姆齐式论证来完成证明。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Faster Maximal Exact Matches with Lazy LCP Evaluation. Recursive Prefix-Free Parsing for Building Big BWTs. PHONI: Streamed Matching Statistics with Multi-Genome References. Client-Driven Transmission of JPEG2000 Image Sequences Using Motion Compensated Conditional Replenishment GeneComp, a new reference-based compressor for SAM files.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1