{"title":"k$ 最小值草图的密钥压缩限制","authors":"Charlie Dickens, Eric Bax, Alexander Saydakov","doi":"arxiv-2409.02852","DOIUrl":null,"url":null,"abstract":"The $k$-Minimum Values (\\kmv) data sketch algorithm stores the $k$ least hash\nkeys generated by hashing the items in a dataset. We show that compression\nbased on ordering the keys and encoding successive differences can offer\n$O(\\log n)$ bits per key in expected storage savings, where $n$ is the number\nof unique values in the data set. We also show that $O(\\log n)$ expected bits\nsaved per key is optimal for any form of compression for the $k$ least of $n$\nrandom values -- that the encoding method is near-optimal among all methods to\nencode a \\kmv sketch. We present a practical method to perform that\ncompression, show that it is computationally efficient, and demonstrate that\nits average savings in practice is within about five percent of the theoretical\nminimum based on entropy. We verify that our method outperforms off-the-shelf\ncompression methods, and we demonstrate that it is practical, using real and\nsynthetic data.","PeriodicalId":501082,"journal":{"name":"arXiv - MATH - Information Theory","volume":"2 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Key Compression Limits for $k$-Minimum Value Sketches\",\"authors\":\"Charlie Dickens, Eric Bax, Alexander Saydakov\",\"doi\":\"arxiv-2409.02852\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The $k$-Minimum Values (\\\\kmv) data sketch algorithm stores the $k$ least hash\\nkeys generated by hashing the items in a dataset. We show that compression\\nbased on ordering the keys and encoding successive differences can offer\\n$O(\\\\log n)$ bits per key in expected storage savings, where $n$ is the number\\nof unique values in the data set. We also show that $O(\\\\log n)$ expected bits\\nsaved per key is optimal for any form of compression for the $k$ least of $n$\\nrandom values -- that the encoding method is near-optimal among all methods to\\nencode a \\\\kmv sketch. We present a practical method to perform that\\ncompression, show that it is computationally efficient, and demonstrate that\\nits average savings in practice is within about five percent of the theoretical\\nminimum based on entropy. We verify that our method outperforms off-the-shelf\\ncompression methods, and we demonstrate that it is practical, using real and\\nsynthetic data.\",\"PeriodicalId\":501082,\"journal\":{\"name\":\"arXiv - MATH - Information Theory\",\"volume\":\"2 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - MATH - Information Theory\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.02852\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - MATH - Information Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.02852","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Key Compression Limits for $k$-Minimum Value Sketches
The $k$-Minimum Values (\kmv) data sketch algorithm stores the $k$ least hash
keys generated by hashing the items in a dataset. We show that compression
based on ordering the keys and encoding successive differences can offer
$O(\log n)$ bits per key in expected storage savings, where $n$ is the number
of unique values in the data set. We also show that $O(\log n)$ expected bits
saved per key is optimal for any form of compression for the $k$ least of $n$
random values -- that the encoding method is near-optimal among all methods to
encode a \kmv sketch. We present a practical method to perform that
compression, show that it is computationally efficient, and demonstrate that
its average savings in practice is within about five percent of the theoretical
minimum based on entropy. We verify that our method outperforms off-the-shelf
compression methods, and we demonstrate that it is practical, using real and
synthetic data.