通过成对依赖压缩表格数据。

Proceedings. Data Compression Conference Pub Date : 2017-04-01 Epub Date: 2017-05-11 DOI:10.1109/DCC.2017.82

Dmitri S Pavlichin, Amir Ingber, Tsachy Weissman

{"title":"通过成对依赖压缩表格数据。","authors":"Dmitri S Pavlichin, Amir Ingber, Tsachy Weissman","doi":"10.1109/DCC.2017.82","DOIUrl":null,"url":null,"abstract":"We propose a method and algorithm for lossless compression of tabular data – including, for example, machine learning datasets, server logs and genomic datasets. Superior compression ratios are achieved by exploiting dependencies between the fields (or \"features\") in the dataset. The algorithm compresses the records w.r.t. a probabilistic graphical model – specifically an optimized forest, where each feature is a node. The work extends a method known as a Chow-Liu tree by incorporating a more accurate correction term to the cost function, which corresponds to the size required to describe the model itself. Additional features of the algorithm are efficient coding of the metadata (such as probability distributions), as well as data relabeling in order to cope with large datasets and alphabets. We test the algorithm on several datasets, and demonstrate an improvement in the compression rates of between 2X and 5X compared to gzip. The larger improvements are observed for very large datasets, such as the Criteo click prediction dataset which was published as part of a recent Kaggle competition.","PeriodicalId":91161,"journal":{"name":"Proceedings. Data Compression Conference","volume":"2017 ","pages":"455"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/DCC.2017.82","citationCount":"1","resultStr":"{\"title\":\"Compressing Tabular Data via Pairwise Dependencies.\",\"authors\":\"Dmitri S Pavlichin, Amir Ingber, Tsachy Weissman\",\"doi\":\"10.1109/DCC.2017.82\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose a method and algorithm for lossless compression of tabular data – including, for example, machine learning datasets, server logs and genomic datasets. Superior compression ratios are achieved by exploiting dependencies between the fields (or \\\"features\\\") in the dataset. The algorithm compresses the records w.r.t. a probabilistic graphical model – specifically an optimized forest, where each feature is a node. The work extends a method known as a Chow-Liu tree by incorporating a more accurate correction term to the cost function, which corresponds to the size required to describe the model itself. Additional features of the algorithm are efficient coding of the metadata (such as probability distributions), as well as data relabeling in order to cope with large datasets and alphabets. We test the algorithm on several datasets, and demonstrate an improvement in the compression rates of between 2X and 5X compared to gzip. The larger improvements are observed for very large datasets, such as the Criteo click prediction dataset which was published as part of a recent Kaggle competition.\",\"PeriodicalId\":91161,\"journal\":{\"name\":\"Proceedings. Data Compression Conference\",\"volume\":\"2017 \",\"pages\":\"455\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1109/DCC.2017.82\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. Data Compression Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DCC.2017.82\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2017/5/11 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.2017.82","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2017/5/11 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Compressing Tabular Data via Pairwise Dependencies.

We propose a method and algorithm for lossless compression of tabular data – including, for example, machine learning datasets, server logs and genomic datasets. Superior compression ratios are achieved by exploiting dependencies between the fields (or "features") in the dataset. The algorithm compresses the records w.r.t. a probabilistic graphical model – specifically an optimized forest, where each feature is a node. The work extends a method known as a Chow-Liu tree by incorporating a more accurate correction term to the cost function, which corresponds to the size required to describe the model itself. Additional features of the algorithm are efficient coding of the metadata (such as probability distributions), as well as data relabeling in order to cope with large datasets and alphabets. We test the algorithm on several datasets, and demonstrate an improvement in the compression rates of between 2X and 5X compared to gzip. The larger improvements are observed for very large datasets, such as the Criteo click prediction dataset which was published as part of a recent Kaggle competition.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings. Data Compression Conference

自引率

0.00%

发文量

期刊最新文献

Faster Maximal Exact Matches with Lazy LCP Evaluation. Recursive Prefix-Free Parsing for Building Big BWTs. PHONI: Streamed Matching Statistics with Multi-Genome References. Client-Driven Transmission of JPEG2000 Image Sequences Using Motion Compensated Conditional Replenishment GeneComp, a new reference-based compressor for SAM files.