{"title":"Compressing Tabular Data via Pairwise Dependencies.","authors":"Dmitri S Pavlichin, Amir Ingber, Tsachy Weissman","doi":"10.1109/DCC.2017.82","DOIUrl":null,"url":null,"abstract":"We propose a method and algorithm for lossless compression of tabular data – including, for example, machine learning datasets, server logs and genomic datasets. Superior compression ratios are achieved by exploiting dependencies between the fields (or \"features\") in the dataset. The algorithm compresses the records w.r.t. a probabilistic graphical model – specifically an optimized forest, where each feature is a node. The work extends a method known as a Chow-Liu tree by incorporating a more accurate correction term to the cost function, which corresponds to the size required to describe the model itself. Additional features of the algorithm are efficient coding of the metadata (such as probability distributions), as well as data relabeling in order to cope with large datasets and alphabets. We test the algorithm on several datasets, and demonstrate an improvement in the compression rates of between 2X and 5X compared to gzip. The larger improvements are observed for very large datasets, such as the Criteo click prediction dataset which was published as part of a recent Kaggle competition.","PeriodicalId":91161,"journal":{"name":"Proceedings. Data Compression Conference","volume":"2017 ","pages":"455"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/DCC.2017.82","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.2017.82","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2017/5/11 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
We propose a method and algorithm for lossless compression of tabular data – including, for example, machine learning datasets, server logs and genomic datasets. Superior compression ratios are achieved by exploiting dependencies between the fields (or "features") in the dataset. The algorithm compresses the records w.r.t. a probabilistic graphical model – specifically an optimized forest, where each feature is a node. The work extends a method known as a Chow-Liu tree by incorporating a more accurate correction term to the cost function, which corresponds to the size required to describe the model itself. Additional features of the algorithm are efficient coding of the metadata (such as probability distributions), as well as data relabeling in order to cope with large datasets and alphabets. We test the algorithm on several datasets, and demonstrate an improvement in the compression rates of between 2X and 5X compared to gzip. The larger improvements are observed for very large datasets, such as the Criteo click prediction dataset which was published as part of a recent Kaggle competition.