H. Nguyen, Emmanuel Müller, Periklis Andritsos, Klemens Böhm
{"title":"Detecting correlated columns in relational databases with mixed data types","authors":"H. Nguyen, Emmanuel Müller, Periklis Andritsos, Klemens Böhm","doi":"10.1145/2618243.2618251","DOIUrl":null,"url":null,"abstract":"In a database, besides known dependencies among columns (e.g., foreign key and primary key constraints), there are many other correlations unknown to the database users. Extraction of such hidden correlations is known to be useful for various tasks in database optimization and data analytics. However, the task is challenging due to the lack of measures to quantify column correlations. Correlations may exist among columns of different data types and value domains, which makes techniques based on value matching inapplicable. Besides, a column may have multiple semantics, which does not allow disjoint partitioning of columns. Finally, from a computational perspective, one has to consider a huge search space that grows exponentially with the number of columns.\n In this paper, we present a novel method for detecting column correlations (DeCoRel). It aims at discovering overlapping groups of correlated columns with mixed data types in relational databases. To handle the heterogeneity of data types, we propose a new correlation measure that combines the good features of Shannon entropy and cumulative entropy. To address the huge search space, we introduce an efficient algorithm for the column grouping. Compared to state of the art techniques, we show our method to be more general than one of the most recent approaches in the database literature. Experiments reveal that our method achieves both higher quality and better scalability than existing techniques.","PeriodicalId":74773,"journal":{"name":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","volume":"12 1","pages":"30:1-30:12"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific and statistical database management : International Conference, SSDBM ... : proceedings. International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2618243.2618251","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14
Abstract
In a database, besides known dependencies among columns (e.g., foreign key and primary key constraints), there are many other correlations unknown to the database users. Extraction of such hidden correlations is known to be useful for various tasks in database optimization and data analytics. However, the task is challenging due to the lack of measures to quantify column correlations. Correlations may exist among columns of different data types and value domains, which makes techniques based on value matching inapplicable. Besides, a column may have multiple semantics, which does not allow disjoint partitioning of columns. Finally, from a computational perspective, one has to consider a huge search space that grows exponentially with the number of columns.
In this paper, we present a novel method for detecting column correlations (DeCoRel). It aims at discovering overlapping groups of correlated columns with mixed data types in relational databases. To handle the heterogeneity of data types, we propose a new correlation measure that combines the good features of Shannon entropy and cumulative entropy. To address the huge search space, we introduce an efficient algorithm for the column grouping. Compared to state of the art techniques, we show our method to be more general than one of the most recent approaches in the database literature. Experiments reveal that our method achieves both higher quality and better scalability than existing techniques.