{"title":"Relevant overlapping subspace clusters on categorical data","authors":"Xiao He, Jing Feng, B. Konte, S. T. Mai, C. Plant","doi":"10.1145/2623330.2623652","DOIUrl":null,"url":null,"abstract":"Clustering categorical data poses some unique challenges: Due to missing order and spacing among the categories, selecting a suitable similarity measure is a difficult task. Many existing techniques require the user to specify input parameters which are difficult to estimate. Moreover, many techniques are limited to detect clusters in the full-dimensional data space. Only few methods exist for subspace clustering and they produce highly redundant results. Therefore, we propose ROCAT (Relevant Overlapping Subspace Clusters on Categorical Data), a novel technique based on the idea of data compression. Following the Minimum Description Length principle, ROCAT automatically detects the most relevant subspace clusters without any input parameter. The relevance of each cluster is validated by its contribution to compress the data. Optimizing the trade-off between goodness-of-fit and model complexity, ROCAT automatically determines a meaningful number of clusters to represent the data. ROCAT is especially designed to detect subspace clusters on categorical data which may overlap in objects and/or attributes; i.e. objects can be assigned to different clusters in different subspaces and attributes may contribute to different subspaces containing clusters. ROCAT naturally avoids undesired redundancy in clusters and subspaces by allowing overlap only if it improves the compression rate. Extensive experiments demonstrate the effectiveness and efficiency of our approach.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"41 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2623330.2623652","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9
Abstract
Clustering categorical data poses some unique challenges: Due to missing order and spacing among the categories, selecting a suitable similarity measure is a difficult task. Many existing techniques require the user to specify input parameters which are difficult to estimate. Moreover, many techniques are limited to detect clusters in the full-dimensional data space. Only few methods exist for subspace clustering and they produce highly redundant results. Therefore, we propose ROCAT (Relevant Overlapping Subspace Clusters on Categorical Data), a novel technique based on the idea of data compression. Following the Minimum Description Length principle, ROCAT automatically detects the most relevant subspace clusters without any input parameter. The relevance of each cluster is validated by its contribution to compress the data. Optimizing the trade-off between goodness-of-fit and model complexity, ROCAT automatically determines a meaningful number of clusters to represent the data. ROCAT is especially designed to detect subspace clusters on categorical data which may overlap in objects and/or attributes; i.e. objects can be assigned to different clusters in different subspaces and attributes may contribute to different subspaces containing clusters. ROCAT naturally avoids undesired redundancy in clusters and subspaces by allowing overlap only if it improves the compression rate. Extensive experiments demonstrate the effectiveness and efficiency of our approach.