Clustering categorical data: A stability analysis framework

I. Jarman, T. Etchells, P. Lisboa, Charlene Beynon, J. Martín-Guerrero
{"title":"Clustering categorical data: A stability analysis framework","authors":"I. Jarman, T. Etchells, P. Lisboa, Charlene Beynon, J. Martín-Guerrero","doi":"10.1109/CIDM.2011.5949452","DOIUrl":null,"url":null,"abstract":"Clustering to identify inherent structure is an important first step in data exploration. The k-means algorithm is a popular choice, but K-means is not generally appropriate for categorical data. A specific extension of k-means for categorical data is the k-modes algorithm. Both of these partition clustering methods are sensitive to the initialization of prototypes, which creates the difficulty of selecting the best solution for a given problem. In addition, selecting the number of clusters can be an issue. Further, the k-modes method is especially prone to instability when presented with ‘noisy’ data, since the calculation of the mode lacks the smoothing effect inherent in the calculation of the mean. This is often the case with real-world datasets, for instance in the domain of Public Health, resulting in solutions that can be radically different depending on the initialization and therefore lead to different interpretations. This paper presents two methodologies. The first addresses sensitivity to initializations using a generic landscape mapping of k-mode solutions. The second methodology utilizes the landscape map to stabilize the partition clusters for discrete data, by drawing a consensus sample in order to separate signal from noise components. Results are presented for the benchmark soybean disease dataset, an artificially generated dataset and a case study involving Public Health data.","PeriodicalId":211565,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIDM.2011.5949452","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Clustering to identify inherent structure is an important first step in data exploration. The k-means algorithm is a popular choice, but K-means is not generally appropriate for categorical data. A specific extension of k-means for categorical data is the k-modes algorithm. Both of these partition clustering methods are sensitive to the initialization of prototypes, which creates the difficulty of selecting the best solution for a given problem. In addition, selecting the number of clusters can be an issue. Further, the k-modes method is especially prone to instability when presented with ‘noisy’ data, since the calculation of the mode lacks the smoothing effect inherent in the calculation of the mean. This is often the case with real-world datasets, for instance in the domain of Public Health, resulting in solutions that can be radically different depending on the initialization and therefore lead to different interpretations. This paper presents two methodologies. The first addresses sensitivity to initializations using a generic landscape mapping of k-mode solutions. The second methodology utilizes the landscape map to stabilize the partition clusters for discrete data, by drawing a consensus sample in order to separate signal from noise components. Results are presented for the benchmark soybean disease dataset, an artificially generated dataset and a case study involving Public Health data.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
聚类分类数据:一个稳定性分析框架
聚类识别内在结构是数据探索中重要的第一步。k-means算法是一种流行的选择,但k-means通常不适用于分类数据。k-means对分类数据的具体扩展是k-modes算法。这两种划分聚类方法都对原型的初始化很敏感,这给给定问题选择最佳解决方案带来了困难。此外,选择集群的数量也是一个问题。此外,k模态方法在处理“噪声”数据时特别容易出现不稳定性,因为模态的计算缺乏平均值计算中固有的平滑效果。现实世界的数据集经常出现这种情况,例如在公共卫生领域,这导致解决方案可能因初始化而截然不同,从而导致不同的解释。本文提出了两种方法。第一个解决了使用k-mode解的通用横向映射对初始化的敏感性。第二种方法利用景观图来稳定离散数据的分区簇,通过绘制共识样本来分离信号和噪声成分。介绍了基准大豆疾病数据集、人工生成数据集和涉及公共卫生数据的案例研究的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A multi-Biclustering Combinatorial Based algorithm Active classifier training with the 3DS strategy Link Pattern Prediction with tensor decomposition in multi-relational networks Using gaming strategies for attacker and defender in recommender systems Generating materialized views using ant based approaches and information retrieval technologies
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1