Comparison of different similarity measures in hierarchical clustering

M. Vagni, N. Giordano, G. Balestra, S. Rosati
{"title":"Comparison of different similarity measures in hierarchical clustering","authors":"M. Vagni, N. Giordano, G. Balestra, S. Rosati","doi":"10.1109/MeMeA52024.2021.9478746","DOIUrl":null,"url":null,"abstract":"The management of datasets containing heterogeneous types of data is a crucial point in the context of precision medicine, where genetic, environmental, and life-style information of each individual has to be analyzed simultaneously. Clustering represents a powerful method, used in data mining, for extracting new useful knowledge from unlabeled datasets. Clustering methods are essentially distance-based, since they measure the similarity (or the distance) between two elements or one element and the cluster centroid. However, the selection of the distance metric is not a trivial task: it could influence the clustering results and, thus, the extracted information. In this study we analyze the impact of four similarity measures (Manhattan or L1 distance, Euclidean or L2 distance, Chebyshev or L∞ distance and Gower distance) on the clustering results obtained for datasets containing different types of variables. We applied hierarchical clustering combined with an automatic cut point selection method to six datasets publicly available on the UCI Repository. Four different clusterizations were obtained for every dataset (one for each distance) and were analyzed in terms of number of clusters, number of elements in each cluster, and cluster centroids. Our results showed that changing the distance metric produces substantial modifications in the obtained clusters. This behavior is particularly evident for datasets containing heterogeneous variables. Thus, the choice of the distance measure should not be done a-priori but evaluated according to the set of data to be analyzed and the task to be accomplished.","PeriodicalId":429222,"journal":{"name":"2021 IEEE International Symposium on Medical Measurements and Applications (MeMeA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Symposium on Medical Measurements and Applications (MeMeA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MeMeA52024.2021.9478746","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

The management of datasets containing heterogeneous types of data is a crucial point in the context of precision medicine, where genetic, environmental, and life-style information of each individual has to be analyzed simultaneously. Clustering represents a powerful method, used in data mining, for extracting new useful knowledge from unlabeled datasets. Clustering methods are essentially distance-based, since they measure the similarity (or the distance) between two elements or one element and the cluster centroid. However, the selection of the distance metric is not a trivial task: it could influence the clustering results and, thus, the extracted information. In this study we analyze the impact of four similarity measures (Manhattan or L1 distance, Euclidean or L2 distance, Chebyshev or L∞ distance and Gower distance) on the clustering results obtained for datasets containing different types of variables. We applied hierarchical clustering combined with an automatic cut point selection method to six datasets publicly available on the UCI Repository. Four different clusterizations were obtained for every dataset (one for each distance) and were analyzed in terms of number of clusters, number of elements in each cluster, and cluster centroids. Our results showed that changing the distance metric produces substantial modifications in the obtained clusters. This behavior is particularly evident for datasets containing heterogeneous variables. Thus, the choice of the distance measure should not be done a-priori but evaluated according to the set of data to be analyzed and the task to be accomplished.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
层次聚类中不同相似性度量的比较
包含异构类型数据的数据集的管理是精准医学背景下的一个关键点,在精准医学背景下,每个人的遗传、环境和生活方式信息必须同时分析。聚类是一种强大的数据挖掘方法,用于从未标记的数据集中提取新的有用知识。聚类方法本质上是基于距离的,因为它们测量两个元素或一个元素与聚类质心之间的相似性(或距离)。然而,距离度量的选择并不是一项微不足道的任务:它可能会影响聚类结果,从而影响提取的信息。在本研究中,我们分析了四种相似性度量(曼哈顿或L1距离、欧几里得或L2距离、切比舍夫或L∞距离和高尔距离)对包含不同类型变量的数据集的聚类结果的影响。我们将分层聚类结合自动切点选择方法应用于UCI Repository上公开的六个数据集。对每个数据集进行了四种不同的聚类(每个距离一个),并从聚类数量、每个聚类中的元素数量和聚类质心三个方面进行了分析。我们的结果表明,改变距离度量会对获得的簇产生实质性的修改。这种行为对于包含异构变量的数据集尤其明显。因此,距离度量的选择不应是先验的,而应根据待分析的数据集和待完成的任务进行评估。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
ML algorithms for the assessment of prescribed physical exercises Measuring the Effect of Rhythmic Auditory Stimuli on Parkinsonian Gait in Challenging Settings A preliminary study on the dynamic characterization of a MEMS microgripper for biomedical applications Gait Parameters of Elderly Subjects in Single-task and Dual-task with three different MIMU set-ups The use of cognitive training and tDCS for the treatment of an high potential subject: a case study
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1