The novel hierarchical clustering approach using self-organizing map with optimum dimension selection

Kshitij Tripathi
{"title":"The novel hierarchical clustering approach using self-organizing map with optimum dimension selection","authors":"Kshitij Tripathi","doi":"10.1002/hcs2.90","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Introduction</h3>\n \n <p>Data clustering is an important field of machine learning that has applicability in wide areas, like, business analysis, manufacturing, energy, healthcare, traveling, and logistics. A variety of clustering applications have already been developed. Data clustering approaches based on self-organizing map (SOM) generally use the map dimensions (of the grid) ranging from 2 × 2 to 8 × 8 (4–64 neurons [microclusters]) without any explicit reason for using the particular dimension, and therefore optimized results are not obtained. These algorithms use some secondary approaches to map these microclusters into the lower dimension (actual number of clusters), like, 2, 3, or 4, as the case may be, based on the optimum number of clusters in the specific data set. The secondary approach, observed in most of the works, is not SOM and is an algorithm, like, cut tree or the other.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>In this work, the proposed approach will give an idea of how to select the most optimal higher dimension of SOM for the given data set, and this dimension is again clustered into the lower actual dimension. Primary and secondary, both utilize the SOM to cluster the data and discover that the weight matrix of the SOM is very meaningful. The optimized two-dimensional configuration of SOM is not the same for every data set, and this work also tries to discover this configuration.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>The adjusted randomized index obtained on the Iris, Wine, Wisconsin diagnostic breast cancer, New Thyroid, Seeds, A1, Imbalance, Dermatology, Ecoli, and Ionosphere is, respectively, 0.7173, 0.9134, 0.7543, 0.8041, 0.7781, 0.8907, 0.8755, 0.7543, 0.5013, and 0.1728, which outperforms all other results available on the web and when no reduction of attributes is done in this work.</p>\n </section>\n \n <section>\n \n <h3> Conclusions</h3>\n \n <p>It is found that SOM is superior to or on par with other clustering approaches, like, k-means or the other, and could be used successfully to cluster all types of data sets. Ten benchmark data sets from diverse domains like medical, biological, and chemical are tested in this work, including the synthetic data sets.</p>\n </section>\n </div>","PeriodicalId":100601,"journal":{"name":"Health Care Science","volume":"3 2","pages":"88-100"},"PeriodicalIF":0.0000,"publicationDate":"2024-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/hcs2.90","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Health Care Science","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/hcs2.90","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction

Data clustering is an important field of machine learning that has applicability in wide areas, like, business analysis, manufacturing, energy, healthcare, traveling, and logistics. A variety of clustering applications have already been developed. Data clustering approaches based on self-organizing map (SOM) generally use the map dimensions (of the grid) ranging from 2 × 2 to 8 × 8 (4–64 neurons [microclusters]) without any explicit reason for using the particular dimension, and therefore optimized results are not obtained. These algorithms use some secondary approaches to map these microclusters into the lower dimension (actual number of clusters), like, 2, 3, or 4, as the case may be, based on the optimum number of clusters in the specific data set. The secondary approach, observed in most of the works, is not SOM and is an algorithm, like, cut tree or the other.

Methods

In this work, the proposed approach will give an idea of how to select the most optimal higher dimension of SOM for the given data set, and this dimension is again clustered into the lower actual dimension. Primary and secondary, both utilize the SOM to cluster the data and discover that the weight matrix of the SOM is very meaningful. The optimized two-dimensional configuration of SOM is not the same for every data set, and this work also tries to discover this configuration.

Results

The adjusted randomized index obtained on the Iris, Wine, Wisconsin diagnostic breast cancer, New Thyroid, Seeds, A1, Imbalance, Dermatology, Ecoli, and Ionosphere is, respectively, 0.7173, 0.9134, 0.7543, 0.8041, 0.7781, 0.8907, 0.8755, 0.7543, 0.5013, and 0.1728, which outperforms all other results available on the web and when no reduction of attributes is done in this work.

Conclusions

It is found that SOM is superior to or on par with other clustering approaches, like, k-means or the other, and could be used successfully to cluster all types of data sets. Ten benchmark data sets from diverse domains like medical, biological, and chemical are tested in this work, including the synthetic data sets.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用自组织图优化维度选择的新型分层聚类方法
导言 数据聚类是机器学习的一个重要领域,适用于商业分析、制造、能源、医疗保健、旅游和物流等广泛领域。目前已开发出多种聚类应用。基于自组织图(SOM)的数据聚类方法通常使用 2 × 2 到 8 × 8(4-64 个神经元[微簇])不等的图维(网格),而没有明确说明使用特定维度的原因,因此无法获得最佳结果。这些算法使用一些辅助方法,根据特定数据集的最佳簇数,将这些微簇映射到较低的维度(实际簇数),如 2、3 或 4。大多数作品中的第二种方法不是 SOM,而是一种算法,如剪切树或其他算法。 方法 在这项工作中,所提出的方法将给出如何为给定数据集选择 SOM 的最优高维度,并将该维度再次聚类到较低的实际维度中。初选和复选都利用 SOM 对数据进行聚类,并发现 SOM 的权重矩阵非常有意义。SOM 的优化二维配置并不是对每个数据集都是一样的,这项工作也试图发现这种配置。 结果 在鸢尾花、葡萄酒、威斯康星诊断乳腺癌、新甲状腺、种子、A1、失衡、皮肤病、Ecoli 和电离层上获得的调整后随机指数分别为 0.7173、0.9134、0.7543、0.8041、0.7781、0.8907、0.8755、0.7543、0.5013 和 0.1728,优于网络上的所有其他结果,并且在本工作中未对属性进行缩减。 结论 我们发现,SOM 优于或等同于其他聚类方法,如 k-means 或其他聚类方法,可成功用于对所有类型的数据集进行聚类。在这项工作中,测试了来自医疗、生物和化学等不同领域的十个基准数据集,包括合成数据集。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
0.90
自引率
0.00%
发文量
0
期刊最新文献
Study protocol: A national cross-sectional study on psychology and behavior investigation of Chinese residents in 2023. Caregiving in Asia: Priority areas for research, policy, and practice to support family caregivers. Innovative public strategies in response to COVID-19: A review of practices from China. Sixty years of ethical evolution: The 2024 revision of the Declaration of Helsinki (DoH). A novel ensemble ARIMA-LSTM approach for evaluating COVID-19 cases and future outbreak preparedness.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1