NDPD: an improved initial centroid method of partitional clustering for big data mining

K. Pandey, D. Shukla
{"title":"NDPD: an improved initial centroid method of partitional clustering for big data mining","authors":"K. Pandey, D. Shukla","doi":"10.1108/jamr-07-2021-0242","DOIUrl":null,"url":null,"abstract":"PurposeThe K-means (KM) clustering algorithm is extremely responsive to the selection of initial centroids since the initial centroid of clusters determines computational effectiveness, efficiency and local optima issues. Numerous initialization strategies are to overcome these problems through the random and deterministic selection of initial centroids. The random initialization strategy suffers from local optimization issues with the worst clustering performance, while the deterministic initialization strategy achieves high computational cost. Big data clustering aims to reduce computation costs and improve cluster efficiency. The objective of this study is to achieve a better initial centroid for big data clustering on business management data without using random and deterministic initialization that avoids local optima and improves clustering efficiency with effectiveness in terms of cluster quality, computation cost, data comparisons and iterations on a single machine.Design/methodology/approachThis study presents the Normal Distribution Probability Density (NDPD) algorithm for big data clustering on a single machine to solve business management-related clustering issues. The NDPDKM algorithm resolves the KM clustering problem by probability density of each data point. The NDPDKM algorithm first identifies the most probable density data points by using the mean and standard deviation of the datasets through normal probability density. Thereafter, the NDPDKM determines K initial centroid by using sorting and linear systematic sampling heuristics.FindingsThe performance of the proposed algorithm is compared with KM, KM++, Var-Part, Murat-KM, Mean-KM and Sort-KM algorithms through Davies Bouldin score, Silhouette coefficient, SD Validity, S_Dbw Validity, Number of Iterations and CPU time validation indices on eight real business datasets. The experimental evaluation demonstrates that the NDPDKM algorithm reduces iterations, local optima, computing costs, and improves cluster performance, effectiveness, efficiency with stable convergence as compared to other algorithms. The NDPDKM algorithm minimizes the average computing time up to 34.83%, 90.28%, 71.83%, 92.67%, 69.53% and 76.03%, and reduces the average iterations up to 40.32%, 44.06%, 32.02%, 62.78%, 19.07% and 36.74% with reference to KM, KM++, Var-Part, Murat-KM, Mean-KM and Sort-KM algorithms.Originality/valueThe KM algorithm is the most widely used partitional clustering approach in data mining techniques that extract hidden knowledge, patterns and trends for decision-making strategies in business data. Business analytics is one of the applications of big data clustering where KM clustering is useful for the various subcategories of business analytics such as customer segmentation analysis, employee salary and performance analysis, document searching, delivery optimization, discount and offer analysis, chaplain management, manufacturing analysis, productivity analysis, specialized employee and investor searching and other decision-making strategies in business.","PeriodicalId":46158,"journal":{"name":"Journal of Advances in Management Research","volume":" ","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2022-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Advances in Management Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/jamr-07-2021-0242","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MANAGEMENT","Score":null,"Total":0}
引用次数: 0

Abstract

PurposeThe K-means (KM) clustering algorithm is extremely responsive to the selection of initial centroids since the initial centroid of clusters determines computational effectiveness, efficiency and local optima issues. Numerous initialization strategies are to overcome these problems through the random and deterministic selection of initial centroids. The random initialization strategy suffers from local optimization issues with the worst clustering performance, while the deterministic initialization strategy achieves high computational cost. Big data clustering aims to reduce computation costs and improve cluster efficiency. The objective of this study is to achieve a better initial centroid for big data clustering on business management data without using random and deterministic initialization that avoids local optima and improves clustering efficiency with effectiveness in terms of cluster quality, computation cost, data comparisons and iterations on a single machine.Design/methodology/approachThis study presents the Normal Distribution Probability Density (NDPD) algorithm for big data clustering on a single machine to solve business management-related clustering issues. The NDPDKM algorithm resolves the KM clustering problem by probability density of each data point. The NDPDKM algorithm first identifies the most probable density data points by using the mean and standard deviation of the datasets through normal probability density. Thereafter, the NDPDKM determines K initial centroid by using sorting and linear systematic sampling heuristics.FindingsThe performance of the proposed algorithm is compared with KM, KM++, Var-Part, Murat-KM, Mean-KM and Sort-KM algorithms through Davies Bouldin score, Silhouette coefficient, SD Validity, S_Dbw Validity, Number of Iterations and CPU time validation indices on eight real business datasets. The experimental evaluation demonstrates that the NDPDKM algorithm reduces iterations, local optima, computing costs, and improves cluster performance, effectiveness, efficiency with stable convergence as compared to other algorithms. The NDPDKM algorithm minimizes the average computing time up to 34.83%, 90.28%, 71.83%, 92.67%, 69.53% and 76.03%, and reduces the average iterations up to 40.32%, 44.06%, 32.02%, 62.78%, 19.07% and 36.74% with reference to KM, KM++, Var-Part, Murat-KM, Mean-KM and Sort-KM algorithms.Originality/valueThe KM algorithm is the most widely used partitional clustering approach in data mining techniques that extract hidden knowledge, patterns and trends for decision-making strategies in business data. Business analytics is one of the applications of big data clustering where KM clustering is useful for the various subcategories of business analytics such as customer segmentation analysis, employee salary and performance analysis, document searching, delivery optimization, discount and offer analysis, chaplain management, manufacturing analysis, productivity analysis, specialized employee and investor searching and other decision-making strategies in business.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
NDPD:一种用于大数据挖掘的改进的分区聚类初始质心方法
目的K-means(KM)聚类算法对初始质心的选择非常敏感,因为聚类的初始质心决定了计算的有效性、效率和局部最优问题。许多初始化策略都是通过随机和确定性地选择初始质心来克服这些问题。随机初始化策略存在聚类性能最差的局部优化问题,而确定性初始化策略的计算成本很高。大数据集群旨在降低计算成本,提高集群效率。本研究的目的是在不使用随机和确定性初始化的情况下,在商业管理数据上实现更好的大数据聚类初始质心,避免局部最优,并在聚类质量、计算成本、数据比较和单机迭代方面有效地提高聚类效率。设计/方法论/方法本研究提出了用于单机上大数据聚类的正态分布概率密度(NDPD)算法,以解决与业务管理相关的聚类问题。NDPDKM算法通过每个数据点的概率密度来解决KM聚类问题。NDPDKM算法首先通过使用数据集通过正态概率密度的平均值和标准差来识别最可能的密度数据点。此后,NDPDKM通过使用排序和线性系统采样启发法来确定K个初始质心。通过Davies-Bouldin评分、Silhouette系数、SD有效性、S_Dbw有效性、迭代次数和CPU时间验证指标,将该算法的性能与KM、KM++、Var-Part、Murat-KM、Mean-KM和Sort-KM算法进行了比较。实验评估表明,与其他算法相比,NDPDKM算法降低了迭代次数、局部最优值和计算成本,并以稳定的收敛性提高了聚类性能、有效性和效率。NDPDKM算法将平均计算时间最小化至34.83%、90.28%、71.83%、92.67%、69.53%和76.03%,并将平均迭代次数减少至40.32%、44.06%、32.02%、62.78%、19.07%和36.74%。独创性/价值KM算法是数据挖掘技术中使用最广泛的部分聚类方法,用于提取商业数据中决策策略的隐藏知识、模式和趋势。业务分析是大数据集群的应用之一,其中KM集群可用于业务分析的各个子类别,如客户细分分析、员工薪酬和绩效分析、文档搜索、交付优化、折扣和优惠分析、牧师管理、制造分析、生产力分析、,专业的员工和投资者搜索以及其他商业决策策略。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
6.50
自引率
3.20%
发文量
30
期刊最新文献
The type of supplier involvement in new product development in the automotive industry: metaheuristic-based K-means clustering and analytic hierarchical process methods Resilience of developing economies to external shocks: empirical evidence from CEMAC countries Unbraiding the effect of policy benefits on subjective well-being: the mediating role of work-related well-being Development and psychometric validation of a scale for sources of resistance to change in higher education institutions Assessment of bus fleet service quality: a graph theoretical approach
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1