Speedup of the k-Means Algorithm for Partitioning Large Datasets of Flat Points by a Preliminary Partition and Selecting Initial Centroids

IF 0.5 Q4 COMPUTER SCIENCE, THEORY & METHODS Applied Computer Systems Pub Date : 2023-06-01 DOI:10.2478/acss-2023-0001
V. Romanuke
{"title":"Speedup of the k-Means Algorithm for Partitioning Large Datasets of Flat Points by a Preliminary Partition and Selecting Initial Centroids","authors":"V. Romanuke","doi":"10.2478/acss-2023-0001","DOIUrl":null,"url":null,"abstract":"Abstract A problem of partitioning large datasets of flat points is considered. Known as the centroid-based clustering problem, it is mainly addressed by the k-means algorithm and its modifications. As the k-means performance becomes poorer on large datasets, including the dataset shape stretching, the goal is to study a possibility of improving the centroid-based clustering for such cases. It is quite noticeable on non-sparse datasets that the resulting clusters produced by k-means resemble beehive honeycomb. It is natural for rectangular-shaped datasets because the hexagonal cells make efficient use of space owing to which the sum of the within-cluster squared Euclidean distances to the centroids is approximated to its minimum. Therefore, the lattices of rectangular and hexagonal clusters, consisting of stretched rectangles and regular hexagons, are suggested to be successively applied. Then the initial centroids are calculated by averaging within respective hexagons. These centroids are used as initial seeds to start the k-means algorithm. This ensures faster and more accurate convergence, where at least the expected speedup is 1.7 to 2.1 times by a 0.7 to 0.9 % accuracy gain. The lattice of rectangular clusters applied first makes rather rough but effective partition allowing to optionally run further clustering on parallel processor cores. The lattice of hexagonal clusters applied to every rectangle allows obtaining initial centroids very quickly. Such centroids are far closer to the solution than the initial centroids in the k-means++ algorithm. Another approach to the k-means update, where initial centroids are selected separately within every rectangle hexagons, can be used as well. It is faster than selecting initial centroids across all hexagons but is less accurate. The speedup is 9 to 11 times by a possible accuracy loss of 0.3 %. However, this approach may outperform the k-means algorithm. The speedup increases as both the lattices become denser and the dataset becomes larger reaching 30 to 50 times.","PeriodicalId":41960,"journal":{"name":"Applied Computer Systems","volume":"69 6 1","pages":"1 - 12"},"PeriodicalIF":0.5000,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Computer Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/acss-2023-0001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Abstract A problem of partitioning large datasets of flat points is considered. Known as the centroid-based clustering problem, it is mainly addressed by the k-means algorithm and its modifications. As the k-means performance becomes poorer on large datasets, including the dataset shape stretching, the goal is to study a possibility of improving the centroid-based clustering for such cases. It is quite noticeable on non-sparse datasets that the resulting clusters produced by k-means resemble beehive honeycomb. It is natural for rectangular-shaped datasets because the hexagonal cells make efficient use of space owing to which the sum of the within-cluster squared Euclidean distances to the centroids is approximated to its minimum. Therefore, the lattices of rectangular and hexagonal clusters, consisting of stretched rectangles and regular hexagons, are suggested to be successively applied. Then the initial centroids are calculated by averaging within respective hexagons. These centroids are used as initial seeds to start the k-means algorithm. This ensures faster and more accurate convergence, where at least the expected speedup is 1.7 to 2.1 times by a 0.7 to 0.9 % accuracy gain. The lattice of rectangular clusters applied first makes rather rough but effective partition allowing to optionally run further clustering on parallel processor cores. The lattice of hexagonal clusters applied to every rectangle allows obtaining initial centroids very quickly. Such centroids are far closer to the solution than the initial centroids in the k-means++ algorithm. Another approach to the k-means update, where initial centroids are selected separately within every rectangle hexagons, can be used as well. It is faster than selecting initial centroids across all hexagons but is less accurate. The speedup is 9 to 11 times by a possible accuracy loss of 0.3 %. However, this approach may outperform the k-means algorithm. The speedup increases as both the lattices become denser and the dataset becomes larger reaching 30 to 50 times.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
k-Means算法对大数据集平面点的初步划分和初始质心选择的加速研究
摘要:研究了大型平面点数据集的分区问题。它被称为基于质心的聚类问题,主要由k-means算法及其修正来解决。由于k-means在大型数据集上的性能变得越来越差,包括数据集形状拉伸,我们的目标是研究在这种情况下改进基于质心的聚类的可能性。在非稀疏数据集上,k-means产生的聚类类似于蜂巢。对于矩形数据集来说,这是很自然的,因为六边形单元有效地利用了空间,因此簇内到质心的欧氏距离平方的总和近似于最小值。因此,建议依次应用由拉伸矩形和正六边形组成的矩形和六边形簇的晶格。然后在各自的六边形内平均计算初始质心。这些质心被用作启动k-means算法的初始种子。这确保了更快和更精确的收敛,其中至少预期的加速是1.7到2.1倍,精度增益为0.7到0.9%。首先应用的矩形集群的晶格会产生相当粗糙但有效的分区,允许在并行处理器内核上选择性地运行进一步的集群。应用于每个矩形的六边形簇的晶格可以非常快速地获得初始质心。这样的质心比k-means++算法中的初始质心更接近解。k-means更新的另一种方法,即在每个矩形六边形中分别选择初始质心,也可以使用。它比在所有六边形中选择初始质心要快,但精度较低。由于可能的精度损失0.3%,加速提高了9到11倍。然而,这种方法可能优于k-means算法。随着格子变得更密集,数据集变得更大,加速会增加,达到30到50倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Applied Computer Systems
Applied Computer Systems COMPUTER SCIENCE, THEORY & METHODS-
自引率
10.00%
发文量
9
审稿时长
30 weeks
期刊最新文献
Multimodal Biometric System Based on the Fusion in Score of Fingerprint and Online Handwritten Signature Multichannel Approach for Sentiment Analysis Using Stack of Neural Network with Lexicon Based Padding and Attention Mechanism BRS-based Model for the Specification of Multi-view Point Ontology Empirical Analysis of Supervised and Unsupervised Machine Learning Algorithms with Aspect-Based Sentiment Analysis Approximate Nearest Neighbour-based Index Tree: A Case Study for Instrumental Music Search
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1