{"title":"Big data clustering method based on parallel K-means","authors":"Haibo Liu, Yongbin Bai, Zhenhao Chen, Zhenfeng Zhang","doi":"10.1109/ICPECA60615.2024.10470970","DOIUrl":null,"url":null,"abstract":"In the era of big data, traditional data clustering algorithms have gradually failed to meet the application requirements, and the optimization of data compression and parallelization methods has become a research hotspot. Based on the analysis of the traditional K-means clustering algorithm, this paper optimizes and improves the parallelized K-means algorithm, and proposes the Spark-Kmeans algorithm, which mainly retains the sample set distribution information by random sampling of large samples, and pre-clusters the samples in the nodes, and reclusters the pre-clustering in the convergence node. And it uses this as the initialization clustering center, so as to eliminate the problem of algorithm convergence instability caused by random initialization of the clustering center. Finally, single-node clustering and Spark-Kmeans clustering experiments are performed on the kdd_cup99 dataset and sklearn randomly generated dataset, and the effectiveness of the algorithm is verified by time-consuming, purity, error squared and indexes.","PeriodicalId":518671,"journal":{"name":"2024 IEEE 4th International Conference on Power, Electronics and Computer Applications (ICPECA)","volume":"120 2","pages":"893-897"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2024 IEEE 4th International Conference on Power, Electronics and Computer Applications (ICPECA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPECA60615.2024.10470970","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In the era of big data, traditional data clustering algorithms have gradually failed to meet the application requirements, and the optimization of data compression and parallelization methods has become a research hotspot. Based on the analysis of the traditional K-means clustering algorithm, this paper optimizes and improves the parallelized K-means algorithm, and proposes the Spark-Kmeans algorithm, which mainly retains the sample set distribution information by random sampling of large samples, and pre-clusters the samples in the nodes, and reclusters the pre-clustering in the convergence node. And it uses this as the initialization clustering center, so as to eliminate the problem of algorithm convergence instability caused by random initialization of the clustering center. Finally, single-node clustering and Spark-Kmeans clustering experiments are performed on the kdd_cup99 dataset and sklearn randomly generated dataset, and the effectiveness of the algorithm is verified by time-consuming, purity, error squared and indexes.