{"title":"改进基于质心的聚类的随机质心初始化","authors":"V. Romanuke","doi":"10.31181/dmame622023742","DOIUrl":null,"url":null,"abstract":"A method for improving centroid-based clustering is suggested. The improvement is built on diversification of the k-means++ initialization. The k-means++ algorithm claimed to be a better version of k-means is tested by a computational set-up, where the dataset size, the number of features, and the number of clusters are varied. The statistics obtained on the testing have shown that, in roughly 50 % of instances to cluster, k-means++ outputs worse results than k-means with random centroid initialization. The impact of the random centroid initialization solidifies as both the dataset size and the number of features increase. In order to reduce the possible underperformance of k-means++, the k-means algorithm is run on a separate processor core in parallel to running the k-means++ algorithm, whereupon the better result is selected. The number of k-means++ algorithm runs is set not less than that of k-means. By incorporating the seeding method of random centroid initialization, the k-means++ algorithm gains about 0.05 % accuracy in every second instance to cluster.","PeriodicalId":32695,"journal":{"name":"Decision Making Applications in Management and Engineering","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Random centroid initialization for improving centroid-based clustering\",\"authors\":\"V. Romanuke\",\"doi\":\"10.31181/dmame622023742\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A method for improving centroid-based clustering is suggested. The improvement is built on diversification of the k-means++ initialization. The k-means++ algorithm claimed to be a better version of k-means is tested by a computational set-up, where the dataset size, the number of features, and the number of clusters are varied. The statistics obtained on the testing have shown that, in roughly 50 % of instances to cluster, k-means++ outputs worse results than k-means with random centroid initialization. The impact of the random centroid initialization solidifies as both the dataset size and the number of features increase. In order to reduce the possible underperformance of k-means++, the k-means algorithm is run on a separate processor core in parallel to running the k-means++ algorithm, whereupon the better result is selected. The number of k-means++ algorithm runs is set not less than that of k-means. By incorporating the seeding method of random centroid initialization, the k-means++ algorithm gains about 0.05 % accuracy in every second instance to cluster.\",\"PeriodicalId\":32695,\"journal\":{\"name\":\"Decision Making Applications in Management and Engineering\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-12-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Decision Making Applications in Management and Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.31181/dmame622023742\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Decision Sciences\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Decision Making Applications in Management and Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31181/dmame622023742","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Decision Sciences","Score":null,"Total":0}
Random centroid initialization for improving centroid-based clustering
A method for improving centroid-based clustering is suggested. The improvement is built on diversification of the k-means++ initialization. The k-means++ algorithm claimed to be a better version of k-means is tested by a computational set-up, where the dataset size, the number of features, and the number of clusters are varied. The statistics obtained on the testing have shown that, in roughly 50 % of instances to cluster, k-means++ outputs worse results than k-means with random centroid initialization. The impact of the random centroid initialization solidifies as both the dataset size and the number of features increase. In order to reduce the possible underperformance of k-means++, the k-means algorithm is run on a separate processor core in parallel to running the k-means++ algorithm, whereupon the better result is selected. The number of k-means++ algorithm runs is set not less than that of k-means. By incorporating the seeding method of random centroid initialization, the k-means++ algorithm gains about 0.05 % accuracy in every second instance to cluster.