K. Dierckens, Adrian B. Harrison, C. Leung, Adrienne V. Pind
{"title":"大数据快速k均值聚类的数据科学与工程解决方案","authors":"K. Dierckens, Adrian B. Harrison, C. Leung, Adrienne V. Pind","doi":"10.1109/Trustcom/BigDataSE/ICESS.2017.332","DOIUrl":null,"url":null,"abstract":"With advances in technology, high volumes of a wide variety of valuable data of different veracity can be easily collected or generated at a high velocity in the current era of big data. Embedded in these big data are implicit, previously unknown and potentially useful information. Hence, fast and scalable big data science and engineering solutions that mine and discover knowledge from these big data are in demand. A popular and practical data mining task is to group similar data into clusters (i.e., clustering). To cluster very large data or big data, k-means based algorithms have been widely used. Although many existing k-means algorithms give quality results, they also suffer from some problems. For instance, there are risks associated with randomly selecting the k centroids, there is a tendency to produce roughly equal circular clusters, and the runtime complexity is very high. To deal with these problems, we present in this paper a big data science and engineering solution that applies heuristic prototype-based algorithm. Evaluation results show the efficiency and scalability of this solution.","PeriodicalId":170253,"journal":{"name":"2017 IEEE Trustcom/BigDataSE/ICESS","volume":"17 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":"{\"title\":\"A Data Science and Engineering Solution for Fast K-Means Clustering of Big Data\",\"authors\":\"K. Dierckens, Adrian B. Harrison, C. Leung, Adrienne V. Pind\",\"doi\":\"10.1109/Trustcom/BigDataSE/ICESS.2017.332\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With advances in technology, high volumes of a wide variety of valuable data of different veracity can be easily collected or generated at a high velocity in the current era of big data. Embedded in these big data are implicit, previously unknown and potentially useful information. Hence, fast and scalable big data science and engineering solutions that mine and discover knowledge from these big data are in demand. A popular and practical data mining task is to group similar data into clusters (i.e., clustering). To cluster very large data or big data, k-means based algorithms have been widely used. Although many existing k-means algorithms give quality results, they also suffer from some problems. For instance, there are risks associated with randomly selecting the k centroids, there is a tendency to produce roughly equal circular clusters, and the runtime complexity is very high. To deal with these problems, we present in this paper a big data science and engineering solution that applies heuristic prototype-based algorithm. Evaluation results show the efficiency and scalability of this solution.\",\"PeriodicalId\":170253,\"journal\":{\"name\":\"2017 IEEE Trustcom/BigDataSE/ICESS\",\"volume\":\"17 2\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"30\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE Trustcom/BigDataSE/ICESS\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/Trustcom/BigDataSE/ICESS.2017.332\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Trustcom/BigDataSE/ICESS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/Trustcom/BigDataSE/ICESS.2017.332","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Data Science and Engineering Solution for Fast K-Means Clustering of Big Data
With advances in technology, high volumes of a wide variety of valuable data of different veracity can be easily collected or generated at a high velocity in the current era of big data. Embedded in these big data are implicit, previously unknown and potentially useful information. Hence, fast and scalable big data science and engineering solutions that mine and discover knowledge from these big data are in demand. A popular and practical data mining task is to group similar data into clusters (i.e., clustering). To cluster very large data or big data, k-means based algorithms have been widely used. Although many existing k-means algorithms give quality results, they also suffer from some problems. For instance, there are risks associated with randomly selecting the k centroids, there is a tendency to produce roughly equal circular clusters, and the runtime complexity is very high. To deal with these problems, we present in this paper a big data science and engineering solution that applies heuristic prototype-based algorithm. Evaluation results show the efficiency and scalability of this solution.