大数据快速k均值聚类的数据科学与工程解决方案

2017 IEEE Trustcom/BigDataSE/ICESS Pub Date : 2017-08-01 DOI:10.1109/Trustcom/BigDataSE/ICESS.2017.332

K. Dierckens, Adrian B. Harrison, C. Leung, Adrienne V. Pind

{"title":"大数据快速k均值聚类的数据科学与工程解决方案","authors":"K. Dierckens, Adrian B. Harrison, C. Leung, Adrienne V. Pind","doi":"10.1109/Trustcom/BigDataSE/ICESS.2017.332","DOIUrl":null,"url":null,"abstract":"With advances in technology, high volumes of a wide variety of valuable data of different veracity can be easily collected or generated at a high velocity in the current era of big data. Embedded in these big data are implicit, previously unknown and potentially useful information. Hence, fast and scalable big data science and engineering solutions that mine and discover knowledge from these big data are in demand. A popular and practical data mining task is to group similar data into clusters (i.e., clustering). To cluster very large data or big data, k-means based algorithms have been widely used. Although many existing k-means algorithms give quality results, they also suffer from some problems. For instance, there are risks associated with randomly selecting the k centroids, there is a tendency to produce roughly equal circular clusters, and the runtime complexity is very high. To deal with these problems, we present in this paper a big data science and engineering solution that applies heuristic prototype-based algorithm. Evaluation results show the efficiency and scalability of this solution.","PeriodicalId":170253,"journal":{"name":"2017 IEEE Trustcom/BigDataSE/ICESS","volume":"17 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":"{\"title\":\"A Data Science and Engineering Solution for Fast K-Means Clustering of Big Data\",\"authors\":\"K. Dierckens, Adrian B. Harrison, C. Leung, Adrienne V. Pind\",\"doi\":\"10.1109/Trustcom/BigDataSE/ICESS.2017.332\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With advances in technology, high volumes of a wide variety of valuable data of different veracity can be easily collected or generated at a high velocity in the current era of big data. Embedded in these big data are implicit, previously unknown and potentially useful information. Hence, fast and scalable big data science and engineering solutions that mine and discover knowledge from these big data are in demand. A popular and practical data mining task is to group similar data into clusters (i.e., clustering). To cluster very large data or big data, k-means based algorithms have been widely used. Although many existing k-means algorithms give quality results, they also suffer from some problems. For instance, there are risks associated with randomly selecting the k centroids, there is a tendency to produce roughly equal circular clusters, and the runtime complexity is very high. To deal with these problems, we present in this paper a big data science and engineering solution that applies heuristic prototype-based algorithm. Evaluation results show the efficiency and scalability of this solution.\",\"PeriodicalId\":170253,\"journal\":{\"name\":\"2017 IEEE Trustcom/BigDataSE/ICESS\",\"volume\":\"17 2\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"30\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE Trustcom/BigDataSE/ICESS\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/Trustcom/BigDataSE/ICESS.2017.332\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Trustcom/BigDataSE/ICESS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/Trustcom/BigDataSE/ICESS.2017.332","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 30

摘要

随着技术的进步，在当前的大数据时代，可以很容易地收集或高速生成大量、种类繁多、不同准确性的有价值数据。在这些大数据中嵌入了隐含的、以前未知的、潜在有用的信息。因此，需要从这些大数据中挖掘和发现知识的快速、可扩展的大数据科学和工程解决方案。一个流行且实用的数据挖掘任务是将相似的数据分组到集群中(即聚类)。为了对非常大的数据或大数据进行聚类，基于k-means的算法已经被广泛使用。虽然许多现有的k-means算法给出了高质量的结果，但它们也存在一些问题。例如，随机选择k个质心存在风险，可能会产生大致相等的圆形簇，并且运行时复杂性非常高。为了解决这些问题，本文提出了一种基于启发式原型算法的大数据科学与工程解决方案。评估结果表明了该方案的有效性和可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Data Science and Engineering Solution for Fast K-Means Clustering of Big Data

With advances in technology, high volumes of a wide variety of valuable data of different veracity can be easily collected or generated at a high velocity in the current era of big data. Embedded in these big data are implicit, previously unknown and potentially useful information. Hence, fast and scalable big data science and engineering solutions that mine and discover knowledge from these big data are in demand. A popular and practical data mining task is to group similar data into clusters (i.e., clustering). To cluster very large data or big data, k-means based algorithms have been widely used. Although many existing k-means algorithms give quality results, they also suffer from some problems. For instance, there are risks associated with randomly selecting the k centroids, there is a tendency to produce roughly equal circular clusters, and the runtime complexity is very high. To deal with these problems, we present in this paper a big data science and engineering solution that applies heuristic prototype-based algorithm. Evaluation results show the efficiency and scalability of this solution.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE Trustcom/BigDataSE/ICESS

自引率

0.00%

发文量

期刊最新文献

Insider Threat Detection Through Attributed Graph Clustering SEEAD: A Semantic-Based Approach for Automatic Binary Code De-obfuscation A Public Key Encryption Scheme for String Identification Vehicle Incident Hot Spots Identification: An Approach for Big Data Implementing Chain of Custody Requirements in Database Audit Records for Forensic Purposes