大数据快速k均值聚类的数据科学与工程解决方案

K. Dierckens, Adrian B. Harrison, C. Leung, Adrienne V. Pind
{"title":"大数据快速k均值聚类的数据科学与工程解决方案","authors":"K. Dierckens, Adrian B. Harrison, C. Leung, Adrienne V. Pind","doi":"10.1109/Trustcom/BigDataSE/ICESS.2017.332","DOIUrl":null,"url":null,"abstract":"With advances in technology, high volumes of a wide variety of valuable data of different veracity can be easily collected or generated at a high velocity in the current era of big data. Embedded in these big data are implicit, previously unknown and potentially useful information. Hence, fast and scalable big data science and engineering solutions that mine and discover knowledge from these big data are in demand. A popular and practical data mining task is to group similar data into clusters (i.e., clustering). To cluster very large data or big data, k-means based algorithms have been widely used. Although many existing k-means algorithms give quality results, they also suffer from some problems. For instance, there are risks associated with randomly selecting the k centroids, there is a tendency to produce roughly equal circular clusters, and the runtime complexity is very high. To deal with these problems, we present in this paper a big data science and engineering solution that applies heuristic prototype-based algorithm. Evaluation results show the efficiency and scalability of this solution.","PeriodicalId":170253,"journal":{"name":"2017 IEEE Trustcom/BigDataSE/ICESS","volume":"17 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":"{\"title\":\"A Data Science and Engineering Solution for Fast K-Means Clustering of Big Data\",\"authors\":\"K. Dierckens, Adrian B. Harrison, C. Leung, Adrienne V. Pind\",\"doi\":\"10.1109/Trustcom/BigDataSE/ICESS.2017.332\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With advances in technology, high volumes of a wide variety of valuable data of different veracity can be easily collected or generated at a high velocity in the current era of big data. Embedded in these big data are implicit, previously unknown and potentially useful information. Hence, fast and scalable big data science and engineering solutions that mine and discover knowledge from these big data are in demand. A popular and practical data mining task is to group similar data into clusters (i.e., clustering). To cluster very large data or big data, k-means based algorithms have been widely used. Although many existing k-means algorithms give quality results, they also suffer from some problems. For instance, there are risks associated with randomly selecting the k centroids, there is a tendency to produce roughly equal circular clusters, and the runtime complexity is very high. To deal with these problems, we present in this paper a big data science and engineering solution that applies heuristic prototype-based algorithm. Evaluation results show the efficiency and scalability of this solution.\",\"PeriodicalId\":170253,\"journal\":{\"name\":\"2017 IEEE Trustcom/BigDataSE/ICESS\",\"volume\":\"17 2\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"30\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE Trustcom/BigDataSE/ICESS\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/Trustcom/BigDataSE/ICESS.2017.332\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Trustcom/BigDataSE/ICESS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/Trustcom/BigDataSE/ICESS.2017.332","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 30

摘要

随着技术的进步,在当前的大数据时代,可以很容易地收集或高速生成大量、种类繁多、不同准确性的有价值数据。在这些大数据中嵌入了隐含的、以前未知的、潜在有用的信息。因此,需要从这些大数据中挖掘和发现知识的快速、可扩展的大数据科学和工程解决方案。一个流行且实用的数据挖掘任务是将相似的数据分组到集群中(即聚类)。为了对非常大的数据或大数据进行聚类,基于k-means的算法已经被广泛使用。虽然许多现有的k-means算法给出了高质量的结果,但它们也存在一些问题。例如,随机选择k个质心存在风险,可能会产生大致相等的圆形簇,并且运行时复杂性非常高。为了解决这些问题,本文提出了一种基于启发式原型算法的大数据科学与工程解决方案。评估结果表明了该方案的有效性和可扩展性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A Data Science and Engineering Solution for Fast K-Means Clustering of Big Data
With advances in technology, high volumes of a wide variety of valuable data of different veracity can be easily collected or generated at a high velocity in the current era of big data. Embedded in these big data are implicit, previously unknown and potentially useful information. Hence, fast and scalable big data science and engineering solutions that mine and discover knowledge from these big data are in demand. A popular and practical data mining task is to group similar data into clusters (i.e., clustering). To cluster very large data or big data, k-means based algorithms have been widely used. Although many existing k-means algorithms give quality results, they also suffer from some problems. For instance, there are risks associated with randomly selecting the k centroids, there is a tendency to produce roughly equal circular clusters, and the runtime complexity is very high. To deal with these problems, we present in this paper a big data science and engineering solution that applies heuristic prototype-based algorithm. Evaluation results show the efficiency and scalability of this solution.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Insider Threat Detection Through Attributed Graph Clustering SEEAD: A Semantic-Based Approach for Automatic Binary Code De-obfuscation A Public Key Encryption Scheme for String Identification Vehicle Incident Hot Spots Identification: An Approach for Big Data Implementing Chain of Custody Requirements in Database Audit Records for Forensic Purposes
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1