大数据环境下K-means聚类分析及性能改进

Purva Rathore, Deepak Shukla
{"title":"大数据环境下K-means聚类分析及性能改进","authors":"Purva Rathore, Deepak Shukla","doi":"10.1109/ICCN.2015.9","DOIUrl":null,"url":null,"abstract":"The big data environment is used to support the huge amount of data processing. In this environment tons (i.e. Giga bytes, Tera bytes) of data is processed. Therefore the various online applications where the huge data request are generated are treated using the big data i.e. facebook, google. In this presented work the big data environment is studied and investigated how the data is consumed using the big data and how the supporting tools are working with the Hadoop storage. Furthermore, for keen understanding and investigation, a cluster analysis technique more specifically the K-mean clustering algorithm is implemented through the Hadoop and MapReduce. The clustering is a part of big data analytics where the unlabelled data is processed and utilized to make groups of the data. In addition of that it is observed the traditional k-mean algorithm is not much suitably works with the Hadoop and MapReduce thus small amount of modification is performed on the data processing technique. In addition of that during cluster analysis various issues are found in traditional k-means i.e. fluctuating accuracy, outliers and empty cluster. Therefore a new clustering algorithm with modification on traditional approach of k-means clustering is proposed and implemented. That approach first enhances the data quality by removing the outlier points in datasets and then the bi-part method is used to perform the clustering. The proposed clustering technique implemented using the JAVA, Hadoop and MapReduce finally the performance of the proposed clustering approach is evaluated and compared with the traditional k-means clustering algorithm. The obtained performance shows the effective results and enhanced accuracy of cluster formation with the removal of the de-efficiency. Thus the proposed work is adoptable for the big data environment with improving the performance of clustering.","PeriodicalId":431743,"journal":{"name":"2015 International Conference on Communication Networks (ICCN)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Analysis and performance improvement of K-means clustering in big data environment\",\"authors\":\"Purva Rathore, Deepak Shukla\",\"doi\":\"10.1109/ICCN.2015.9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The big data environment is used to support the huge amount of data processing. In this environment tons (i.e. Giga bytes, Tera bytes) of data is processed. Therefore the various online applications where the huge data request are generated are treated using the big data i.e. facebook, google. In this presented work the big data environment is studied and investigated how the data is consumed using the big data and how the supporting tools are working with the Hadoop storage. Furthermore, for keen understanding and investigation, a cluster analysis technique more specifically the K-mean clustering algorithm is implemented through the Hadoop and MapReduce. The clustering is a part of big data analytics where the unlabelled data is processed and utilized to make groups of the data. In addition of that it is observed the traditional k-mean algorithm is not much suitably works with the Hadoop and MapReduce thus small amount of modification is performed on the data processing technique. In addition of that during cluster analysis various issues are found in traditional k-means i.e. fluctuating accuracy, outliers and empty cluster. Therefore a new clustering algorithm with modification on traditional approach of k-means clustering is proposed and implemented. That approach first enhances the data quality by removing the outlier points in datasets and then the bi-part method is used to perform the clustering. The proposed clustering technique implemented using the JAVA, Hadoop and MapReduce finally the performance of the proposed clustering approach is evaluated and compared with the traditional k-means clustering algorithm. The obtained performance shows the effective results and enhanced accuracy of cluster formation with the removal of the de-efficiency. Thus the proposed work is adoptable for the big data environment with improving the performance of clustering.\",\"PeriodicalId\":431743,\"journal\":{\"name\":\"2015 International Conference on Communication Networks (ICCN)\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 International Conference on Communication Networks (ICCN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCN.2015.9\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Communication Networks (ICCN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCN.2015.9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

摘要

使用大数据环境来支持海量的数据处理。在这种环境中,要处理大量(即千兆字节、兆字节)的数据。因此,产生大量数据请求的各种在线应用程序都使用大数据进行处理,例如facebook, google。在本文中,作者研究了大数据环境,并调查了数据是如何使用大数据消费的,以及支持工具是如何与Hadoop存储一起工作的。此外,为了深入了解和研究,本文通过Hadoop和MapReduce实现了一种聚类分析技术,更具体地说是k -均值聚类算法。聚类是大数据分析的一部分,其中未标记的数据被处理并用于数据组。此外,观察到传统的k-mean算法不太适合与Hadoop和MapReduce一起工作,因此对数据处理技术进行了少量修改。此外,在聚类分析过程中,传统的k-means还存在各种问题,即波动精度、异常值和空聚类。为此,提出并实现了一种改进传统k均值聚类方法的聚类算法。该方法首先通过去除数据集中的离群点来提高数据质量,然后使用双部分方法进行聚类。利用JAVA、Hadoop和MapReduce实现了本文提出的聚类技术,最后对本文提出的聚类方法的性能进行了评价,并与传统的k-means聚类算法进行了比较。实验结果表明,在去除脱效率后,簇的形成精度得到了提高。因此,本文提出的方法可以应用于大数据环境,提高了聚类的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Analysis and performance improvement of K-means clustering in big data environment
The big data environment is used to support the huge amount of data processing. In this environment tons (i.e. Giga bytes, Tera bytes) of data is processed. Therefore the various online applications where the huge data request are generated are treated using the big data i.e. facebook, google. In this presented work the big data environment is studied and investigated how the data is consumed using the big data and how the supporting tools are working with the Hadoop storage. Furthermore, for keen understanding and investigation, a cluster analysis technique more specifically the K-mean clustering algorithm is implemented through the Hadoop and MapReduce. The clustering is a part of big data analytics where the unlabelled data is processed and utilized to make groups of the data. In addition of that it is observed the traditional k-mean algorithm is not much suitably works with the Hadoop and MapReduce thus small amount of modification is performed on the data processing technique. In addition of that during cluster analysis various issues are found in traditional k-means i.e. fluctuating accuracy, outliers and empty cluster. Therefore a new clustering algorithm with modification on traditional approach of k-means clustering is proposed and implemented. That approach first enhances the data quality by removing the outlier points in datasets and then the bi-part method is used to perform the clustering. The proposed clustering technique implemented using the JAVA, Hadoop and MapReduce finally the performance of the proposed clustering approach is evaluated and compared with the traditional k-means clustering algorithm. The obtained performance shows the effective results and enhanced accuracy of cluster formation with the removal of the de-efficiency. Thus the proposed work is adoptable for the big data environment with improving the performance of clustering.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Ontology based intrusion detection system for web application security High datarate rate regulated 4D 8PSK-TCM implementation in FPGA for satellite The Cloud-interactive knowledge Parking intervention model Multiresonator based system for performance evaluation utilizing high gain and reducing error The alleviation of low power Schmitt trigger using FinFET technology
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1