{"title":"大数据环境下K-means聚类分析及性能改进","authors":"Purva Rathore, Deepak Shukla","doi":"10.1109/ICCN.2015.9","DOIUrl":null,"url":null,"abstract":"The big data environment is used to support the huge amount of data processing. In this environment tons (i.e. Giga bytes, Tera bytes) of data is processed. Therefore the various online applications where the huge data request are generated are treated using the big data i.e. facebook, google. In this presented work the big data environment is studied and investigated how the data is consumed using the big data and how the supporting tools are working with the Hadoop storage. Furthermore, for keen understanding and investigation, a cluster analysis technique more specifically the K-mean clustering algorithm is implemented through the Hadoop and MapReduce. The clustering is a part of big data analytics where the unlabelled data is processed and utilized to make groups of the data. In addition of that it is observed the traditional k-mean algorithm is not much suitably works with the Hadoop and MapReduce thus small amount of modification is performed on the data processing technique. In addition of that during cluster analysis various issues are found in traditional k-means i.e. fluctuating accuracy, outliers and empty cluster. Therefore a new clustering algorithm with modification on traditional approach of k-means clustering is proposed and implemented. That approach first enhances the data quality by removing the outlier points in datasets and then the bi-part method is used to perform the clustering. The proposed clustering technique implemented using the JAVA, Hadoop and MapReduce finally the performance of the proposed clustering approach is evaluated and compared with the traditional k-means clustering algorithm. The obtained performance shows the effective results and enhanced accuracy of cluster formation with the removal of the de-efficiency. Thus the proposed work is adoptable for the big data environment with improving the performance of clustering.","PeriodicalId":431743,"journal":{"name":"2015 International Conference on Communication Networks (ICCN)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Analysis and performance improvement of K-means clustering in big data environment\",\"authors\":\"Purva Rathore, Deepak Shukla\",\"doi\":\"10.1109/ICCN.2015.9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The big data environment is used to support the huge amount of data processing. In this environment tons (i.e. Giga bytes, Tera bytes) of data is processed. Therefore the various online applications where the huge data request are generated are treated using the big data i.e. facebook, google. In this presented work the big data environment is studied and investigated how the data is consumed using the big data and how the supporting tools are working with the Hadoop storage. Furthermore, for keen understanding and investigation, a cluster analysis technique more specifically the K-mean clustering algorithm is implemented through the Hadoop and MapReduce. The clustering is a part of big data analytics where the unlabelled data is processed and utilized to make groups of the data. In addition of that it is observed the traditional k-mean algorithm is not much suitably works with the Hadoop and MapReduce thus small amount of modification is performed on the data processing technique. In addition of that during cluster analysis various issues are found in traditional k-means i.e. fluctuating accuracy, outliers and empty cluster. Therefore a new clustering algorithm with modification on traditional approach of k-means clustering is proposed and implemented. That approach first enhances the data quality by removing the outlier points in datasets and then the bi-part method is used to perform the clustering. The proposed clustering technique implemented using the JAVA, Hadoop and MapReduce finally the performance of the proposed clustering approach is evaluated and compared with the traditional k-means clustering algorithm. The obtained performance shows the effective results and enhanced accuracy of cluster formation with the removal of the de-efficiency. Thus the proposed work is adoptable for the big data environment with improving the performance of clustering.\",\"PeriodicalId\":431743,\"journal\":{\"name\":\"2015 International Conference on Communication Networks (ICCN)\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 International Conference on Communication Networks (ICCN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCN.2015.9\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Communication Networks (ICCN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCN.2015.9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Analysis and performance improvement of K-means clustering in big data environment
The big data environment is used to support the huge amount of data processing. In this environment tons (i.e. Giga bytes, Tera bytes) of data is processed. Therefore the various online applications where the huge data request are generated are treated using the big data i.e. facebook, google. In this presented work the big data environment is studied and investigated how the data is consumed using the big data and how the supporting tools are working with the Hadoop storage. Furthermore, for keen understanding and investigation, a cluster analysis technique more specifically the K-mean clustering algorithm is implemented through the Hadoop and MapReduce. The clustering is a part of big data analytics where the unlabelled data is processed and utilized to make groups of the data. In addition of that it is observed the traditional k-mean algorithm is not much suitably works with the Hadoop and MapReduce thus small amount of modification is performed on the data processing technique. In addition of that during cluster analysis various issues are found in traditional k-means i.e. fluctuating accuracy, outliers and empty cluster. Therefore a new clustering algorithm with modification on traditional approach of k-means clustering is proposed and implemented. That approach first enhances the data quality by removing the outlier points in datasets and then the bi-part method is used to perform the clustering. The proposed clustering technique implemented using the JAVA, Hadoop and MapReduce finally the performance of the proposed clustering approach is evaluated and compared with the traditional k-means clustering algorithm. The obtained performance shows the effective results and enhanced accuracy of cluster formation with the removal of the de-efficiency. Thus the proposed work is adoptable for the big data environment with improving the performance of clustering.