{"title":"An analysis of MapReduce efficiency in document clustering using parallel K-means algorithm","authors":"Tanvir Habib Sardar, Zahid Ansari","doi":"10.1016/j.fcij.2018.03.003","DOIUrl":null,"url":null,"abstract":"<div><p>One of the significant data mining techniques is clustering. Due to expansion and digitalization of each field, large datasets are being generated rapidly. Such large dataset clustering is a challenge for traditional sequential clustering algorithms due to huge processing time. Distributed parallel architectures and algorithms are thus helpful to achieve performance and scalability requirement of clustering large datasets. In this study, we design and experiment a parallel k-means algorithm using MapReduce programming model and compared the result with sequential k-means for clustering varying size of document dataset. The result demonstrates that proposed k-means obtains higher performance and outperformed sequential k-means while clustering documents.</p></div>","PeriodicalId":100561,"journal":{"name":"Future Computing and Informatics Journal","volume":"3 2","pages":"Pages 200-209"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/j.fcij.2018.03.003","citationCount":"46","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Computing and Informatics Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2314728817300661","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 46
Abstract
One of the significant data mining techniques is clustering. Due to expansion and digitalization of each field, large datasets are being generated rapidly. Such large dataset clustering is a challenge for traditional sequential clustering algorithms due to huge processing time. Distributed parallel architectures and algorithms are thus helpful to achieve performance and scalability requirement of clustering large datasets. In this study, we design and experiment a parallel k-means algorithm using MapReduce programming model and compared the result with sequential k-means for clustering varying size of document dataset. The result demonstrates that proposed k-means obtains higher performance and outperformed sequential k-means while clustering documents.