{"title":"Similarity Measure Approaches Applied in Text Document Clustering for Information Retrieval","authors":"Naveen Kumar, S. Yadav, Divakar Yadav","doi":"10.1109/PDGC50313.2020.9315851","DOIUrl":null,"url":null,"abstract":"In today's world with ever increasing amount of text assets overloaded on web with digitized libraries, sorting out these documents got developed into a feasible need. Document clustering is an important procedure which consequently sorts out huge number of articles into a modest number of balanced gatherings. Document clustering is making groups of similar documents into number of clusters such that documents within the same group with high similarity values among one another and dissimilar to documents from other clusters. Common applications of document Clustering includes grouping similar news articles, analysis of customer feedback, text mining, duplicate content detection, finding similar documents, search optimization and many more. This lead to utilization of these documents for finding required information in a competent and efficient manner. Document clustering required a measurement for evaluating how surprising two given information are. This dissimilarity is often estimated by using some distance measures, for example, Cosine Similarity, Euclidean distance, etc. In our work, we evaluated and analyzed how effective these measures are in partitioned clustering for text document datasets. In our experiments we have used standard K-means algorithm and our results details on six text documents datasets and five most commonly used distance or similarity measures in text clustering.","PeriodicalId":347216,"journal":{"name":"2020 Sixth International Conference on Parallel, Distributed and Grid Computing (PDGC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Sixth International Conference on Parallel, Distributed and Grid Computing (PDGC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDGC50313.2020.9315851","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
In today's world with ever increasing amount of text assets overloaded on web with digitized libraries, sorting out these documents got developed into a feasible need. Document clustering is an important procedure which consequently sorts out huge number of articles into a modest number of balanced gatherings. Document clustering is making groups of similar documents into number of clusters such that documents within the same group with high similarity values among one another and dissimilar to documents from other clusters. Common applications of document Clustering includes grouping similar news articles, analysis of customer feedback, text mining, duplicate content detection, finding similar documents, search optimization and many more. This lead to utilization of these documents for finding required information in a competent and efficient manner. Document clustering required a measurement for evaluating how surprising two given information are. This dissimilarity is often estimated by using some distance measures, for example, Cosine Similarity, Euclidean distance, etc. In our work, we evaluated and analyzed how effective these measures are in partitioned clustering for text document datasets. In our experiments we have used standard K-means algorithm and our results details on six text documents datasets and five most commonly used distance or similarity measures in text clustering.