相似度量方法在文本文档聚类信息检索中的应用

Naveen Kumar, S. Yadav, Divakar Yadav
{"title":"相似度量方法在文本文档聚类信息检索中的应用","authors":"Naveen Kumar, S. Yadav, Divakar Yadav","doi":"10.1109/PDGC50313.2020.9315851","DOIUrl":null,"url":null,"abstract":"In today's world with ever increasing amount of text assets overloaded on web with digitized libraries, sorting out these documents got developed into a feasible need. Document clustering is an important procedure which consequently sorts out huge number of articles into a modest number of balanced gatherings. Document clustering is making groups of similar documents into number of clusters such that documents within the same group with high similarity values among one another and dissimilar to documents from other clusters. Common applications of document Clustering includes grouping similar news articles, analysis of customer feedback, text mining, duplicate content detection, finding similar documents, search optimization and many more. This lead to utilization of these documents for finding required information in a competent and efficient manner. Document clustering required a measurement for evaluating how surprising two given information are. This dissimilarity is often estimated by using some distance measures, for example, Cosine Similarity, Euclidean distance, etc. In our work, we evaluated and analyzed how effective these measures are in partitioned clustering for text document datasets. In our experiments we have used standard K-means algorithm and our results details on six text documents datasets and five most commonly used distance or similarity measures in text clustering.","PeriodicalId":347216,"journal":{"name":"2020 Sixth International Conference on Parallel, Distributed and Grid Computing (PDGC)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Similarity Measure Approaches Applied in Text Document Clustering for Information Retrieval\",\"authors\":\"Naveen Kumar, S. Yadav, Divakar Yadav\",\"doi\":\"10.1109/PDGC50313.2020.9315851\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In today's world with ever increasing amount of text assets overloaded on web with digitized libraries, sorting out these documents got developed into a feasible need. Document clustering is an important procedure which consequently sorts out huge number of articles into a modest number of balanced gatherings. Document clustering is making groups of similar documents into number of clusters such that documents within the same group with high similarity values among one another and dissimilar to documents from other clusters. Common applications of document Clustering includes grouping similar news articles, analysis of customer feedback, text mining, duplicate content detection, finding similar documents, search optimization and many more. This lead to utilization of these documents for finding required information in a competent and efficient manner. Document clustering required a measurement for evaluating how surprising two given information are. This dissimilarity is often estimated by using some distance measures, for example, Cosine Similarity, Euclidean distance, etc. In our work, we evaluated and analyzed how effective these measures are in partitioned clustering for text document datasets. In our experiments we have used standard K-means algorithm and our results details on six text documents datasets and five most commonly used distance or similarity measures in text clustering.\",\"PeriodicalId\":347216,\"journal\":{\"name\":\"2020 Sixth International Conference on Parallel, Distributed and Grid Computing (PDGC)\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 Sixth International Conference on Parallel, Distributed and Grid Computing (PDGC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PDGC50313.2020.9315851\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Sixth International Conference on Parallel, Distributed and Grid Computing (PDGC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDGC50313.2020.9315851","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

随着数字化图书馆在网络上承载的文本资源越来越多,对这些文档进行整理已成为一种可行的需求。文档聚类是一个重要的过程,它将大量的文章整理成数量适中的平衡集合。文档聚类是将一组相似的文档分成若干个簇,这样同一组中的文档彼此之间具有高相似性值,并且与其他簇中的文档不相似。文档聚类的常见应用包括对相似的新闻文章进行分组、分析客户反馈、文本挖掘、重复内容检测、查找相似的文档、搜索优化等等。这导致利用这些文件以一种称职和有效的方式查找所需的信息。文档聚类需要一个度量来评估两个给定信息的惊人程度。这种差异通常通过使用一些距离度量来估计,例如,余弦相似度,欧几里得距离等。在我们的工作中,我们评估和分析了这些度量在文本文档数据集的分区聚类中的有效性。在我们的实验中,我们使用了标准的K-means算法,并在6个文本文档数据集和5个文本聚类中最常用的距离或相似度量上详细介绍了我们的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Similarity Measure Approaches Applied in Text Document Clustering for Information Retrieval
In today's world with ever increasing amount of text assets overloaded on web with digitized libraries, sorting out these documents got developed into a feasible need. Document clustering is an important procedure which consequently sorts out huge number of articles into a modest number of balanced gatherings. Document clustering is making groups of similar documents into number of clusters such that documents within the same group with high similarity values among one another and dissimilar to documents from other clusters. Common applications of document Clustering includes grouping similar news articles, analysis of customer feedback, text mining, duplicate content detection, finding similar documents, search optimization and many more. This lead to utilization of these documents for finding required information in a competent and efficient manner. Document clustering required a measurement for evaluating how surprising two given information are. This dissimilarity is often estimated by using some distance measures, for example, Cosine Similarity, Euclidean distance, etc. In our work, we evaluated and analyzed how effective these measures are in partitioned clustering for text document datasets. In our experiments we have used standard K-means algorithm and our results details on six text documents datasets and five most commonly used distance or similarity measures in text clustering.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Message Data Analysis of Various Terrorism Activities Using Big Data Approaches on Global Terrorism Database A Convolutional Neural Network Approach for The Diagnosis of Breast Cancer Color Fading: Variation of Colorimetric Parameters with Spectral Reflectance Automatic Rumour Detection Model on Social Media
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1