Malay document clustering using complete linkage clustering technique with Cosine Coefficient

Nurazzah Abd Rahman, Zainab Binti Abu Bakar, Nurul Syeilla Syazhween Zulkefli
{"title":"Malay document clustering using complete linkage clustering technique with Cosine Coefficient","authors":"Nurazzah Abd Rahman, Zainab Binti Abu Bakar, Nurul Syeilla Syazhween Zulkefli","doi":"10.1109/ICOS.2015.7377286","DOIUrl":null,"url":null,"abstract":"Finding useful and relevant information is a very challenging task to the user. The retrieval system usually responded with a long listed documents which are not necessarily relevant to the user's need. Document clustering is a special technique that can sort out the documents effectively so that documents in the same cluster are similar to each other and documents in different cluster are dissimilar to each other. This paper focuses on document clustering for Malay test collection. It consists of 2028 Malay translated Hadith documents from book Sahih Bukhari. This paper presents the results using Complete Linkage Clustering algorithm with Cosine Coefficient on Malay translated Hadith documents. The evaluation of the experiments uses Recall (R), Precision (P) and Effectiveness (E) measure. The experiments is conducted on 100 clusters, 50 clusters and 20 clusters. It shows that the smaller the size of clusters, Recall (R) will increase, but Precision (P) will decrease. Results for Effectiveness (E) measure compared to the non-clustered documents show that applying clustering algorithm will improved the effectiveness of searching process. For this experiment 20 clusters is rather effective compared to the others.","PeriodicalId":422736,"journal":{"name":"2015 IEEE Conference on Open Systems (ICOS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE Conference on Open Systems (ICOS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICOS.2015.7377286","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

Finding useful and relevant information is a very challenging task to the user. The retrieval system usually responded with a long listed documents which are not necessarily relevant to the user's need. Document clustering is a special technique that can sort out the documents effectively so that documents in the same cluster are similar to each other and documents in different cluster are dissimilar to each other. This paper focuses on document clustering for Malay test collection. It consists of 2028 Malay translated Hadith documents from book Sahih Bukhari. This paper presents the results using Complete Linkage Clustering algorithm with Cosine Coefficient on Malay translated Hadith documents. The evaluation of the experiments uses Recall (R), Precision (P) and Effectiveness (E) measure. The experiments is conducted on 100 clusters, 50 clusters and 20 clusters. It shows that the smaller the size of clusters, Recall (R) will increase, but Precision (P) will decrease. Results for Effectiveness (E) measure compared to the non-clustered documents show that applying clustering algorithm will improved the effectiveness of searching process. For this experiment 20 clusters is rather effective compared to the others.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
马来文文档聚类使用完全链接聚类技术与余弦系数
对用户来说,寻找有用和相关的信息是一项非常具有挑战性的任务。检索系统通常以一长串不一定与用户需要有关的文件作为回应。文档聚类是一种特殊的技术,它可以有效地对文档进行分类,使同一集群中的文档相互相似,而不同集群中的文档相互不相似。本文主要研究马来语测试集的文档聚类问题。它由2028马来语翻译的《布哈里圣训》文件组成。本文介绍了用余弦系数的完全链接聚类算法对马来语翻译的圣训文档进行聚类的结果。实验评价采用召回率(R)、精度(P)和有效性(E)测度。实验分别在100个集群、50个集群和20个集群上进行。结果表明,聚类的大小越小,召回率(R)越高,而精确率(P)越低。有效性(E)度量与非聚类文档的比较结果表明,应用聚类算法可以提高搜索过程的有效性。在这个实验中,与其他集群相比,20个集群相当有效。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Game engine framework for Non-Programming background Exploring fine-grained sentiment values in online product reviews A systematic literature review on attractiveness and learnability factors in Web applications Copy move image forgery detection using Hessian and center symmetric local binary pattern An optimised 3D view of Kinect live stream
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1