Clustering sentence level-text using fuzzy hierarchical algorithm

G. Priya, G. Anupriya
{"title":"Clustering sentence level-text using fuzzy hierarchical algorithm","authors":"G. Priya, G. Anupriya","doi":"10.1109/ICHCI-IEEE.2013.6887778","DOIUrl":null,"url":null,"abstract":"Clustering is a popular technique for unsupervised text analysis, often used to explore the content of large amounts of sentences. It is performed based on the similarity of sentences. Sentences may contain interrelated concepts and implementing flat clustering algorithms allows one sentence to be present only in one cluster. Also, sentences are semantically related to each other and so word co-occurrence is not a valid measure for sentence level flat clustering. So, WordNet based semantic similarity measure along with fuzzy sentence clustering algorithm is proposed. The existing system uses the Fuzzy C-Means algorithm where the cluster size should be specified as an input. Due to the rigorous convergence criteria, the time complexity is much larger. Most of the NLP documents are hierarchical in nature and so fuzzy hierarchical sentence clustering algorithm is used here. Labeling is performed for each cluster depending on the hierarchy formed and instead of considering all the terms in a sentence, only the verbs and nouns are considered for the similarity computation. Agglomerative clustering based on the verbs and divisive clustering based on nouns is proposed. This methodology is validated through various performance measures like Purity, Entropy and Time. Upon comparing the results for various datasets, it was observed that the overall improvement in purity is 36.6% and entropy is 31%. The time complexity of the hierarchical algorithm is very much less than the EM algorithm. Thus better quality clusters are formed in comparatively less time by using the Fuzzy Hierarchical Sentence Clustering Algorithm.","PeriodicalId":419263,"journal":{"name":"2013 International Conference on Human Computer Interactions (ICHCI)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Conference on Human Computer Interactions (ICHCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICHCI-IEEE.2013.6887778","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Clustering is a popular technique for unsupervised text analysis, often used to explore the content of large amounts of sentences. It is performed based on the similarity of sentences. Sentences may contain interrelated concepts and implementing flat clustering algorithms allows one sentence to be present only in one cluster. Also, sentences are semantically related to each other and so word co-occurrence is not a valid measure for sentence level flat clustering. So, WordNet based semantic similarity measure along with fuzzy sentence clustering algorithm is proposed. The existing system uses the Fuzzy C-Means algorithm where the cluster size should be specified as an input. Due to the rigorous convergence criteria, the time complexity is much larger. Most of the NLP documents are hierarchical in nature and so fuzzy hierarchical sentence clustering algorithm is used here. Labeling is performed for each cluster depending on the hierarchy formed and instead of considering all the terms in a sentence, only the verbs and nouns are considered for the similarity computation. Agglomerative clustering based on the verbs and divisive clustering based on nouns is proposed. This methodology is validated through various performance measures like Purity, Entropy and Time. Upon comparing the results for various datasets, it was observed that the overall improvement in purity is 36.6% and entropy is 31%. The time complexity of the hierarchical algorithm is very much less than the EM algorithm. Thus better quality clusters are formed in comparatively less time by using the Fuzzy Hierarchical Sentence Clustering Algorithm.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于模糊层次算法的句子级文本聚类
聚类是一种流行的无监督文本分析技术,通常用于探索大量句子的内容。它是基于句子的相似性来执行的。句子可能包含相互关联的概念,实现平面聚类算法允许一个句子只出现在一个聚类中。此外,由于句子之间存在语义上的关联,因此词共现并不是句子级平面聚类的有效度量。为此,提出了基于WordNet的语义相似度度量和模糊句子聚类算法。现有的系统使用模糊c均值算法,其中簇大小应指定为输入。由于收敛准则严格,时间复杂度大得多。大多数自然语言处理文档本质上是层次化的,因此本文采用了模糊层次化句子聚类算法。根据所形成的层次结构对每个集群进行标记,而不是考虑句子中的所有术语,相似性计算只考虑动词和名词。提出了基于动词的凝聚聚类和基于名词的分裂聚类。这种方法通过各种性能指标(如纯度、熵和时间)得到验证。通过比较不同数据集的结果,可以观察到纯度的总体提高为36.6%,熵的总体提高为31%。分层算法的时间复杂度远低于EM算法。因此,使用模糊层次句子聚类算法可以在较短的时间内形成质量较好的聚类。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
An efficient technique for video content managing in peer-to-peer computing using multilevel cache and bandwidth based cluster A feasibility study for developing an emotional control system through brain computer interface Various levels of human stress & their impact on human computer interaction Partial-retuning of decentralised PI controller of nonlinear multivariable process using firefly algorithm Automation framework for localizability testing of internationalized software
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1