{"title":"Clustering sentence level-text using fuzzy hierarchical algorithm","authors":"G. Priya, G. Anupriya","doi":"10.1109/ICHCI-IEEE.2013.6887778","DOIUrl":null,"url":null,"abstract":"Clustering is a popular technique for unsupervised text analysis, often used to explore the content of large amounts of sentences. It is performed based on the similarity of sentences. Sentences may contain interrelated concepts and implementing flat clustering algorithms allows one sentence to be present only in one cluster. Also, sentences are semantically related to each other and so word co-occurrence is not a valid measure for sentence level flat clustering. So, WordNet based semantic similarity measure along with fuzzy sentence clustering algorithm is proposed. The existing system uses the Fuzzy C-Means algorithm where the cluster size should be specified as an input. Due to the rigorous convergence criteria, the time complexity is much larger. Most of the NLP documents are hierarchical in nature and so fuzzy hierarchical sentence clustering algorithm is used here. Labeling is performed for each cluster depending on the hierarchy formed and instead of considering all the terms in a sentence, only the verbs and nouns are considered for the similarity computation. Agglomerative clustering based on the verbs and divisive clustering based on nouns is proposed. This methodology is validated through various performance measures like Purity, Entropy and Time. Upon comparing the results for various datasets, it was observed that the overall improvement in purity is 36.6% and entropy is 31%. The time complexity of the hierarchical algorithm is very much less than the EM algorithm. Thus better quality clusters are formed in comparatively less time by using the Fuzzy Hierarchical Sentence Clustering Algorithm.","PeriodicalId":419263,"journal":{"name":"2013 International Conference on Human Computer Interactions (ICHCI)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Conference on Human Computer Interactions (ICHCI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICHCI-IEEE.2013.6887778","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Clustering is a popular technique for unsupervised text analysis, often used to explore the content of large amounts of sentences. It is performed based on the similarity of sentences. Sentences may contain interrelated concepts and implementing flat clustering algorithms allows one sentence to be present only in one cluster. Also, sentences are semantically related to each other and so word co-occurrence is not a valid measure for sentence level flat clustering. So, WordNet based semantic similarity measure along with fuzzy sentence clustering algorithm is proposed. The existing system uses the Fuzzy C-Means algorithm where the cluster size should be specified as an input. Due to the rigorous convergence criteria, the time complexity is much larger. Most of the NLP documents are hierarchical in nature and so fuzzy hierarchical sentence clustering algorithm is used here. Labeling is performed for each cluster depending on the hierarchy formed and instead of considering all the terms in a sentence, only the verbs and nouns are considered for the similarity computation. Agglomerative clustering based on the verbs and divisive clustering based on nouns is proposed. This methodology is validated through various performance measures like Purity, Entropy and Time. Upon comparing the results for various datasets, it was observed that the overall improvement in purity is 36.6% and entropy is 31%. The time complexity of the hierarchical algorithm is very much less than the EM algorithm. Thus better quality clusters are formed in comparatively less time by using the Fuzzy Hierarchical Sentence Clustering Algorithm.