Hrishikesh Bhaumik, Biswanath Chakraborty, A. Mukherjee, S. Bhattacharyya, Manojit Chattopadhyay
{"title":"Towards Reliable Clustering of English Text Documents Using Correlation Coefficient","authors":"Hrishikesh Bhaumik, Biswanath Chakraborty, A. Mukherjee, S. Bhattacharyya, Manojit Chattopadhyay","doi":"10.1109/CICN.2014.121","DOIUrl":null,"url":null,"abstract":"This paper proposes a new approach for clustering English text documents, based on finding the pair wise correlation of documents in a given set of text documents. The correlation coefficient for each pair of documents is calculated on the basis of ranks given to the words in the documents. The ranking of the words occurring in a document is computed on the basis of weights of the words calculated according to the conventional TF-IDF factor. The proposed method is found to be able to cluster a given set of text documents into a number of classes depending on their contents where the number of classes is not known a priori. It is revealed from experimental results that the proposed method of text categorization using correlation coefficient performs better than some of the other text categorization methods, including methods that use artificial neural network.","PeriodicalId":6487,"journal":{"name":"2014 International Conference on Computational Intelligence and Communication Networks","volume":"12 1","pages":"530-535"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Computational Intelligence and Communication Networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CICN.2014.121","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
This paper proposes a new approach for clustering English text documents, based on finding the pair wise correlation of documents in a given set of text documents. The correlation coefficient for each pair of documents is calculated on the basis of ranks given to the words in the documents. The ranking of the words occurring in a document is computed on the basis of weights of the words calculated according to the conventional TF-IDF factor. The proposed method is found to be able to cluster a given set of text documents into a number of classes depending on their contents where the number of classes is not known a priori. It is revealed from experimental results that the proposed method of text categorization using correlation coefficient performs better than some of the other text categorization methods, including methods that use artificial neural network.