{"title":"An ensemble approach for text document clustering using Wikipedia concepts","authors":"Seyednaser Nourashrafeddin, E. Milios, D. Arnold","doi":"10.1145/2644866.2644868","DOIUrl":null,"url":null,"abstract":"Most text clustering algorithms represent a corpus as a document-term matrix in the bag of words model. The feature values are computed based on term frequencies in documents and no semantic relatedness between terms is considered. Therefore, two semantically similar documents may sit in different clusters if they do not share any terms. One solution to this problem is to enrich the document representation using an external resource like Wikipedia. We propose a new way to integrate Wikipedia concepts in partitional text document clustering in this work. A text corpus is first represented as a document-term matrix and a document-concept matrix. Terms that exist in the corpus are then clustered based on the document-term representation. Given the term clusters, we propose two methods, one based on the document-term representation and the other one based on the document-concept representation, to find two sets of seed documents. The two sets are then used in our text clustering algorithm in an ensemble approach to cluster documents. The experimental results show that even though the document-concept representations do not result in good document clusters per se, integrating them in our ensemble approach improves the quality of document clusters significantly.","PeriodicalId":91385,"journal":{"name":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","volume":"32 1","pages":"107-116"},"PeriodicalIF":0.0000,"publicationDate":"2014-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Symposium on Document Engineering. ACM Symposium on Document Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2644866.2644868","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17
Abstract
Most text clustering algorithms represent a corpus as a document-term matrix in the bag of words model. The feature values are computed based on term frequencies in documents and no semantic relatedness between terms is considered. Therefore, two semantically similar documents may sit in different clusters if they do not share any terms. One solution to this problem is to enrich the document representation using an external resource like Wikipedia. We propose a new way to integrate Wikipedia concepts in partitional text document clustering in this work. A text corpus is first represented as a document-term matrix and a document-concept matrix. Terms that exist in the corpus are then clustered based on the document-term representation. Given the term clusters, we propose two methods, one based on the document-term representation and the other one based on the document-concept representation, to find two sets of seed documents. The two sets are then used in our text clustering algorithm in an ensemble approach to cluster documents. The experimental results show that even though the document-concept representations do not result in good document clusters per se, integrating them in our ensemble approach improves the quality of document clusters significantly.