Muhammad Adnan, M. Rafi, Muhammad Rafi Muhammad Rafi
{"title":"带有显式语义分析(ESA)的文档聚类","authors":"Muhammad Adnan, M. Rafi, Muhammad Rafi Muhammad Rafi","doi":"10.31645/2014.12.1.8","DOIUrl":null,"url":null,"abstract":"Document clustering recently became a very vital approach as numbers of documents on web and on proprietary repositories are increased in unprecedented manner. The documents that are written in human language generally contain some context and usage of words mainly depends upon the same context, recently researchers have tried to enrich document representation via some external knowledge base. This can facilitate the contextual information in the clustering process. We proposed an enrichment process with explicit content analysis using Wikipedia as knowledge base. Our approach is distinct in the sense we only uses the conceptual words from a document and their frequency to embed the contextual information. Hence, our approach does not over enrich the documents. A vector based representation, with cosine similarity and agglomerative hierarchical clustering is used to perform actual document clustering. We compare our proposed method with existing relevant approaches on NEWS20 dataset, with evaluation measure for clustering like: F-Score, Entropy and Purity.","PeriodicalId":412730,"journal":{"name":"Journal of Independent Studies and Research Computing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Document clustering with explicit semantic analysis (ESA)\",\"authors\":\"Muhammad Adnan, M. Rafi, Muhammad Rafi Muhammad Rafi\",\"doi\":\"10.31645/2014.12.1.8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Document clustering recently became a very vital approach as numbers of documents on web and on proprietary repositories are increased in unprecedented manner. The documents that are written in human language generally contain some context and usage of words mainly depends upon the same context, recently researchers have tried to enrich document representation via some external knowledge base. This can facilitate the contextual information in the clustering process. We proposed an enrichment process with explicit content analysis using Wikipedia as knowledge base. Our approach is distinct in the sense we only uses the conceptual words from a document and their frequency to embed the contextual information. Hence, our approach does not over enrich the documents. A vector based representation, with cosine similarity and agglomerative hierarchical clustering is used to perform actual document clustering. We compare our proposed method with existing relevant approaches on NEWS20 dataset, with evaluation measure for clustering like: F-Score, Entropy and Purity.\",\"PeriodicalId\":412730,\"journal\":{\"name\":\"Journal of Independent Studies and Research Computing\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Independent Studies and Research Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.31645/2014.12.1.8\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Independent Studies and Research Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31645/2014.12.1.8","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Document clustering with explicit semantic analysis (ESA)
Document clustering recently became a very vital approach as numbers of documents on web and on proprietary repositories are increased in unprecedented manner. The documents that are written in human language generally contain some context and usage of words mainly depends upon the same context, recently researchers have tried to enrich document representation via some external knowledge base. This can facilitate the contextual information in the clustering process. We proposed an enrichment process with explicit content analysis using Wikipedia as knowledge base. Our approach is distinct in the sense we only uses the conceptual words from a document and their frequency to embed the contextual information. Hence, our approach does not over enrich the documents. A vector based representation, with cosine similarity and agglomerative hierarchical clustering is used to perform actual document clustering. We compare our proposed method with existing relevant approaches on NEWS20 dataset, with evaluation measure for clustering like: F-Score, Entropy and Purity.