{"title":"从图词中提取序列聚类文本文档","authors":"M. M. Fazal, M. Rafi","doi":"10.31645/2014.12.1.5","DOIUrl":null,"url":null,"abstract":"Document clustering is an unsupervised machine learning technique that organizes a large collection of documents into smaller, topic homogenous, meaningful sub-collections (clusters). Traditional document clustering approaches use extracted features like: word (term), phrases, sequences and topics from the documents as descriptors for clustering process. These features do not consider the relationship among different words that are used to convey the contextual information within the document. Recently, Graph-of-Word approach is introduced in information research; this approach addresses the problem of independence assumption by building a graph of word from the words that appeared in a document. Hence, the relationships among words are captured in the representation. It is an un[1]weighted directed graph whose vertices represent unique terms and whose edges represent co-occurrences between the terms. The representation is simplified by using a sliding window of size = 3 with the text of the document. This paper uses a sequence based-representation of document that is extracted from graph[1]of-word of the document. A similarity measure is defined over the common sequences between two documents. The proposed approach is implemented and tested on standard text mining datasets. A series of experiments reveal that the proposed approach outperforms the traditional approaches on clustering measures like: Purity, Entropy and F-Score.","PeriodicalId":412730,"journal":{"name":"Journal of Independent Studies and Research Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Clustering textual documents by extracting sequence from word-of-graph\",\"authors\":\"M. M. Fazal, M. Rafi\",\"doi\":\"10.31645/2014.12.1.5\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Document clustering is an unsupervised machine learning technique that organizes a large collection of documents into smaller, topic homogenous, meaningful sub-collections (clusters). Traditional document clustering approaches use extracted features like: word (term), phrases, sequences and topics from the documents as descriptors for clustering process. These features do not consider the relationship among different words that are used to convey the contextual information within the document. Recently, Graph-of-Word approach is introduced in information research; this approach addresses the problem of independence assumption by building a graph of word from the words that appeared in a document. Hence, the relationships among words are captured in the representation. It is an un[1]weighted directed graph whose vertices represent unique terms and whose edges represent co-occurrences between the terms. The representation is simplified by using a sliding window of size = 3 with the text of the document. This paper uses a sequence based-representation of document that is extracted from graph[1]of-word of the document. A similarity measure is defined over the common sequences between two documents. The proposed approach is implemented and tested on standard text mining datasets. A series of experiments reveal that the proposed approach outperforms the traditional approaches on clustering measures like: Purity, Entropy and F-Score.\",\"PeriodicalId\":412730,\"journal\":{\"name\":\"Journal of Independent Studies and Research Computing\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Independent Studies and Research Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.31645/2014.12.1.5\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Independent Studies and Research Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31645/2014.12.1.5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Clustering textual documents by extracting sequence from word-of-graph
Document clustering is an unsupervised machine learning technique that organizes a large collection of documents into smaller, topic homogenous, meaningful sub-collections (clusters). Traditional document clustering approaches use extracted features like: word (term), phrases, sequences and topics from the documents as descriptors for clustering process. These features do not consider the relationship among different words that are used to convey the contextual information within the document. Recently, Graph-of-Word approach is introduced in information research; this approach addresses the problem of independence assumption by building a graph of word from the words that appeared in a document. Hence, the relationships among words are captured in the representation. It is an un[1]weighted directed graph whose vertices represent unique terms and whose edges represent co-occurrences between the terms. The representation is simplified by using a sliding window of size = 3 with the text of the document. This paper uses a sequence based-representation of document that is extracted from graph[1]of-word of the document. A similarity measure is defined over the common sequences between two documents. The proposed approach is implemented and tested on standard text mining datasets. A series of experiments reveal that the proposed approach outperforms the traditional approaches on clustering measures like: Purity, Entropy and F-Score.