{"title":"一种基于语义的波斯语文档分类聚类特征提取方法","authors":"Saeedeh Davoudi, S. Mirzaei","doi":"10.1109/CSICC52343.2021.9420602","DOIUrl":null,"url":null,"abstract":"Natural Language Processing (NLP) is one of the promising fields of artificial intelligence. In recent decades, high volume of text data has been generated through the Internet. This kind of data is a valuable source of information which can be used in various fields such as information retrieval, search engines, recommender systems, etc. One practical task of text mining is document classification. In this paper, we mainly focus on Persian document classification. We introduce a new feature extraction approach derived from the combination of K-means clustering and Word2Vec to acquire semantically relevant and discriminant word representations. We call our proposed approach CC-Word2Vec (Categorical Clustering-Word2Vec) since we retrain the Word2Vec model using the word clusters of each category obtained by K-Means algorithm. We use 200 documents of 5 most frequent categories of Hamshahri news dataset to evaluate our method. We pass the extracted word vectors to Multi-Layer Perceptron (MLP) and Gradient Boosting (GB) classifiers to compare the performance of the proposed approach with Term Frequency Inverse Document Frequency (TF-IDF) and Word2Vec methods. Our new approach resulted in an improvement in the obtained accuracy of Gradient Boosting and Multi-Layer Perceptron models in comparison with TF-IDF and Word2Vec techniques.","PeriodicalId":374593,"journal":{"name":"2021 26th International Computer Conference, Computer Society of Iran (CSICC)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"A Semantic-based Feature Extraction Method Using Categorical Clustering for Persian Document Classification\",\"authors\":\"Saeedeh Davoudi, S. Mirzaei\",\"doi\":\"10.1109/CSICC52343.2021.9420602\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Natural Language Processing (NLP) is one of the promising fields of artificial intelligence. In recent decades, high volume of text data has been generated through the Internet. This kind of data is a valuable source of information which can be used in various fields such as information retrieval, search engines, recommender systems, etc. One practical task of text mining is document classification. In this paper, we mainly focus on Persian document classification. We introduce a new feature extraction approach derived from the combination of K-means clustering and Word2Vec to acquire semantically relevant and discriminant word representations. We call our proposed approach CC-Word2Vec (Categorical Clustering-Word2Vec) since we retrain the Word2Vec model using the word clusters of each category obtained by K-Means algorithm. We use 200 documents of 5 most frequent categories of Hamshahri news dataset to evaluate our method. We pass the extracted word vectors to Multi-Layer Perceptron (MLP) and Gradient Boosting (GB) classifiers to compare the performance of the proposed approach with Term Frequency Inverse Document Frequency (TF-IDF) and Word2Vec methods. Our new approach resulted in an improvement in the obtained accuracy of Gradient Boosting and Multi-Layer Perceptron models in comparison with TF-IDF and Word2Vec techniques.\",\"PeriodicalId\":374593,\"journal\":{\"name\":\"2021 26th International Computer Conference, Computer Society of Iran (CSICC)\",\"volume\":\"87 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-03-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 26th International Computer Conference, Computer Society of Iran (CSICC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CSICC52343.2021.9420602\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 26th International Computer Conference, Computer Society of Iran (CSICC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CSICC52343.2021.9420602","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Semantic-based Feature Extraction Method Using Categorical Clustering for Persian Document Classification
Natural Language Processing (NLP) is one of the promising fields of artificial intelligence. In recent decades, high volume of text data has been generated through the Internet. This kind of data is a valuable source of information which can be used in various fields such as information retrieval, search engines, recommender systems, etc. One practical task of text mining is document classification. In this paper, we mainly focus on Persian document classification. We introduce a new feature extraction approach derived from the combination of K-means clustering and Word2Vec to acquire semantically relevant and discriminant word representations. We call our proposed approach CC-Word2Vec (Categorical Clustering-Word2Vec) since we retrain the Word2Vec model using the word clusters of each category obtained by K-Means algorithm. We use 200 documents of 5 most frequent categories of Hamshahri news dataset to evaluate our method. We pass the extracted word vectors to Multi-Layer Perceptron (MLP) and Gradient Boosting (GB) classifiers to compare the performance of the proposed approach with Term Frequency Inverse Document Frequency (TF-IDF) and Word2Vec methods. Our new approach resulted in an improvement in the obtained accuracy of Gradient Boosting and Multi-Layer Perceptron models in comparison with TF-IDF and Word2Vec techniques.