{"title":"Research on text categorization model based on LDA — KNN","authors":"Weihua Chen, Xian Zhang","doi":"10.1109/IAEAC.2017.8054520","DOIUrl":null,"url":null,"abstract":"In the text classification, The similarity between the text need to be calculated, but the existing classification methods only consider the similarity between feature words and categories and does not involve the semantic similarity between feature words. In this paper, a new classification model LDA (Latent Dirichlet Allocation) — KNN (K-Nearest Neighbor) is proposed. LDA is used to solve the problem of semantic similarity measurement in traditional text categorization. The sample space is modeled and selected by this model. In the reduced feature space, KNN classifier is used to classify the sample. The experiment was based on the Matlab software platform, and the data set was obtained from the Chinese corpus of Fudan University, and the high precision classification result was obtained with the average value of 0.933. LDA-KNN model is compared with MI(Mutual Information)-KNN model and LSI(Latent Semantic Index)-KNN model. The results show that LDA-KNN model has superior classification performance in automatic text categorization.","PeriodicalId":432109,"journal":{"name":"2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IAEAC.2017.8054520","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9
Abstract
In the text classification, The similarity between the text need to be calculated, but the existing classification methods only consider the similarity between feature words and categories and does not involve the semantic similarity between feature words. In this paper, a new classification model LDA (Latent Dirichlet Allocation) — KNN (K-Nearest Neighbor) is proposed. LDA is used to solve the problem of semantic similarity measurement in traditional text categorization. The sample space is modeled and selected by this model. In the reduced feature space, KNN classifier is used to classify the sample. The experiment was based on the Matlab software platform, and the data set was obtained from the Chinese corpus of Fudan University, and the high precision classification result was obtained with the average value of 0.933. LDA-KNN model is compared with MI(Mutual Information)-KNN model and LSI(Latent Semantic Index)-KNN model. The results show that LDA-KNN model has superior classification performance in automatic text categorization.