{"title":"一种高性能的文本分类混合算法","authors":"Prema Nedungadi, Haripriya Harikumar, M. Ramesh","doi":"10.1109/ICADIWT.2014.6814691","DOIUrl":null,"url":null,"abstract":"The high computational complexity of text classification is a significant problem with the growing surge in text data. An effective but computationally expensive classification is the k-nearest-neighbor (kNN) algorithm. Principal Component Analysis (PCA) has commonly been used as a preprocessing phase to reduce the dimensionality followed by kNN. However, though the dimensionality is reduced, the algorithm requires all the vectors in the projected space to perform the kNN. We propose a new hybrid algorithm that uses PCA & kNN but performs kNN with a small set of neighbors instead of the complete data vectors in the projected space, thus reducing the computational complexity. An added advantage in our method is that we are able to get effective classification using a relatively smaller number of principal components. New text for classification is projected into the lower dimensional space and kNN is performed only with the neighbors in each axis based on the principal that vectors that are closer in the original space are closer in the projected space and also along the projected components. Our findings with the standard benchmark dataset Reuters show that the proposed model significantly outperforms kNN and the standard PCA-kNN hybrid algorithms while maintaining similar classification accuracy.","PeriodicalId":339627,"journal":{"name":"The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"A high performance hybrid algorithm for text classification\",\"authors\":\"Prema Nedungadi, Haripriya Harikumar, M. Ramesh\",\"doi\":\"10.1109/ICADIWT.2014.6814691\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The high computational complexity of text classification is a significant problem with the growing surge in text data. An effective but computationally expensive classification is the k-nearest-neighbor (kNN) algorithm. Principal Component Analysis (PCA) has commonly been used as a preprocessing phase to reduce the dimensionality followed by kNN. However, though the dimensionality is reduced, the algorithm requires all the vectors in the projected space to perform the kNN. We propose a new hybrid algorithm that uses PCA & kNN but performs kNN with a small set of neighbors instead of the complete data vectors in the projected space, thus reducing the computational complexity. An added advantage in our method is that we are able to get effective classification using a relatively smaller number of principal components. New text for classification is projected into the lower dimensional space and kNN is performed only with the neighbors in each axis based on the principal that vectors that are closer in the original space are closer in the projected space and also along the projected components. Our findings with the standard benchmark dataset Reuters show that the proposed model significantly outperforms kNN and the standard PCA-kNN hybrid algorithms while maintaining similar classification accuracy.\",\"PeriodicalId\":339627,\"journal\":{\"name\":\"The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014)\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-05-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICADIWT.2014.6814691\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICADIWT.2014.6814691","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A high performance hybrid algorithm for text classification
The high computational complexity of text classification is a significant problem with the growing surge in text data. An effective but computationally expensive classification is the k-nearest-neighbor (kNN) algorithm. Principal Component Analysis (PCA) has commonly been used as a preprocessing phase to reduce the dimensionality followed by kNN. However, though the dimensionality is reduced, the algorithm requires all the vectors in the projected space to perform the kNN. We propose a new hybrid algorithm that uses PCA & kNN but performs kNN with a small set of neighbors instead of the complete data vectors in the projected space, thus reducing the computational complexity. An added advantage in our method is that we are able to get effective classification using a relatively smaller number of principal components. New text for classification is projected into the lower dimensional space and kNN is performed only with the neighbors in each axis based on the principal that vectors that are closer in the original space are closer in the projected space and also along the projected components. Our findings with the standard benchmark dataset Reuters show that the proposed model significantly outperforms kNN and the standard PCA-kNN hybrid algorithms while maintaining similar classification accuracy.