一种高性能的文本分类混合算法

The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014) Pub Date : 2014-05-15 DOI:10.1109/ICADIWT.2014.6814691

Prema Nedungadi, Haripriya Harikumar, M. Ramesh

{"title":"一种高性能的文本分类混合算法","authors":"Prema Nedungadi, Haripriya Harikumar, M. Ramesh","doi":"10.1109/ICADIWT.2014.6814691","DOIUrl":null,"url":null,"abstract":"The high computational complexity of text classification is a significant problem with the growing surge in text data. An effective but computationally expensive classification is the k-nearest-neighbor (kNN) algorithm. Principal Component Analysis (PCA) has commonly been used as a preprocessing phase to reduce the dimensionality followed by kNN. However, though the dimensionality is reduced, the algorithm requires all the vectors in the projected space to perform the kNN. We propose a new hybrid algorithm that uses PCA & kNN but performs kNN with a small set of neighbors instead of the complete data vectors in the projected space, thus reducing the computational complexity. An added advantage in our method is that we are able to get effective classification using a relatively smaller number of principal components. New text for classification is projected into the lower dimensional space and kNN is performed only with the neighbors in each axis based on the principal that vectors that are closer in the original space are closer in the projected space and also along the projected components. Our findings with the standard benchmark dataset Reuters show that the proposed model significantly outperforms kNN and the standard PCA-kNN hybrid algorithms while maintaining similar classification accuracy.","PeriodicalId":339627,"journal":{"name":"The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"A high performance hybrid algorithm for text classification\",\"authors\":\"Prema Nedungadi, Haripriya Harikumar, M. Ramesh\",\"doi\":\"10.1109/ICADIWT.2014.6814691\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The high computational complexity of text classification is a significant problem with the growing surge in text data. An effective but computationally expensive classification is the k-nearest-neighbor (kNN) algorithm. Principal Component Analysis (PCA) has commonly been used as a preprocessing phase to reduce the dimensionality followed by kNN. However, though the dimensionality is reduced, the algorithm requires all the vectors in the projected space to perform the kNN. We propose a new hybrid algorithm that uses PCA & kNN but performs kNN with a small set of neighbors instead of the complete data vectors in the projected space, thus reducing the computational complexity. An added advantage in our method is that we are able to get effective classification using a relatively smaller number of principal components. New text for classification is projected into the lower dimensional space and kNN is performed only with the neighbors in each axis based on the principal that vectors that are closer in the original space are closer in the projected space and also along the projected components. Our findings with the standard benchmark dataset Reuters show that the proposed model significantly outperforms kNN and the standard PCA-kNN hybrid algorithms while maintaining similar classification accuracy.\",\"PeriodicalId\":339627,\"journal\":{\"name\":\"The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014)\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-05-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICADIWT.2014.6814691\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICADIWT.2014.6814691","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

随着文本数据的激增，文本分类的高计算复杂度成为一个重要问题。一种有效但计算代价昂贵的分类方法是k-最近邻(kNN)算法。主成分分析(PCA)通常被用作预处理阶段，以降低kNN之后的维数。然而，虽然降低了维数，但该算法要求投影空间中的所有向量执行kNN。我们提出了一种新的混合算法，该算法使用PCA和kNN，但使用一小组邻居而不是投影空间中的完整数据向量来执行kNN，从而降低了计算复杂度。我们的方法的另一个优点是我们能够使用相对较少数量的主成分进行有效的分类。用于分类的新文本被投影到较低维空间中，并且基于在原始空间中更接近的向量在投影空间中更接近的原则，仅对每个轴上的邻居执行kNN，并且沿着投影分量也更接近。我们对标准基准数据集路透社的研究结果表明，所提出的模型在保持相似分类精度的同时，显著优于kNN和标准PCA-kNN混合算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A high performance hybrid algorithm for text classification

The high computational complexity of text classification is a significant problem with the growing surge in text data. An effective but computationally expensive classification is the k-nearest-neighbor (kNN) algorithm. Principal Component Analysis (PCA) has commonly been used as a preprocessing phase to reduce the dimensionality followed by kNN. However, though the dimensionality is reduced, the algorithm requires all the vectors in the projected space to perform the kNN. We propose a new hybrid algorithm that uses PCA & kNN but performs kNN with a small set of neighbors instead of the complete data vectors in the projected space, thus reducing the computational complexity. An added advantage in our method is that we are able to get effective classification using a relatively smaller number of principal components. New text for classification is projected into the lower dimensional space and kNN is performed only with the neighbors in each axis based on the principal that vectors that are closer in the original space are closer in the projected space and also along the projected components. Our findings with the standard benchmark dataset Reuters show that the proposed model significantly outperforms kNN and the standard PCA-kNN hybrid algorithms while maintaining similar classification accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014)

自引率

0.00%

发文量