A high performance hybrid algorithm for text classification

Prema Nedungadi, Haripriya Harikumar, M. Ramesh
{"title":"A high performance hybrid algorithm for text classification","authors":"Prema Nedungadi, Haripriya Harikumar, M. Ramesh","doi":"10.1109/ICADIWT.2014.6814691","DOIUrl":null,"url":null,"abstract":"The high computational complexity of text classification is a significant problem with the growing surge in text data. An effective but computationally expensive classification is the k-nearest-neighbor (kNN) algorithm. Principal Component Analysis (PCA) has commonly been used as a preprocessing phase to reduce the dimensionality followed by kNN. However, though the dimensionality is reduced, the algorithm requires all the vectors in the projected space to perform the kNN. We propose a new hybrid algorithm that uses PCA & kNN but performs kNN with a small set of neighbors instead of the complete data vectors in the projected space, thus reducing the computational complexity. An added advantage in our method is that we are able to get effective classification using a relatively smaller number of principal components. New text for classification is projected into the lower dimensional space and kNN is performed only with the neighbors in each axis based on the principal that vectors that are closer in the original space are closer in the projected space and also along the projected components. Our findings with the standard benchmark dataset Reuters show that the proposed model significantly outperforms kNN and the standard PCA-kNN hybrid algorithms while maintaining similar classification accuracy.","PeriodicalId":339627,"journal":{"name":"The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICADIWT.2014.6814691","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

The high computational complexity of text classification is a significant problem with the growing surge in text data. An effective but computationally expensive classification is the k-nearest-neighbor (kNN) algorithm. Principal Component Analysis (PCA) has commonly been used as a preprocessing phase to reduce the dimensionality followed by kNN. However, though the dimensionality is reduced, the algorithm requires all the vectors in the projected space to perform the kNN. We propose a new hybrid algorithm that uses PCA & kNN but performs kNN with a small set of neighbors instead of the complete data vectors in the projected space, thus reducing the computational complexity. An added advantage in our method is that we are able to get effective classification using a relatively smaller number of principal components. New text for classification is projected into the lower dimensional space and kNN is performed only with the neighbors in each axis based on the principal that vectors that are closer in the original space are closer in the projected space and also along the projected components. Our findings with the standard benchmark dataset Reuters show that the proposed model significantly outperforms kNN and the standard PCA-kNN hybrid algorithms while maintaining similar classification accuracy.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
一种高性能的文本分类混合算法
随着文本数据的激增,文本分类的高计算复杂度成为一个重要问题。一种有效但计算代价昂贵的分类方法是k-最近邻(kNN)算法。主成分分析(PCA)通常被用作预处理阶段,以降低kNN之后的维数。然而,虽然降低了维数,但该算法要求投影空间中的所有向量执行kNN。我们提出了一种新的混合算法,该算法使用PCA和kNN,但使用一小组邻居而不是投影空间中的完整数据向量来执行kNN,从而降低了计算复杂度。我们的方法的另一个优点是我们能够使用相对较少数量的主成分进行有效的分类。用于分类的新文本被投影到较低维空间中,并且基于在原始空间中更接近的向量在投影空间中更接近的原则,仅对每个轴上的邻居执行kNN,并且沿着投影分量也更接近。我们对标准基准数据集路透社的研究结果表明,所提出的模型在保持相似分类精度的同时,显著优于kNN和标准PCA-kNN混合算法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Game theoretic resource allocation in cloud computing Automated colour segmentation of Tuberculosis bacteria thru region growing: A novel approach A multi-objective differential evolution approach for the question selection problem Formal representation of service interactions for SaaS based applications A novel approach for predicting the length of hospital stay with DBSCAN and supervised classification algorithms
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1