Reducing High-Dimensional Data by Principal Component Analysis vs. Random Projection for Nearest Neighbor Classification

Sampath Deegalla, Henrik Boström
{"title":"Reducing High-Dimensional Data by Principal Component Analysis vs. Random Projection for Nearest Neighbor Classification","authors":"Sampath Deegalla, Henrik Boström","doi":"10.1109/ICMLA.2006.43","DOIUrl":null,"url":null,"abstract":"The computational cost of using nearest neighbor classification often prevents the method from being applied in practice when dealing with high-dimensional data, such as images and micro arrays. One possible solution to this problem is to reduce the dimensionality of the data, ideally without loosing predictive performance. Two different dimensionality reduction methods, principle component analysis (PCA) and random projection (RP), are investigated for this purpose and compared w.r.t. the performance of the resulting nearest neighbor classifier on five image data sets and five micro array data sets. The experiment results demonstrate that PCA outperforms RP for all data sets used in this study. However, the experiments also show that PCA is more sensitive to the choice of the number of reduced dimensions. After reaching a peak, the accuracy degrades with the number of dimensions for PCA, while the accuracy for RP increases with the number of dimensions. The experiments also show that the use of PCA and RP may even outperform using the non-reduced feature set (in 9 respectively 6 cases out of 10), hence not only resulting in more efficient, but also more effective, nearest neighbor classification","PeriodicalId":297071,"journal":{"name":"2006 5th International Conference on Machine Learning and Applications (ICMLA'06)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"107","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2006 5th International Conference on Machine Learning and Applications (ICMLA'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2006.43","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 107

Abstract

The computational cost of using nearest neighbor classification often prevents the method from being applied in practice when dealing with high-dimensional data, such as images and micro arrays. One possible solution to this problem is to reduce the dimensionality of the data, ideally without loosing predictive performance. Two different dimensionality reduction methods, principle component analysis (PCA) and random projection (RP), are investigated for this purpose and compared w.r.t. the performance of the resulting nearest neighbor classifier on five image data sets and five micro array data sets. The experiment results demonstrate that PCA outperforms RP for all data sets used in this study. However, the experiments also show that PCA is more sensitive to the choice of the number of reduced dimensions. After reaching a peak, the accuracy degrades with the number of dimensions for PCA, while the accuracy for RP increases with the number of dimensions. The experiments also show that the use of PCA and RP may even outperform using the non-reduced feature set (in 9 respectively 6 cases out of 10), hence not only resulting in more efficient, but also more effective, nearest neighbor classification
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于主成分分析的高维数据降维与基于随机投影的最近邻分类
在处理高维数据(如图像和微阵列)时,使用最近邻分类的计算成本往往阻碍该方法在实践中的应用。这个问题的一个可能的解决方案是降低数据的维数,理想情况下不会失去预测性能。为此研究了主成分分析(PCA)和随机投影(RP)两种不同的降维方法,并比较了在5个图像数据集和5个微阵列数据集上得到的最近邻分类器的性能。实验结果表明,对于本研究中使用的所有数据集,PCA都优于RP。然而,实验也表明,主成分分析对降维数的选择更为敏感。在达到峰值后,PCA的准确率随着维数的增加而下降,而RP的准确率随着维数的增加而增加。实验还表明,使用PCA和RP甚至可能优于使用非约简特征集(在9个案例中分别为6 / 10),因此不仅产生更高效,而且更有效的最近邻分类
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
An Efficient Heuristic for Discovering Multiple Ill-Defined Attributes in Datasets Robust Model Selection Using Cross Validation: A Simple Iterative Technique for Developing Robust Gene Signatures in Biomedical Genomics Applications Detecting Web Content Function Using Generalized Hidden Markov Model Naive Bayes Classification Given Probability Estimation Trees A New Machine Learning Technique Based on Straight Line Segments
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1