{"title":"高维数据集的各种降维与分类算法研究","authors":"Smit Shah, S. Joshi","doi":"10.1109/ICIRCA51532.2021.9544602","DOIUrl":null,"url":null,"abstract":"A potential drawback of huge data is that it makes analysis of the data hard and also computationally infeasible. Health care, finance, retail, and education are a few of the data mining applications that involve very high-dimensional data. A large number of dimensions introduce a popular problem of “Curse of Dimensionality”. This problem makes it difficult to perform classification and engenders lower accuracy of machine learning classifiers. This paper computes a threshold value (35%) to which if the data is reduced, the best accuracy can be obtained. Further, this research work considers an image dataset of very high dimensions on which different dimensionality reduction techniques such as PCA, LDA, and SVD are performed to find out the best dimension fit for an image dataset. Also, various ML classification algorithms, such as Logistic Regression, Random Forest Classifier, Naive Bayes, and SVM are applied to find out the best classifier for the dimensionally reduced dataset. Finally, this research work has concluded that, PCA+SVM, LDA+Random Forest, and SVD+SVM have produced the best results out of all the possible combinations from the comparative study.","PeriodicalId":245244,"journal":{"name":"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Study of Various Dimensionality Reduction and Classification Algorithms on High Dimensional Dataset\",\"authors\":\"Smit Shah, S. Joshi\",\"doi\":\"10.1109/ICIRCA51532.2021.9544602\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A potential drawback of huge data is that it makes analysis of the data hard and also computationally infeasible. Health care, finance, retail, and education are a few of the data mining applications that involve very high-dimensional data. A large number of dimensions introduce a popular problem of “Curse of Dimensionality”. This problem makes it difficult to perform classification and engenders lower accuracy of machine learning classifiers. This paper computes a threshold value (35%) to which if the data is reduced, the best accuracy can be obtained. Further, this research work considers an image dataset of very high dimensions on which different dimensionality reduction techniques such as PCA, LDA, and SVD are performed to find out the best dimension fit for an image dataset. Also, various ML classification algorithms, such as Logistic Regression, Random Forest Classifier, Naive Bayes, and SVM are applied to find out the best classifier for the dimensionally reduced dataset. Finally, this research work has concluded that, PCA+SVM, LDA+Random Forest, and SVD+SVM have produced the best results out of all the possible combinations from the comparative study.\",\"PeriodicalId\":245244,\"journal\":{\"name\":\"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)\",\"volume\":\"40 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICIRCA51532.2021.9544602\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIRCA51532.2021.9544602","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Study of Various Dimensionality Reduction and Classification Algorithms on High Dimensional Dataset
A potential drawback of huge data is that it makes analysis of the data hard and also computationally infeasible. Health care, finance, retail, and education are a few of the data mining applications that involve very high-dimensional data. A large number of dimensions introduce a popular problem of “Curse of Dimensionality”. This problem makes it difficult to perform classification and engenders lower accuracy of machine learning classifiers. This paper computes a threshold value (35%) to which if the data is reduced, the best accuracy can be obtained. Further, this research work considers an image dataset of very high dimensions on which different dimensionality reduction techniques such as PCA, LDA, and SVD are performed to find out the best dimension fit for an image dataset. Also, various ML classification algorithms, such as Logistic Regression, Random Forest Classifier, Naive Bayes, and SVM are applied to find out the best classifier for the dimensionally reduced dataset. Finally, this research work has concluded that, PCA+SVM, LDA+Random Forest, and SVD+SVM have produced the best results out of all the possible combinations from the comparative study.