{"title":"Granular-ball-matrix-based incremental semi-supervised feature selection approach to high-dimensional variation using neighbourhood discernibility degree for ordered partially labelled dataset","authors":"Weihua Xu, Jinlong Li","doi":"10.1007/s10489-024-06134-1","DOIUrl":null,"url":null,"abstract":"<div><p>In numerous real-world applications, data tends to be ordered and partially labelled, predominantly due to the constraints of labeling costs. The current methodologies for managing such data are inadequate, especially when confronted with the challenge of high-dimensional datasets, which often require reprocessing from the start, resulting in significant inefficiencies. To tackle this, we introduce an incremental semi-supervised feature selection algorithm that is grounded in neighborhood discernibility, and incorporates pseudolabel granular balls and matrix updating techniques. This novel approach evaluates the significance of features for both labelled and unlabelled data independently, using the power of neighborhood distinguishability to identify the most optimal subset of features. In a bid to enhance computational efficiency, especially with large datasets, we adopt a pseudolabel granular balls technique, which effectively segments the dataset into more manageable samples prior to feature selection. For high-dimensional data, we employ matrices to store neighborhood information, with distance functions and matrix structures that are tailored for both low and high-dimensional contexts. Furthermore, we present an innovative matrix updating method designed to accommodate fluctuations in the number of features. Our experimental results conducted across 12 datasets-including 4 with over 2000 features-demonstrate that our algorithm not only outperforms existing methods in handling large samples and high-dimensional datasets but also achieves an average time reduction of over six fold compared to similar semi-supervised algorithms. Moreover, we observe an average improvement in accuracy of 1.4%, 0.6%, and 0.2% per dataset for SVM, KNN, and Random Forest classifiers, respectively, when compared to the best-performing algorithm among the compared algorithms.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 4","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2025-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-024-06134-1","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In numerous real-world applications, data tends to be ordered and partially labelled, predominantly due to the constraints of labeling costs. The current methodologies for managing such data are inadequate, especially when confronted with the challenge of high-dimensional datasets, which often require reprocessing from the start, resulting in significant inefficiencies. To tackle this, we introduce an incremental semi-supervised feature selection algorithm that is grounded in neighborhood discernibility, and incorporates pseudolabel granular balls and matrix updating techniques. This novel approach evaluates the significance of features for both labelled and unlabelled data independently, using the power of neighborhood distinguishability to identify the most optimal subset of features. In a bid to enhance computational efficiency, especially with large datasets, we adopt a pseudolabel granular balls technique, which effectively segments the dataset into more manageable samples prior to feature selection. For high-dimensional data, we employ matrices to store neighborhood information, with distance functions and matrix structures that are tailored for both low and high-dimensional contexts. Furthermore, we present an innovative matrix updating method designed to accommodate fluctuations in the number of features. Our experimental results conducted across 12 datasets-including 4 with over 2000 features-demonstrate that our algorithm not only outperforms existing methods in handling large samples and high-dimensional datasets but also achieves an average time reduction of over six fold compared to similar semi-supervised algorithms. Moreover, we observe an average improvement in accuracy of 1.4%, 0.6%, and 0.2% per dataset for SVM, KNN, and Random Forest classifiers, respectively, when compared to the best-performing algorithm among the compared algorithms.
期刊介绍:
With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance.
The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.