{"title":"基于高维符号回归中特征去除影响的特征选择遗传编程","authors":"Baligh Al-Helali;Qi Chen;Bing Xue;Mengjie Zhang","doi":"10.1109/TETCI.2024.3369407","DOIUrl":null,"url":null,"abstract":"Symbolic regression is increasingly important for discovering mathematical models for various prediction tasks. It works by searching for the arithmetic expressions that best represent a target variable using a set of input features. However, as the number of features increases, the search process becomes more complex. To address high-dimensional symbolic regression, this work proposes a genetic programming for feature selection method based on the impact of feature removal on the performance of SR models. Unlike existing Shapely value methods that simulate feature absence at the data level, the proposed approach suggests removing features at the model level. This approach circumvents the production of unrealistic data instances, which is a major limitation of Shapely value and permutation-based methods. Moreover, after calculating the importance of the features, a cut-off strategy, which works by injecting a number of random features and utilising their importance to automatically set a threshold, is proposed for selecting important features. The experimental results on artificial and real-world high-dimensional data sets show that, compared with state-of-the-art feature selection methods using the permutation importance and Shapely value, the proposed method not only improves the SR accuracy but also selects smaller sets of features.","PeriodicalId":13135,"journal":{"name":"IEEE Transactions on Emerging Topics in Computational Intelligence","volume":"8 3","pages":"2269-2282"},"PeriodicalIF":5.3000,"publicationDate":"2024-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Genetic Programming for Feature Selection Based on Feature Removal Impact in High-Dimensional Symbolic Regression\",\"authors\":\"Baligh Al-Helali;Qi Chen;Bing Xue;Mengjie Zhang\",\"doi\":\"10.1109/TETCI.2024.3369407\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Symbolic regression is increasingly important for discovering mathematical models for various prediction tasks. It works by searching for the arithmetic expressions that best represent a target variable using a set of input features. However, as the number of features increases, the search process becomes more complex. To address high-dimensional symbolic regression, this work proposes a genetic programming for feature selection method based on the impact of feature removal on the performance of SR models. Unlike existing Shapely value methods that simulate feature absence at the data level, the proposed approach suggests removing features at the model level. This approach circumvents the production of unrealistic data instances, which is a major limitation of Shapely value and permutation-based methods. Moreover, after calculating the importance of the features, a cut-off strategy, which works by injecting a number of random features and utilising their importance to automatically set a threshold, is proposed for selecting important features. The experimental results on artificial and real-world high-dimensional data sets show that, compared with state-of-the-art feature selection methods using the permutation importance and Shapely value, the proposed method not only improves the SR accuracy but also selects smaller sets of features.\",\"PeriodicalId\":13135,\"journal\":{\"name\":\"IEEE Transactions on Emerging Topics in Computational Intelligence\",\"volume\":\"8 3\",\"pages\":\"2269-2282\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2024-03-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Emerging Topics in Computational Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10466603/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10466603/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
摘要
符号回归对于发现各种预测任务的数学模型越来越重要。它的工作原理是利用一组输入特征,搜索最能代表目标变量的算术表达式。然而,随着特征数量的增加,搜索过程也变得更加复杂。为了解决高维符号回归问题,本研究根据特征去除对 SR 模型性能的影响,提出了一种遗传编程特征选择方法。与现有的在数据层面模拟特征缺失的 Shapely 值方法不同,所提出的方法建议在模型层面去除特征。这种方法避免了产生不切实际的数据实例,而这正是 Shapely 值和基于排列的方法的主要局限。此外,在计算特征的重要性后,还提出了一种截断策略,即通过注入一些随机特征并利用其重要性自动设置阈值,来选择重要特征。在人工和真实世界高维数据集上的实验结果表明,与使用置换重要性和 Shapely 值的最先进特征选择方法相比,所提出的方法不仅提高了 SR 的准确性,而且选择的特征集更小。
Genetic Programming for Feature Selection Based on Feature Removal Impact in High-Dimensional Symbolic Regression
Symbolic regression is increasingly important for discovering mathematical models for various prediction tasks. It works by searching for the arithmetic expressions that best represent a target variable using a set of input features. However, as the number of features increases, the search process becomes more complex. To address high-dimensional symbolic regression, this work proposes a genetic programming for feature selection method based on the impact of feature removal on the performance of SR models. Unlike existing Shapely value methods that simulate feature absence at the data level, the proposed approach suggests removing features at the model level. This approach circumvents the production of unrealistic data instances, which is a major limitation of Shapely value and permutation-based methods. Moreover, after calculating the importance of the features, a cut-off strategy, which works by injecting a number of random features and utilising their importance to automatically set a threshold, is proposed for selecting important features. The experimental results on artificial and real-world high-dimensional data sets show that, compared with state-of-the-art feature selection methods using the permutation importance and Shapely value, the proposed method not only improves the SR accuracy but also selects smaller sets of features.
期刊介绍:
The IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI) publishes original articles on emerging aspects of computational intelligence, including theory, applications, and surveys.
TETCI is an electronics only publication. TETCI publishes six issues per year.
Authors are encouraged to submit manuscripts in any emerging topic in computational intelligence, especially nature-inspired computing topics not covered by other IEEE Computational Intelligence Society journals. A few such illustrative examples are glial cell networks, computational neuroscience, Brain Computer Interface, ambient intelligence, non-fuzzy computing with words, artificial life, cultural learning, artificial endocrine networks, social reasoning, artificial hormone networks, computational intelligence for the IoT and Smart-X technologies.