{"title":"基于遗传编程的不完整数据符号回归特征选择","authors":"Baligh Al-Helali, Qi Chen, Bing Xue, Mengjie Zhang","doi":"10.1162/evco_a_00362","DOIUrl":null,"url":null,"abstract":"<p><p>High-dimensionality is one of the serious real-world data challenges in symbolic regression and it is more challenging if the data are incomplete. Genetic programming has been successfully utilised for high-dimensional tasks due to its natural feature selection ability, but it is not directly applicable to incomplete data. Commonly, it needs to impute the missing values first and then perform genetic programming on the imputed complete data. However, in the case of having many irrelevant features being incomplete, intuitively, it is not necessary to perform costly imputations on such features. For this purpose, this work proposes a genetic programming-based approach to select features directly from incomplete high-dimensional data to improve symbolic regression performance. We extend the concept of identity/neutral elements from mathematics into the function operators of genetic programming, thus they can handle the missing values in incomplete data. Experiments have been conducted on a number of data sets considering different missingness ratios in high-dimensional symbolic regression tasks. The results show that the proposed method leads to better symbolic regression results when compared with state-of-the-art methods that can select features directly from incomplete data. Further results show that our approach not only leads to better symbolic regression accuracy but also selects a smaller number of relevant features, and consequently improves both the effectiveness and the efficiency of the learning process.</p>","PeriodicalId":50470,"journal":{"name":"Evolutionary Computation","volume":" ","pages":"1-27"},"PeriodicalIF":4.6000,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Genetic Programming-based Feature Selection for Symbolic Regression on Incomplete Data.\",\"authors\":\"Baligh Al-Helali, Qi Chen, Bing Xue, Mengjie Zhang\",\"doi\":\"10.1162/evco_a_00362\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>High-dimensionality is one of the serious real-world data challenges in symbolic regression and it is more challenging if the data are incomplete. Genetic programming has been successfully utilised for high-dimensional tasks due to its natural feature selection ability, but it is not directly applicable to incomplete data. Commonly, it needs to impute the missing values first and then perform genetic programming on the imputed complete data. However, in the case of having many irrelevant features being incomplete, intuitively, it is not necessary to perform costly imputations on such features. For this purpose, this work proposes a genetic programming-based approach to select features directly from incomplete high-dimensional data to improve symbolic regression performance. We extend the concept of identity/neutral elements from mathematics into the function operators of genetic programming, thus they can handle the missing values in incomplete data. Experiments have been conducted on a number of data sets considering different missingness ratios in high-dimensional symbolic regression tasks. The results show that the proposed method leads to better symbolic regression results when compared with state-of-the-art methods that can select features directly from incomplete data. Further results show that our approach not only leads to better symbolic regression accuracy but also selects a smaller number of relevant features, and consequently improves both the effectiveness and the efficiency of the learning process.</p>\",\"PeriodicalId\":50470,\"journal\":{\"name\":\"Evolutionary Computation\",\"volume\":\" \",\"pages\":\"1-27\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2024-11-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Evolutionary Computation\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1162/evco_a_00362\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Evolutionary Computation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/evco_a_00362","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Genetic Programming-based Feature Selection for Symbolic Regression on Incomplete Data.
High-dimensionality is one of the serious real-world data challenges in symbolic regression and it is more challenging if the data are incomplete. Genetic programming has been successfully utilised for high-dimensional tasks due to its natural feature selection ability, but it is not directly applicable to incomplete data. Commonly, it needs to impute the missing values first and then perform genetic programming on the imputed complete data. However, in the case of having many irrelevant features being incomplete, intuitively, it is not necessary to perform costly imputations on such features. For this purpose, this work proposes a genetic programming-based approach to select features directly from incomplete high-dimensional data to improve symbolic regression performance. We extend the concept of identity/neutral elements from mathematics into the function operators of genetic programming, thus they can handle the missing values in incomplete data. Experiments have been conducted on a number of data sets considering different missingness ratios in high-dimensional symbolic regression tasks. The results show that the proposed method leads to better symbolic regression results when compared with state-of-the-art methods that can select features directly from incomplete data. Further results show that our approach not only leads to better symbolic regression accuracy but also selects a smaller number of relevant features, and consequently improves both the effectiveness and the efficiency of the learning process.
期刊介绍:
Evolutionary Computation is a leading journal in its field. It provides an international forum for facilitating and enhancing the exchange of information among researchers involved in both the theoretical and practical aspects of computational systems drawing their inspiration from nature, with particular emphasis on evolutionary models of computation such as genetic algorithms, evolutionary strategies, classifier systems, evolutionary programming, and genetic programming. It welcomes articles from related fields such as swarm intelligence (e.g. Ant Colony Optimization and Particle Swarm Optimization), and other nature-inspired computation paradigms (e.g. Artificial Immune Systems). As well as publishing articles describing theoretical and/or experimental work, the journal also welcomes application-focused papers describing breakthrough results in an application domain or methodological papers where the specificities of the real-world problem led to significant algorithmic improvements that could possibly be generalized to other areas.