Hossein Nematzadeh, Joseph Mani, Zahra Nematzadeh, Ebrahim Akbari, Radziah Mohamad
{"title":"利用遗传算法为高维医学数据集选择基于距离的相互拥挤特征","authors":"Hossein Nematzadeh, Joseph Mani, Zahra Nematzadeh, Ebrahim Akbari, Radziah Mohamad","doi":"arxiv-2407.15611","DOIUrl":null,"url":null,"abstract":"Feature selection poses a challenge in small-sample high-dimensional\ndatasets, where the number of features exceeds the number of observations, as\nseen in microarray, gene expression, and medical datasets. There isn't a\nuniversally optimal feature selection method applicable to any data\ndistribution, and as a result, the literature consistently endeavors to address\nthis issue. One recent approach in feature selection is termed frequency-based\nfeature selection. However, existing methods in this domain tend to overlook\nfeature values, focusing solely on the distribution in the response variable.\nIn response, this paper introduces the Distance-based Mutual Congestion (DMC)\nas a filter method that considers both the feature values and the distribution\nof observations in the response variable. DMC sorts the features of datasets,\nand the top 5% are retained and clustered by KMeans to mitigate\nmulticollinearity. This is achieved by randomly selecting one feature from each\ncluster. The selected features form the feature space, and the search space for\nthe Genetic Algorithm with Adaptive Rates (GAwAR) will be approximated using\nthis feature space. GAwAR approximates the combination of the top 10 features\nthat maximizes prediction accuracy within a wrapper scheme. To prevent\npremature convergence, GAwAR adaptively updates the crossover and mutation\nrates. The hybrid DMC-GAwAR is applicable to binary classification datasets,\nand experimental results demonstrate its superiority over some recent works.\nThe implementation and corresponding data are available at\nhttps://github.com/hnematzadeh/DMC-GAwAR","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"66 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Distance-based mutual congestion feature selection with genetic algorithm for high-dimensional medical datasets\",\"authors\":\"Hossein Nematzadeh, Joseph Mani, Zahra Nematzadeh, Ebrahim Akbari, Radziah Mohamad\",\"doi\":\"arxiv-2407.15611\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Feature selection poses a challenge in small-sample high-dimensional\\ndatasets, where the number of features exceeds the number of observations, as\\nseen in microarray, gene expression, and medical datasets. There isn't a\\nuniversally optimal feature selection method applicable to any data\\ndistribution, and as a result, the literature consistently endeavors to address\\nthis issue. One recent approach in feature selection is termed frequency-based\\nfeature selection. However, existing methods in this domain tend to overlook\\nfeature values, focusing solely on the distribution in the response variable.\\nIn response, this paper introduces the Distance-based Mutual Congestion (DMC)\\nas a filter method that considers both the feature values and the distribution\\nof observations in the response variable. DMC sorts the features of datasets,\\nand the top 5% are retained and clustered by KMeans to mitigate\\nmulticollinearity. This is achieved by randomly selecting one feature from each\\ncluster. The selected features form the feature space, and the search space for\\nthe Genetic Algorithm with Adaptive Rates (GAwAR) will be approximated using\\nthis feature space. GAwAR approximates the combination of the top 10 features\\nthat maximizes prediction accuracy within a wrapper scheme. To prevent\\npremature convergence, GAwAR adaptively updates the crossover and mutation\\nrates. The hybrid DMC-GAwAR is applicable to binary classification datasets,\\nand experimental results demonstrate its superiority over some recent works.\\nThe implementation and corresponding data are available at\\nhttps://github.com/hnematzadeh/DMC-GAwAR\",\"PeriodicalId\":501347,\"journal\":{\"name\":\"arXiv - CS - Neural and Evolutionary Computing\",\"volume\":\"66 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Neural and Evolutionary Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.15611\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Neural and Evolutionary Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.15611","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Distance-based mutual congestion feature selection with genetic algorithm for high-dimensional medical datasets
Feature selection poses a challenge in small-sample high-dimensional
datasets, where the number of features exceeds the number of observations, as
seen in microarray, gene expression, and medical datasets. There isn't a
universally optimal feature selection method applicable to any data
distribution, and as a result, the literature consistently endeavors to address
this issue. One recent approach in feature selection is termed frequency-based
feature selection. However, existing methods in this domain tend to overlook
feature values, focusing solely on the distribution in the response variable.
In response, this paper introduces the Distance-based Mutual Congestion (DMC)
as a filter method that considers both the feature values and the distribution
of observations in the response variable. DMC sorts the features of datasets,
and the top 5% are retained and clustered by KMeans to mitigate
multicollinearity. This is achieved by randomly selecting one feature from each
cluster. The selected features form the feature space, and the search space for
the Genetic Algorithm with Adaptive Rates (GAwAR) will be approximated using
this feature space. GAwAR approximates the combination of the top 10 features
that maximizes prediction accuracy within a wrapper scheme. To prevent
premature convergence, GAwAR adaptively updates the crossover and mutation
rates. The hybrid DMC-GAwAR is applicable to binary classification datasets,
and experimental results demonstrate its superiority over some recent works.
The implementation and corresponding data are available at
https://github.com/hnematzadeh/DMC-GAwAR