利用遗传算法为高维医学数据集选择基于距离的相互拥挤特征

Hossein Nematzadeh, Joseph Mani, Zahra Nematzadeh, Ebrahim Akbari, Radziah Mohamad
{"title":"利用遗传算法为高维医学数据集选择基于距离的相互拥挤特征","authors":"Hossein Nematzadeh, Joseph Mani, Zahra Nematzadeh, Ebrahim Akbari, Radziah Mohamad","doi":"arxiv-2407.15611","DOIUrl":null,"url":null,"abstract":"Feature selection poses a challenge in small-sample high-dimensional\ndatasets, where the number of features exceeds the number of observations, as\nseen in microarray, gene expression, and medical datasets. There isn't a\nuniversally optimal feature selection method applicable to any data\ndistribution, and as a result, the literature consistently endeavors to address\nthis issue. One recent approach in feature selection is termed frequency-based\nfeature selection. However, existing methods in this domain tend to overlook\nfeature values, focusing solely on the distribution in the response variable.\nIn response, this paper introduces the Distance-based Mutual Congestion (DMC)\nas a filter method that considers both the feature values and the distribution\nof observations in the response variable. DMC sorts the features of datasets,\nand the top 5% are retained and clustered by KMeans to mitigate\nmulticollinearity. This is achieved by randomly selecting one feature from each\ncluster. The selected features form the feature space, and the search space for\nthe Genetic Algorithm with Adaptive Rates (GAwAR) will be approximated using\nthis feature space. GAwAR approximates the combination of the top 10 features\nthat maximizes prediction accuracy within a wrapper scheme. To prevent\npremature convergence, GAwAR adaptively updates the crossover and mutation\nrates. The hybrid DMC-GAwAR is applicable to binary classification datasets,\nand experimental results demonstrate its superiority over some recent works.\nThe implementation and corresponding data are available at\nhttps://github.com/hnematzadeh/DMC-GAwAR","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"66 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Distance-based mutual congestion feature selection with genetic algorithm for high-dimensional medical datasets\",\"authors\":\"Hossein Nematzadeh, Joseph Mani, Zahra Nematzadeh, Ebrahim Akbari, Radziah Mohamad\",\"doi\":\"arxiv-2407.15611\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Feature selection poses a challenge in small-sample high-dimensional\\ndatasets, where the number of features exceeds the number of observations, as\\nseen in microarray, gene expression, and medical datasets. There isn't a\\nuniversally optimal feature selection method applicable to any data\\ndistribution, and as a result, the literature consistently endeavors to address\\nthis issue. One recent approach in feature selection is termed frequency-based\\nfeature selection. However, existing methods in this domain tend to overlook\\nfeature values, focusing solely on the distribution in the response variable.\\nIn response, this paper introduces the Distance-based Mutual Congestion (DMC)\\nas a filter method that considers both the feature values and the distribution\\nof observations in the response variable. DMC sorts the features of datasets,\\nand the top 5% are retained and clustered by KMeans to mitigate\\nmulticollinearity. This is achieved by randomly selecting one feature from each\\ncluster. The selected features form the feature space, and the search space for\\nthe Genetic Algorithm with Adaptive Rates (GAwAR) will be approximated using\\nthis feature space. GAwAR approximates the combination of the top 10 features\\nthat maximizes prediction accuracy within a wrapper scheme. To prevent\\npremature convergence, GAwAR adaptively updates the crossover and mutation\\nrates. The hybrid DMC-GAwAR is applicable to binary classification datasets,\\nand experimental results demonstrate its superiority over some recent works.\\nThe implementation and corresponding data are available at\\nhttps://github.com/hnematzadeh/DMC-GAwAR\",\"PeriodicalId\":501347,\"journal\":{\"name\":\"arXiv - CS - Neural and Evolutionary Computing\",\"volume\":\"66 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Neural and Evolutionary Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.15611\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Neural and Evolutionary Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.15611","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

在微阵列、基因表达和医学数据集等小样本高维数据集中,特征的数量超过了观测值的数量,这就给特征选择带来了挑战。目前还没有一种适用于任何数据分布的通用最优特征选择方法,因此,文献一直在努力解决这个问题。最近的一种特征选择方法被称为基于频率的特征选择。作为回应,本文引入了基于距离的相互拥塞(DMC),作为一种既考虑特征值又考虑响应变量中观测值分布的筛选方法。DMC 对数据集的特征进行排序,保留前 5%,并通过 KMeans 方法进行聚类,以减轻多重共线性。这是通过从每个聚类中随机选择一个特征来实现的。所选特征构成特征空间,而自适应速率遗传算法(GAwAR)的搜索空间将使用该特征空间进行近似。GAwAR 在一个封装方案中近似地组合了预测准确率最高的前 10 个特征。为了防止过早收敛,GAwAR 会自适应地更新交叉和突变率。混合 DMC-GAwAR 适用于二元分类数据集,实验结果表明它优于最近的一些研究。实现方法和相应数据可在以下网站获取:https://github.com/hnematzadeh/DMC-GAwAR。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Distance-based mutual congestion feature selection with genetic algorithm for high-dimensional medical datasets
Feature selection poses a challenge in small-sample high-dimensional datasets, where the number of features exceeds the number of observations, as seen in microarray, gene expression, and medical datasets. There isn't a universally optimal feature selection method applicable to any data distribution, and as a result, the literature consistently endeavors to address this issue. One recent approach in feature selection is termed frequency-based feature selection. However, existing methods in this domain tend to overlook feature values, focusing solely on the distribution in the response variable. In response, this paper introduces the Distance-based Mutual Congestion (DMC) as a filter method that considers both the feature values and the distribution of observations in the response variable. DMC sorts the features of datasets, and the top 5% are retained and clustered by KMeans to mitigate multicollinearity. This is achieved by randomly selecting one feature from each cluster. The selected features form the feature space, and the search space for the Genetic Algorithm with Adaptive Rates (GAwAR) will be approximated using this feature space. GAwAR approximates the combination of the top 10 features that maximizes prediction accuracy within a wrapper scheme. To prevent premature convergence, GAwAR adaptively updates the crossover and mutation rates. The hybrid DMC-GAwAR is applicable to binary classification datasets, and experimental results demonstrate its superiority over some recent works. The implementation and corresponding data are available at https://github.com/hnematzadeh/DMC-GAwAR
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Hardware-Friendly Implementation of Physical Reservoir Computing with CMOS-based Time-domain Analog Spiking Neurons Self-Contrastive Forward-Forward Algorithm Bio-Inspired Mamba: Temporal Locality and Bioplausible Learning in Selective State Space Models PReLU: Yet Another Single-Layer Solution to the XOR Problem Inferno: An Extensible Framework for Spiking Neural Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1