交叉验证的排列特征重要性考虑特征之间的相关性

IF 3 Q2 CHEMISTRY, ANALYTICAL Analytical science advances Pub Date : 2022-09-07 DOI:10.1002/ansa.202200018
Hiromasa Kaneko
{"title":"交叉验证的排列特征重要性考虑特征之间的相关性","authors":"Hiromasa Kaneko","doi":"10.1002/ansa.202200018","DOIUrl":null,"url":null,"abstract":"<p>In molecular design, material design, process design, and process control, it is important not only to construct a model with high predictive ability between explanatory features x and objective features y using a dataset but also to interpret the constructed model. An index of feature importance in x is permutation feature importance (PFI), which can be combined with any regressors and classifiers. However, the PFI becomes unstable when the number of samples is low because it is necessary to divide a dataset into training and validation data when calculating it. Additionally, when there are strongly correlated features in x, the PFI of these features is estimated to be low. Hence, a cross-validated PFI (CVPFI) method is proposed. CVPFI can be calculated stably, even with a small number of samples, because model construction and feature evaluation are repeated based on cross-validation. Furthermore, by considering the absolute correlation coefficients between the features, the feature importance can be evaluated appropriately even when there are strongly correlated features in x. Case studies using numerical simulation data and actual compound data showed that the feature importance can be evaluated appropriately using CVPFI compared to PFI. This is possible when the number of samples is low, when linear and nonlinear relationships are mixed between x and y when there are strong correlations between features in x, and when quantised and biased features exist in x. Python codes for CVPFI are available at https://github.com/hkaneko1985/dcekit.</p>","PeriodicalId":93411,"journal":{"name":"Analytical science advances","volume":null,"pages":null},"PeriodicalIF":3.0000,"publicationDate":"2022-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://chemistry-europe.onlinelibrary.wiley.com/doi/epdf/10.1002/ansa.202200018","citationCount":"13","resultStr":"{\"title\":\"Cross-validated permutation feature importance considering correlation between features\",\"authors\":\"Hiromasa Kaneko\",\"doi\":\"10.1002/ansa.202200018\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>In molecular design, material design, process design, and process control, it is important not only to construct a model with high predictive ability between explanatory features x and objective features y using a dataset but also to interpret the constructed model. An index of feature importance in x is permutation feature importance (PFI), which can be combined with any regressors and classifiers. However, the PFI becomes unstable when the number of samples is low because it is necessary to divide a dataset into training and validation data when calculating it. Additionally, when there are strongly correlated features in x, the PFI of these features is estimated to be low. Hence, a cross-validated PFI (CVPFI) method is proposed. CVPFI can be calculated stably, even with a small number of samples, because model construction and feature evaluation are repeated based on cross-validation. Furthermore, by considering the absolute correlation coefficients between the features, the feature importance can be evaluated appropriately even when there are strongly correlated features in x. Case studies using numerical simulation data and actual compound data showed that the feature importance can be evaluated appropriately using CVPFI compared to PFI. This is possible when the number of samples is low, when linear and nonlinear relationships are mixed between x and y when there are strong correlations between features in x, and when quantised and biased features exist in x. Python codes for CVPFI are available at https://github.com/hkaneko1985/dcekit.</p>\",\"PeriodicalId\":93411,\"journal\":{\"name\":\"Analytical science advances\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2022-09-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://chemistry-europe.onlinelibrary.wiley.com/doi/epdf/10.1002/ansa.202200018\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Analytical science advances\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/ansa.202200018\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CHEMISTRY, ANALYTICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Analytical science advances","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ansa.202200018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, ANALYTICAL","Score":null,"Total":0}
引用次数: 13

摘要

在分子设计、材料设计、工艺设计和过程控制中,不仅要利用数据集在解释特征x和客观特征y之间构建具有高预测能力的模型,而且要对构建的模型进行解释。x中特征重要性的一个指标是排列特征重要性(PFI),它可以与任何回归器和分类器组合。然而,当样本数量较少时,PFI变得不稳定,因为在计算数据集时需要将数据集分为训练数据和验证数据。此外,当x中存在强相关特征时,这些特征的PFI估计较低。因此,提出了一种交叉验证的PFI (CVPFI)方法。CVPFI可以稳定地计算,即使样本数量很少,因为模型构建和特征评估是基于交叉验证的重复。此外,通过考虑特征之间的绝对相关系数,即使x中存在强相关特征,也可以适当地评估特征的重要性。使用数值模拟数据和实际复合数据的案例研究表明,与PFI相比,使用CVPFI可以适当地评估特征的重要性。当样本数量较少时,当x和y之间混合了线性和非线性关系时,当x中的特征之间存在强相关性时,以及当x中存在量化和偏差特征时,这是可能的。CVPFI的Python代码可在https://github.com/hkaneko1985/dcekit上获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Cross-validated permutation feature importance considering correlation between features

In molecular design, material design, process design, and process control, it is important not only to construct a model with high predictive ability between explanatory features x and objective features y using a dataset but also to interpret the constructed model. An index of feature importance in x is permutation feature importance (PFI), which can be combined with any regressors and classifiers. However, the PFI becomes unstable when the number of samples is low because it is necessary to divide a dataset into training and validation data when calculating it. Additionally, when there are strongly correlated features in x, the PFI of these features is estimated to be low. Hence, a cross-validated PFI (CVPFI) method is proposed. CVPFI can be calculated stably, even with a small number of samples, because model construction and feature evaluation are repeated based on cross-validation. Furthermore, by considering the absolute correlation coefficients between the features, the feature importance can be evaluated appropriately even when there are strongly correlated features in x. Case studies using numerical simulation data and actual compound data showed that the feature importance can be evaluated appropriately using CVPFI compared to PFI. This is possible when the number of samples is low, when linear and nonlinear relationships are mixed between x and y when there are strong correlations between features in x, and when quantised and biased features exist in x. Python codes for CVPFI are available at https://github.com/hkaneko1985/dcekit.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
4.60
自引率
0.00%
发文量
0
期刊最新文献
Emerging Scientists in Analytical Sciences: Zhuoheng Zhou Sensitive and Cost-Effective Tools in the Detection of Ovarian Cancer Biomarkers Preprocessing of spectroscopic data to highlight spectral features of materials Bioactive Potential of the Sulfated Exopolysaccharides From the Brown Microalga Halamphora sp.: Antioxidant, Antimicrobial, and Antiapoptotic Profiles Effect of orange fruit peel extract concentration on the synthesis of zinc oxide nanoparticles
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1