基于偏差-方差分析的可解释线性降维

IF 2.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Data Mining and Knowledge Discovery Pub Date : 2024-03-25 DOI:10.1007/s10618-024-01015-0

{"title":"基于偏差-方差分析的可解释线性降维","authors":"","doi":"10.1007/s10618-024-01015-0","DOIUrl":null,"url":null,"abstract":"<h3>Abstract</h3> <p>One of the central issues of several machine learning applications on real data is the choice of the input features. Ideally, the designer should select a small number of the relevant, nonredundant features to preserve the complete information contained in the original dataset, with little collinearity among features. This procedure helps mitigate problems like overfitting and the curse of dimensionality, which arise when dealing with high-dimensional problems. On the other hand, it is not desirable to simply discard some features, since they may still contain information that can be exploited to improve results. Instead, <em>dimensionality reduction</em> techniques are designed to limit the number of features in a dataset by projecting them into a lower dimensional space, possibly considering all the original features. However, the projected features resulting from the application of dimensionality reduction techniques are usually difficult to interpret. In this paper, we seek to design a principled dimensionality reduction approach that maintains the interpretability of the resulting features. Specifically, we propose a bias-variance analysis for linear models and we leverage these theoretical results to design an algorithm, <em>Linear Correlated Features Aggregation</em> (LinCFA), which aggregates groups of continuous features with their average if their correlation is “sufficiently large”. In this way, all features are considered, the dimensionality is reduced and the interpretability is preserved. Finally, we provide numerical validations of the proposed algorithm both on synthetic datasets to confirm the theoretical results and on real datasets to show some promising applications.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"86 1","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Interpretable linear dimensionality reduction based on bias-variance analysis\",\"authors\":\"\",\"doi\":\"10.1007/s10618-024-01015-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<h3>Abstract</h3> <p>One of the central issues of several machine learning applications on real data is the choice of the input features. Ideally, the designer should select a small number of the relevant, nonredundant features to preserve the complete information contained in the original dataset, with little collinearity among features. This procedure helps mitigate problems like overfitting and the curse of dimensionality, which arise when dealing with high-dimensional problems. On the other hand, it is not desirable to simply discard some features, since they may still contain information that can be exploited to improve results. Instead, <em>dimensionality reduction</em> techniques are designed to limit the number of features in a dataset by projecting them into a lower dimensional space, possibly considering all the original features. However, the projected features resulting from the application of dimensionality reduction techniques are usually difficult to interpret. In this paper, we seek to design a principled dimensionality reduction approach that maintains the interpretability of the resulting features. Specifically, we propose a bias-variance analysis for linear models and we leverage these theoretical results to design an algorithm, <em>Linear Correlated Features Aggregation</em> (LinCFA), which aggregates groups of continuous features with their average if their correlation is “sufficiently large”. In this way, all features are considered, the dimensionality is reduced and the interpretability is preserved. Finally, we provide numerical validations of the proposed algorithm both on synthetic datasets to confirm the theoretical results and on real datasets to show some promising applications.</p>\",\"PeriodicalId\":55183,\"journal\":{\"name\":\"Data Mining and Knowledge Discovery\",\"volume\":\"86 1\",\"pages\":\"\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-03-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Data Mining and Knowledge Discovery\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s10618-024-01015-0\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Mining and Knowledge Discovery","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10618-024-01015-0","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

摘要对真实数据进行机器学习的若干应用的核心问题之一是输入特征的选择。理想情况下，设计者应选择少量相关的、非冗余的特征，以保留原始数据集所包含的完整信息，且特征之间的共线性很小。这一程序有助于缓解处理高维问题时出现的过拟合和维度诅咒等问题。另一方面，简单地丢弃某些特征并不可取，因为这些特征可能仍然包含可用于改进结果的信息。相反，降维技术旨在将数据集中的特征投影到一个更低的维度空间，从而限制特征的数量，并可能考虑到所有原始特征。然而，应用降维技术得到的投影特征通常很难解释。在本文中，我们试图设计一种有原则的降维方法，以保持所得特征的可解释性。具体来说，我们提出了线性模型的偏差-方差分析，并利用这些理论结果设计了一种算法--线性相关特征聚合（Linear Correlated Features Aggregation，LinCFA）。通过这种方法，所有特征都被考虑在内，维度得以降低，可解释性得以保持。最后，我们在合成数据集上对所提出的算法进行了数值验证，以确认理论结果，并在真实数据集上展示了一些有前景的应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Interpretable linear dimensionality reduction based on bias-variance analysis

Abstract

One of the central issues of several machine learning applications on real data is the choice of the input features. Ideally, the designer should select a small number of the relevant, nonredundant features to preserve the complete information contained in the original dataset, with little collinearity among features. This procedure helps mitigate problems like overfitting and the curse of dimensionality, which arise when dealing with high-dimensional problems. On the other hand, it is not desirable to simply discard some features, since they may still contain information that can be exploited to improve results. Instead, dimensionality reduction techniques are designed to limit the number of features in a dataset by projecting them into a lower dimensional space, possibly considering all the original features. However, the projected features resulting from the application of dimensionality reduction techniques are usually difficult to interpret. In this paper, we seek to design a principled dimensionality reduction approach that maintains the interpretability of the resulting features. Specifically, we propose a bias-variance analysis for linear models and we leverage these theoretical results to design an algorithm, Linear Correlated Features Aggregation (LinCFA), which aggregates groups of continuous features with their average if their correlation is “sufficiently large”. In this way, all features are considered, the dimensionality is reduced and the interpretability is preserved. Finally, we provide numerical validations of the proposed algorithm both on synthetic datasets to confirm the theoretical results and on real datasets to show some promising applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Data Mining and Knowledge Discovery 工程技术-计算机：人工智能

CiteScore

10.40

自引率

4.20%

发文量

审稿时长

10 months

期刊介绍： Advances in data gathering, storage, and distribution have created a need for computational tools and techniques to aid in data analysis. Data Mining and Knowledge Discovery in Databases (KDD) is a rapidly growing area of research and application that builds on techniques and theories from many fields, including statistics, databases, pattern recognition and learning, data visualization, uncertainty modelling, data warehousing and OLAP, optimization, and high performance computing.

期刊最新文献

Missing value replacement in strings and applications. FRUITS: feature extraction using iterated sums for time series classification Bounding the family-wise error rate in local causal discovery using Rademacher averages Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack Efficient learning with projected histograms