Variable selection for both outcomes and predictors: sparse multivariate principal covariates regression

IF 2.9 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Machine Learning Pub Date : 2024-07-17 DOI:10.1007/s10994-024-06520-3

Soogeun Park, Eva Ceulemans, Katrijn Van Deun

{"title":"Variable selection for both outcomes and predictors: sparse multivariate principal covariates regression","authors":"Soogeun Park, Eva Ceulemans, Katrijn Van Deun","doi":"10.1007/s10994-024-06520-3","DOIUrl":null,"url":null,"abstract":"<p>Datasets comprised of large sets of both predictor and outcome variables are becoming more widely used in research. In addition to the well-known problems of model complexity and predictor variable selection, predictive modelling with such large data also presents a relatively novel and under-studied challenge of outcome variable selection. Certain outcome variables in the data may not be adequately predicted by the given sets of predictors. In this paper, we propose the method of Sparse Multivariate Principal Covariates Regression that addresses these issues altogether by expanding the Principal Covariates Regression model to incorporate sparsity penalties on both of predictor and outcome variables. Our method is one of the first methods that perform variable selection for both predictors and outcomes simultaneously. Moreover, by relying on summary variables that explain the variance in both predictor and outcome variables, the method offers a sparse and succinct model representation of the data. In a simulation study, the method performed better than methods with similar aims such as sparse Partial Least Squares at prediction of the outcome variables and recovery of the population parameters. Lastly, we administered the method on an empirical dataset to illustrate its application in practice.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"2018 1","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Learning","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10994-024-06520-3","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Datasets comprised of large sets of both predictor and outcome variables are becoming more widely used in research. In addition to the well-known problems of model complexity and predictor variable selection, predictive modelling with such large data also presents a relatively novel and under-studied challenge of outcome variable selection. Certain outcome variables in the data may not be adequately predicted by the given sets of predictors. In this paper, we propose the method of Sparse Multivariate Principal Covariates Regression that addresses these issues altogether by expanding the Principal Covariates Regression model to incorporate sparsity penalties on both of predictor and outcome variables. Our method is one of the first methods that perform variable selection for both predictors and outcomes simultaneously. Moreover, by relying on summary variables that explain the variance in both predictor and outcome variables, the method offers a sparse and succinct model representation of the data. In a simulation study, the method performed better than methods with similar aims such as sparse Partial Least Squares at prediction of the outcome variables and recovery of the population parameters. Lastly, we administered the method on an empirical dataset to illustrate its application in practice.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

结果和预测因素的变量选择：稀疏多变量主协变量回归

由大量预测变量和结果变量组成的数据集在研究中的应用越来越广泛。除了众所周知的模型复杂性和预测变量选择问题外，使用此类大型数据建立预测模型还面临着一个相对新颖且研究不足的挑战，即结果变量选择问题。给定的预测变量集可能无法充分预测数据中的某些结果变量。在本文中，我们提出了稀疏多变量主变量回归方法，通过扩展主变量回归模型，在预测变量和结果变量上都加入稀疏性惩罚，从而彻底解决这些问题。我们的方法是首批同时对预测变量和结果变量进行选择的方法之一。此外，通过依赖能解释预测变量和结果变量方差的汇总变量，该方法提供了稀疏而简洁的数据模型表示。在模拟研究中，该方法在预测结果变量和恢复群体参数方面的表现优于具有类似目的的方法，如稀疏偏最小二乘法。最后，我们在一个经验数据集上使用了该方法，以说明其在实践中的应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Machine Learning 工程技术-计算机：人工智能

CiteScore

11.00

自引率

2.70%

发文量

162

审稿时长

3 months

期刊介绍： Machine Learning serves as a global platform dedicated to computational approaches in learning. The journal reports substantial findings on diverse learning methods applied to various problems, offering support through empirical studies, theoretical analysis, or connections to psychological phenomena. It demonstrates the application of learning methods to solve significant problems and aims to enhance the conduct of machine learning research with a focus on verifiable and replicable evidence in published papers.

期刊最新文献

Linear Causal Discovery with Interventional Constraints. Interpretable optimisation-based approach for hyper-box classification. Deep latent force models: ODE-based process convolutions for Bayesian deep learning. Offline reinforcement learning for learning to dispatch for job shop scheduling. Computing the distance between unbalanced distributions: the flat metric.