Model selection procedure for high-dimensional data.

IF 2.1 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Statistical Analysis and Data Mining Pub Date : 2010-10-01 DOI:10.1002/sam.10088

Yongli Zhang, Xiaotong Shen

{"title":"Model selection procedure for high-dimensional data.","authors":"Yongli Zhang, Xiaotong Shen","doi":"10.1002/sam.10088","DOIUrl":null,"url":null,"abstract":"<p><p>For high-dimensional regression, the number of predictors may greatly exceed the sample size but only a small fraction of them are related to the response. Therefore, variable selection is inevitable, where consistent model selection is the primary concern. However, conventional consistent model selection criteria like BIC may be inadequate due to their nonadaptivity to the model space and infeasibility of exhaustive search. To address these two issues, we establish a probability lower bound of selecting the smallest true model by an information criterion, based on which we propose a model selection criterion, what we call RIC(c), which adapts to the model space. Furthermore, we develop a computationally feasible method combining the computational power of least angle regression (LAR) with of RIC(c). Both theoretical and simulation studies show that this method identifies the smallest true model with probability converging to one if the smallest true model is selected by LAR. The proposed method is applied to real data from the power market and outperforms the backward variable selection in terms of price forecasting accuracy.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"3 5","pages":"350-358"},"PeriodicalIF":2.1000,"publicationDate":"2010-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.10088","citationCount":"29","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Analysis and Data Mining","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1002/sam.10088","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 29

Abstract

For high-dimensional regression, the number of predictors may greatly exceed the sample size but only a small fraction of them are related to the response. Therefore, variable selection is inevitable, where consistent model selection is the primary concern. However, conventional consistent model selection criteria like BIC may be inadequate due to their nonadaptivity to the model space and infeasibility of exhaustive search. To address these two issues, we establish a probability lower bound of selecting the smallest true model by an information criterion, based on which we propose a model selection criterion, what we call RIC(c), which adapts to the model space. Furthermore, we develop a computationally feasible method combining the computational power of least angle regression (LAR) with of RIC(c). Both theoretical and simulation studies show that this method identifies the smallest true model with probability converging to one if the smallest true model is selected by LAR. The proposed method is applied to real data from the power market and outperforms the backward variable selection in terms of price forecasting accuracy.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

高维数据的模型选择程序。

对于高维回归，预测因子的数量可能大大超过样本量，但其中只有一小部分与响应相关。因此，变量选择是不可避免的，其中一致的模型选择是主要关注的问题。然而，传统的一致性模型选择准则(如BIC)对模型空间的不适应性和穷举搜索的不可行性，可能存在不足。为了解决这两个问题，我们建立了一个根据信息准则选择最小真模型的概率下界，在此基础上我们提出了一个模型选择准则，我们称之为RIC(c)，它适应于模型空间。此外，我们还开发了一种计算可行的方法，将最小角回归(LAR)的计算能力与RIC(c)相结合。理论和仿真研究均表明，该方法在最小真模型被LAR选择的情况下，以概率收敛于1的方式识别出最小真模型。将该方法应用于电力市场的实际数据，在价格预测精度上优于后向变量选择方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Statistical Analysis and Data Mining COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCEC-COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

CiteScore

3.20

自引率

7.70%

发文量

期刊介绍： Statistical Analysis and Data Mining addresses the broad area of data analysis, including statistical approaches, machine learning, data mining, and applications. Topics include statistical and computational approaches for analyzing massive and complex datasets, novel statistical and/or machine learning methods and theory, and state-of-the-art applications with high impact. Of special interest are articles that describe innovative analytical techniques, and discuss their application to real problems, in such a way that they are accessible and beneficial to domain experts across science, engineering, and commerce. The focus of the journal is on papers which satisfy one or more of the following criteria: Solve data analysis problems associated with massive, complex datasets Develop innovative statistical approaches, machine learning algorithms, or methods integrating ideas across disciplines, e.g., statistics, computer science, electrical engineering, operation research. Formulate and solve high-impact real-world problems which challenge existing paradigms via new statistical and/or computational models Provide survey to prominent research topics.