{"title":"Ranked sparsity: a cogent regularization framework for selecting and estimating feature interactions and polynomials","authors":"Ryan A. Peterson, Joseph E. Cavanaugh","doi":"10.1007/s10182-021-00431-7","DOIUrl":null,"url":null,"abstract":"<div><p>We explore and illustrate the concept of ranked sparsity, a phenomenon that often occurs naturally in modeling applications when an expected disparity exists in the quality of information between different feature sets. Its presence can cause traditional and modern model selection methods to fail because such procedures commonly presume that each potential parameter is equally worthy of entering into the final model—we call this presumption “covariate equipoise.” However, this presumption does not always hold, especially in the presence of derived variables. For instance, when all possible interactions are considered as candidate predictors, the premise of covariate equipoise will often produce over-specified and opaque models. The sheer number of additional candidate variables grossly inflates the number of false discoveries in the interactions, resulting in unnecessarily complex and difficult-to-interpret models with many (truly spurious) interactions. We suggest a modeling strategy that requires a stronger level of evidence in order to allow certain variables (e.g., interactions) to be selected in the final model. This ranked sparsity paradigm can be implemented with the sparsity-ranked lasso (SRL). We compare the performance of SRL relative to competing methods in a series of simulation studies, showing that the SRL is a very attractive method because it is fast and accurate and produces more transparent models (with fewer false interactions). We illustrate its utility in an application to predict the survival of lung cancer patients using a set of gene expression measurements and clinical covariates, searching in particular for gene–environment interactions.</p></div>","PeriodicalId":1,"journal":{"name":"Accounts of Chemical Research","volume":null,"pages":null},"PeriodicalIF":16.4000,"publicationDate":"2022-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10182-021-00431-7.pdf","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Accounts of Chemical Research","FirstCategoryId":"100","ListUrlMain":"https://link.springer.com/article/10.1007/s10182-021-00431-7","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 1
Abstract
We explore and illustrate the concept of ranked sparsity, a phenomenon that often occurs naturally in modeling applications when an expected disparity exists in the quality of information between different feature sets. Its presence can cause traditional and modern model selection methods to fail because such procedures commonly presume that each potential parameter is equally worthy of entering into the final model—we call this presumption “covariate equipoise.” However, this presumption does not always hold, especially in the presence of derived variables. For instance, when all possible interactions are considered as candidate predictors, the premise of covariate equipoise will often produce over-specified and opaque models. The sheer number of additional candidate variables grossly inflates the number of false discoveries in the interactions, resulting in unnecessarily complex and difficult-to-interpret models with many (truly spurious) interactions. We suggest a modeling strategy that requires a stronger level of evidence in order to allow certain variables (e.g., interactions) to be selected in the final model. This ranked sparsity paradigm can be implemented with the sparsity-ranked lasso (SRL). We compare the performance of SRL relative to competing methods in a series of simulation studies, showing that the SRL is a very attractive method because it is fast and accurate and produces more transparent models (with fewer false interactions). We illustrate its utility in an application to predict the survival of lung cancer patients using a set of gene expression measurements and clinical covariates, searching in particular for gene–environment interactions.
期刊介绍:
Accounts of Chemical Research presents short, concise and critical articles offering easy-to-read overviews of basic research and applications in all areas of chemistry and biochemistry. These short reviews focus on research from the author’s own laboratory and are designed to teach the reader about a research project. In addition, Accounts of Chemical Research publishes commentaries that give an informed opinion on a current research problem. Special Issues online are devoted to a single topic of unusual activity and significance.
Accounts of Chemical Research replaces the traditional article abstract with an article "Conspectus." These entries synopsize the research affording the reader a closer look at the content and significance of an article. Through this provision of a more detailed description of the article contents, the Conspectus enhances the article's discoverability by search engines and the exposure for the research.