This paper studies the estimation of correlation coefficient between unobserved variables of interest. These unobservable variables are distorted in an additive fashion by an observed confounding variable. We propose a new identifiability condition by using the exponential calibration to obtain calibrated variables and propose a direct‐plug‐in estimator for the correlation coefficient. We show that the direct‐plug‐in estimator is asymptotically efficient. Next, we suggest an asymptotic normal approximation and an empirical likelihood‐based statistic to construct the confidence intervals. Last, we propose several test statistics for testing whether the true correlation coefficient is zero or not. The asymptotic properties of the proposed test statistics are examined. We conduct Monte Carlo simulation experiments to examine the performance of the proposed estimators and test statistics. These methods are applied to analyze a temperature forecast data set for an illustration.
{"title":"Exponential calibration for correlation coefficient with additive distortion measurement errors","authors":"Jun Zhang, Zhuoer Xu","doi":"10.1002/sam.11509","DOIUrl":"https://doi.org/10.1002/sam.11509","url":null,"abstract":"This paper studies the estimation of correlation coefficient between unobserved variables of interest. These unobservable variables are distorted in an additive fashion by an observed confounding variable. We propose a new identifiability condition by using the exponential calibration to obtain calibrated variables and propose a direct‐plug‐in estimator for the correlation coefficient. We show that the direct‐plug‐in estimator is asymptotically efficient. Next, we suggest an asymptotic normal approximation and an empirical likelihood‐based statistic to construct the confidence intervals. Last, we propose several test statistics for testing whether the true correlation coefficient is zero or not. The asymptotic properties of the proposed test statistics are examined. We conduct Monte Carlo simulation experiments to examine the performance of the proposed estimators and test statistics. These methods are applied to analyze a temperature forecast data set for an illustration.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115632675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The phenomenon of big data has become ubiquitous in nearly all disciplines, from science to engineering. A key challenge is the use of such data for fitting statistical and machine learning models, which can incur high computational and storage costs. One solution is to perform model fitting on a carefully selected subset of the data. Various data reduction methods have been proposed in the literature, ranging from random subsampling to optimal experimental design‐based methods. However, when the goal is to learn the underlying input–output relationship, such reduction methods may not be ideal, since it does not make use of information contained in the output. To this end, we propose a supervised data compression method called supercompress, which integrates output information by sampling data from regions most important for modeling the desired input–output relationship. An advantage of supercompress is that it is nonparametric—the compression method does not rely on parametric modeling assumptions between inputs and output. As a result, the proposed method is robust to a wide range of modeling choices. We demonstrate the usefulness of supercompress over existing data reduction methods, in both simulations and a taxicab predictive modeling application.
{"title":"Supervised compression of big data","authors":"V. R. Joseph, Simon Mak","doi":"10.1002/sam.11508","DOIUrl":"https://doi.org/10.1002/sam.11508","url":null,"abstract":"The phenomenon of big data has become ubiquitous in nearly all disciplines, from science to engineering. A key challenge is the use of such data for fitting statistical and machine learning models, which can incur high computational and storage costs. One solution is to perform model fitting on a carefully selected subset of the data. Various data reduction methods have been proposed in the literature, ranging from random subsampling to optimal experimental design‐based methods. However, when the goal is to learn the underlying input–output relationship, such reduction methods may not be ideal, since it does not make use of information contained in the output. To this end, we propose a supervised data compression method called supercompress, which integrates output information by sampling data from regions most important for modeling the desired input–output relationship. An advantage of supercompress is that it is nonparametric—the compression method does not rely on parametric modeling assumptions between inputs and output. As a result, the proposed method is robust to a wide range of modeling choices. We demonstrate the usefulness of supercompress over existing data reduction methods, in both simulations and a taxicab predictive modeling application.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126687652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This interdisciplinary research includes elements of computing, optimization, and statistics for big data. Specifically, it addresses model order identification aspects of big time series data. Computing and minimizing information criteria, such as BIC, on a grid of integer orders becomes prohibitive for time series recorded at a large number of time points. We propose to compute information criteria only for a sample of integer orders and use kriging‐based methods to emulate the information criteria on the rest of the grid. Then we use an efficient global optimization (EGO) algorithm to identify the orders. The method is applied to both ARMA and ARMA‐GARCH models. We simulated times series from each type of model of prespecified orders and applied the method to identify the orders. We also used real big time series with tens of thousands of time points to illustrate the method. In particular, we used sentiment scores for news headlines on the economy for ARMA models, and the NASDAQ daily returns for ARMA‐GARCH models, from the beginning in 1971 to mid‐April 2020 in the early stages of the COVID‐19 pandemic. The proposed method identifies efficiently and accurately the orders of models for big time series data.
{"title":"Emulated order identification for models of big time series data","authors":"Brian Wu, Dorin Drignei","doi":"10.1002/sam.11504","DOIUrl":"https://doi.org/10.1002/sam.11504","url":null,"abstract":"This interdisciplinary research includes elements of computing, optimization, and statistics for big data. Specifically, it addresses model order identification aspects of big time series data. Computing and minimizing information criteria, such as BIC, on a grid of integer orders becomes prohibitive for time series recorded at a large number of time points. We propose to compute information criteria only for a sample of integer orders and use kriging‐based methods to emulate the information criteria on the rest of the grid. Then we use an efficient global optimization (EGO) algorithm to identify the orders. The method is applied to both ARMA and ARMA‐GARCH models. We simulated times series from each type of model of prespecified orders and applied the method to identify the orders. We also used real big time series with tens of thousands of time points to illustrate the method. In particular, we used sentiment scores for news headlines on the economy for ARMA models, and the NASDAQ daily returns for ARMA‐GARCH models, from the beginning in 1971 to mid‐April 2020 in the early stages of the COVID‐19 pandemic. The proposed method identifies efficiently and accurately the orders of models for big time series data.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131154827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to their long‐standing reputation as excellent off‐the‐shelf predictors, random forests (RFs) continue to remain a go‐to model of choice for applied statisticians and data scientists. Despite their widespread use, however, until recently, little was known about their inner workings and about which aspects of the procedure were driving their success. Very recently, two competing hypotheses have emerged–one based on interpolation and the other based on regularization. This work argues in favor of the latter by utilizing the regularization framework to reexamine the decades‐old question of whether individual trees in an ensemble ought to be pruned. Despite the fact that default constructions of RFs use near full depth trees in most popular software packages, here we provide strong evidence that tree depth should be seen as a natural form of regularization across the entire procedure. In particular, our work suggests that RFs with shallow trees are advantageous when the signal‐to‐noise ratio in the data is low. In building up this argument, we also critique the newly popular notion of “double descent” in RFs by drawing parallels to U‐statistics and arguing that the noticeable jumps in random forest accuracy are the result of simple averaging rather than interpolation.
{"title":"Trees, forests, chickens, and eggs: when and why to prune trees in a random forest","authors":"Siyu Zhou, L. Mentch","doi":"10.1002/sam.11594","DOIUrl":"https://doi.org/10.1002/sam.11594","url":null,"abstract":"Due to their long‐standing reputation as excellent off‐the‐shelf predictors, random forests (RFs) continue to remain a go‐to model of choice for applied statisticians and data scientists. Despite their widespread use, however, until recently, little was known about their inner workings and about which aspects of the procedure were driving their success. Very recently, two competing hypotheses have emerged–one based on interpolation and the other based on regularization. This work argues in favor of the latter by utilizing the regularization framework to reexamine the decades‐old question of whether individual trees in an ensemble ought to be pruned. Despite the fact that default constructions of RFs use near full depth trees in most popular software packages, here we provide strong evidence that tree depth should be seen as a natural form of regularization across the entire procedure. In particular, our work suggests that RFs with shallow trees are advantageous when the signal‐to‐noise ratio in the data is low. In building up this argument, we also critique the newly popular notion of “double descent” in RFs by drawing parallels to U‐statistics and arguing that the noticeable jumps in random forest accuracy are the result of simple averaging rather than interpolation.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126486915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Department of Energy relies on complex physics simulations for prediction in domains like cosmology, nuclear theory, and materials science. These simulations are often extremely computationally intensive, with some requiring days or weeks for a single simulation. In order to assure their accuracy, these models are calibrated against observational data in order to estimate inputs and systematic biases. Because of their great computational complexity, this process typically requires the construction of an emulator, a fast approximation to the simulation. In this paper, two emulator approaches are compared: Gaussian process regression and neural networks. Their emulation accuracy and calibration performance on three real problems of Department of Energy interest is considered. On these problems, the Gaussian process emulator tends to be more accurate with narrower, but still well‐calibrated uncertainty estimates. The neural network emulator is accurate, but tends to have large uncertainty on its predictions. As a result, calibration with the Gaussian process emulator produces more constrained posteriors that still perform well in prediction.
{"title":"A comparison of Gaussian processes and neural networks for computer model emulation and calibration","authors":"Samuel Myren, E. Lawrence","doi":"10.1002/sam.11507","DOIUrl":"https://doi.org/10.1002/sam.11507","url":null,"abstract":"The Department of Energy relies on complex physics simulations for prediction in domains like cosmology, nuclear theory, and materials science. These simulations are often extremely computationally intensive, with some requiring days or weeks for a single simulation. In order to assure their accuracy, these models are calibrated against observational data in order to estimate inputs and systematic biases. Because of their great computational complexity, this process typically requires the construction of an emulator, a fast approximation to the simulation. In this paper, two emulator approaches are compared: Gaussian process regression and neural networks. Their emulation accuracy and calibration performance on three real problems of Department of Energy interest is considered. On these problems, the Gaussian process emulator tends to be more accurate with narrower, but still well‐calibrated uncertainty estimates. The neural network emulator is accurate, but tends to have large uncertainty on its predictions. As a result, calibration with the Gaussian process emulator produces more constrained posteriors that still perform well in prediction.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115036004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abylay Zhumekenov, Rustem Takhanov, Alejandro J. Castro, Z. Assylbekov
The paper investigates approximation error of two‐layer feedforward Fourier Neural Networks (FNNs). Such networks are motivated by the approximation properties of Fourier series. Several implementations of FNNs were proposed since 1980s: by Gallant and White, Silvescu, Tan, Zuo and Cai, and Liu. The main focus of our work is Silvescu's FNN, because its activation function does not fit into the category of networks, where the linearly transformed input is exposed to activation. The latter ones were extensively described by Hornik. In regard to non‐trivial Silvescu's FNN, its convergence rate is proven to be of order O(1/n). The paper continues investigating classes of functions approximated by Silvescu FNN, which appeared to be from Schwartz space and space of positive definite functions.
{"title":"Approximation error of Fourier neural networks","authors":"Abylay Zhumekenov, Rustem Takhanov, Alejandro J. Castro, Z. Assylbekov","doi":"10.1002/sam.11506","DOIUrl":"https://doi.org/10.1002/sam.11506","url":null,"abstract":"The paper investigates approximation error of two‐layer feedforward Fourier Neural Networks (FNNs). Such networks are motivated by the approximation properties of Fourier series. Several implementations of FNNs were proposed since 1980s: by Gallant and White, Silvescu, Tan, Zuo and Cai, and Liu. The main focus of our work is Silvescu's FNN, because its activation function does not fit into the category of networks, where the linearly transformed input is exposed to activation. The latter ones were extensively described by Hornik. In regard to non‐trivial Silvescu's FNN, its convergence rate is proven to be of order O(1/n). The paper continues investigating classes of functions approximated by Silvescu FNN, which appeared to be from Schwartz space and space of positive definite functions.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130973284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Massi, F. Ieva, Francesca Gasperoni, A. Paganoni
Class imbalance is a common issue in many domain applications of learning algorithms. Oftentimes, in the same domains it is much more relevant to correctly classify and profile minority class observations. This need can be addressed by feature selection (FS), that offers several further advantages, such as decreasing computational costs, aiding inference and interpretability. However, traditional FS techniques may become suboptimal in the presence of strongly imbalanced data. To achieve FS advantages in this setting, we propose a filtering FS algorithm ranking feature importance on the basis of the reconstruction error of a deep sparse autoencoders ensemble (DSAEE). We use each DSAE trained only on majority class to reconstruct both classes. From the analysis of the aggregated reconstruction error, we determine the features where the minority class presents a different distribution of values w.r.t. the overrepresented one, thus identifying the most relevant features to discriminate between the two. We empirically demonstrate the efficacy of our algorithm in several experiments, both simulated and on high‐dimensional datasets of varying sample size, showcasing its capability to select relevant and generalizable features to profile and classify minority class, outperforming other benchmark FS methods. We also briefly present a real application in radiogenomics, where the methodology was applied successfully.
{"title":"Feature selection for imbalanced data with deep sparse autoencoders ensemble","authors":"M. Massi, F. Ieva, Francesca Gasperoni, A. Paganoni","doi":"10.1002/sam.11567","DOIUrl":"https://doi.org/10.1002/sam.11567","url":null,"abstract":"Class imbalance is a common issue in many domain applications of learning algorithms. Oftentimes, in the same domains it is much more relevant to correctly classify and profile minority class observations. This need can be addressed by feature selection (FS), that offers several further advantages, such as decreasing computational costs, aiding inference and interpretability. However, traditional FS techniques may become suboptimal in the presence of strongly imbalanced data. To achieve FS advantages in this setting, we propose a filtering FS algorithm ranking feature importance on the basis of the reconstruction error of a deep sparse autoencoders ensemble (DSAEE). We use each DSAE trained only on majority class to reconstruct both classes. From the analysis of the aggregated reconstruction error, we determine the features where the minority class presents a different distribution of values w.r.t. the overrepresented one, thus identifying the most relevant features to discriminate between the two. We empirically demonstrate the efficacy of our algorithm in several experiments, both simulated and on high‐dimensional datasets of varying sample size, showcasing its capability to select relevant and generalizable features to profile and classify minority class, outperforming other benchmark FS methods. We also briefly present a real application in radiogenomics, where the methodology was applied successfully.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126663908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Massimo Pellagatti, Chiara Masci, F. Ieva, A. Paganoni
We propose a new statistical method, called generalized mixed‐effects random forest (GMERF), that extends the use of random forest to the analysis of hierarchical data, for any type of response variable in the exponential family. The method maintains the flexibility and the ability of modeling complex patterns within the data, typical of tree‐based ensemble methods, and it can handle both continuous and discrete covariates. At the same time, GMERF takes into account the nested structure of hierarchical data, modeling the dependence structure that exists at the highest level of the hierarchy and allowing statistical inference on this structure. In the case study, we apply GMERF to Higher Education data to analyze the university student dropout phenomenon. We predict engineering student dropout probability by means of student‐level information and considering the degree program students are enrolled in as grouping factor.
{"title":"Generalized mixed‐effects random forest: A flexible approach to predict university student dropout","authors":"Massimo Pellagatti, Chiara Masci, F. Ieva, A. Paganoni","doi":"10.1002/sam.11505","DOIUrl":"https://doi.org/10.1002/sam.11505","url":null,"abstract":"We propose a new statistical method, called generalized mixed‐effects random forest (GMERF), that extends the use of random forest to the analysis of hierarchical data, for any type of response variable in the exponential family. The method maintains the flexibility and the ability of modeling complex patterns within the data, typical of tree‐based ensemble methods, and it can handle both continuous and discrete covariates. At the same time, GMERF takes into account the nested structure of hierarchical data, modeling the dependence structure that exists at the highest level of the hierarchy and allowing statistical inference on this structure. In the case study, we apply GMERF to Higher Education data to analyze the university student dropout phenomenon. We predict engineering student dropout probability by means of student‐level information and considering the degree program students are enrolled in as grouping factor.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130355214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Clustering categorical sequences is a problem that arises in many fields. There is a few techniques available in this framework but none of them take into account the possible temporal character of transitions from one state to another. A mixture of Markov models is proposed, where transition probabilities are represented as functions of time. The corresponding expectation–maximization algorithm is discussed along with related computational challenges. The effectiveness of the proposed procedure is illustrated on the set of simulation studies, in which it outperforms four alternative approaches. The method is applied to major life event sequences from the British Household Panel Survey. As reflected by Bayesian Information Criterion, the proposed model demonstrates substantially better performance than its competitors. The analysis of obtained results and related transition probability plots reveals two groups of individuals: people with a conventional development of life course and those encountering some challenges.
{"title":"Model‐based clustering of time‐dependent categorical sequences with application to the analysis of major life event patterns","authors":"Yingying Zhang, Volodymyr Melnykov, Xuwen Zhu","doi":"10.1002/sam.11502","DOIUrl":"https://doi.org/10.1002/sam.11502","url":null,"abstract":"Clustering categorical sequences is a problem that arises in many fields. There is a few techniques available in this framework but none of them take into account the possible temporal character of transitions from one state to another. A mixture of Markov models is proposed, where transition probabilities are represented as functions of time. The corresponding expectation–maximization algorithm is discussed along with related computational challenges. The effectiveness of the proposed procedure is illustrated on the set of simulation studies, in which it outperforms four alternative approaches. The method is applied to major life event sequences from the British Household Panel Survey. As reflected by Bayesian Information Criterion, the proposed model demonstrates substantially better performance than its competitors. The analysis of obtained results and related transition probability plots reveals two groups of individuals: people with a conventional development of life course and those encountering some challenges.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128367520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Feature (or variable) selection from a large number of p features continuously challenges data science, especially for ever‐enlarging data and in discovering scientifically important features in a regression setting. For example, to develop valid drug targets for ovarian cancer, we must control the false‐discovery rate (FDR) of a selection procedure. The popular approach to feature selection in large‐p regression uses a penalized likelihood or a shrinkage estimation, such as a LASSO, SCAD, Elastic Net, or MCP procedure. We present a different approach called the Subsampling Winner algorithm (SWA), which subsamples from p features. The idea of SWA is analogous to selecting US national merit scholars' that selects semifinalists based on student's performance in tests done at local schools (a.k.a. subsample analyses), and then determine the finalists (a.k.a. winning features) from the semifinalists. Due to its subsampling nature, SWA can scale to data of any dimension. SWA also has the best‐controlled FDR compared to the penalized and Random Forest procedures while having a competitive true‐feature discovery rate. Our application of SWA to an ovarian cancer data revealed functionally important genes and pathways.
{"title":"Subsampling from features in large regression to find “winning features”","authors":"Yiying Fan, Jiayang Sun","doi":"10.1002/sam.11499","DOIUrl":"https://doi.org/10.1002/sam.11499","url":null,"abstract":"Feature (or variable) selection from a large number of p features continuously challenges data science, especially for ever‐enlarging data and in discovering scientifically important features in a regression setting. For example, to develop valid drug targets for ovarian cancer, we must control the false‐discovery rate (FDR) of a selection procedure. The popular approach to feature selection in large‐p regression uses a penalized likelihood or a shrinkage estimation, such as a LASSO, SCAD, Elastic Net, or MCP procedure. We present a different approach called the Subsampling Winner algorithm (SWA), which subsamples from p features. The idea of SWA is analogous to selecting US national merit scholars' that selects semifinalists based on student's performance in tests done at local schools (a.k.a. subsample analyses), and then determine the finalists (a.k.a. winning features) from the semifinalists. Due to its subsampling nature, SWA can scale to data of any dimension. SWA also has the best‐controlled FDR compared to the penalized and Random Forest procedures while having a competitive true‐feature discovery rate. Our application of SWA to an ovarian cancer data revealed functionally important genes and pathways.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114316481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}