This paper is concerned with inference on finite dimensional parameters in semiparametric moment condition models, where the moment functionals are linear with respect to unknown nuisance functions. By exploiting this linearity, we reformulate the inference problem via the Riesz representer, and develop a general inference procedure based on nonparametric likelihood. For treatment effect or missing data analysis, the Riesz representer is typically associated with the inverse propensity score even though the scope of our framework is much wider. In particular, we propose a two-step procedure, where the first step computes the projection weights to approximate the Riesz representer, and the second step reweights the moment conditions so that the likelihood increment admits an asymptotically pivotal chi-square calibration. Our reweighting method is naturally extended to inference on missing data, treatment effects, and data combination models, and other semiparametric problems. Simulation and real data examples illustrate usefulness of the proposed method. We note that our reweighting method and theoretical results are limited to linear functionals.
{"title":"Reweighted nonparametric likelihood inference for linear functionals","authors":"Karun Adusumilli, Taisuke Otsu, Chen Qiu","doi":"10.1214/23-ejs2168","DOIUrl":"https://doi.org/10.1214/23-ejs2168","url":null,"abstract":"This paper is concerned with inference on finite dimensional parameters in semiparametric moment condition models, where the moment functionals are linear with respect to unknown nuisance functions. By exploiting this linearity, we reformulate the inference problem via the Riesz representer, and develop a general inference procedure based on nonparametric likelihood. For treatment effect or missing data analysis, the Riesz representer is typically associated with the inverse propensity score even though the scope of our framework is much wider. In particular, we propose a two-step procedure, where the first step computes the projection weights to approximate the Riesz representer, and the second step reweights the moment conditions so that the likelihood increment admits an asymptotically pivotal chi-square calibration. Our reweighting method is naturally extended to inference on missing data, treatment effects, and data combination models, and other semiparametric problems. Simulation and real data examples illustrate usefulness of the proposed method. We note that our reweighting method and theoretical results are limited to linear functionals.","PeriodicalId":49272,"journal":{"name":"Electronic Journal of Statistics","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135784450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Besov priors are nonparametric priors that can model spatially inhomogeneous functions. They are routinely used in inverse problems and imaging, where they exhibit attractive sparsity-promoting and edge-preserving features. A recent line of work has initiated the study of their asymptotic frequentist convergence properties. In the present paper, we consider the theoretical recovery performance of the posterior distributions associated to Besov-Laplace priors in the density estimation model, under the assumption that the observations are generated by a possibly spatially inhomogeneous true density belonging to a Besov space. We improve on existing results and show that carefully tuned Besov-Laplace priors attain optimal posterior contraction rates. Furthermore, we show that hierarchical procedures involving a hyper-prior on the regularity parameter lead to adaptation to any smoothness level.
{"title":"Besov-Laplace priors in density estimation: optimal posterior contraction rates and adaptation","authors":"Matteo Giordano","doi":"10.1214/23-ejs2161","DOIUrl":"https://doi.org/10.1214/23-ejs2161","url":null,"abstract":"Besov priors are nonparametric priors that can model spatially inhomogeneous functions. They are routinely used in inverse problems and imaging, where they exhibit attractive sparsity-promoting and edge-preserving features. A recent line of work has initiated the study of their asymptotic frequentist convergence properties. In the present paper, we consider the theoretical recovery performance of the posterior distributions associated to Besov-Laplace priors in the density estimation model, under the assumption that the observations are generated by a possibly spatially inhomogeneous true density belonging to a Besov space. We improve on existing results and show that carefully tuned Besov-Laplace priors attain optimal posterior contraction rates. Furthermore, we show that hierarchical procedures involving a hyper-prior on the regularity parameter lead to adaptation to any smoothness level.","PeriodicalId":49272,"journal":{"name":"Electronic Journal of Statistics","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135913525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The categorical Gini correlation proposed by Dang et al. [7] is a dependence measure to characterize independence between categorical and numerical variables. The asymptotic distributions of the sample correlation under dependence and independence have been established when the dimension of the numerical variable is fixed. However, its asymptotic behavior for high dimensional data has not been explored. In this paper, we develop the central limit theorem for the Gini correlation in the more realistic setting where the dimensionality of the numerical variable is diverging. We then construct a powerful and consistent test for the K-sample problem based on the asymptotic normality. The proposed test not only avoids computation burden but also gains power over the permutation procedure. Simulation studies and real data illustrations show that the proposed test is more competitive to existing methods across a broad range of realistic situations, especially in unbalanced cases.
{"title":"Asymptotic normality of Gini correlation in high dimension with applications to the K-sample problem","authors":"Yongli Sang, Xin Dang","doi":"10.1214/23-ejs2165","DOIUrl":"https://doi.org/10.1214/23-ejs2165","url":null,"abstract":"The categorical Gini correlation proposed by Dang et al. [7] is a dependence measure to characterize independence between categorical and numerical variables. The asymptotic distributions of the sample correlation under dependence and independence have been established when the dimension of the numerical variable is fixed. However, its asymptotic behavior for high dimensional data has not been explored. In this paper, we develop the central limit theorem for the Gini correlation in the more realistic setting where the dimensionality of the numerical variable is diverging. We then construct a powerful and consistent test for the K-sample problem based on the asymptotic normality. The proposed test not only avoids computation burden but also gains power over the permutation procedure. Simulation studies and real data illustrations show that the proposed test is more competitive to existing methods across a broad range of realistic situations, especially in unbalanced cases.","PeriodicalId":49272,"journal":{"name":"Electronic Journal of Statistics","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135450743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multiple heterogeneous data sources are becoming increasingly available for statistical analyses in the era of big data. As an important example in finite-population inference, we develop a unified framework of the test-and-pool approach to general parameter estimation by combining gold-standard probability and non-probability samples. We focus on the case when the study variable is observed in both datasets for estimating the target parameters, and each contains other auxiliary variables. Utilizing the probability design, we conduct a pretest procedure to determine the comparability of the non-probability data with the probability data and decide whether or not to leverage the non-probability data in a pooled analysis. When the probability and non-probability data are comparable, our approach combines both data for efficient estimation. Otherwise, we retain only the probability data for estimation. We also characterize the asymptotic distribution of the proposed test-and-pool estimator under a local alternative and provide a data-adaptive procedure to select the critical tuning parameters that target the smallest mean square error of the test-and-pool estimator. Lastly, to deal with the non-regularity of the test-and-pool estimator, we construct a robust confidence interval that has a good finite-sample coverage property.
{"title":"Pretest estimation in combining probability and non-probability samples","authors":"Chenyin Gao, Shu Yang","doi":"10.1214/23-ejs2137","DOIUrl":"https://doi.org/10.1214/23-ejs2137","url":null,"abstract":"Multiple heterogeneous data sources are becoming increasingly available for statistical analyses in the era of big data. As an important example in finite-population inference, we develop a unified framework of the test-and-pool approach to general parameter estimation by combining gold-standard probability and non-probability samples. We focus on the case when the study variable is observed in both datasets for estimating the target parameters, and each contains other auxiliary variables. Utilizing the probability design, we conduct a pretest procedure to determine the comparability of the non-probability data with the probability data and decide whether or not to leverage the non-probability data in a pooled analysis. When the probability and non-probability data are comparable, our approach combines both data for efficient estimation. Otherwise, we retain only the probability data for estimation. We also characterize the asymptotic distribution of the proposed test-and-pool estimator under a local alternative and provide a data-adaptive procedure to select the critical tuning parameters that target the smallest mean square error of the test-and-pool estimator. Lastly, to deal with the non-regularity of the test-and-pool estimator, we construct a robust confidence interval that has a good finite-sample coverage property.","PeriodicalId":49272,"journal":{"name":"Electronic Journal of Statistics","volume":" ","pages":""},"PeriodicalIF":1.1,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45146367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In many modern statistical problems, the limited available data must be used both to develop the hypotheses to test, and to test these hypotheses-that is, both for exploratory and confirmatory data analysis. Reusing the same dataset for both exploration and testing can lead to massive selection bias, leading to many false discoveries. Selective inference is a framework that allows for performing valid inference even when the same data is reused for exploration and testing. In this work, we are interested in the problem of selective inference for data clustering, where a clustering procedure is used to hypothesize a separation of the data points into a collection of subgroups, and we then wish to test whether these data-dependent clusters in fact represent meaningful differences within the data. Recent work by Gao et al. [2022] provides a framework for doing selective inference for this setting, where a hierarchical clustering algorithm is used for producing the cluster assignments, which was then extended to k-means clustering by Chen and Witten [2022]. Both these works rely on assuming a known covariance structure for the data, but in practice, the noise level needs to be estimated-and this is particularly challenging when the true cluster structure is unknown. In our work, we extend this work to the setting of noise with unknown variance, and provide a selective inference method for this more general setting. Empirical results show that our new method is better able to maintain high power while controlling Type I error when the true noise level is unknown.
{"title":"Selective inference for clustering with unknown variance","authors":"Y. Yun, R. Barber","doi":"10.1214/23-ejs2143","DOIUrl":"https://doi.org/10.1214/23-ejs2143","url":null,"abstract":"In many modern statistical problems, the limited available data must be used both to develop the hypotheses to test, and to test these hypotheses-that is, both for exploratory and confirmatory data analysis. Reusing the same dataset for both exploration and testing can lead to massive selection bias, leading to many false discoveries. Selective inference is a framework that allows for performing valid inference even when the same data is reused for exploration and testing. In this work, we are interested in the problem of selective inference for data clustering, where a clustering procedure is used to hypothesize a separation of the data points into a collection of subgroups, and we then wish to test whether these data-dependent clusters in fact represent meaningful differences within the data. Recent work by Gao et al. [2022] provides a framework for doing selective inference for this setting, where a hierarchical clustering algorithm is used for producing the cluster assignments, which was then extended to k-means clustering by Chen and Witten [2022]. Both these works rely on assuming a known covariance structure for the data, but in practice, the noise level needs to be estimated-and this is particularly challenging when the true cluster structure is unknown. In our work, we extend this work to the setting of noise with unknown variance, and provide a selective inference method for this more general setting. Empirical results show that our new method is better able to maintain high power while controlling Type I error when the true noise level is unknown.","PeriodicalId":49272,"journal":{"name":"Electronic Journal of Statistics","volume":" ","pages":""},"PeriodicalIF":1.1,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48717813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-01Epub Date: 2023-11-28DOI: 10.1214/23-ejs2182
Lan Luo, Ruijian Han, Yuanyuan Lin, Jian Huang
In this paper we develop an online statistical inference approach for high-dimensional generalized linear models with streaming data for realtime estimation and inference. We propose an online debiased lasso method that aligns with the data collection scheme of streaming data. Online debiased lasso differs from offline debiased lasso in two important aspects. First, it updates component-wise confidence intervals of regression coefficients with only summary statistics of the historical data. Second, online debiased lasso adds an additional term to correct approximation errors accumulated throughout the online updating procedure. We show that our proposed online debiased estimators in generalized linear models are asymptotically normal. This result provides a theoretical basis for carrying out real-time interim statistical inference with streaming data. Extensive numerical experiments are conducted to evaluate the performance of our proposed online debiased lasso method. These experiments demonstrate the effectiveness of our algorithm and support the theoretical results. Furthermore, we illustrate the application of our method with a high-dimensional text dataset.
{"title":"Online inference in high-dimensional generalized linear models with streaming data.","authors":"Lan Luo, Ruijian Han, Yuanyuan Lin, Jian Huang","doi":"10.1214/23-ejs2182","DOIUrl":"10.1214/23-ejs2182","url":null,"abstract":"<p><p>In this paper we develop an online statistical inference approach for high-dimensional generalized linear models with streaming data for realtime estimation and inference. We propose an online debiased lasso method that aligns with the data collection scheme of streaming data. Online debiased lasso differs from offline debiased lasso in two important aspects. First, it updates component-wise confidence intervals of regression coefficients with only summary statistics of the historical data. Second, online debiased lasso adds an additional term to correct approximation errors accumulated throughout the online updating procedure. We show that our proposed online debiased estimators in generalized linear models are asymptotically normal. This result provides a theoretical basis for carrying out real-time interim statistical inference with streaming data. Extensive numerical experiments are conducted to evaluate the performance of our proposed online debiased lasso method. These experiments demonstrate the effectiveness of our algorithm and support the theoretical results. Furthermore, we illustrate the application of our method with a high-dimensional text dataset.</p>","PeriodicalId":49272,"journal":{"name":"Electronic Journal of Statistics","volume":"17 2","pages":"3443-3471"},"PeriodicalIF":1.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11346802/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142074378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christopher Harshaw, Fredrik Sävje, David Eisenstat, Vahab Mirrokni, Jean Pouget-Abadie
A bipartite experiment consists of one set of units being assigned treatments and another set of units for which we measure outcomes. The two sets of units are connected by a bipartite graph, governing how the treated units can affect the outcome units. In this paper, we consider estimation of the average total treatment effect in the bipartite experimental framework under a linear exposure-response model. We introduce the Exposure Reweighted Linear (ERL) estimator, and show that the estimator is unbiased, consistent and asymptotically normal, provided that the bipartite graph is sufficiently sparse. To facilitate inference, we introduce an unbiased and consistent estimator of the variance of the ERL point estimator. Finally, we introduce a cluster-based design, Exposure-Design, that uses heuristics to increase the precision of the ERL estimator by realizing a desirable exposure distribution.
{"title":"Design and analysis of bipartite experiments under a linear exposure-response model","authors":"Christopher Harshaw, Fredrik Sävje, David Eisenstat, Vahab Mirrokni, Jean Pouget-Abadie","doi":"10.1214/23-ejs2111","DOIUrl":"https://doi.org/10.1214/23-ejs2111","url":null,"abstract":"A bipartite experiment consists of one set of units being assigned treatments and another set of units for which we measure outcomes. The two sets of units are connected by a bipartite graph, governing how the treated units can affect the outcome units. In this paper, we consider estimation of the average total treatment effect in the bipartite experimental framework under a linear exposure-response model. We introduce the Exposure Reweighted Linear (ERL) estimator, and show that the estimator is unbiased, consistent and asymptotically normal, provided that the bipartite graph is sufficiently sparse. To facilitate inference, we introduce an unbiased and consistent estimator of the variance of the ERL point estimator. Finally, we introduce a cluster-based design, Exposure-Design, that uses heuristics to increase the precision of the ERL estimator by realizing a desirable exposure distribution.","PeriodicalId":49272,"journal":{"name":"Electronic Journal of Statistics","volume":"255 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135470674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Temporal dependence is frequently encountered in large-scale structured noisy data, arising from scientific studies in neuroscience and meteorology, among others. This challenging characteristic may not align with existing theoretical frameworks or data analysis tools. Motivated by multi-session fMRI time series data, this paper introduces a novel semi-parametric inference procedure suitable for a broad class of “non-stationary, non-Gaussian, temporally dependent” noise processes in time-course data. It develops a new test statistic based on a tapering-type estimator of the large-dimensional noise auto-covariance matrix and establishes its asymptotic chi-squared distribution. Our method not only relaxes the consistency requirement for the noise covariance matrix estimator but also avoids direct matrix inversion without sacrificing detection power. It adapts well to both stationary and a wider range of temporal noise processes, making it particularly effective for handling challenging scenarios involving very large scales of data and large dimensions of noise covariance matrices. We demonstrate the efficacy of the proposed procedure through simulation evaluations and real fMRI data analysis.
{"title":"Semi-parametric inference for large-scale data with temporally dependent noise","authors":"Chunming Zhang, Xiao Guo, Min Chen, Xinze Du","doi":"10.1214/23-ejs2171","DOIUrl":"https://doi.org/10.1214/23-ejs2171","url":null,"abstract":"Temporal dependence is frequently encountered in large-scale structured noisy data, arising from scientific studies in neuroscience and meteorology, among others. This challenging characteristic may not align with existing theoretical frameworks or data analysis tools. Motivated by multi-session fMRI time series data, this paper introduces a novel semi-parametric inference procedure suitable for a broad class of “non-stationary, non-Gaussian, temporally dependent” noise processes in time-course data. It develops a new test statistic based on a tapering-type estimator of the large-dimensional noise auto-covariance matrix and establishes its asymptotic chi-squared distribution. Our method not only relaxes the consistency requirement for the noise covariance matrix estimator but also avoids direct matrix inversion without sacrificing detection power. It adapts well to both stationary and a wider range of temporal noise processes, making it particularly effective for handling challenging scenarios involving very large scales of data and large dimensions of noise covariance matrices. We demonstrate the efficacy of the proposed procedure through simulation evaluations and real fMRI data analysis.","PeriodicalId":49272,"journal":{"name":"Electronic Journal of Statistics","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135610487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider ill-posed inverse problems where the forward operator $T$ is unknown, and instead we have access to training data consisting of functions $f_i$ and their noisy images $Tf_i$. This is a practically relevant and challenging problem which current methods are able to solve only under strong assumptions on the training set. Here we propose a new method that requires minimal assumptions on the data, and prove reconstruction rates that depend on the number of training points and the noise level. We show that, in the regime of"many"training data, the method is minimax optimal. The proposed method employs a type of convolutional neural networks (U-nets) and empirical risk minimization in order to"fit"the unknown operator. In a nutshell, our approach is based on two ideas: the first is to relate U-nets to multiscale decompositions such as wavelets, thereby linking them to the existing theory, and the second is to use the hierarchical structure of U-nets and the low number of parameters of convolutional neural nets to prove entropy bounds that are practically useful. A significant difference with the existing works on neural networks in nonparametric statistics is that we use them to approximate operators and not functions, which we argue is mathematically more natural and technically more convenient.
{"title":"Deep learning for inverse problems with unknown operator","authors":"Miguel del Álamo","doi":"10.1214/23-ejs2114","DOIUrl":"https://doi.org/10.1214/23-ejs2114","url":null,"abstract":"We consider ill-posed inverse problems where the forward operator $T$ is unknown, and instead we have access to training data consisting of functions $f_i$ and their noisy images $Tf_i$. This is a practically relevant and challenging problem which current methods are able to solve only under strong assumptions on the training set. Here we propose a new method that requires minimal assumptions on the data, and prove reconstruction rates that depend on the number of training points and the noise level. We show that, in the regime of\"many\"training data, the method is minimax optimal. The proposed method employs a type of convolutional neural networks (U-nets) and empirical risk minimization in order to\"fit\"the unknown operator. In a nutshell, our approach is based on two ideas: the first is to relate U-nets to multiscale decompositions such as wavelets, thereby linking them to the existing theory, and the second is to use the hierarchical structure of U-nets and the low number of parameters of convolutional neural nets to prove entropy bounds that are practically useful. A significant difference with the existing works on neural networks in nonparametric statistics is that we use them to approximate operators and not functions, which we argue is mathematically more natural and technically more convenient.","PeriodicalId":49272,"journal":{"name":"Electronic Journal of Statistics","volume":"NS24 3 1","pages":""},"PeriodicalIF":1.1,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77828556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Variable screening for ultrahigh-dimensional data has attracted extensive attention in the past decade. In many applications, researchers learn from previous studies about certain important predictors or control variables related to the response of interest. Such knowledge should be taken into account in the screening procedure. The development of variable screening conditional on prior information, however, has been less fruitful, compared to the vast literature for generic unconditional screening. In this paper, we propose a model-free variable screening paradigm that allows for high-dimensional controls and applies to either continuous or categorical responses. The contribution of each individual predictor is quantified marginally and conditionally in the presence of the control variables as well as the other candidates by reproducing-kernel-based R2 and partial R2 statistics. As a result, the proposed method enjoys the sure screening property and the rank consistency property in the notion of sufficiency, with which its superiority over existing methods is well-established. The advantages of the proposed method are demonstrated by simulation studies encompassing a variety of regression and classification models, and an application to high-throughput gene expression data.
{"title":"Sufficient variable screening with high-dimensional controls","authors":"Chenlu Ke","doi":"10.1214/23-ejs2150","DOIUrl":"https://doi.org/10.1214/23-ejs2150","url":null,"abstract":"Variable screening for ultrahigh-dimensional data has attracted extensive attention in the past decade. In many applications, researchers learn from previous studies about certain important predictors or control variables related to the response of interest. Such knowledge should be taken into account in the screening procedure. The development of variable screening conditional on prior information, however, has been less fruitful, compared to the vast literature for generic unconditional screening. In this paper, we propose a model-free variable screening paradigm that allows for high-dimensional controls and applies to either continuous or categorical responses. The contribution of each individual predictor is quantified marginally and conditionally in the presence of the control variables as well as the other candidates by reproducing-kernel-based R2 and partial R2 statistics. As a result, the proposed method enjoys the sure screening property and the rank consistency property in the notion of sufficiency, with which its superiority over existing methods is well-established. The advantages of the proposed method are demonstrated by simulation studies encompassing a variety of regression and classification models, and an application to high-throughput gene expression data.","PeriodicalId":49272,"journal":{"name":"Electronic Journal of Statistics","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135911130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}