In this paper, we consider an experimental setting where units enter the experiment sequentially. Our goal is to form stopping rules which lead to estimators of treatment effects with a given precision. We propose a fixed-width confidence interval design (FWCID) where the experiment terminates once a pre-specified confidence interval width is achieved. We show that under this design, the difference-in-means estimator is a consistent estimator of the average treatment effect and standard confidence intervals have asymptotic guarantees of coverage and efficiency for several versions of the design. In addition, we propose a version of the design that we call fixed power design (FPD) where a given power is asymptotically guaranteed for a given treatment effect, without the need to specify the variances of the outcomes under treatment or control. In addition, this design also gives a consistent difference-in-means estimator with correct coverage of the corresponding standard confidence interval. We complement our theoretical findings with Monte Carlo simulations where we compare our proposed designs with standard designs in the sequential experiments literature, showing that our designs outperform these designs in several important aspects. We believe our results to be relevant for many experimental settings where units enter sequentially, such as in clinical trials, as well as in online A/B tests used by the tech and e-commerce industry.
{"title":"Precision-based designs for sequential randomized experiments","authors":"Mattias Nordin, Mårten Schultzberg","doi":"arxiv-2405.03487","DOIUrl":"https://doi.org/arxiv-2405.03487","url":null,"abstract":"In this paper, we consider an experimental setting where units enter the\u0000experiment sequentially. Our goal is to form stopping rules which lead to\u0000estimators of treatment effects with a given precision. We propose a\u0000fixed-width confidence interval design (FWCID) where the experiment terminates\u0000once a pre-specified confidence interval width is achieved. We show that under\u0000this design, the difference-in-means estimator is a consistent estimator of the\u0000average treatment effect and standard confidence intervals have asymptotic\u0000guarantees of coverage and efficiency for several versions of the design. In\u0000addition, we propose a version of the design that we call fixed power design\u0000(FPD) where a given power is asymptotically guaranteed for a given treatment\u0000effect, without the need to specify the variances of the outcomes under\u0000treatment or control. In addition, this design also gives a consistent\u0000difference-in-means estimator with correct coverage of the corresponding\u0000standard confidence interval. We complement our theoretical findings with Monte\u0000Carlo simulations where we compare our proposed designs with standard designs\u0000in the sequential experiments literature, showing that our designs outperform\u0000these designs in several important aspects. We believe our results to be\u0000relevant for many experimental settings where units enter sequentially, such as\u0000in clinical trials, as well as in online A/B tests used by the tech and\u0000e-commerce industry.","PeriodicalId":501330,"journal":{"name":"arXiv - MATH - Statistics Theory","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140889010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We address parameter estimation in second-order stochastic differential equations (SDEs), prevalent in physics, biology, and ecology. Second-order SDE is converted to a first-order system by introducing an auxiliary velocity variable raising two main challenges. First, the system is hypoelliptic since the noise affects only the velocity, making the Euler-Maruyama estimator ill-conditioned. To overcome that, we propose an estimator based on the Strang splitting scheme. Second, since the velocity is rarely observed we adjust the estimator for partial observations. We present four estimators for complete and partial observations, using full likelihood or only velocity marginal likelihood. These estimators are intuitive, easy to implement, and computationally fast, and we prove their consistency and asymptotic normality. Our analysis demonstrates that using full likelihood with complete observations reduces the asymptotic variance of the diffusion estimator. With partial observations, the asymptotic variance increases due to information loss but remains unaffected by the likelihood choice. However, a numerical study on the Kramers oscillator reveals that using marginal likelihood for partial observations yields less biased estimators. We apply our approach to paleoclimate data from the Greenland ice core and fit it to the Kramers oscillator model, capturing transitions between metastable states reflecting observed climatic conditions during glacial eras.
{"title":"Strang Splitting for Parametric Inference in Second-order Stochastic Differential Equations","authors":"Predrag Pilipovic, Adeline Samson, Susanne Ditlevsen","doi":"arxiv-2405.03606","DOIUrl":"https://doi.org/arxiv-2405.03606","url":null,"abstract":"We address parameter estimation in second-order stochastic differential\u0000equations (SDEs), prevalent in physics, biology, and ecology. Second-order SDE\u0000is converted to a first-order system by introducing an auxiliary velocity\u0000variable raising two main challenges. First, the system is hypoelliptic since\u0000the noise affects only the velocity, making the Euler-Maruyama estimator\u0000ill-conditioned. To overcome that, we propose an estimator based on the Strang\u0000splitting scheme. Second, since the velocity is rarely observed we adjust the\u0000estimator for partial observations. We present four estimators for complete and\u0000partial observations, using full likelihood or only velocity marginal\u0000likelihood. These estimators are intuitive, easy to implement, and\u0000computationally fast, and we prove their consistency and asymptotic normality.\u0000Our analysis demonstrates that using full likelihood with complete observations\u0000reduces the asymptotic variance of the diffusion estimator. With partial\u0000observations, the asymptotic variance increases due to information loss but\u0000remains unaffected by the likelihood choice. However, a numerical study on the\u0000Kramers oscillator reveals that using marginal likelihood for partial\u0000observations yields less biased estimators. We apply our approach to\u0000paleoclimate data from the Greenland ice core and fit it to the Kramers\u0000oscillator model, capturing transitions between metastable states reflecting\u0000observed climatic conditions during glacial eras.","PeriodicalId":501330,"journal":{"name":"arXiv - MATH - Statistics Theory","volume":"238 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140889076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Suppose that we first apply the Lasso to a design matrix, and then update one of its columns. In general, the signs of the Lasso coefficients may change, and there is no closed-form expression for updating the Lasso solution exactly. In this work, we propose an approximate formula for updating a debiased Lasso coefficient. We provide general nonasymptotic error bounds in terms of the norms and correlations of a given design matrix's columns, and then prove asymptotic convergence results for the case of a random design matrix with i.i.d. sub-Gaussian row vectors and i.i.d. Gaussian noise. Notably, the approximate formula is asymptotically correct for most coordinates in the proportional growth regime, under the mild assumption that each row of the design matrix is sub-Gaussian with a covariance matrix having a bounded condition number. Our proof only requires certain concentration and anti-concentration properties to control various error terms and the number of sign changes. In contrast, rigorously establishing distributional limit properties (e.g. Gaussian limits for the debiased Lasso) under similarly general assumptions has been considered open problem in the universality theory. As applications, we show that the approximate formula allows us to reduce the computation complexity of variable selection algorithms that require solving multiple Lasso problems, such as the conditional randomization test and a variant of the knockoff filter.
{"title":"Stability of a Generalized Debiased Lasso with Applications to Resampling-Based Variable Selection","authors":"Jingbo Liu","doi":"arxiv-2405.03063","DOIUrl":"https://doi.org/arxiv-2405.03063","url":null,"abstract":"Suppose that we first apply the Lasso to a design matrix, and then update one\u0000of its columns. In general, the signs of the Lasso coefficients may change, and\u0000there is no closed-form expression for updating the Lasso solution exactly. In\u0000this work, we propose an approximate formula for updating a debiased Lasso\u0000coefficient. We provide general nonasymptotic error bounds in terms of the\u0000norms and correlations of a given design matrix's columns, and then prove\u0000asymptotic convergence results for the case of a random design matrix with\u0000i.i.d. sub-Gaussian row vectors and i.i.d. Gaussian noise. Notably, the\u0000approximate formula is asymptotically correct for most coordinates in the\u0000proportional growth regime, under the mild assumption that each row of the\u0000design matrix is sub-Gaussian with a covariance matrix having a bounded\u0000condition number. Our proof only requires certain concentration and\u0000anti-concentration properties to control various error terms and the number of\u0000sign changes. In contrast, rigorously establishing distributional limit\u0000properties (e.g. Gaussian limits for the debiased Lasso) under similarly\u0000general assumptions has been considered open problem in the universality\u0000theory. As applications, we show that the approximate formula allows us to\u0000reduce the computation complexity of variable selection algorithms that require\u0000solving multiple Lasso problems, such as the conditional randomization test and\u0000a variant of the knockoff filter.","PeriodicalId":501330,"journal":{"name":"arXiv - MATH - Statistics Theory","volume":"118 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140889002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Klaus Herrmann, Marius Hofert, Johanna G. Neslehova
Weak convergence of maxima of dependent sequences of identically distributed continuous random variables is studied under normalizing sequences arising as subsequences of the normalizing sequences from an associated iid sequence. This general framework allows one to derive several generalizations of the well-known Fisher-Tippett-Gnedenko theorem under conditions on the univariate marginal distribution and the dependence structure of the sequence. The limiting distributions are shown to be compositions of a generalized extreme value distribution and a distortion function which reflects the limiting behavior of the diagonal of the underlying copula. Uniform convergence rates for the weak convergence to the limiting distribution are also derived. Examples covering well-known dependence structures are provided. Several existing results, e.g. for exchangeable sequences or stationary time series, are embedded in the proposed framework.
{"title":"Limiting Behavior of Maxima under Dependence","authors":"Klaus Herrmann, Marius Hofert, Johanna G. Neslehova","doi":"arxiv-2405.02833","DOIUrl":"https://doi.org/arxiv-2405.02833","url":null,"abstract":"Weak convergence of maxima of dependent sequences of identically distributed\u0000continuous random variables is studied under normalizing sequences arising as\u0000subsequences of the normalizing sequences from an associated iid sequence. This\u0000general framework allows one to derive several generalizations of the\u0000well-known Fisher-Tippett-Gnedenko theorem under conditions on the univariate\u0000marginal distribution and the dependence structure of the sequence. The\u0000limiting distributions are shown to be compositions of a generalized extreme\u0000value distribution and a distortion function which reflects the limiting\u0000behavior of the diagonal of the underlying copula. Uniform convergence rates\u0000for the weak convergence to the limiting distribution are also derived.\u0000Examples covering well-known dependence structures are provided. Several\u0000existing results, e.g. for exchangeable sequences or stationary time series,\u0000are embedded in the proposed framework.","PeriodicalId":501330,"journal":{"name":"arXiv - MATH - Statistics Theory","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140889009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Erhan Bayrakta, Fei Lu, Mauro Maggioni, Ruoyu Wu, Sichen Yang
We introduce a new class of probabilistic cellular automata that are capable of exhibiting rich dynamics such as synchronization and ergodicity and can be easily inferred from data. The system is a finite-state locally interacting Markov chain on a circular graph. Each site's subsequent state is random, with a distribution determined by its neighborhood's empirical distribution multiplied by a local transition matrix. We establish sufficient and necessary conditions on the local transition matrix for synchronization and ergodicity. Also, we introduce novel least squares estimators for inferring the local transition matrix from various types of data, which may consist of either multiple trajectories, a long trajectory, or ensemble sequences without trajectory information. Under suitable identifiability conditions, we show the asymptotic normality of these estimators and provide non-asymptotic bounds for their accuracy.
{"title":"Probabilistic cellular automata with local transition matrices: synchronization, ergodicity, and inference","authors":"Erhan Bayrakta, Fei Lu, Mauro Maggioni, Ruoyu Wu, Sichen Yang","doi":"arxiv-2405.02928","DOIUrl":"https://doi.org/arxiv-2405.02928","url":null,"abstract":"We introduce a new class of probabilistic cellular automata that are capable\u0000of exhibiting rich dynamics such as synchronization and ergodicity and can be\u0000easily inferred from data. The system is a finite-state locally interacting\u0000Markov chain on a circular graph. Each site's subsequent state is random, with\u0000a distribution determined by its neighborhood's empirical distribution\u0000multiplied by a local transition matrix. We establish sufficient and necessary\u0000conditions on the local transition matrix for synchronization and ergodicity.\u0000Also, we introduce novel least squares estimators for inferring the local\u0000transition matrix from various types of data, which may consist of either\u0000multiple trajectories, a long trajectory, or ensemble sequences without\u0000trajectory information. Under suitable identifiability conditions, we show the\u0000asymptotic normality of these estimators and provide non-asymptotic bounds for\u0000their accuracy.","PeriodicalId":501330,"journal":{"name":"arXiv - MATH - Statistics Theory","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140889077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I review some of the main methods for selecting tuning parameters in nonparametric and $ell_1$-penalized estimation. For the nonparametric estimation, I consider the methods of Mallows, Stein, Lepski, cross-validation, penalization, and aggregation in the context of series estimation. For the $ell_1$-penalized estimation, I consider the methods based on the theory of self-normalized moderate deviations, bootstrap, Stein's unbiased risk estimation, and cross-validation in the context of Lasso estimation. I explain the intuition behind each of the methods and discuss their comparative advantages. I also give some extensions.
{"title":"Tuning parameter selection in econometrics","authors":"Denis Chetverikov","doi":"arxiv-2405.03021","DOIUrl":"https://doi.org/arxiv-2405.03021","url":null,"abstract":"I review some of the main methods for selecting tuning parameters in\u0000nonparametric and $ell_1$-penalized estimation. For the nonparametric\u0000estimation, I consider the methods of Mallows, Stein, Lepski, cross-validation,\u0000penalization, and aggregation in the context of series estimation. For the\u0000$ell_1$-penalized estimation, I consider the methods based on the theory of\u0000self-normalized moderate deviations, bootstrap, Stein's unbiased risk\u0000estimation, and cross-validation in the context of Lasso estimation. I explain\u0000the intuition behind each of the methods and discuss their comparative\u0000advantages. I also give some extensions.","PeriodicalId":501330,"journal":{"name":"arXiv - MATH - Statistics Theory","volume":"45 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140889006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Negative probabilities arise primarily in quantum theory and computing. Bartlett provides a definition based on characteristic functions and extraordinary random variables. As Bartlett observes, negative probabilities must always be combined with positive probabilities to yield a valid probability distribution before any physical interpretation is admissible. Negative probabilities arise as mixing distributions of unobserved latent variables in Bayesian modeling. Our goal is to provide a link with dual densities and the class of scale mixtures of normal distributions. We provide an analysis of the classic half coin distribution and Feynman's negative probability examples. A number of examples of dual densities with negative mixing measures including the linnik distribution, Wigner distribution and the stable distribution are provided. Finally, we conclude with directions for future research.
{"title":"Negative Probability","authors":"Nick Polson, Vadim Sokolov","doi":"arxiv-2405.03043","DOIUrl":"https://doi.org/arxiv-2405.03043","url":null,"abstract":"Negative probabilities arise primarily in quantum theory and computing.\u0000Bartlett provides a definition based on characteristic functions and\u0000extraordinary random variables. As Bartlett observes, negative probabilities\u0000must always be combined with positive probabilities to yield a valid\u0000probability distribution before any physical interpretation is admissible.\u0000Negative probabilities arise as mixing distributions of unobserved latent\u0000variables in Bayesian modeling. Our goal is to provide a link with dual\u0000densities and the class of scale mixtures of normal distributions. We provide\u0000an analysis of the classic half coin distribution and Feynman's negative\u0000probability examples. A number of examples of dual densities with negative\u0000mixing measures including the linnik distribution, Wigner distribution and the\u0000stable distribution are provided. Finally, we conclude with directions for\u0000future research.","PeriodicalId":501330,"journal":{"name":"arXiv - MATH - Statistics Theory","volume":"177 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140889908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In a nutshell, unscented trajectory optimization is the generation of optimal trajectories through the use of an unscented transform. Although unscented trajectory optimization was introduced by the authors about a decade ago, it is reintroduced in this paper as a special instantiation of tychastic optimal control theory. Tychastic optimal control theory (from textit{Tyche}, the Greek goddess of chance) avoids the use of a Brownian motion and the resulting It^{o} calculus even though it uses random variables across the entire spectrum of a problem formulation. This approach circumvents the enormous technical and numerical challenges associated with stochastic trajectory optimization. Furthermore, it is shown how a tychastic optimal control problem that involves nonlinear transformations of the expectation operator can be quickly instantiated using an unscented transform. These nonlinear transformations are particularly useful in managing trajectory dispersions be it associated with path constraints or targeted values of final-time conditions. This paper also presents a systematic and rapid process for formulating and computing the most desirable tychastic trajectory using an unscented transform. Numerical examples are used to illustrate how unscented trajectory optimization may be used for risk reduction and mission recovery caused by uncertainties and failures.
{"title":"Unscented Trajectory Optimization","authors":"I. M. Ross, R. J. Proulx, M. Karpenko","doi":"arxiv-2405.02753","DOIUrl":"https://doi.org/arxiv-2405.02753","url":null,"abstract":"In a nutshell, unscented trajectory optimization is the generation of optimal\u0000trajectories through the use of an unscented transform. Although unscented\u0000trajectory optimization was introduced by the authors about a decade ago, it is\u0000reintroduced in this paper as a special instantiation of tychastic optimal\u0000control theory. Tychastic optimal control theory (from textit{Tyche}, the\u0000Greek goddess of chance) avoids the use of a Brownian motion and the resulting\u0000It^{o} calculus even though it uses random variables across the entire\u0000spectrum of a problem formulation. This approach circumvents the enormous\u0000technical and numerical challenges associated with stochastic trajectory\u0000optimization. Furthermore, it is shown how a tychastic optimal control problem\u0000that involves nonlinear transformations of the expectation operator can be\u0000quickly instantiated using an unscented transform. These nonlinear\u0000transformations are particularly useful in managing trajectory dispersions be\u0000it associated with path constraints or targeted values of final-time\u0000conditions. This paper also presents a systematic and rapid process for\u0000formulating and computing the most desirable tychastic trajectory using an\u0000unscented transform. Numerical examples are used to illustrate how unscented\u0000trajectory optimization may be used for risk reduction and mission recovery\u0000caused by uncertainties and failures.","PeriodicalId":501330,"journal":{"name":"arXiv - MATH - Statistics Theory","volume":"118 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140889008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Testing differences in mean vectors is a fundamental task in the analysis of high-dimensional compositional data. Existing methods may suffer from low power if the underlying signal pattern is in a situation that does not favor the deployed test. In this work, we develop two-sample power-enhanced mean tests for high-dimensional compositional data based on the combination of $p$-values, which integrates strengths from two popular types of tests: the maximum-type test and the quadratic-type test. We provide rigorous theoretical guarantees on the proposed tests, showing accurate Type-I error rate control and enhanced testing power. Our method boosts the testing power towards a broader alternative space, which yields robust performance across a wide range of signal pattern settings. Our theory also contributes to the literature on power enhancement and Gaussian approximation for high-dimensional hypothesis testing. We demonstrate the performance of our method on both simulated data and real-world microbiome data, showing that our proposed approach improves the testing power substantially compared to existing methods.
{"title":"Power-Enhanced Two-Sample Mean Tests for High-Dimensional Compositional Data with Application to Microbiome Data Analysis","authors":"Danning Li, Lingzhou Xue, Haoyi Yang, Xiufan Yu","doi":"arxiv-2405.02551","DOIUrl":"https://doi.org/arxiv-2405.02551","url":null,"abstract":"Testing differences in mean vectors is a fundamental task in the analysis of\u0000high-dimensional compositional data. Existing methods may suffer from low power\u0000if the underlying signal pattern is in a situation that does not favor the\u0000deployed test. In this work, we develop two-sample power-enhanced mean tests\u0000for high-dimensional compositional data based on the combination of $p$-values,\u0000which integrates strengths from two popular types of tests: the maximum-type\u0000test and the quadratic-type test. We provide rigorous theoretical guarantees on\u0000the proposed tests, showing accurate Type-I error rate control and enhanced\u0000testing power. Our method boosts the testing power towards a broader\u0000alternative space, which yields robust performance across a wide range of\u0000signal pattern settings. Our theory also contributes to the literature on power\u0000enhancement and Gaussian approximation for high-dimensional hypothesis testing.\u0000We demonstrate the performance of our method on both simulated data and\u0000real-world microbiome data, showing that our proposed approach improves the\u0000testing power substantially compared to existing methods.","PeriodicalId":501330,"journal":{"name":"arXiv - MATH - Statistics Theory","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140890018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brandon Woosuk Park, Anand N. Vidyashankar, Tucker S. McElroy
When multitudes of features can plausibly be associated with a response, both privacy considerations and model parsimony suggest grouping them to increase the predictive power of a regression model. Specifically, the identification of groups of predictors significantly associated with the response variable eases further downstream analysis and decision-making. This paper proposes a new data analysis methodology that utilizes the high-dimensional predictor space to construct an implicit network with weighted edges %and weights on the edges to identify significant associations between the response and the predictors. Using a population model for groups of predictors defined via network-wide metrics, a new supervised grouping algorithm is proposed to determine the correct group, with probability tending to one as the sample size diverges to infinity. For this reason, we establish several theoretical properties of the estimates of network-wide metrics. A novel model-assisted bootstrap procedure that substantially decreases computational complexity is developed, facilitating the assessment of uncertainty in the estimates of network-wide metrics. The proposed methods account for several challenges that arise in the high-dimensional data setting, including (i) a large number of predictors, (ii) uncertainty regarding the true statistical model, and (iii) model selection variability. The performance of the proposed methods is demonstrated through numerical experiments, data from sports analytics, and breast cancer data.
{"title":"Grouping predictors via network-wide metrics","authors":"Brandon Woosuk Park, Anand N. Vidyashankar, Tucker S. McElroy","doi":"arxiv-2405.02715","DOIUrl":"https://doi.org/arxiv-2405.02715","url":null,"abstract":"When multitudes of features can plausibly be associated with a response, both\u0000privacy considerations and model parsimony suggest grouping them to increase\u0000the predictive power of a regression model. Specifically, the identification of\u0000groups of predictors significantly associated with the response variable eases\u0000further downstream analysis and decision-making. This paper proposes a new data\u0000analysis methodology that utilizes the high-dimensional predictor space to\u0000construct an implicit network with weighted edges %and weights on the edges to\u0000identify significant associations between the response and the predictors.\u0000Using a population model for groups of predictors defined via network-wide\u0000metrics, a new supervised grouping algorithm is proposed to determine the\u0000correct group, with probability tending to one as the sample size diverges to\u0000infinity. For this reason, we establish several theoretical properties of the\u0000estimates of network-wide metrics. A novel model-assisted bootstrap procedure\u0000that substantially decreases computational complexity is developed,\u0000facilitating the assessment of uncertainty in the estimates of network-wide\u0000metrics. The proposed methods account for several challenges that arise in the\u0000high-dimensional data setting, including (i) a large number of predictors, (ii)\u0000uncertainty regarding the true statistical model, and (iii) model selection\u0000variability. The performance of the proposed methods is demonstrated through\u0000numerical experiments, data from sports analytics, and breast cancer data.","PeriodicalId":501330,"journal":{"name":"arXiv - MATH - Statistics Theory","volume":"38 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140889064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}