Matthieu Pluntz, Cyril Dalmasso, Pascale Tubert-Bitter, Ismail Ahmed
{"title":"Simulation-calibration testing for inference in Lasso regressions","authors":"Matthieu Pluntz, Cyril Dalmasso, Pascale Tubert-Bitter, Ismail Ahmed","doi":"arxiv-2409.02269","DOIUrl":null,"url":null,"abstract":"We propose a test of the significance of a variable appearing on the Lasso\npath and use it in a procedure for selecting one of the models of the Lasso\npath, controlling the Family-Wise Error Rate. Our null hypothesis depends on a\nset A of already selected variables and states that it contains all the active\nvariables. We focus on the regularization parameter value from which a first\nvariable outside A is selected. As the test statistic, we use this quantity's\nconditional p-value, which we define conditional on the non-penalized estimated\ncoefficients of the model restricted to A. We estimate this by simulating\noutcome vectors and then calibrating them on the observed outcome's estimated\ncoefficients. We adapt the calibration heuristically to the case of generalized\nlinear models in which it turns into an iterative stochastic procedure. We\nprove that the test controls the risk of selecting a false positive in linear\nmodels, both under the null hypothesis and, under a correlation condition, when\nA does not contain all active variables. We assess the performance of our\nprocedure through extensive simulation studies. We also illustrate it in the\ndetection of exposures associated with drug-induced liver injuries in the\nFrench pharmacovigilance database.","PeriodicalId":501379,"journal":{"name":"arXiv - STAT - Statistics Theory","volume":"61 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Statistics Theory","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.02269","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We propose a test of the significance of a variable appearing on the Lasso
path and use it in a procedure for selecting one of the models of the Lasso
path, controlling the Family-Wise Error Rate. Our null hypothesis depends on a
set A of already selected variables and states that it contains all the active
variables. We focus on the regularization parameter value from which a first
variable outside A is selected. As the test statistic, we use this quantity's
conditional p-value, which we define conditional on the non-penalized estimated
coefficients of the model restricted to A. We estimate this by simulating
outcome vectors and then calibrating them on the observed outcome's estimated
coefficients. We adapt the calibration heuristically to the case of generalized
linear models in which it turns into an iterative stochastic procedure. We
prove that the test controls the risk of selecting a false positive in linear
models, both under the null hypothesis and, under a correlation condition, when
A does not contain all active variables. We assess the performance of our
procedure through extensive simulation studies. We also illustrate it in the
detection of exposures associated with drug-induced liver injuries in the
French pharmacovigilance database.
我们建议对拉索帕斯中出现的变量进行重要性检验,并将其用于选择拉索帕斯模型之一的程序中,同时控制全族平均误差率(Family-Wise Error Rate)。我们的零假设取决于已选定变量的集合 A,并指出它包含所有有效变量。我们的重点是正则化参数值,从中选出 A 以外的第一个变量。作为检验统计量,我们使用这个量的条件 p 值,它是以限制在 A 中的模型的非惩罚估计系数为条件定义的。我们通过模拟结果向量,然后根据观察结果的估计系数进行校准来估计这个值。我们将校准启发式地应用于广义线性模型的情况,在这种情况下,校准变成了一个迭代随机过程。我们证明,无论是在零假设下,还是在相关条件下,当 A 不包含所有活动变量时,该检验都能控制线性模型中选择假阳性的风险。我们通过大量的模拟研究评估了我们程序的性能。我们还以法国药物警戒数据库中与药物引起的肝损伤相关的暴露检测为例进行了说明。