Rheanna M. Mainzer, Cattram D. Nguyen, John B. Carlin, Margarita Moreno-Betancur, Ian R. White, Katherine J. Lee
{"title":"A comparison of strategies for selecting auxiliary variables for multiple imputation","authors":"Rheanna M. Mainzer, Cattram D. Nguyen, John B. Carlin, Margarita Moreno-Betancur, Ian R. White, Katherine J. Lee","doi":"10.1002/bimj.202200291","DOIUrl":null,"url":null,"abstract":"<p>Multiple imputation (MI) is a popular method for handling missing data. Auxiliary variables can be added to the imputation model(s) to improve MI estimates. However, the choice of which auxiliary variables to include is not always straightforward. Several data-driven auxiliary variable selection strategies have been proposed, but there has been limited evaluation of their performance. Using a simulation study we evaluated the performance of eight auxiliary variable selection strategies: (1, 2) two versions of selection based on correlations in the observed data; (3) selection using hypothesis tests of the “missing completely at random” assumption; (4) replacing auxiliary variables with their principal components; (5, 6) forward and forward stepwise selection; (7) forward selection based on the estimated fraction of missing information; and (8) selection via the least absolute shrinkage and selection operator (LASSO). A complete case analysis and an MI analysis using all auxiliary variables (the “full model”) were included for comparison. We also applied all strategies to a motivating case study. The full model outperformed all auxiliary variable selection strategies in the simulation study, with the LASSO strategy the best performing auxiliary variable selection strategy overall. All MI analysis strategies that we were able to apply to the case study led to similar estimates, although computational time was substantially reduced when variable selection was employed. This study provides further support for adopting an inclusive auxiliary variable strategy where possible. Auxiliary variable selection using the LASSO may be a promising alternative when the full model fails or is too burdensome.</p>","PeriodicalId":55360,"journal":{"name":"Biometrical Journal","volume":null,"pages":null},"PeriodicalIF":1.3000,"publicationDate":"2024-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/bimj.202200291","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biometrical Journal","FirstCategoryId":"99","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/bimj.202200291","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Multiple imputation (MI) is a popular method for handling missing data. Auxiliary variables can be added to the imputation model(s) to improve MI estimates. However, the choice of which auxiliary variables to include is not always straightforward. Several data-driven auxiliary variable selection strategies have been proposed, but there has been limited evaluation of their performance. Using a simulation study we evaluated the performance of eight auxiliary variable selection strategies: (1, 2) two versions of selection based on correlations in the observed data; (3) selection using hypothesis tests of the “missing completely at random” assumption; (4) replacing auxiliary variables with their principal components; (5, 6) forward and forward stepwise selection; (7) forward selection based on the estimated fraction of missing information; and (8) selection via the least absolute shrinkage and selection operator (LASSO). A complete case analysis and an MI analysis using all auxiliary variables (the “full model”) were included for comparison. We also applied all strategies to a motivating case study. The full model outperformed all auxiliary variable selection strategies in the simulation study, with the LASSO strategy the best performing auxiliary variable selection strategy overall. All MI analysis strategies that we were able to apply to the case study led to similar estimates, although computational time was substantially reduced when variable selection was employed. This study provides further support for adopting an inclusive auxiliary variable strategy where possible. Auxiliary variable selection using the LASSO may be a promising alternative when the full model fails or is too burdensome.
期刊介绍:
Biometrical Journal publishes papers on statistical methods and their applications in life sciences including medicine, environmental sciences and agriculture. Methodological developments should be motivated by an interesting and relevant problem from these areas. Ideally the manuscript should include a description of the problem and a section detailing the application of the new methodology to the problem. Case studies, review articles and letters to the editors are also welcome. Papers containing only extensive mathematical theory are not suitable for publication in Biometrical Journal.