Pegah Golchian, Jan Kapar, David S Watson, Marvin N Wright
Handling missing values is a common challenge in biostatistical analyses, typically addressed by imputation methods. We propose a novel, fast, and easy-to-use imputation method called missing value imputation with adversarial random forests (MissARF), based on generative machine learning, that provides both single and multiple imputation. MissARF employs adversarial random forest (ARF) for density estimation and data synthesis. To impute a missing value of an observation, we condition on the non-missing values and sample from the estimated conditional distribution generated by ARF. Our experiments demonstrate that MissARF performs comparably to state-of-the-art single and multiple imputation methods in terms of imputation quality and fast runtime with no additional costs for multiple imputation.
{"title":"Missing Value Imputation With Adversarial Random Forests-MissARF.","authors":"Pegah Golchian, Jan Kapar, David S Watson, Marvin N Wright","doi":"10.1002/sim.70379","DOIUrl":"10.1002/sim.70379","url":null,"abstract":"<p><p>Handling missing values is a common challenge in biostatistical analyses, typically addressed by imputation methods. We propose a novel, fast, and easy-to-use imputation method called missing value imputation with adversarial random forests (MissARF), based on generative machine learning, that provides both single and multiple imputation. MissARF employs adversarial random forest (ARF) for density estimation and data synthesis. To impute a missing value of an observation, we condition on the non-missing values and sample from the estimated conditional distribution generated by ARF. Our experiments demonstrate that MissARF performs comparably to state-of-the-art single and multiple imputation methods in terms of imputation quality and fast runtime with no additional costs for multiple imputation.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"45 3-5","pages":"e70379"},"PeriodicalIF":1.8,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12871009/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146120212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Confounding bias and selection bias are two major challenges in causal inference with observational data. While numerous methods have been developed to mitigate confounding bias, they often assume that the data are representative of the study population and ignore the potential selection bias introduced during data collection. In this paper, we propose a unified weighting framework-survey-weighted propensity score weighting-to simultaneously address both confounding and selection biases when the observational dataset is a probability survey sample from a finite population, which is itself viewed as a random sample from the target superpopulation. The proposed method yields a doubly robust inferential procedure for a class of population weighted average treatment effects. We further extend our results to non-probability observational data when the sampling mechanism is unknown but auxiliary information of the confounding variables is available from an external probability sample. We focus on practically important scenarios where the confounders are only partially observed in the external data. Our analysis reveals that the key variables in the external data are those related to both treatment effect heterogeneity and the selection mechanism. We also discuss how to combine auxiliary information from multiple reference probability samples. Monte Carlo simulations and an application to a real-world non-probability observational dataset demonstrate the superiority of our proposed methods over standard propensity score weighting approaches.
{"title":"Causal Inference With Survey Data: A Robust Framework for Propensity Score Weighting in Probability and Non-Probability Samples.","authors":"Wei Liang, Changbao Wu","doi":"10.1002/sim.70420","DOIUrl":"10.1002/sim.70420","url":null,"abstract":"<p><p>Confounding bias and selection bias are two major challenges in causal inference with observational data. While numerous methods have been developed to mitigate confounding bias, they often assume that the data are representative of the study population and ignore the potential selection bias introduced during data collection. In this paper, we propose a unified weighting framework-survey-weighted propensity score weighting-to simultaneously address both confounding and selection biases when the observational dataset is a probability survey sample from a finite population, which is itself viewed as a random sample from the target superpopulation. The proposed method yields a doubly robust inferential procedure for a class of population weighted average treatment effects. We further extend our results to non-probability observational data when the sampling mechanism is unknown but auxiliary information of the confounding variables is available from an external probability sample. We focus on practically important scenarios where the confounders are only partially observed in the external data. Our analysis reveals that the key variables in the external data are those related to both treatment effect heterogeneity and the selection mechanism. We also discuss how to combine auxiliary information from multiple reference probability samples. Monte Carlo simulations and an application to a real-world non-probability observational dataset demonstrate the superiority of our proposed methods over standard propensity score weighting approaches.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"45 3-5","pages":"e70420"},"PeriodicalIF":1.8,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12873465/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146120100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We have studied 21 435 unique randomized controlled trials (RCTs) from the Cochrane Database of Systematic Reviews (CDSR). Of these trials, 7224 (34%) have a continuous (numerical) outcome and 14 211 (66%) have a binary outcome. We find that trials with a binary outcome have larger sample sizes on average, but also larger standard errors and fewer statistically significant results. We conclude that researchers tend to increase the sample size to compensate for the low information content of binary outcomes, but not sufficiently. In many cases, the binary outcome is the result of dichotomization of a continuous outcome, which is sometimes referred to as "responder analysis". In those cases, the loss of information is avoidable. Burdening more participants than necessary is wasteful, costly, and unethical. We provide a method to convert a sample size calculation for the comparison of two proportions into one for the comparison of the means of the underlying continuous outcomes. This demonstrates how much the sample size may be reduced if the outcome were not dichotomized. We also provide a method to calculate the loss of information after a dichotomization. We apply this method to all the trials from the CDSR with a binary outcome, and estimate that on average, only about 60% of the information is retained after dichotomization. We provide R code and a shiny app at: https://vanzwet.shinyapps.io/info_loss/ to do these calculations. We hope that quantifying the loss of information will discourage researchers from dichotomizing continuous outcomes. Instead, we recommend they "model continuously but interpret dichotomously". For example, they might present "percentage achieving clinically meaningful improvement" derived from a continuous analysis rather than by dichotomizing raw data.
{"title":"An Empirical Assessment of the Cost of Dichotomization of the Outcome of Clinical Trials.","authors":"Erik W van Zwet, Frank E Harrell, Stephen J Senn","doi":"10.1002/sim.70402","DOIUrl":"10.1002/sim.70402","url":null,"abstract":"<p><p>We have studied 21 435 unique randomized controlled trials (RCTs) from the Cochrane Database of Systematic Reviews (CDSR). Of these trials, 7224 (34%) have a continuous (numerical) outcome and 14 211 (66%) have a binary outcome. We find that trials with a binary outcome have larger sample sizes on average, but also larger standard errors and fewer statistically significant results. We conclude that researchers tend to increase the sample size to compensate for the low information content of binary outcomes, but not sufficiently. In many cases, the binary outcome is the result of dichotomization of a continuous outcome, which is sometimes referred to as \"responder analysis\". In those cases, the loss of information is avoidable. Burdening more participants than necessary is wasteful, costly, and unethical. We provide a method to convert a sample size calculation for the comparison of two proportions into one for the comparison of the means of the underlying continuous outcomes. This demonstrates how much the sample size may be reduced if the outcome were not dichotomized. We also provide a method to calculate the loss of information after a dichotomization. We apply this method to all the trials from the CDSR with a binary outcome, and estimate that on average, only about 60% of the information is retained after dichotomization. We provide R code and a shiny app at: https://vanzwet.shinyapps.io/info_loss/ to do these calculations. We hope that quantifying the loss of information will discourage researchers from dichotomizing continuous outcomes. Instead, we recommend they \"model continuously but interpret dichotomously\". For example, they might present \"percentage achieving clinically meaningful improvement\" derived from a continuous analysis rather than by dichotomizing raw data.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"45 3-5","pages":"e70402"},"PeriodicalIF":1.8,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12875020/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146126398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lillian Rountree, Lauren Zimmermann, Lucy Teed, Daniel M Weinberger, Bhramar Mukherjee
Excess death estimation, defined as the difference between the observed and expected death counts, is a popular technique for assessing the overall death toll of a public health crisis. The expected death count is defined as the expected number of deaths in the counterfactual scenario where prevailing conditions continued and the public health crisis did not occur. While excess death is frequently obtained by estimating the expected number of deaths and subtracting it from the observed number, some methods calculate this difference directly, based on historic mortality data and direct predictors of excess deaths. This tutorial provides guidance to researchers on the application of four popular methods for estimating excess death: the World Health Organization's Bayesian model; The Economist's gradient boosting algorithm; Acosta and Irizarry's quasi-Poisson model; and the Institute for Health Metrics and Evaluation's ensemble model. We begin with explanations of the mathematical formulation of each method and then demonstrate how to code each method in R, applying the code for a case study estimating excess death in the United States for the post-pandemic period of 2022-2024. An additional simulation study estimating excess death for three different scenarios and three different extrapolation periods further demonstrates general trends in performance across methods; together, these two studies show how the estimates by these methods and their accuracy vary widely depending on the choice of input covariates, reference period, extrapolation period, and tuning parameters. Caution should be exercised when extrapolating for estimating excess death, particularly in cases where the reference period of pre-event conditions is temporally distant (> 5 years) from the period of interest. In place of committing to one method under one setting, we advocate for using multiple excess death methods in tandem, comparing and synthesizing their results and conducting thorough sensitivity analyses as best practice for estimating excess death for a period of interest. We also call for more detailed simulation studies and benchmark datasets to better understand the accuracy and comparative performance of methods estimating excess death.
{"title":"A Tutorial on Implementing Statistical Methods for Estimating Excess Death With a Case Study and Simulations on Estimating Excess Death in the Post-COVID-19 United States.","authors":"Lillian Rountree, Lauren Zimmermann, Lucy Teed, Daniel M Weinberger, Bhramar Mukherjee","doi":"10.1002/sim.70396","DOIUrl":"https://doi.org/10.1002/sim.70396","url":null,"abstract":"<p><p>Excess death estimation, defined as the difference between the observed and expected death counts, is a popular technique for assessing the overall death toll of a public health crisis. The expected death count is defined as the expected number of deaths in the counterfactual scenario where prevailing conditions continued and the public health crisis did not occur. While excess death is frequently obtained by estimating the expected number of deaths and subtracting it from the observed number, some methods calculate this difference directly, based on historic mortality data and direct predictors of excess deaths. This tutorial provides guidance to researchers on the application of four popular methods for estimating excess death: the World Health Organization's Bayesian model; The Economist's gradient boosting algorithm; Acosta and Irizarry's quasi-Poisson model; and the Institute for Health Metrics and Evaluation's ensemble model. We begin with explanations of the mathematical formulation of each method and then demonstrate how to code each method in R, applying the code for a case study estimating excess death in the United States for the post-pandemic period of 2022-2024. An additional simulation study estimating excess death for three different scenarios and three different extrapolation periods further demonstrates general trends in performance across methods; together, these two studies show how the estimates by these methods and their accuracy vary widely depending on the choice of input covariates, reference period, extrapolation period, and tuning parameters. Caution should be exercised when extrapolating for estimating excess death, particularly in cases where the reference period of pre-event conditions is temporally distant (> 5 years) from the period of interest. In place of committing to one method under one setting, we advocate for using multiple excess death methods in tandem, comparing and synthesizing their results and conducting thorough sensitivity analyses as best practice for estimating excess death for a period of interest. We also call for more detailed simulation studies and benchmark datasets to better understand the accuracy and comparative performance of methods estimating excess death.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"45 3-5","pages":"e70396"},"PeriodicalIF":1.8,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146166841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arce Domingo-Relloso, Yuchen Zhang, Ziqing Wang, Astrid M Suchy-Dicey, Dedra S Buchwald, Ana Navas-Acien, Joel Schwartz, Kiros Berhane, Brent A Coull, Linda Valeri
Not accounting for competing events in survival analysis can lead to biased estimates, as individuals who die from other causes do not have the opportunity to develop the event of interest. Formal definitions and considerations for causal effects in the presence of competing risks have been published, but not for the mediation analysis setting when the exposure is not separable and both the outcome and the mediator are nonterminal events. We propose, for the first time, an approach based on the path-specific effects framework to account for competing risks in longitudinal mediation analysis with time-to-event outcomes. We do so by considering the pathway through the competing event as another mediator, which is nested within our longitudinal mediator of interest. We provide a theoretical formulation and related definitions of the effects of interest based on the mediational g-formula, as well as a detailed description of the algorithm. We also present a simulation study and an application of our algorithm to data from the Strong Heart Study, a prospective cohort of American Indian adults. In this application, we evaluated the mediating role of the blood pressure trajectory (measured in three visits) on the association of arsenic and cadmium with time to cardiovascular disease, accounting for competing risks by death. Identifying the effects through different paths enables us to evaluate the impact of metals on the outcome of interest, as well as through competing risks, more transparently.
{"title":"A Path-Specific Effect Approach to Mediation Analysis With Time-Varying Mediators and Time-to-Event Outcomes Accounting for Competing Risks.","authors":"Arce Domingo-Relloso, Yuchen Zhang, Ziqing Wang, Astrid M Suchy-Dicey, Dedra S Buchwald, Ana Navas-Acien, Joel Schwartz, Kiros Berhane, Brent A Coull, Linda Valeri","doi":"10.1002/sim.70425","DOIUrl":"10.1002/sim.70425","url":null,"abstract":"<p><p>Not accounting for competing events in survival analysis can lead to biased estimates, as individuals who die from other causes do not have the opportunity to develop the event of interest. Formal definitions and considerations for causal effects in the presence of competing risks have been published, but not for the mediation analysis setting when the exposure is not separable and both the outcome and the mediator are nonterminal events. We propose, for the first time, an approach based on the path-specific effects framework to account for competing risks in longitudinal mediation analysis with time-to-event outcomes. We do so by considering the pathway through the competing event as another mediator, which is nested within our longitudinal mediator of interest. We provide a theoretical formulation and related definitions of the effects of interest based on the mediational g-formula, as well as a detailed description of the algorithm. We also present a simulation study and an application of our algorithm to data from the Strong Heart Study, a prospective cohort of American Indian adults. In this application, we evaluated the mediating role of the blood pressure trajectory (measured in three visits) on the association of arsenic and cadmium with time to cardiovascular disease, accounting for competing risks by death. Identifying the effects through different paths enables us to evaluate the impact of metals on the outcome of interest, as well as through competing risks, more transparently.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"45 3-5","pages":"e70425"},"PeriodicalIF":1.8,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12873459/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146120114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modeling prognosis has unique significance in cancer research. For this purpose, omics data have been routinely used. In a series of recent studies, pathological imaging data derived from biopsy have also been shown as informative. Motivated by the complementary information contained in omics and pathological imaging data, we examine integrating them under a Cox modeling framework. The two types of data have distinct properties: for omics variables, which are more actionable and demand stronger interpretability, we model their effects in a parametric way; whereas for pathological imaging features, which are not actionable and do not have lucid interpretations, we model their effects in a nonparametric way for better flexibility and prediction performance. Specifically, we adopt deep neural networks (DNNs) for nonparametric estimation, considering their advantages over regression models in accommodating nonlinearity and providing better prediction. As both omics and pathological imaging data are high-dimensional and are expected to contain noises, we propose applying penalization for selecting relevant variables and regulating estimation. Different from some existing studies, we pay unique attention to overlapping information contained in the two types of data. Numerical investigations are carefully carried out. In the analysis of TCGA data, sensible selection and superior prediction performance are observed, which demonstrates the practical utility of the proposed analysis.
{"title":"Integrating Omics and Pathological Imaging Data for Cancer Prognosis via a Deep Neural Network-Based Cox Model.","authors":"Jingmao Li, Shuangge Ma","doi":"10.1002/sim.70435","DOIUrl":"https://doi.org/10.1002/sim.70435","url":null,"abstract":"<p><p>Modeling prognosis has unique significance in cancer research. For this purpose, omics data have been routinely used. In a series of recent studies, pathological imaging data derived from biopsy have also been shown as informative. Motivated by the complementary information contained in omics and pathological imaging data, we examine integrating them under a Cox modeling framework. The two types of data have distinct properties: for omics variables, which are more actionable and demand stronger interpretability, we model their effects in a parametric way; whereas for pathological imaging features, which are not actionable and do not have lucid interpretations, we model their effects in a nonparametric way for better flexibility and prediction performance. Specifically, we adopt deep neural networks (DNNs) for nonparametric estimation, considering their advantages over regression models in accommodating nonlinearity and providing better prediction. As both omics and pathological imaging data are high-dimensional and are expected to contain noises, we propose applying penalization for selecting relevant variables and regulating estimation. Different from some existing studies, we pay unique attention to overlapping information contained in the two types of data. Numerical investigations are carefully carried out. In the analysis of TCGA data, sensible selection and superior prediction performance are observed, which demonstrates the practical utility of the proposed analysis.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"45 3-5","pages":"e70435"},"PeriodicalIF":1.8,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146126432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In a weighted logrank test, such as the Harrington-Fleming test and the Tarone-Ware test, predetermined weights are used to emphasize early, middle, or late differences in survival distributions to maximize the test's power. The optimal weight function under an alternative, which depends on the true hazard functions of the groups being compared, has been derived. However, that optimal weight function cannot be directly used to construct an optimal test since the resulting test does not properly control the type I error rate. We further show that the power of a weighted logrank test with proper type I error control has an upper bound that cannot be achieved. Based on the theory, we propose a weighted logrank test that self-adaptively determines an "optimal" weight function. The new test is more powerful than existing standard and weighted logrank tests while maintaining proper type I error rates by tuning a parameter. We demonstrate through extensive simulation studies that the proposed test is both powerful and highly robust in a wide range of scenarios. The method is illustrated with data from several clinical trials in lung cancer.
{"title":"A Powerful and Self-Adaptive Weighted Logrank Test.","authors":"Zhiguo Li, Xiaofei Wang","doi":"10.1002/sim.70390","DOIUrl":"https://doi.org/10.1002/sim.70390","url":null,"abstract":"<p><p>In a weighted logrank test, such as the Harrington-Fleming test and the Tarone-Ware test, predetermined weights are used to emphasize early, middle, or late differences in survival distributions to maximize the test's power. The optimal weight function under an alternative, which depends on the true hazard functions of the groups being compared, has been derived. However, that optimal weight function cannot be directly used to construct an optimal test since the resulting test does not properly control the type I error rate. We further show that the power of a weighted logrank test with proper type I error control has an upper bound that cannot be achieved. Based on the theory, we propose a weighted logrank test that self-adaptively determines an \"optimal\" weight function. The new test is more powerful than existing standard and weighted logrank tests while maintaining proper type I error rates by tuning a parameter. We demonstrate through extensive simulation studies that the proposed test is both powerful and highly robust in a wide range of scenarios. The method is illustrated with data from several clinical trials in lung cancer.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"45 3-5","pages":"e70390"},"PeriodicalIF":1.8,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146120060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sparse regression problems, where the goal is to identify a small set of relevant predictors, often require modeling not only main effects but also meaningful interactions through other variables. While the pliable lasso has emerged as a powerful frequentist tool for modeling such interactions under strong heredity constraints, it lacks a natural framework for uncertainty quantification and incorporation of prior knowledge. In this paper, we propose a Bayesian pliable lasso that extends this approach by placing sparsity-inducing priors, such as the horseshoe, on both main and interaction effects. The hierarchical prior structure enforces heredity constraints while adaptively shrinking irrelevant coefficients and allowing important effects to persist. We extend this framework to generalized linear models and develop a tailored approach to handle missing responses. To facilitate posterior inference, we develop an efficient Gibbs sampling algorithm based on a reparameterization of the horseshoe prior. Our Bayesian framework yields sparse, interpretable interaction structures, and principled measures of uncertainty. Through simulations and real-data studies, we demonstrate its advantages over existing methods in recovering complex interaction patterns under both complete and incomplete data. Our method is implemented in the package hspliable available on Github: https://github.com/tienmt/hspliable.
{"title":"Bayesian Pliable Lasso With Horseshoe Prior for Interaction Effects in GLMs With Missing Responses.","authors":"The Tien Mai","doi":"10.1002/sim.70406","DOIUrl":"https://doi.org/10.1002/sim.70406","url":null,"abstract":"<p><p>Sparse regression problems, where the goal is to identify a small set of relevant predictors, often require modeling not only main effects but also meaningful interactions through other variables. While the pliable lasso has emerged as a powerful frequentist tool for modeling such interactions under strong heredity constraints, it lacks a natural framework for uncertainty quantification and incorporation of prior knowledge. In this paper, we propose a Bayesian pliable lasso that extends this approach by placing sparsity-inducing priors, such as the horseshoe, on both main and interaction effects. The hierarchical prior structure enforces heredity constraints while adaptively shrinking irrelevant coefficients and allowing important effects to persist. We extend this framework to generalized linear models and develop a tailored approach to handle missing responses. To facilitate posterior inference, we develop an efficient Gibbs sampling algorithm based on a reparameterization of the horseshoe prior. Our Bayesian framework yields sparse, interpretable interaction structures, and principled measures of uncertainty. Through simulations and real-data studies, we demonstrate its advantages over existing methods in recovering complex interaction patterns under both complete and incomplete data. Our method is implemented in the package hspliable available on Github: https://github.com/tienmt/hspliable.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"45 3-5","pages":"e70406"},"PeriodicalIF":1.8,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146120134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Misclassification Simulation-Extrapolation (MC-SIMEX) is an established method to correct for misclassification in binary covariates in a model. It involves the use of a simulation component which simulates pseudo-datasets with added degree of misclassification in the binary covariate and an extrapolation component which models the covariate's regression coefficients obtained at each level of misclassification using a quadratic function. This quadratic function is then used to extrapolate the covariate's regression coefficients to a point of "no error" in the classification of the binary covariate under question. However, extrapolation functions are not usually known accurately beforehand and are therefore only approximated versions. In this article, we propose an innovative method that uses the exact (not approximated) extrapolation function through the use of a derived relationship between the naïve regression coefficient estimates and the true coefficients in generalized linear models. Simulation studies are conducted to study and compare the numerical properties of the resulting estimator to the original MC-SIMEX estimator. Real data analysis using colon cancer data from the MSKCC cancer registry is also provided.
{"title":"An Improved Misclassification Simulation Extrapolation (MC-SIMEX) Algorithm.","authors":"Varadan Sevilimedu, Lili Yu","doi":"10.1002/sim.70418","DOIUrl":"https://doi.org/10.1002/sim.70418","url":null,"abstract":"<p><p>Misclassification Simulation-Extrapolation (MC-SIMEX) is an established method to correct for misclassification in binary covariates in a model. It involves the use of a simulation component which simulates pseudo-datasets with added degree of misclassification in the binary covariate and an extrapolation component which models the covariate's regression coefficients obtained at each level of misclassification using a quadratic function. This quadratic function is then used to extrapolate the covariate's regression coefficients to a point of \"no error\" in the classification of the binary covariate under question. However, extrapolation functions are not usually known accurately beforehand and are therefore only approximated versions. In this article, we propose an innovative method that uses the exact (not approximated) extrapolation function through the use of a derived relationship between the naïve regression coefficient estimates and the true coefficients in generalized linear models. Simulation studies are conducted to study and compare the numerical properties of the resulting estimator to the original MC-SIMEX estimator. Real data analysis using colon cancer data from the MSKCC cancer registry is also provided.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"45 3-5","pages":"e70418"},"PeriodicalIF":1.8,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146120140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A popular approach to growth reference centile estimation is the LMS (Lambda-Mu-Sigma) method, which assumes a parametric distribution for response variable and fits the location, scale and shape parameters of the distribution of as smooth functions of explanatory variable . This article provides two methods, transformation and adaptive smoothing, for improving the centile estimation when there is high curvature (i.e., rapid change in slope) with respect to in one or more of the distribution parameters. In general, high curvature is reduced (i.e., attenuated or dampened) by smoothing. In the first method, is transformed to variable to reduce this high curvature, and the distribution parameters are fitted as smooth functions of . Three different transformations of are described. In the second method, the distribution parameters are adaptively smoothed against by allowing the smoothing parameter itself to vary continuously with . Simulations are used to compare the performance of the two methods. Three examples show how the process can lead to substantially smoother and better fitting centiles.
一种常用的生长参考百分位数估计方法是LMS (Lambda-Mu-Sigma)方法,该方法假设响应变量Y $$ Y $$的参数分布,并将Y $$ Y $$分布的位置、规模和形状参数拟合为解释变量X $$ X $$的光滑函数。本文提供了变换和自适应平滑两种方法,用于在一个或多个Y $$ Y $$分布参数中存在相对于X $$ X $$的高曲率(即斜率的快速变化)时改进分位数估计。一般来说,通过平滑可以减少高曲率(即衰减或阻尼)。在第一种方法中,将X $$ X $$转换为变量T $$ T $$以减小这种高曲率,并将Y $$ Y $$分布参数拟合为T $$ T $$的光滑函数。描述了X $$ X $$的三种不同变换。在第二种方法中,通过允许平滑参数本身随Y $$ Y $$连续变化,Y $$ Y $$分布参数针对X $$ X $$进行自适应平滑。通过仿真比较了两种方法的性能。三个例子显示了该过程如何导致更平滑和更好的拟合百分位数。
{"title":"Improved Centile Estimation by Transformation And/Or Adaptive Smoothing of the Explanatory Variable.","authors":"R A Rigby, D M Stasinopoulos, T J Cole","doi":"10.1002/sim.70414","DOIUrl":"10.1002/sim.70414","url":null,"abstract":"<p><p>A popular approach to growth reference centile estimation is the LMS (Lambda-Mu-Sigma) method, which assumes a parametric distribution for response variable <math> <semantics><mrow><mi>Y</mi></mrow> <annotation>$$ Y $$</annotation></semantics> </math> and fits the location, scale and shape parameters of the distribution of <math> <semantics><mrow><mi>Y</mi></mrow> <annotation>$$ Y $$</annotation></semantics> </math> as smooth functions of explanatory variable <math> <semantics><mrow><mi>X</mi></mrow> <annotation>$$ X $$</annotation></semantics> </math> . This article provides two methods, transformation and adaptive smoothing, for improving the centile estimation when there is high curvature (i.e., rapid change in slope) with respect to <math> <semantics><mrow><mi>X</mi></mrow> <annotation>$$ X $$</annotation></semantics> </math> in one or more of the <math> <semantics><mrow><mi>Y</mi></mrow> <annotation>$$ Y $$</annotation></semantics> </math> distribution parameters. In general, high curvature is reduced (i.e., attenuated or dampened) by smoothing. In the first method, <math> <semantics><mrow><mi>X</mi></mrow> <annotation>$$ X $$</annotation></semantics> </math> is transformed to variable <math> <semantics><mrow><mi>T</mi></mrow> <annotation>$$ T $$</annotation></semantics> </math> to reduce this high curvature, and the <math> <semantics><mrow><mi>Y</mi></mrow> <annotation>$$ Y $$</annotation></semantics> </math> distribution parameters are fitted as smooth functions of <math> <semantics><mrow><mi>T</mi></mrow> <annotation>$$ T $$</annotation></semantics> </math> . Three different transformations of <math> <semantics><mrow><mi>X</mi></mrow> <annotation>$$ X $$</annotation></semantics> </math> are described. In the second method, the <math> <semantics><mrow><mi>Y</mi></mrow> <annotation>$$ Y $$</annotation></semantics> </math> distribution parameters are adaptively smoothed against <math> <semantics><mrow><mi>X</mi></mrow> <annotation>$$ X $$</annotation></semantics> </math> by allowing the smoothing parameter itself to vary continuously with <math> <semantics><mrow><mi>Y</mi></mrow> <annotation>$$ Y $$</annotation></semantics> </math> . Simulations are used to compare the performance of the two methods. Three examples show how the process can lead to substantially smoother and better fitting centiles.</p>","PeriodicalId":21879,"journal":{"name":"Statistics in Medicine","volume":"45 3-5","pages":"e70414"},"PeriodicalIF":1.8,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12874224/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146126374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}