Pub Date : 2025-08-01Epub Date: 2025-07-04DOI: 10.1177/09622802251348796
Guogen Shan, Yahui Zhang, Guoqiao Wang, Samuel S Wu, Aidong A Ding
Saved time is used in Alzheimer's disease (AD) trials as an easy interpretation of the treatment benefit to communicate with patients, family members, and caregivers. The projection approach is frequently applied to estimate saved time and its confidence interval (CI) by using the placebo or treatment disease progression curves. The estimated standard error of saved time by using these existing methods does not account for the correlation between outcomes. In addition, there was no closed-form CI for researchers to use in practice. To fill this critical gap, we derive the closed-form CI for saved time estimated from the placebo or treatment disease progression curves. We compare them with regard to coverage probability and interval width under various disease progression patterns that are commonly observed in AD symptomatic therapy and disease-modifying therapy trials. Data from the phase 3 donanemab trials are used to illustrate the application of the new CI methods.
{"title":"Closed-form confidence intervals for saved time using summary statistics in Alzheimer's disease studies.","authors":"Guogen Shan, Yahui Zhang, Guoqiao Wang, Samuel S Wu, Aidong A Ding","doi":"10.1177/09622802251348796","DOIUrl":"10.1177/09622802251348796","url":null,"abstract":"<p><p>Saved time is used in Alzheimer's disease (AD) trials as an easy interpretation of the treatment benefit to communicate with patients, family members, and caregivers. The projection approach is frequently applied to estimate saved time and its confidence interval (CI) by using the placebo or treatment disease progression curves. The estimated standard error of saved time by using these existing methods does not account for the correlation between outcomes. In addition, there was no closed-form CI for researchers to use in practice. To fill this critical gap, we derive the closed-form CI for saved time estimated from the placebo or treatment disease progression curves. We compare them with regard to coverage probability and interval width under various disease progression patterns that are commonly observed in AD symptomatic therapy and disease-modifying therapy trials. Data from the phase 3 donanemab trials are used to illustrate the application of the new CI methods.</p>","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"1605-1616"},"PeriodicalIF":1.9,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144561189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-01Epub Date: 2025-06-20DOI: 10.1177/09622802251348800
Robert Thiesmeier, Scott M Hofer, Nicola Orsini
Individual participant data (IPD) meta-analysis of randomised trials is a crucial method for detecting and investigating effect modifications in medical research. However, few studies have explored scenarios involving systematically missing data on discrete effect modifiers (EMs) in IPD meta-analyses with a limited number of trials. This simulation study examines the impact of systematic missing values in IPD meta-analysis using a two-stage imputation method. We simulated IPD meta-analyses of randomised trials with multiple studies that had systematically missing data on the EM. A multivariable Weibull survival model was specified to assess beneficial (Hazard Ratio (HR)0.8), null (HR1.0), and harmful (HR1.2) treatment effects for low, medium, and high levels of an EM, respectively. Bias and coverage were evaluated using Monte-Carlo simulations. The absolute bias for common and heterogeneous effect IPD meta-analyses was less than 0.016 and 0.007, respectively, with coverage close to its nominal value across all EM levels. An uncongenial imputation model resulted in larger bias, even when the proportion of studies with systematically missing data on the EM was small. Overall, the proposed two-stage imputation approach provided unbiased estimates with improved precision. The assumptions and limitations of this approach are discussed.
{"title":"Multiple imputation for systematically missing effect modifiers in individual participant data meta-analysis.","authors":"Robert Thiesmeier, Scott M Hofer, Nicola Orsini","doi":"10.1177/09622802251348800","DOIUrl":"10.1177/09622802251348800","url":null,"abstract":"<p><p>Individual participant data (IPD) meta-analysis of randomised trials is a crucial method for detecting and investigating effect modifications in medical research. However, few studies have explored scenarios involving systematically missing data on discrete effect modifiers (EMs) in IPD meta-analyses with a limited number of trials. This simulation study examines the impact of systematic missing values in IPD meta-analysis using a two-stage imputation method. We simulated IPD meta-analyses of randomised trials with multiple studies that had systematically missing data on the EM. A multivariable Weibull survival model was specified to assess beneficial (Hazard Ratio (HR)<math><mo>=</mo></math>0.8), null (HR<math><mo>=</mo></math>1.0), and harmful (HR<math><mo>=</mo></math>1.2) treatment effects for low, medium, and high levels of an EM, respectively. Bias and coverage were evaluated using Monte-Carlo simulations. The absolute bias for common and heterogeneous effect IPD meta-analyses was less than 0.016 and 0.007, respectively, with coverage close to its nominal value across all EM levels. An uncongenial imputation model resulted in larger bias, even when the proportion of studies with systematically missing data on the EM was small. Overall, the proposed two-stage imputation approach provided unbiased estimates with improved precision. The assumptions and limitations of this approach are discussed.</p>","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"1590-1604"},"PeriodicalIF":1.9,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12365359/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144333871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-01Epub Date: 2025-05-29DOI: 10.1177/09622802251345485
Danilo Alvares, Cristian Meza, Rolando De la Cruz
Motivated by a pregnancy miscarriage study, we propose a Bayesian joint model for longitudinal and time-to-event outcomes that takes into account different complexities of the problem. In particular, the longitudinal process is modeled by means of a nonlinear specification with subject-specific error variance. In addition, the exact time of fetal death is unknown, and a subgroup of women is not susceptible to miscarriage. Hence, we model the survival process via a mixture cure model for interval-censored data. Finally, both processes are linked through the subject-specific longitudinal mean and variance. A simulation study is conducted in order to validate our joint model. In the real application, we use individual weighted and Cox-Snell residuals to assess the goodness-of-fit of our proposal versus a joint model that shares only the subject-specific longitudinal mean (standard approach). In addition, the leave-one-out cross-validation criterion is applied to compare the predictive ability of both models.
{"title":"Bayesian inference for nonlinear mixed-effects location scale and interval-censoring cure-survival models: An application to pregnancy miscarriage.","authors":"Danilo Alvares, Cristian Meza, Rolando De la Cruz","doi":"10.1177/09622802251345485","DOIUrl":"10.1177/09622802251345485","url":null,"abstract":"<p><p>Motivated by a pregnancy miscarriage study, we propose a Bayesian joint model for longitudinal and time-to-event outcomes that takes into account different complexities of the problem. In particular, the longitudinal process is modeled by means of a nonlinear specification with subject-specific error variance. In addition, the exact time of fetal death is unknown, and a subgroup of women is not susceptible to miscarriage. Hence, we model the survival process via a mixture cure model for interval-censored data. Finally, both processes are linked through the subject-specific longitudinal mean and variance. A simulation study is conducted in order to validate our joint model. In the real application, we use individual weighted and Cox-Snell residuals to assess the goodness-of-fit of our proposal versus a joint model that shares only the subject-specific longitudinal mean (standard approach). In addition, the leave-one-out cross-validation criterion is applied to compare the predictive ability of both models.</p>","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"1525-1533"},"PeriodicalIF":1.9,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12365357/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144175029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-01Epub Date: 2025-06-23DOI: 10.1177/09622802251343599
Alan D Hutson, Han Yu
Oncology clinical trials are increasingly expensive, necessitating efforts to streamline phase II and III trials to reduce costs and expedite treatment delivery. Randomization is often impractical in oncology trials due to small sample sizes and limited statistical power, leading to biased inferences. The FDA has recently published guidance documents encouraging the use of prognostic baseline measures to improve the precision of inferences around treatment effects. To address this, we propose an extension of Rosenbaum's exact testing method incorporating a variant of martingale residuals for right censored data. This method can dramatically improve the statistical power of the test comparing treatment arms given time-to-event endpoints as compared to the standard log-rank test. Additionally, the modification of the martingale residual provides a straightforward metric for summarizing treatment effect by quantifying the expected events per treatment arm at each time-point. This approach is illustrated using a phase II clinical trial in small cell lung cancer.
{"title":"Strategies to boost statistical efficiency in randomized oncology trials with primary time-to-event endpoints.","authors":"Alan D Hutson, Han Yu","doi":"10.1177/09622802251343599","DOIUrl":"10.1177/09622802251343599","url":null,"abstract":"<p><p>Oncology clinical trials are increasingly expensive, necessitating efforts to streamline phase II and III trials to reduce costs and expedite treatment delivery. Randomization is often impractical in oncology trials due to small sample sizes and limited statistical power, leading to biased inferences. The FDA has recently published guidance documents encouraging the use of prognostic baseline measures to improve the precision of inferences around treatment effects. To address this, we propose an extension of Rosenbaum's exact testing method incorporating a variant of martingale residuals for right censored data. This method can dramatically improve the statistical power of the test comparing treatment arms given time-to-event endpoints as compared to the standard log-rank test. Additionally, the modification of the martingale residual provides a straightforward metric for summarizing treatment effect by quantifying the expected events per treatment arm at each time-point. This approach is illustrated using a phase II clinical trial in small cell lung cancer.</p>","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"1534-1552"},"PeriodicalIF":1.9,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144476715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-01Epub Date: 2025-05-14DOI: 10.1177/09622802251338375
Xun Zhao, Yalu Ping
The identification of risk features associated with disease plays a crucial role in biomedical fields. These features are often used to provide evidence for clinical decision-making. However, in the presence of between-center heterogeneity, covariate effects across data centers may exhibit inconsistent directions, making feature selection challenging. In this work, we propose a novel framework to select reproducible risk features whose underlying effects are consistent across different centers. We quantify the feature reproducibility based on the sign-consistency criterion, which provides an acceptable level of heterogeneity in effect sizes and ensures the reasonable similarity of reproducible signals. Compared with the existing feature selection methods, our proposed method effectively protects data privacy and does not rely on the assumption of data homogeneity. Extensive simulations demonstrated that the proposed method has greater power than existing methods do. We apply the proposed approach to analyze data from the China Health and Retirement Study Longitudinal Study (CHARLS) and identify nine important risk factors that show reproducible associations with depression.
{"title":"Reproducible feature selection in heterogeneous multicenter datasets via sign-consistency criteria.","authors":"Xun Zhao, Yalu Ping","doi":"10.1177/09622802251338375","DOIUrl":"10.1177/09622802251338375","url":null,"abstract":"<p><p>The identification of risk features associated with disease plays a crucial role in biomedical fields. These features are often used to provide evidence for clinical decision-making. However, in the presence of between-center heterogeneity, covariate effects across data centers may exhibit inconsistent directions, making feature selection challenging. In this work, we propose a novel framework to select reproducible risk features whose underlying effects are consistent across different centers. We quantify the feature reproducibility based on the sign-consistency criterion, which provides an acceptable level of heterogeneity in effect sizes and ensures the reasonable similarity of reproducible signals. Compared with the existing feature selection methods, our proposed method effectively protects data privacy and does not rely on the assumption of data homogeneity. Extensive simulations demonstrated that the proposed method has greater power than existing methods do. We apply the proposed approach to analyze data from the China Health and Retirement Study Longitudinal Study (CHARLS) and identify nine important risk factors that show reproducible associations with depression.</p>","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"1328-1341"},"PeriodicalIF":1.9,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144080762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-01Epub Date: 2025-06-19DOI: 10.1177/09622802251345486
Jiaxing Qiu, Douglas E Lake, Pavel Chernyavskiy, Teague R Henry
For prediction models developed on clustered data that do not account for cluster heterogeneity in model parameterization, it is crucial to use cluster-based validation to assess model generalizability on unseen clusters. This article introduces a clustered estimator of the network information criterion to approximate leave-one-cluster-out deviance for standard prediction models with twice-differentiable log-likelihood functions. The clustered network information criterion serves as a fast alternative to cluster-based cross-validation. Stone proved that the Akaike information criterion is asymptotically equivalent to leave-one-observation-out cross-validation for true parametric models with independent and identically distributed observations. Ripley noted that the network information criterion, derived from Stone's proof, is a better approximation when the model is misspecified. For clustered data, we derived clustered network information criterion by substituting the Fisher information matrix in the network information criterion with a clustering-adjusted estimator. The clustered network information criterion imposes a greater penalty when the data exhibits stronger clustering, thereby allowing the clustered network information criterion to better prevent over-parameterization. In a simulation study and an empirical example, we used standard regression to develop prediction models for clustered data with Gaussian or binomial responses. Compared to the commonly used Akaike information criterion and Bayesian information criterion for standard regression, clustered network information criterion provides a much more accurate approximation to leave-one-cluster-out deviance and results in more accurate model size and variable selection, as determined by cluster-based cross-validation, especially when the data exhibit strong clustering.
{"title":"Fast leave-one-cluster-out cross-validation using clustered network information criterion.","authors":"Jiaxing Qiu, Douglas E Lake, Pavel Chernyavskiy, Teague R Henry","doi":"10.1177/09622802251345486","DOIUrl":"10.1177/09622802251345486","url":null,"abstract":"<p><p>For prediction models developed on clustered data that do not account for cluster heterogeneity in model parameterization, it is crucial to use cluster-based validation to assess model generalizability on unseen clusters. This article introduces a clustered estimator of the network information criterion to approximate leave-one-cluster-out deviance for standard prediction models with twice-differentiable log-likelihood functions. The clustered network information criterion serves as a fast alternative to cluster-based cross-validation. Stone proved that the Akaike information criterion is asymptotically equivalent to leave-one-observation-out cross-validation for true parametric models with independent and identically distributed observations. Ripley noted that the network information criterion, derived from Stone's proof, is a better approximation when the model is misspecified. For clustered data, we derived clustered network information criterion by substituting the Fisher information matrix in the network information criterion with a clustering-adjusted estimator. The clustered network information criterion imposes a greater penalty when the data exhibits stronger clustering, thereby allowing the clustered network information criterion to better prevent over-parameterization. In a simulation study and an empirical example, we used standard regression to develop prediction models for clustered data with Gaussian or binomial responses. Compared to the commonly used Akaike information criterion and Bayesian information criterion for standard regression, clustered network information criterion provides a much more accurate approximation to leave-one-cluster-out deviance and results in more accurate model size and variable selection, as determined by cluster-based cross-validation, especially when the data exhibit strong clustering.</p>","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"1413-1430"},"PeriodicalIF":1.9,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144326885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-01Epub Date: 2025-05-15DOI: 10.1177/09622802251340246
Yingjie Qiu, Mengyi Lu, Yan Han, Wenxian Zhou, Yi Zhao, Leng Han, Yong Zang
We present a model-free phase I/II clinical trial design, referred to as the UFO design, to optimize the dose of immunotherapy by jointly modeling toxicity, efficacy, and immune response outcomes. Instead of relying on complex parametric modeling approaches, we propose a model-free approach that uses the inherent correlations among different types of outcomes in immunotherapy and the constrained dose-outcome order to facilitate information sharing across different doses. This approach ensures the efficiency and transparency of the UFO design to be implemented in clinical practice. The UFO design is also extended to accommodate the delayed outcomes. It demonstrates favorable operating characteristics through simulation studies. The R Shniy app for simulation and trial implementation using the UFO design is also provided at iusccc.shinyapps.io/smartdesign.
{"title":"A model-free phase I/II dose optimization design for immunotherapy trials.","authors":"Yingjie Qiu, Mengyi Lu, Yan Han, Wenxian Zhou, Yi Zhao, Leng Han, Yong Zang","doi":"10.1177/09622802251340246","DOIUrl":"10.1177/09622802251340246","url":null,"abstract":"<p><p>We present a model-free phase I/II clinical trial design, referred to as the UFO design, to optimize the dose of immunotherapy by jointly modeling toxicity, efficacy, and immune response outcomes. Instead of relying on complex parametric modeling approaches, we propose a model-free approach that uses the inherent correlations among different types of outcomes in immunotherapy and the constrained dose-outcome order to facilitate information sharing across different doses. This approach ensures the efficiency and transparency of the UFO design to be implemented in clinical practice. The UFO design is also extended to accommodate the delayed outcomes. It demonstrates favorable operating characteristics through simulation studies. The R Shniy app for simulation and trial implementation using the UFO design is also provided at iusccc.shinyapps.io/smartdesign.</p>","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"1442-1458"},"PeriodicalIF":1.9,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144080692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-01Epub Date: 2025-05-14DOI: 10.1177/09622802251338983
Oya Kalaycıoğlu, Menelaos Pavlou, Serhat E Akhanlı, Mark A de Belder, Gareth Ambler, Rumana Z Omar
Machine learning techniques (MLTs) are increasingly being used to develop clinical risk prediction models for binary health outcomes but the sample size requirements for developing and validating such models remain unclear. This study investigates whether sample size guidelines that target mean absolute prediction error (MAPE) for logistic regression models can be applied to tree-based ensemble MLTs (bagging, random forests, and boosting). Simulations based on two large cardiovascular datasets were used to evaluate the performance of MLTs in terms of MAPE, calibration, the C-statistic and Brier score, across six data-generating mechanisms (DGMs) and varying sample sizes. When the DGM and analysis model matched, boosting required a sample size 2-3 times larger than recommended; random forests and bagging did not achieve the target MAPE even with a 12-fold increase. For a neutral DGM that did not match any of the analysis models, logistic regression with only main effects and boosting resulted in target MAPE values with a 12-fold increase in the recommended sample size. For external validation, our simulations showed that sample size guidelines to achieve a target precision of the estimated C-statistic were suitable, and thus may be used to inform sample size calculations for MLTs.
{"title":"Evaluating the sample size requirements of tree-based ensemble machine learning techniques for clinical risk prediction.","authors":"Oya Kalaycıoğlu, Menelaos Pavlou, Serhat E Akhanlı, Mark A de Belder, Gareth Ambler, Rumana Z Omar","doi":"10.1177/09622802251338983","DOIUrl":"10.1177/09622802251338983","url":null,"abstract":"<p><p>Machine learning techniques (MLTs) are increasingly being used to develop clinical risk prediction models for binary health outcomes but the sample size requirements for developing and validating such models remain unclear. This study investigates whether sample size guidelines that target mean absolute prediction error (MAPE) for logistic regression models can be applied to tree-based ensemble MLTs (bagging, random forests, and boosting). Simulations based on two large cardiovascular datasets were used to evaluate the performance of MLTs in terms of MAPE, calibration, the <i>C</i>-statistic and Brier score, across six data-generating mechanisms (DGMs) and varying sample sizes. When the DGM and analysis model matched, boosting required a sample size 2-3 times larger than recommended; random forests and bagging did not achieve the target MAPE even with a 12-fold increase. For a neutral DGM that did not match any of the analysis models, logistic regression with only main effects and boosting resulted in target MAPE values with a 12-fold increase in the recommended sample size. For external validation, our simulations showed that sample size guidelines to achieve a target precision of the estimated <i>C</i>-statistic were suitable, and thus may be used to inform sample size calculations for MLTs.</p>","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"1356-1372"},"PeriodicalIF":1.9,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12308042/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144080758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-01Epub Date: 2025-07-02DOI: 10.1177/09622802251343597
Marinela Capanu, Mihai Giurcanu, Colin B Begg, Mithat Gönen
Although high-dimensional data analysis has received a lot of attention after the advent of omics data, model selection in this setting continues to be challenging and there is still substantial room for improvement. Through a novel combination of existing methods, we propose here a two-stage subsampling approach for variable selection in high-dimensional generalized linear regression models. In the first stage, we screen the variables using smoothly clipped absolute deviance penalty regularization followed by partial least squares regression on repeated subsamples of the data; we include in the second stage only those predictors that were most frequently selected over the subsamples either by smoothly clipped absolute deviance or for having the top loadings in either of the first two partial least squares regression components. In the second stage, we again repeatedly subsample the data and, for each subsample, we find the best Akaike information criterion model based on an exhaustive search of all possible models on the reduced set of predictors. We then include in the final model those predictors with high selection probability across the subsamples. We prove that the proposed first-stage estimator is -consistent and that the true predictors are included in the first stage with probability converging to 1. In an extensive simulation study, we show that this two-stage approach outperforms the competitors yielding among the highest probability of selecting the true model while having one of the lowest number of false positives in the settings of logistic, Poisson, and linear regression. We illustrate the proposed method on two gene expression cancer datasets.
{"title":"Two-stage subsampling variable selection for sparse high-dimensional generalized linear models.","authors":"Marinela Capanu, Mihai Giurcanu, Colin B Begg, Mithat Gönen","doi":"10.1177/09622802251343597","DOIUrl":"10.1177/09622802251343597","url":null,"abstract":"<p><p>Although high-dimensional data analysis has received a lot of attention after the advent of omics data, model selection in this setting continues to be challenging and there is still substantial room for improvement. Through a novel combination of existing methods, we propose here a two-stage subsampling approach for variable selection in high-dimensional generalized linear regression models. In the first stage, we screen the variables using smoothly clipped absolute deviance penalty regularization followed by partial least squares regression on repeated subsamples of the data; we include in the second stage only those predictors that were most frequently selected over the subsamples either by smoothly clipped absolute deviance or for having the top loadings in either of the first two partial least squares regression components. In the second stage, we again repeatedly subsample the data and, for each subsample, we find the best Akaike information criterion model based on an exhaustive search of all possible models on the reduced set of predictors. We then include in the final model those predictors with high selection probability across the subsamples. We prove that the proposed first-stage estimator is <math><msup><mi>n</mi><mrow><mn>1</mn><mo>/</mo><mn>2</mn></mrow></msup></math>-consistent and that the true predictors are included in the first stage with probability converging to 1. In an extensive simulation study, we show that this two-stage approach outperforms the competitors yielding among the highest probability of selecting the true model while having one of the lowest number of false positives in the settings of logistic, Poisson, and linear regression. We illustrate the proposed method on two gene expression cancer datasets.</p>","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"1504-1521"},"PeriodicalIF":1.9,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144544953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-01Epub Date: 2025-05-14DOI: 10.1177/09622802251338978
Hoi Min Ng, Kin Yau Wong
Varying coefficient models are commonly used to capture intricate interaction effects among covariates in regression models, allowing for the modification of one covariate's effect by another. Although these models offer increased flexibility, they also introduce greater estimation and computational complexity as a trade-off. This complexity is particularly evident in genomic studies, where the covariates are often high-dimensional, rendering conventional estimation methods inapplicable. In this paper, we study a penalized estimation method for the varying coefficient additive hazards model. We adopt the group lasso penalty along with the kernel smoothing technique to estimate the varying coefficients. In contrast to existing kernel methods, which only use a "local" neighborhood of subjects to estimate the varying coefficient function at any given point, the proposed method takes a "global" approach that incorporates all subjects and is more efficient. Through extensive simulation studies, we demonstrate that the proposed method produces interpretable results with satisfactory predictive performance. We provide an application to a major cancer genomic study.
{"title":"Penalized estimation for varying coefficient additive hazards models.","authors":"Hoi Min Ng, Kin Yau Wong","doi":"10.1177/09622802251338978","DOIUrl":"10.1177/09622802251338978","url":null,"abstract":"<p><p>Varying coefficient models are commonly used to capture intricate interaction effects among covariates in regression models, allowing for the modification of one covariate's effect by another. Although these models offer increased flexibility, they also introduce greater estimation and computational complexity as a trade-off. This complexity is particularly evident in genomic studies, where the covariates are often high-dimensional, rendering conventional estimation methods inapplicable. In this paper, we study a penalized estimation method for the varying coefficient additive hazards model. We adopt the group lasso penalty along with the kernel smoothing technique to estimate the varying coefficients. In contrast to existing kernel methods, which only use a \"local\" neighborhood of subjects to estimate the varying coefficient function at any given point, the proposed method takes a \"global\" approach that incorporates all subjects and is more efficient. Through extensive simulation studies, we demonstrate that the proposed method produces interpretable results with satisfactory predictive performance. We provide an application to a major cancer genomic study.</p>","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"1373-1384"},"PeriodicalIF":1.9,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144080759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}