首页 > 最新文献

Statistical Methods in Medical Research最新文献

英文 中文
Multiple imputation for systematically missing effect modifiers in individual participant data meta-analysis. 个体参与者数据荟萃分析中系统缺失效应修正因子的多重归因。
IF 1.9 3区 医学 Q3 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-08-01 Epub Date: 2025-06-20 DOI: 10.1177/09622802251348800
Robert Thiesmeier, Scott M Hofer, Nicola Orsini

Individual participant data (IPD) meta-analysis of randomised trials is a crucial method for detecting and investigating effect modifications in medical research. However, few studies have explored scenarios involving systematically missing data on discrete effect modifiers (EMs) in IPD meta-analyses with a limited number of trials. This simulation study examines the impact of systematic missing values in IPD meta-analysis using a two-stage imputation method. We simulated IPD meta-analyses of randomised trials with multiple studies that had systematically missing data on the EM. A multivariable Weibull survival model was specified to assess beneficial (Hazard Ratio (HR)=0.8), null (HR=1.0), and harmful (HR=1.2) treatment effects for low, medium, and high levels of an EM, respectively. Bias and coverage were evaluated using Monte-Carlo simulations. The absolute bias for common and heterogeneous effect IPD meta-analyses was less than 0.016 and 0.007, respectively, with coverage close to its nominal value across all EM levels. An uncongenial imputation model resulted in larger bias, even when the proportion of studies with systematically missing data on the EM was small. Overall, the proposed two-stage imputation approach provided unbiased estimates with improved precision. The assumptions and limitations of this approach are discussed.

随机试验的个体参与者数据(IPD)荟萃分析是检测和调查医学研究中效果变化的重要方法。然而,很少有研究在IPD荟萃分析中系统地缺失离散效应调节剂(EMs)的数据,且试验数量有限。本模拟研究使用两阶段的imputation方法检验IPD meta分析中系统缺失值的影响。我们模拟了随机试验的IPD荟萃分析,其中有多个研究系统地缺少EM数据。指定了一个多变量威布尔生存模型,分别评估低、中、高水平EM的有益(风险比(HR)=0.8)、无效(HR=1.0)和有害(HR=1.2)治疗效果。使用蒙特卡罗模拟评估偏差和覆盖率。共同效应和异质性效应IPD荟萃分析的绝对偏倚分别小于0.016和0.007,覆盖范围接近所有EM水平的标称值。即使在系统性缺失EM数据的研究比例很小的情况下,不一致的归因模型也会导致更大的偏差。总体而言,所提出的两阶段估算方法提供了精度更高的无偏估计。讨论了该方法的假设和局限性。
{"title":"Multiple imputation for systematically missing effect modifiers in individual participant data meta-analysis.","authors":"Robert Thiesmeier, Scott M Hofer, Nicola Orsini","doi":"10.1177/09622802251348800","DOIUrl":"10.1177/09622802251348800","url":null,"abstract":"<p><p>Individual participant data (IPD) meta-analysis of randomised trials is a crucial method for detecting and investigating effect modifications in medical research. However, few studies have explored scenarios involving systematically missing data on discrete effect modifiers (EMs) in IPD meta-analyses with a limited number of trials. This simulation study examines the impact of systematic missing values in IPD meta-analysis using a two-stage imputation method. We simulated IPD meta-analyses of randomised trials with multiple studies that had systematically missing data on the EM. A multivariable Weibull survival model was specified to assess beneficial (Hazard Ratio (HR)<math><mo>=</mo></math>0.8), null (HR<math><mo>=</mo></math>1.0), and harmful (HR<math><mo>=</mo></math>1.2) treatment effects for low, medium, and high levels of an EM, respectively. Bias and coverage were evaluated using Monte-Carlo simulations. The absolute bias for common and heterogeneous effect IPD meta-analyses was less than 0.016 and 0.007, respectively, with coverage close to its nominal value across all EM levels. An uncongenial imputation model resulted in larger bias, even when the proportion of studies with systematically missing data on the EM was small. Overall, the proposed two-stage imputation approach provided unbiased estimates with improved precision. The assumptions and limitations of this approach are discussed.</p>","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"1590-1604"},"PeriodicalIF":1.9,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12365359/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144333871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian inference for nonlinear mixed-effects location scale and interval-censoring cure-survival models: An application to pregnancy miscarriage. 非线性混合效应的贝叶斯推断:位置尺度和间隔筛选治疗-生存模型:在妊娠流产中的应用。
IF 1.9 3区 医学 Q3 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-08-01 Epub Date: 2025-05-29 DOI: 10.1177/09622802251345485
Danilo Alvares, Cristian Meza, Rolando De la Cruz

Motivated by a pregnancy miscarriage study, we propose a Bayesian joint model for longitudinal and time-to-event outcomes that takes into account different complexities of the problem. In particular, the longitudinal process is modeled by means of a nonlinear specification with subject-specific error variance. In addition, the exact time of fetal death is unknown, and a subgroup of women is not susceptible to miscarriage. Hence, we model the survival process via a mixture cure model for interval-censored data. Finally, both processes are linked through the subject-specific longitudinal mean and variance. A simulation study is conducted in order to validate our joint model. In the real application, we use individual weighted and Cox-Snell residuals to assess the goodness-of-fit of our proposal versus a joint model that shares only the subject-specific longitudinal mean (standard approach). In addition, the leave-one-out cross-validation criterion is applied to compare the predictive ability of both models.

在怀孕流产研究的激励下,我们提出了一个考虑到问题不同复杂性的纵向和事件时间结果的贝叶斯联合模型。特别是,纵向过程是通过具有特定对象误差方差的非线性规范来建模的。此外,胎儿死亡的确切时间尚不清楚,而且有一小部分妇女不易流产。因此,我们通过间隔截尾数据的混合治愈模型对生存过程进行建模。最后,这两个过程通过特定主题的纵向均值和方差联系在一起。为了验证我们的联合模型,进行了仿真研究。在实际应用中,我们使用个体加权和Cox-Snell残差来评估我们的建议与仅共享特定主题纵向平均值(标准方法)的联合模型的拟合优度。此外,采用留一交叉验证准则来比较两种模型的预测能力。
{"title":"Bayesian inference for nonlinear mixed-effects location scale and interval-censoring cure-survival models: An application to pregnancy miscarriage.","authors":"Danilo Alvares, Cristian Meza, Rolando De la Cruz","doi":"10.1177/09622802251345485","DOIUrl":"10.1177/09622802251345485","url":null,"abstract":"<p><p>Motivated by a pregnancy miscarriage study, we propose a Bayesian joint model for longitudinal and time-to-event outcomes that takes into account different complexities of the problem. In particular, the longitudinal process is modeled by means of a nonlinear specification with subject-specific error variance. In addition, the exact time of fetal death is unknown, and a subgroup of women is not susceptible to miscarriage. Hence, we model the survival process via a mixture cure model for interval-censored data. Finally, both processes are linked through the subject-specific longitudinal mean and variance. A simulation study is conducted in order to validate our joint model. In the real application, we use individual weighted and Cox-Snell residuals to assess the goodness-of-fit of our proposal versus a joint model that shares only the subject-specific longitudinal mean (standard approach). In addition, the leave-one-out cross-validation criterion is applied to compare the predictive ability of both models.</p>","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"1525-1533"},"PeriodicalIF":1.9,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12365357/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144175029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Strategies to boost statistical efficiency in randomized oncology trials with primary time-to-event endpoints. 以主要事件时间为终点的随机肿瘤学试验提高统计效率的策略。
IF 1.9 3区 医学 Q3 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-08-01 Epub Date: 2025-06-23 DOI: 10.1177/09622802251343599
Alan D Hutson, Han Yu

Oncology clinical trials are increasingly expensive, necessitating efforts to streamline phase II and III trials to reduce costs and expedite treatment delivery. Randomization is often impractical in oncology trials due to small sample sizes and limited statistical power, leading to biased inferences. The FDA has recently published guidance documents encouraging the use of prognostic baseline measures to improve the precision of inferences around treatment effects. To address this, we propose an extension of Rosenbaum's exact testing method incorporating a variant of martingale residuals for right censored data. This method can dramatically improve the statistical power of the test comparing treatment arms given time-to-event endpoints as compared to the standard log-rank test. Additionally, the modification of the martingale residual provides a straightforward metric for summarizing treatment effect by quantifying the expected events per treatment arm at each time-point. This approach is illustrated using a phase II clinical trial in small cell lung cancer.

肿瘤学临床试验越来越昂贵,需要努力简化II期和III期试验,以降低成本并加快治疗交付。随机化在肿瘤学试验中往往是不切实际的,因为样本量小,统计能力有限,导致有偏倚的推断。FDA最近发布了指导文件,鼓励使用预后基线措施来提高治疗效果推断的准确性。为了解决这个问题,我们提出了Rosenbaum的精确测试方法的扩展,该方法包含了右审查数据的鞅残差的变体。与标准log-rank检验相比,该方法可以显著提高比较给定时间到事件终点的治疗组的检验的统计能力。此外,鞅残差的修改通过量化每个治疗组在每个时间点的预期事件,为总结治疗效果提供了一个直接的度量。这种方法在小细胞肺癌的II期临床试验中得到证实。
{"title":"Strategies to boost statistical efficiency in randomized oncology trials with primary time-to-event endpoints.","authors":"Alan D Hutson, Han Yu","doi":"10.1177/09622802251343599","DOIUrl":"10.1177/09622802251343599","url":null,"abstract":"<p><p>Oncology clinical trials are increasingly expensive, necessitating efforts to streamline phase II and III trials to reduce costs and expedite treatment delivery. Randomization is often impractical in oncology trials due to small sample sizes and limited statistical power, leading to biased inferences. The FDA has recently published guidance documents encouraging the use of prognostic baseline measures to improve the precision of inferences around treatment effects. To address this, we propose an extension of Rosenbaum's exact testing method incorporating a variant of martingale residuals for right censored data. This method can dramatically improve the statistical power of the test comparing treatment arms given time-to-event endpoints as compared to the standard log-rank test. Additionally, the modification of the martingale residual provides a straightforward metric for summarizing treatment effect by quantifying the expected events per treatment arm at each time-point. This approach is illustrated using a phase II clinical trial in small cell lung cancer.</p>","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"1534-1552"},"PeriodicalIF":1.9,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144476715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reproducible feature selection in heterogeneous multicenter datasets via sign-consistency criteria. 基于符号一致性标准的异构多中心数据集可重复特征选择。
IF 1.9 3区 医学 Q3 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-07-01 Epub Date: 2025-05-14 DOI: 10.1177/09622802251338375
Xun Zhao, Yalu Ping

The identification of risk features associated with disease plays a crucial role in biomedical fields. These features are often used to provide evidence for clinical decision-making. However, in the presence of between-center heterogeneity, covariate effects across data centers may exhibit inconsistent directions, making feature selection challenging. In this work, we propose a novel framework to select reproducible risk features whose underlying effects are consistent across different centers. We quantify the feature reproducibility based on the sign-consistency criterion, which provides an acceptable level of heterogeneity in effect sizes and ensures the reasonable similarity of reproducible signals. Compared with the existing feature selection methods, our proposed method effectively protects data privacy and does not rely on the assumption of data homogeneity. Extensive simulations demonstrated that the proposed method has greater power than existing methods do. We apply the proposed approach to analyze data from the China Health and Retirement Study Longitudinal Study (CHARLS) and identify nine important risk factors that show reproducible associations with depression.

识别与疾病相关的风险特征在生物医学领域起着至关重要的作用。这些特征通常为临床决策提供依据。然而,在中心间异质性的存在下,跨数据中心的协变量效应可能表现出不一致的方向,使得特征选择具有挑战性。在这项工作中,我们提出了一个新的框架来选择可重复的风险特征,其潜在影响在不同的中心是一致的。我们基于符号一致性标准量化特征再现性,该标准提供了可接受的效应大小异质性水平,并确保可再现信号的合理相似性。与现有的特征选择方法相比,本文提出的方法有效地保护了数据隐私,并且不依赖于数据同质性假设。大量的仿真结果表明,所提出的方法比现有方法具有更大的功率。我们应用该方法分析了中国健康与退休研究纵向研究(CHARLS)的数据,并确定了9个与抑郁症有可重复关联的重要危险因素。
{"title":"Reproducible feature selection in heterogeneous multicenter datasets via sign-consistency criteria.","authors":"Xun Zhao, Yalu Ping","doi":"10.1177/09622802251338375","DOIUrl":"10.1177/09622802251338375","url":null,"abstract":"<p><p>The identification of risk features associated with disease plays a crucial role in biomedical fields. These features are often used to provide evidence for clinical decision-making. However, in the presence of between-center heterogeneity, covariate effects across data centers may exhibit inconsistent directions, making feature selection challenging. In this work, we propose a novel framework to select reproducible risk features whose underlying effects are consistent across different centers. We quantify the feature reproducibility based on the sign-consistency criterion, which provides an acceptable level of heterogeneity in effect sizes and ensures the reasonable similarity of reproducible signals. Compared with the existing feature selection methods, our proposed method effectively protects data privacy and does not rely on the assumption of data homogeneity. Extensive simulations demonstrated that the proposed method has greater power than existing methods do. We apply the proposed approach to analyze data from the China Health and Retirement Study Longitudinal Study (CHARLS) and identify nine important risk factors that show reproducible associations with depression.</p>","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"1328-1341"},"PeriodicalIF":1.9,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144080762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fast leave-one-cluster-out cross-validation using clustered network information criterion. 基于聚类网络信息准则的快速留一簇出交叉验证。
IF 1.9 3区 医学 Q3 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-07-01 Epub Date: 2025-06-19 DOI: 10.1177/09622802251345486
Jiaxing Qiu, Douglas E Lake, Pavel Chernyavskiy, Teague R Henry

For prediction models developed on clustered data that do not account for cluster heterogeneity in model parameterization, it is crucial to use cluster-based validation to assess model generalizability on unseen clusters. This article introduces a clustered estimator of the network information criterion to approximate leave-one-cluster-out deviance for standard prediction models with twice-differentiable log-likelihood functions. The clustered network information criterion serves as a fast alternative to cluster-based cross-validation. Stone proved that the Akaike information criterion is asymptotically equivalent to leave-one-observation-out cross-validation for true parametric models with independent and identically distributed observations. Ripley noted that the network information criterion, derived from Stone's proof, is a better approximation when the model is misspecified. For clustered data, we derived clustered network information criterion by substituting the Fisher information matrix in the network information criterion with a clustering-adjusted estimator. The clustered network information criterion imposes a greater penalty when the data exhibits stronger clustering, thereby allowing the clustered network information criterion to better prevent over-parameterization. In a simulation study and an empirical example, we used standard regression to develop prediction models for clustered data with Gaussian or binomial responses. Compared to the commonly used Akaike information criterion and Bayesian information criterion for standard regression, clustered network information criterion provides a much more accurate approximation to leave-one-cluster-out deviance and results in more accurate model size and variable selection, as determined by cluster-based cross-validation, especially when the data exhibit strong clustering.

对于在聚类数据上开发的预测模型,在模型参数化中不考虑聚类的异质性,使用基于聚类的验证来评估模型在未见聚类上的可泛化性至关重要。本文介绍了一种网络信息准则的聚类估计器,用于近似具有二次可微对数似然函数的标准预测模型的留一聚类偏差。聚类网络信息标准可作为基于聚类的交叉验证的快速替代方案。Stone证明了对于具有独立同分布观测值的真参数模型,Akaike信息准则渐近等价于留一个观测值的交叉验证。Ripley指出,从Stone的证明中衍生出来的网络信息标准,在模型被错误指定时是一个更好的近似值。对于聚类数据,用聚类调整估计量代替网络信息准则中的Fisher信息矩阵,得到聚类网络信息准则。当数据表现出更强的聚类时,聚类网络信息准则施加更大的惩罚,从而允许聚类网络信息准则更好地防止过度参数化。在模拟研究和实证示例中,我们使用标准回归开发具有高斯或二项响应的聚类数据的预测模型。与标准回归中常用的赤池信息准则和贝叶斯信息准则相比,聚类网络信息准则提供了更准确的近似留一个聚类偏差,并通过基于聚类的交叉验证确定了更准确的模型大小和变量选择,特别是当数据表现出强聚类时。
{"title":"Fast leave-one-cluster-out cross-validation using clustered network information criterion.","authors":"Jiaxing Qiu, Douglas E Lake, Pavel Chernyavskiy, Teague R Henry","doi":"10.1177/09622802251345486","DOIUrl":"10.1177/09622802251345486","url":null,"abstract":"<p><p>For prediction models developed on clustered data that do not account for cluster heterogeneity in model parameterization, it is crucial to use cluster-based validation to assess model generalizability on unseen clusters. This article introduces a clustered estimator of the network information criterion to approximate leave-one-cluster-out deviance for standard prediction models with twice-differentiable log-likelihood functions. The clustered network information criterion serves as a fast alternative to cluster-based cross-validation. Stone proved that the Akaike information criterion is asymptotically equivalent to leave-one-observation-out cross-validation for true parametric models with independent and identically distributed observations. Ripley noted that the network information criterion, derived from Stone's proof, is a better approximation when the model is misspecified. For clustered data, we derived clustered network information criterion by substituting the Fisher information matrix in the network information criterion with a clustering-adjusted estimator. The clustered network information criterion imposes a greater penalty when the data exhibits stronger clustering, thereby allowing the clustered network information criterion to better prevent over-parameterization. In a simulation study and an empirical example, we used standard regression to develop prediction models for clustered data with Gaussian or binomial responses. Compared to the commonly used Akaike information criterion and Bayesian information criterion for standard regression, clustered network information criterion provides a much more accurate approximation to leave-one-cluster-out deviance and results in more accurate model size and variable selection, as determined by cluster-based cross-validation, especially when the data exhibit strong clustering.</p>","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"1413-1430"},"PeriodicalIF":1.9,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144326885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A model-free phase I/II dose optimization design for immunotherapy trials. 免疫治疗试验无模型I/II期剂量优化设计
IF 1.9 3区 医学 Q3 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-07-01 Epub Date: 2025-05-15 DOI: 10.1177/09622802251340246
Yingjie Qiu, Mengyi Lu, Yan Han, Wenxian Zhou, Yi Zhao, Leng Han, Yong Zang

We present a model-free phase I/II clinical trial design, referred to as the UFO design, to optimize the dose of immunotherapy by jointly modeling toxicity, efficacy, and immune response outcomes. Instead of relying on complex parametric modeling approaches, we propose a model-free approach that uses the inherent correlations among different types of outcomes in immunotherapy and the constrained dose-outcome order to facilitate information sharing across different doses. This approach ensures the efficiency and transparency of the UFO design to be implemented in clinical practice. The UFO design is also extended to accommodate the delayed outcomes. It demonstrates favorable operating characteristics through simulation studies. The R Shniy app for simulation and trial implementation using the UFO design is also provided at iusccc.shinyapps.io/smartdesign.

我们提出了一个无模型的I/II期临床试验设计,称为UFO设计,通过联合模拟毒性、疗效和免疫反应结果来优化免疫治疗的剂量。我们提出了一种无模型方法,而不是依赖于复杂的参数建模方法,该方法利用免疫治疗中不同类型结果之间的内在相关性和受限的剂量-结果顺序来促进不同剂量之间的信息共享。这种方法保证了UFO设计在临床实践中实施的效率和透明度。UFO设计也被扩展以适应延迟的结果。通过仿真研究,显示出良好的工作特性。使用UFO设计进行模拟和试验实施的R Shniy应用程序也提供在iusccc.shinyapps.io/smartdesign。
{"title":"A model-free phase I/II dose optimization design for immunotherapy trials.","authors":"Yingjie Qiu, Mengyi Lu, Yan Han, Wenxian Zhou, Yi Zhao, Leng Han, Yong Zang","doi":"10.1177/09622802251340246","DOIUrl":"10.1177/09622802251340246","url":null,"abstract":"<p><p>We present a model-free phase I/II clinical trial design, referred to as the UFO design, to optimize the dose of immunotherapy by jointly modeling toxicity, efficacy, and immune response outcomes. Instead of relying on complex parametric modeling approaches, we propose a model-free approach that uses the inherent correlations among different types of outcomes in immunotherapy and the constrained dose-outcome order to facilitate information sharing across different doses. This approach ensures the efficiency and transparency of the UFO design to be implemented in clinical practice. The UFO design is also extended to accommodate the delayed outcomes. It demonstrates favorable operating characteristics through simulation studies. The R Shniy app for simulation and trial implementation using the UFO design is also provided at iusccc.shinyapps.io/smartdesign.</p>","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"1442-1458"},"PeriodicalIF":1.9,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144080692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating the sample size requirements of tree-based ensemble machine learning techniques for clinical risk prediction. 评估基于树的集成机器学习技术用于临床风险预测的样本量要求。
IF 1.9 3区 医学 Q3 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-07-01 Epub Date: 2025-05-14 DOI: 10.1177/09622802251338983
Oya Kalaycıoğlu, Menelaos Pavlou, Serhat E Akhanlı, Mark A de Belder, Gareth Ambler, Rumana Z Omar

Machine learning techniques (MLTs) are increasingly being used to develop clinical risk prediction models for binary health outcomes but the sample size requirements for developing and validating such models remain unclear. This study investigates whether sample size guidelines that target mean absolute prediction error (MAPE) for logistic regression models can be applied to tree-based ensemble MLTs (bagging, random forests, and boosting). Simulations based on two large cardiovascular datasets were used to evaluate the performance of MLTs in terms of MAPE, calibration, the C-statistic and Brier score, across six data-generating mechanisms (DGMs) and varying sample sizes. When the DGM and analysis model matched, boosting required a sample size 2-3 times larger than recommended; random forests and bagging did not achieve the target MAPE even with a 12-fold increase. For a neutral DGM that did not match any of the analysis models, logistic regression with only main effects and boosting resulted in target MAPE values with a 12-fold increase in the recommended sample size. For external validation, our simulations showed that sample size guidelines to achieve a target precision of the estimated C-statistic were suitable, and thus may be used to inform sample size calculations for MLTs.

机器学习技术(mlt)越来越多地被用于开发二元健康结果的临床风险预测模型,但开发和验证此类模型的样本量要求仍不清楚。本研究探讨了以逻辑回归模型的平均绝对预测误差(MAPE)为目标的样本量指南是否可以应用于基于树的集成mlt(套袋、随机森林和提升)。基于两个大型心血管数据集的模拟用于评估mlt在MAPE、校准、c统计量和Brier评分方面的性能,跨越六种数据生成机制(dgm)和不同的样本量。当DGM和分析模型匹配时,提升所需的样本量比推荐的大2-3倍;随机森林和套袋即使增加了12倍,也没有达到目标MAPE。对于与任何分析模型都不匹配的中性DGM,只有主效应和促进的逻辑回归导致目标MAPE值增加了12倍的推荐样本量。对于外部验证,我们的模拟表明,实现估计c统计量的目标精度的样本量指南是合适的,因此可以用于通知mlt的样本量计算。
{"title":"Evaluating the sample size requirements of tree-based ensemble machine learning techniques for clinical risk prediction.","authors":"Oya Kalaycıoğlu, Menelaos Pavlou, Serhat E Akhanlı, Mark A de Belder, Gareth Ambler, Rumana Z Omar","doi":"10.1177/09622802251338983","DOIUrl":"10.1177/09622802251338983","url":null,"abstract":"<p><p>Machine learning techniques (MLTs) are increasingly being used to develop clinical risk prediction models for binary health outcomes but the sample size requirements for developing and validating such models remain unclear. This study investigates whether sample size guidelines that target mean absolute prediction error (MAPE) for logistic regression models can be applied to tree-based ensemble MLTs (bagging, random forests, and boosting). Simulations based on two large cardiovascular datasets were used to evaluate the performance of MLTs in terms of MAPE, calibration, the <i>C</i>-statistic and Brier score, across six data-generating mechanisms (DGMs) and varying sample sizes. When the DGM and analysis model matched, boosting required a sample size 2-3 times larger than recommended; random forests and bagging did not achieve the target MAPE even with a 12-fold increase. For a neutral DGM that did not match any of the analysis models, logistic regression with only main effects and boosting resulted in target MAPE values with a 12-fold increase in the recommended sample size. For external validation, our simulations showed that sample size guidelines to achieve a target precision of the estimated <i>C</i>-statistic were suitable, and thus may be used to inform sample size calculations for MLTs.</p>","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"1356-1372"},"PeriodicalIF":1.9,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12308042/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144080758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Two-stage subsampling variable selection for sparse high-dimensional generalized linear models. 稀疏高维广义线性模型的两阶段子抽样变量选择。
IF 1.9 3区 医学 Q3 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-07-01 Epub Date: 2025-07-02 DOI: 10.1177/09622802251343597
Marinela Capanu, Mihai Giurcanu, Colin B Begg, Mithat Gönen

Although high-dimensional data analysis has received a lot of attention after the advent of omics data, model selection in this setting continues to be challenging and there is still substantial room for improvement. Through a novel combination of existing methods, we propose here a two-stage subsampling approach for variable selection in high-dimensional generalized linear regression models. In the first stage, we screen the variables using smoothly clipped absolute deviance penalty regularization followed by partial least squares regression on repeated subsamples of the data; we include in the second stage only those predictors that were most frequently selected over the subsamples either by smoothly clipped absolute deviance or for having the top loadings in either of the first two partial least squares regression components. In the second stage, we again repeatedly subsample the data and, for each subsample, we find the best Akaike information criterion model based on an exhaustive search of all possible models on the reduced set of predictors. We then include in the final model those predictors with high selection probability across the subsamples. We prove that the proposed first-stage estimator is n1/2-consistent and that the true predictors are included in the first stage with probability converging to 1. In an extensive simulation study, we show that this two-stage approach outperforms the competitors yielding among the highest probability of selecting the true model while having one of the lowest number of false positives in the settings of logistic, Poisson, and linear regression. We illustrate the proposed method on two gene expression cancer datasets.

尽管在组学数据出现后,高维数据分析受到了很多关注,但在这种情况下的模型选择仍然具有挑战性,仍然有很大的改进空间。通过对现有方法的新颖组合,我们提出了一种用于高维广义线性回归模型中变量选择的两阶段子抽样方法。在第一阶段,我们使用平滑剪裁的绝对偏差惩罚正则化,然后对数据的重复子样本进行偏最小二乘回归来筛选变量;在第二阶段,我们只包括那些在子样本中通过平滑剪裁的绝对偏差或在前两个偏最小二乘回归成分中的任何一个中具有最高负载的最频繁选择的预测因子。在第二阶段,我们再次重复对数据进行子样本,对于每个子样本,我们基于对减少的预测集上所有可能模型的穷举搜索,找到最佳的赤池信息标准模型。然后,我们在最终模型中包括那些在子样本中具有高选择概率的预测因子。我们证明了所提出的第一阶段估计量是n1/2一致的,并且真实的预测量包含在第一阶段,其概率收敛于1。在广泛的模拟研究中,我们表明这种两阶段方法优于竞争对手,在逻辑、泊松和线性回归的设置中,选择真实模型的概率最高,同时具有最低数量的假阳性。我们在两个基因表达癌数据集上说明了所提出的方法。
{"title":"Two-stage subsampling variable selection for sparse high-dimensional generalized linear models.","authors":"Marinela Capanu, Mihai Giurcanu, Colin B Begg, Mithat Gönen","doi":"10.1177/09622802251343597","DOIUrl":"10.1177/09622802251343597","url":null,"abstract":"<p><p>Although high-dimensional data analysis has received a lot of attention after the advent of omics data, model selection in this setting continues to be challenging and there is still substantial room for improvement. Through a novel combination of existing methods, we propose here a two-stage subsampling approach for variable selection in high-dimensional generalized linear regression models. In the first stage, we screen the variables using smoothly clipped absolute deviance penalty regularization followed by partial least squares regression on repeated subsamples of the data; we include in the second stage only those predictors that were most frequently selected over the subsamples either by smoothly clipped absolute deviance or for having the top loadings in either of the first two partial least squares regression components. In the second stage, we again repeatedly subsample the data and, for each subsample, we find the best Akaike information criterion model based on an exhaustive search of all possible models on the reduced set of predictors. We then include in the final model those predictors with high selection probability across the subsamples. We prove that the proposed first-stage estimator is <math><msup><mi>n</mi><mrow><mn>1</mn><mo>/</mo><mn>2</mn></mrow></msup></math>-consistent and that the true predictors are included in the first stage with probability converging to 1. In an extensive simulation study, we show that this two-stage approach outperforms the competitors yielding among the highest probability of selecting the true model while having one of the lowest number of false positives in the settings of logistic, Poisson, and linear regression. We illustrate the proposed method on two gene expression cancer datasets.</p>","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"1504-1521"},"PeriodicalIF":1.9,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144544953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Penalized estimation for varying coefficient additive hazards models. 变系数加性危险模型的惩罚估计。
IF 1.9 3区 医学 Q3 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-07-01 Epub Date: 2025-05-14 DOI: 10.1177/09622802251338978
Hoi Min Ng, Kin Yau Wong

Varying coefficient models are commonly used to capture intricate interaction effects among covariates in regression models, allowing for the modification of one covariate's effect by another. Although these models offer increased flexibility, they also introduce greater estimation and computational complexity as a trade-off. This complexity is particularly evident in genomic studies, where the covariates are often high-dimensional, rendering conventional estimation methods inapplicable. In this paper, we study a penalized estimation method for the varying coefficient additive hazards model. We adopt the group lasso penalty along with the kernel smoothing technique to estimate the varying coefficients. In contrast to existing kernel methods, which only use a "local" neighborhood of subjects to estimate the varying coefficient function at any given point, the proposed method takes a "global" approach that incorporates all subjects and is more efficient. Through extensive simulation studies, we demonstrate that the proposed method produces interpretable results with satisfactory predictive performance. We provide an application to a major cancer genomic study.

变系数模型通常用于捕获回归模型中协变量之间复杂的相互作用效应,允许一个协变量的影响被另一个协变量修改。尽管这些模型提供了更大的灵活性,但作为权衡,它们也引入了更大的估计和计算复杂性。这种复杂性在基因组研究中尤其明显,其中协变量通常是高维的,使得传统的估计方法不适用。本文研究了变系数加性危害模型的一种惩罚估计方法。我们采用群套索惩罚和核平滑技术来估计变化系数。现有的核函数方法仅使用对象的“局部”邻域来估计任意给定点的变系数函数,而该方法采用了包含所有对象的“全局”方法,效率更高。通过大量的仿真研究,我们证明了该方法产生的可解释结果具有令人满意的预测性能。我们为一个主要的癌症基因组研究提供应用程序。
{"title":"Penalized estimation for varying coefficient additive hazards models.","authors":"Hoi Min Ng, Kin Yau Wong","doi":"10.1177/09622802251338978","DOIUrl":"10.1177/09622802251338978","url":null,"abstract":"<p><p>Varying coefficient models are commonly used to capture intricate interaction effects among covariates in regression models, allowing for the modification of one covariate's effect by another. Although these models offer increased flexibility, they also introduce greater estimation and computational complexity as a trade-off. This complexity is particularly evident in genomic studies, where the covariates are often high-dimensional, rendering conventional estimation methods inapplicable. In this paper, we study a penalized estimation method for the varying coefficient additive hazards model. We adopt the group lasso penalty along with the kernel smoothing technique to estimate the varying coefficients. In contrast to existing kernel methods, which only use a \"local\" neighborhood of subjects to estimate the varying coefficient function at any given point, the proposed method takes a \"global\" approach that incorporates all subjects and is more efficient. Through extensive simulation studies, we demonstrate that the proposed method produces interpretable results with satisfactory predictive performance. We provide an application to a major cancer genomic study.</p>","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"1373-1384"},"PeriodicalIF":1.9,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144080759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Two-stage targeted minimum-loss based estimation for non-negative two-part outcomes. 基于非负两部分结果的两阶段目标最小损失估计。
IF 1.9 3区 医学 Q3 HEALTH CARE SCIENCES & SERVICES Pub Date : 2025-07-01 Epub Date: 2025-06-26 DOI: 10.1177/09622802251340245
Nicholas T Williams, Richard Liu, Katherine L Hoffman, Sarah Forrest, Kara E Rudolph, Iván Díaz

Non-negative two-part outcomes are defined as outcomes with a density function that have a zero point mass but are otherwise positive. Examples, such as healthcare expenditure and hospital length of stay, are common in healthcare utilization research. Despite the practical relevance of non-negative two-part outcomes, few methods exist to leverage knowledge of their semicontinuity to achieve improved performance in estimating causal effects. In this paper, we develop a nonparametric two-stage targeted minimum-loss based estimator (denoted as hTMLE) for non-negative two-part outcomes. We present methods for a general class of interventions, which can accommodate continuous, categorical, and binary exposures. The two-stage TMLE uses a targeted estimate of the intensity component of the outcome to produce a targeted estimate of the binary component of the outcome that may improve finite sample efficiency. We demonstrate the efficiency gains achieved by the two-stage TMLE with simulated examples and then apply it to a cohort of Medicaid beneficiaries to estimate the effect of chronic pain and physical disability on days' supply of opioids.

非负的两部分结果被定义为具有密度函数的结果,该密度函数具有零点质量,但在其他方面为正。医疗保健支出和住院时间等例子在医疗保健利用研究中很常见。尽管非负的两部分结果具有实际意义,但很少有方法可以利用其半连续性的知识来提高估计因果效应的性能。在本文中,我们开发了非负两部分结果的非参数两阶段目标最小损失估计器(表示为html)。我们提出了一种一般类型的干预方法,可以适应连续的、分类的和二元暴露。两阶段TMLE使用结果的强度分量的目标估计来产生结果的二进制分量的目标估计,这可能会提高有限样本效率。我们通过模拟实例证明了两阶段TMLE所取得的效率收益,然后将其应用于医疗补助受益人队列,以估计慢性疼痛和身体残疾对阿片类药物日供应的影响。
{"title":"Two-stage targeted minimum-loss based estimation for non-negative two-part outcomes.","authors":"Nicholas T Williams, Richard Liu, Katherine L Hoffman, Sarah Forrest, Kara E Rudolph, Iván Díaz","doi":"10.1177/09622802251340245","DOIUrl":"10.1177/09622802251340245","url":null,"abstract":"<p><p>Non-negative two-part outcomes are defined as outcomes with a density function that have a zero point mass but are otherwise positive. Examples, such as healthcare expenditure and hospital length of stay, are common in healthcare utilization research. Despite the practical relevance of non-negative two-part outcomes, few methods exist to leverage knowledge of their semicontinuity to achieve improved performance in estimating causal effects. In this paper, we develop a nonparametric two-stage targeted minimum-loss based estimator (denoted as hTMLE) for non-negative two-part outcomes. We present methods for a general class of interventions, which can accommodate continuous, categorical, and binary exposures. The two-stage TMLE uses a targeted estimate of the intensity component of the outcome to produce a targeted estimate of the binary component of the outcome that may improve finite sample efficiency. We demonstrate the efficiency gains achieved by the two-stage TMLE with simulated examples and then apply it to a cohort of Medicaid beneficiaries to estimate the effect of chronic pain and physical disability on days' supply of opioids.</p>","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"1431-1441"},"PeriodicalIF":1.9,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12717843/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144498076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Statistical Methods in Medical Research
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1