The development and validation of prognostic models for overall survival in the presence of missing data in the training dataset: a strategy with a detailed example.

IF 2.6 Diagnostic and prognostic research Pub Date : 2021-08-04 DOI:10.1186/s41512-021-00103-9

Kara-Louise Royle, David A Cairns

{"title":"The development and validation of prognostic models for overall survival in the presence of missing data in the training dataset: a strategy with a detailed example.","authors":"Kara-Louise Royle, David A Cairns","doi":"10.1186/s41512-021-00103-9","DOIUrl":null,"url":null,"abstract":"Background: The United Kingdom Myeloma Research Alliance (UK-MRA) Myeloma Risk Profile is a prognostic model for overall survival. It was trained and tested on clinical trial data, aiming to improve the stratification of transplant ineligible (TNE) patients with newly diagnosed multiple myeloma. Missing data is a common problem which affects the development and validation of prognostic models, where decisions on how to address missingness have implications on the choice of methodology.Methods: Model building The training and test datasets were the TNE pathways from two large randomised multicentre, phase III clinical trials. Potential prognostic factors were identified by expert opinion. Missing data in the training dataset was imputed using multiple imputation by chained equations. Univariate analysis fitted Cox proportional hazards models in each imputed dataset with the estimates combined by Rubin's rules. Multivariable analysis applied penalised Cox regression models, with a fixed penalty term across the imputed datasets. The estimates from each imputed dataset and bootstrap standard errors were combined by Rubin's rules to define the prognostic model. Model assessment Calibration was assessed by visualising the observed and predicted probabilities across the imputed datasets. Discrimination was assessed by combining the prognostic separation D-statistic from each imputed dataset by Rubin's rules. Model validation The D-statistic was applied in a bootstrap internal validation process in the training dataset and an external validation process in the test dataset, where acceptable performance was pre-specified. Development of risk groups Risk groups were defined using the tertiles of the combined prognostic index, obtained by combining the prognostic index from each imputed dataset by Rubin's rules.Results: The training dataset included 1852 patients, 1268 (68.47%) with complete case data. Ten imputed datasets were generated. Five hundred twenty patients were included in the test dataset. The D-statistic for the prognostic model was 0.840 (95% CI 0.716-0.964) in the training dataset and 0.654 (95% CI 0.497-0.811) in the test dataset and the corrected D-Statistic was 0.801.Conclusion: The decision to impute missing covariate data in the training dataset influenced the methods implemented to train and test the model. To extend current literature and aid future researchers, we have presented a detailed example of one approach. Whilst our example is not without limitations, a benefit is that all of the patient information available in the training dataset was utilised to develop the model.Trial registration: Both trials were registered; Myeloma IX- ISRCTN68454111 , registered 21 September 2000. Myeloma XI- ISRCTN49407852 , registered 24 June 2009.","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":" ","pages":"14"},"PeriodicalIF":2.6000,"publicationDate":"2021-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8335879/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Diagnostic and prognostic research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s41512-021-00103-9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The United Kingdom Myeloma Research Alliance (UK-MRA) Myeloma Risk Profile is a prognostic model for overall survival. It was trained and tested on clinical trial data, aiming to improve the stratification of transplant ineligible (TNE) patients with newly diagnosed multiple myeloma. Missing data is a common problem which affects the development and validation of prognostic models, where decisions on how to address missingness have implications on the choice of methodology.

Methods: Model building The training and test datasets were the TNE pathways from two large randomised multicentre, phase III clinical trials. Potential prognostic factors were identified by expert opinion. Missing data in the training dataset was imputed using multiple imputation by chained equations. Univariate analysis fitted Cox proportional hazards models in each imputed dataset with the estimates combined by Rubin's rules. Multivariable analysis applied penalised Cox regression models, with a fixed penalty term across the imputed datasets. The estimates from each imputed dataset and bootstrap standard errors were combined by Rubin's rules to define the prognostic model. Model assessment Calibration was assessed by visualising the observed and predicted probabilities across the imputed datasets. Discrimination was assessed by combining the prognostic separation D-statistic from each imputed dataset by Rubin's rules. Model validation The D-statistic was applied in a bootstrap internal validation process in the training dataset and an external validation process in the test dataset, where acceptable performance was pre-specified. Development of risk groups Risk groups were defined using the tertiles of the combined prognostic index, obtained by combining the prognostic index from each imputed dataset by Rubin's rules.

Results: The training dataset included 1852 patients, 1268 (68.47%) with complete case data. Ten imputed datasets were generated. Five hundred twenty patients were included in the test dataset. The D-statistic for the prognostic model was 0.840 (95% CI 0.716-0.964) in the training dataset and 0.654 (95% CI 0.497-0.811) in the test dataset and the corrected D-Statistic was 0.801.

Conclusion: The decision to impute missing covariate data in the training dataset influenced the methods implemented to train and test the model. To extend current literature and aid future researchers, we have presented a detailed example of one approach. Whilst our example is not without limitations, a benefit is that all of the patient information available in the training dataset was utilised to develop the model.

Trial registration: Both trials were registered; Myeloma IX- ISRCTN68454111 , registered 21 September 2000. Myeloma XI- ISRCTN49407852 , registered 24 June 2009.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在训练数据集中存在缺失数据的情况下，总体生存预测模型的开发和验证:一个带有详细示例的策略。

背景：英国骨髓瘤研究联盟（UK-MRA）骨髓瘤风险简介是一个总体生存率的预后模型。它在临床试验数据上进行了训练和测试，旨在改善新诊断的多发性骨髓瘤移植不合格（TNE）患者的分层。缺失数据是影响预后模型开发和验证的一个常见问题，在预后模型中，如何解决缺失问题的决策对方法的选择有影响。方法：模型构建训练和测试数据集是来自两项大型随机多中心III期临床试验的TNE途径。通过专家意见确定了潜在的预后因素。训练数据集中的缺失数据通过链式方程使用多重插补进行插补。单变量分析在每个估算数据集中拟合了Cox比例风险模型，估计值由Rubin规则组合而成。多变量分析应用了惩罚Cox回归模型，在估算数据集上有一个固定的惩罚项。通过鲁宾规则将每个估算数据集的估计值和引导标准误差相结合，以定义预后模型。模型评估通过可视化估算数据集的观测和预测概率来评估校准。通过结合鲁宾规则对每个估算数据集的预后分离D统计量进行评估。模型验证在训练数据集中的引导内部验证过程和测试数据集中的外部验证过程中应用了D统计量，其中预先指定了可接受的性能。风险组的发展使用组合预后指数的三分位数来定义风险组，该指数是通过鲁宾规则将每个估算数据集的预后指数组合而获得的。结果：训练数据集包括1852名患者，其中1268人（68.47%）拥有完整的病例数据。生成了10个估算数据集。520名患者被纳入测试数据集中。预测模型的D统计量在训练数据集中为0.840（95%CI 0.716-0.964），在测试数据集中为0.654（95%CI 0.497-0.811），校正后的D统计量为0.801。结论：在训练数据集中估算缺失协变量数据的决定影响了训练和测试模型的方法。为了扩展现有文献并帮助未来的研究人员，我们提供了一个方法的详细示例。虽然我们的例子并非没有限制，但一个好处是，训练数据集中可用的所有患者信息都被用于开发模型。试验注册：两项试验都已注册；骨髓瘤IX-ISRCTN68454111，2000年9月21日登记。XI骨髓瘤-ISRCTN49407852，2009年6月24日注册。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Diagnostic and prognostic research

自引率

0.00%

发文量

审稿时长

18 weeks