Pub Date : 2025-12-04DOI: 10.1186/s41512-025-00206-7
Hamish Innes, Philip J Johnson
Background: Some prognostic factors (PF) are "versatile" insofar as they predict diverse health outcomes (age is an exemplar par excellence). In this study, we sought to quantify the versatility of commonly measured PFs.
Methods: Participants from the UKB (UK biobank) study were followed from enrolment until the date of outcome or censoring. Over 800 incident adverse outcomes were considered, based on a unique 3-digit ICD code (A00, A01, A02, etc.). Twenty-four routine PFs-including renal, liver function and blood count biomarkers-featured in this analysis. Cox regression was used to determine the association between each PF with time to each outcome event. The number of statistically significant associations, direction of the association (positive/negative) and the median log hazard ratio (LHR) were determined for each PF. Data were visualised using Volcano plots.
Results: The analysis included up to 502,408 UKB participants followed for 12.4 years. PFs with the greatest number of statistically significant associations were age (563/836; median LHR: 0.47), waist-hip ratio (530/836; LHR: 0.37) and hand-grip strength (416/836; LHR: 0.27). Conversely, PFs with the lowest number of significant associations were diastolic blood pressure (138/835; LHR: 0.11) and total protein (138/835; LHR: 0.11). Positive correlation was observed between the number of events a PF was associated with and the average effect size for those associations.
Conclusion: A wide spectrum exists between the most and least versatile PFs. In addition to age, waist-hip ratio and handgrip strength exhibit high versatility. Understanding PF versatility has implications for optimising the development/performance of prognostic models.
{"title":"Quantifying the versatility of routinely measured prognostic factors.","authors":"Hamish Innes, Philip J Johnson","doi":"10.1186/s41512-025-00206-7","DOIUrl":"10.1186/s41512-025-00206-7","url":null,"abstract":"<p><strong>Background: </strong>Some prognostic factors (PF) are \"versatile\" insofar as they predict diverse health outcomes (age is an exemplar par excellence). In this study, we sought to quantify the versatility of commonly measured PFs.</p><p><strong>Methods: </strong>Participants from the UKB (UK biobank) study were followed from enrolment until the date of outcome or censoring. Over 800 incident adverse outcomes were considered, based on a unique 3-digit ICD code (A00, A01, A02, etc.). Twenty-four routine PFs-including renal, liver function and blood count biomarkers-featured in this analysis. Cox regression was used to determine the association between each PF with time to each outcome event. The number of statistically significant associations, direction of the association (positive/negative) and the median log hazard ratio (LHR) were determined for each PF. Data were visualised using Volcano plots.</p><p><strong>Results: </strong>The analysis included up to 502,408 UKB participants followed for 12.4 years. PFs with the greatest number of statistically significant associations were age (563/836; median LHR: 0.47), waist-hip ratio (530/836; LHR: 0.37) and hand-grip strength (416/836; LHR: 0.27). Conversely, PFs with the lowest number of significant associations were diastolic blood pressure (138/835; LHR: 0.11) and total protein (138/835; LHR: 0.11). Positive correlation was observed between the number of events a PF was associated with and the average effect size for those associations.</p><p><strong>Conclusion: </strong>A wide spectrum exists between the most and least versatile PFs. In addition to age, waist-hip ratio and handgrip strength exhibit high versatility. Understanding PF versatility has implications for optimising the development/performance of prognostic models.</p>","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":"9 1","pages":"25"},"PeriodicalIF":2.6,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12676796/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145673009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-02DOI: 10.1186/s41512-025-00209-4
Giovanni Cinà, Tabea E Röber, Rob Goedhart, Ş İlker Birbil
The recent uptake in certified Artificial Intelligence (AI) tools for healthcare applications has renewed the debate around their adoption. Explainable AI, the sub-discipline promising to render AI devices more transparent and trustworthy, has also come under scrutiny as part of this discussion. Some experts in the medical AI space debate the reliability of Explainable AI techniques, expressing concerns on their use and inclusion in guidelines and standards. Revisiting such criticisms, this article offers a balanced perspective on the utility of Explainable AI, focusing on the specificity of clinical applications of AI and placing them in the context of healthcare interventions. Against its detractors and despite valid concerns, we argue that the Explainable AI research program is still central to human-machine interaction and ultimately a useful tool against loss of control, a danger that cannot be prevented by rigorous clinical validation alone.
{"title":"Why we do need explainable AI for healthcare.","authors":"Giovanni Cinà, Tabea E Röber, Rob Goedhart, Ş İlker Birbil","doi":"10.1186/s41512-025-00209-4","DOIUrl":"10.1186/s41512-025-00209-4","url":null,"abstract":"<p><p>The recent uptake in certified Artificial Intelligence (AI) tools for healthcare applications has renewed the debate around their adoption. Explainable AI, the sub-discipline promising to render AI devices more transparent and trustworthy, has also come under scrutiny as part of this discussion. Some experts in the medical AI space debate the reliability of Explainable AI techniques, expressing concerns on their use and inclusion in guidelines and standards. Revisiting such criticisms, this article offers a balanced perspective on the utility of Explainable AI, focusing on the specificity of clinical applications of AI and placing them in the context of healthcare interventions. Against its detractors and despite valid concerns, we argue that the Explainable AI research program is still central to human-machine interaction and ultimately a useful tool against loss of control, a danger that cannot be prevented by rigorous clinical validation alone.</p>","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":"9 1","pages":"24"},"PeriodicalIF":2.6,"publicationDate":"2025-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12670843/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145656552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-11DOI: 10.1186/s41512-025-00208-5
Jason E Black, David Jt Campbell, Paul E Ronksley, Kerry A McBrien, Tyler S Williamson
Background: Several clinical prediction models that predict the risk of chronic kidney disease (CKD) in people with diabetes have been developed; however, these models lack external validation demonstrating accurate predictions in Canadian primary care. We externally validated existing clinical prediction models for CKD in Canadian primary care data, overall and across subgroups defined by sex/gender, age, comorbidities, and neighbourhood-level deprivation.
Methods: We conducted a retrospective cohort study using data from the Canadian Primary Care Sentinel Surveillance Network (CPCSSN) electronic medical record database (2014-2019). We identified models that use demographic, health behaviour, clinical and diabetes-related characteristics to predict incident CKD based on two recent systematic reviews and included models with sufficient predictors in CPCSSN (≤ 1 unavailable) and eGFR-based CKD definitions. We included adult patients (18 +) with diabetes without an existing diagnosis of CKD. We identified incident cases of CKD within 5 years based on ≥ 2 laboratory values corresponding to eGFR < 60 mL/min/1.73 m2 separated by ≥ 90 days and ≤ 1 year. For each model, we estimated the discrimination, precision, recall, and calibration within CPCSSN.
Results: Among 37,604 patients with diabetes, 14.6% met diagnostic criteria for CKD within 5 years. Overall performance of the 13 included CKD prediction models in CPCCSN was mixed: three models displayed moderate to strong discrimination (areas under the receiver-operating characteristic curves [AUROCs] > 0.70), whereas other AUROCs were as Low as 0.508. After model updating, calibrations were heterogeneous with most models displaying some miscalibration. Some subgroups displayed considerable differences in performance: discriminative performance (AUROC) declined with increasing age and number of comorbidities, whereas the precision and recall improved with increasing age and number of comorbidities. We observed no difference in performance according to sex/gender or deprivation quintile.
Conclusions: Three models displayed moderate to strong performance predicting CKD among CPCSSN patients. Next, these models should be evaluated for their impact on practitioner and patient outcomes when implemented in clinical practice. If successful, these models hold promise in achieving widespread adoption to help identify those at highest risk of CKD and guide therapies that may prevent or delay CKD and related sequelae (e.g., end-stage renal disease) among people with diabetes.
{"title":"Performance of clinical prediction models for chronic kidney disease among people with diabetes: external validation using the Canadian Primary Care Sentinel Surveillance Network (CPCSSN).","authors":"Jason E Black, David Jt Campbell, Paul E Ronksley, Kerry A McBrien, Tyler S Williamson","doi":"10.1186/s41512-025-00208-5","DOIUrl":"10.1186/s41512-025-00208-5","url":null,"abstract":"<p><strong>Background: </strong>Several clinical prediction models that predict the risk of chronic kidney disease (CKD) in people with diabetes have been developed; however, these models lack external validation demonstrating accurate predictions in Canadian primary care. We externally validated existing clinical prediction models for CKD in Canadian primary care data, overall and across subgroups defined by sex/gender, age, comorbidities, and neighbourhood-level deprivation.</p><p><strong>Methods: </strong>We conducted a retrospective cohort study using data from the Canadian Primary Care Sentinel Surveillance Network (CPCSSN) electronic medical record database (2014-2019). We identified models that use demographic, health behaviour, clinical and diabetes-related characteristics to predict incident CKD based on two recent systematic reviews and included models with sufficient predictors in CPCSSN (≤ 1 unavailable) and eGFR-based CKD definitions. We included adult patients (18 +) with diabetes without an existing diagnosis of CKD. We identified incident cases of CKD within 5 years based on ≥ 2 laboratory values corresponding to eGFR < 60 mL/min/1.73 m<sup>2</sup> separated by ≥ 90 days and ≤ 1 year. For each model, we estimated the discrimination, precision, recall, and calibration within CPCSSN.</p><p><strong>Results: </strong>Among 37,604 patients with diabetes, 14.6% met diagnostic criteria for CKD within 5 years. Overall performance of the 13 included CKD prediction models in CPCCSN was mixed: three models displayed moderate to strong discrimination (areas under the receiver-operating characteristic curves [AUROCs] > 0.70), whereas other AUROCs were as Low as 0.508. After model updating, calibrations were heterogeneous with most models displaying some miscalibration. Some subgroups displayed considerable differences in performance: discriminative performance (AUROC) declined with increasing age and number of comorbidities, whereas the precision and recall improved with increasing age and number of comorbidities. We observed no difference in performance according to sex/gender or deprivation quintile.</p><p><strong>Conclusions: </strong>Three models displayed moderate to strong performance predicting CKD among CPCSSN patients. Next, these models should be evaluated for their impact on practitioner and patient outcomes when implemented in clinical practice. If successful, these models hold promise in achieving widespread adoption to help identify those at highest risk of CKD and guide therapies that may prevent or delay CKD and related sequelae (e.g., end-stage renal disease) among people with diabetes.</p>","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":"9 1","pages":"26"},"PeriodicalIF":2.6,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12604207/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145490972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-06DOI: 10.1186/s41512-025-00211-w
Yusuf Yildiz, Goran Nenadic, Meghna Jani, David A Jenkins
Objective: Large language models (LLMs) are attracting increasing interest in healthcare. This commentary evaluates the potential of LLMs to improve clinical prediction models (CPMs) for diagnostic and prognostic tasks, with a focus on their ability to process longitudinal electronic health record (EHR) data.
Findings: LLMs show promise in handling multimodal and longitudinal EHR data and can support multi-outcome predictions for diverse health conditions. However, methodological, validation, infrastructural, and regulatory challenges remain. These include inadequate methods for time-to-event modelling, poor calibration of predictions, limited external validation, and bias affecting underrepresented groups. High infrastructure costs and the absence of clear regulatory frameworks further prevent adoption.
Implications: Further work and interdisciplinary collaboration are needed to support equitable and effective integration into the clinical prediction. Developing temporally aware, fair, and explainable models should be a priority focus for transforming clinical prediction workflow.
{"title":"Will large language models transform clinical prediction?","authors":"Yusuf Yildiz, Goran Nenadic, Meghna Jani, David A Jenkins","doi":"10.1186/s41512-025-00211-w","DOIUrl":"10.1186/s41512-025-00211-w","url":null,"abstract":"<p><strong>Objective: </strong>Large language models (LLMs) are attracting increasing interest in healthcare. This commentary evaluates the potential of LLMs to improve clinical prediction models (CPMs) for diagnostic and prognostic tasks, with a focus on their ability to process longitudinal electronic health record (EHR) data.</p><p><strong>Findings: </strong>LLMs show promise in handling multimodal and longitudinal EHR data and can support multi-outcome predictions for diverse health conditions. However, methodological, validation, infrastructural, and regulatory challenges remain. These include inadequate methods for time-to-event modelling, poor calibration of predictions, limited external validation, and bias affecting underrepresented groups. High infrastructure costs and the absence of clear regulatory frameworks further prevent adoption.</p><p><strong>Implications: </strong>Further work and interdisciplinary collaboration are needed to support equitable and effective integration into the clinical prediction. Developing temporally aware, fair, and explainable models should be a priority focus for transforming clinical prediction workflow.</p>","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":"9 1","pages":"28"},"PeriodicalIF":2.6,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12590740/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145460758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-04DOI: 10.1186/s41512-025-00207-6
Werner Vach
Five years ago, Korevaar and colleagues proposed a framework for designing diagnostic accuracy studies, focusing on the definition of clear study hypotheses. This proposal filled a gap and was well received by the scientific community. In this commentary, I suggest five potential refinements. They aim at increasing the flexibility of the framework while pertaining its logical consistency. The refinements address the following five topics: (1) The relationship between minimal criteria and the choice of the null hypothesis region; (2) The potential to allow compensation between sensitivity and specificity; (3) The possibility to use other pairs than sensitivity and specificity; (4) The potential phrasing as an estimation problem; (5) The advantages of directly moving to a comparative accuracy study.
{"title":"Targeted test evaluation: five suggestions for refining the framework for designing diagnostic accuracy studies with clear study hypotheses.","authors":"Werner Vach","doi":"10.1186/s41512-025-00207-6","DOIUrl":"10.1186/s41512-025-00207-6","url":null,"abstract":"<p><p>Five years ago, Korevaar and colleagues proposed a framework for designing diagnostic accuracy studies, focusing on the definition of clear study hypotheses. This proposal filled a gap and was well received by the scientific community. In this commentary, I suggest five potential refinements. They aim at increasing the flexibility of the framework while pertaining its logical consistency. The refinements address the following five topics: (1) The relationship between minimal criteria and the choice of the null hypothesis region; (2) The potential to allow compensation between sensitivity and specificity; (3) The possibility to use other pairs than sensitivity and specificity; (4) The potential phrasing as an estimation problem; (5) The advantages of directly moving to a comparative accuracy study.</p>","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":"9 1","pages":"27"},"PeriodicalIF":2.6,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12584363/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145440237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-14DOI: 10.1186/s41512-025-00197-5
Sarwar I Mozumder, Sarah Booth, Richard D Riley, Mark J Rutherford, Paul C Lambert
Background: When developing/validating prognostic models, it is typical to assess calibration between predicted and observed risks - either in the development dataset or in an external sample. For competing risks data, correct specification of more than one model may be required to ensure well-calibrated predicted risks for the event of interest. Furthermore, interest may be in the predicted risks of the event of interest, competing events and all-causes. Therefore, calibration must be assessed simultaneously using various measures.
Methods: We focus on the calibration of prediction models for external validation using a cause-specific hazards approach. We propose that miscalibration for cause-specific hazard models be assessed using components specific to each model through the complement of the cause-specific survival alongside the assessment of the calibration of the cause-specific absolute risks. We simulated a range of scenarios to illustrate how to identify which model(s) are mis-specified in an external validation setting. Calibration plots and calibration statistics (calibration slope, calibration-in-the-large) are presented alongside performance measures such as the Brier score and Index of Prediction Accuracy. We use pseudo-observations to calculate observed risks and generate a smooth calibration curve with restricted cubic splines. We fitted flexible parametric survival models to the simulated data to flexibly estimate baseline cause-specific hazards for the prediction of individual cause-specific absolute risks.
Results: Our simulations illustrate that miscalibration due to changes in the baseline cause-specific hazards in external validation data is better identified using components from each cause-specific model. A mis-calibrated model on one cause could lead to poor calibration of the predicted absolute risks for each cause of interest, including the all-cause absolute risk. This is because prediction of a single cause-specific absolute risk is impacted by effects of variables on the cause of interest and competing events.
Conclusions: If accurate predictions for both all-cause and each cause-specific absolute risks are of interest, this is best achieved by developing and validating models via the cause-specific hazards approach. For each cause-specific model, researchers should evaluate calibration plots separately using the complement of the cause-specific survival function to reveal the cause of any miscalibration. However, this also requires careful consideration of dependent censoring which must be sufficiently accounted for.
{"title":"Calibration of cause-specific absolute risk for external validation using each cause-specific hazards model in the presence of competing events.","authors":"Sarwar I Mozumder, Sarah Booth, Richard D Riley, Mark J Rutherford, Paul C Lambert","doi":"10.1186/s41512-025-00197-5","DOIUrl":"10.1186/s41512-025-00197-5","url":null,"abstract":"<p><strong>Background: </strong>When developing/validating prognostic models, it is typical to assess calibration between predicted and observed risks - either in the development dataset or in an external sample. For competing risks data, correct specification of more than one model may be required to ensure well-calibrated predicted risks for the event of interest. Furthermore, interest may be in the predicted risks of the event of interest, competing events and all-causes. Therefore, calibration must be assessed simultaneously using various measures.</p><p><strong>Methods: </strong>We focus on the calibration of prediction models for external validation using a cause-specific hazards approach. We propose that miscalibration for cause-specific hazard models be assessed using components specific to each model through the complement of the cause-specific survival alongside the assessment of the calibration of the cause-specific absolute risks. We simulated a range of scenarios to illustrate how to identify which model(s) are mis-specified in an external validation setting. Calibration plots and calibration statistics (calibration slope, calibration-in-the-large) are presented alongside performance measures such as the Brier score and Index of Prediction Accuracy. We use pseudo-observations to calculate observed risks and generate a smooth calibration curve with restricted cubic splines. We fitted flexible parametric survival models to the simulated data to flexibly estimate baseline cause-specific hazards for the prediction of individual cause-specific absolute risks.</p><p><strong>Results: </strong>Our simulations illustrate that miscalibration due to changes in the baseline cause-specific hazards in external validation data is better identified using components from each cause-specific model. A mis-calibrated model on one cause could lead to poor calibration of the predicted absolute risks for each cause of interest, including the all-cause absolute risk. This is because prediction of a single cause-specific absolute risk is impacted by effects of variables on the cause of interest and competing events.</p><p><strong>Conclusions: </strong>If accurate predictions for both all-cause and each cause-specific absolute risks are of interest, this is best achieved by developing and validating models via the cause-specific hazards approach. For each cause-specific model, researchers should evaluate calibration plots separately using the complement of the cause-specific survival function to reveal the cause of any miscalibration. However, this also requires careful consideration of dependent censoring which must be sufficiently accounted for.</p>","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":"9 1","pages":"23"},"PeriodicalIF":2.6,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12519608/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145287912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-02DOI: 10.1186/s41512-025-00201-y
Sarah C Voter, Issa J Dahabreh, Christopher B Boyer, Habib Rahbar, Despina Kontos, Jon A Steingrimsson
Background: When a machine learning model is developed and evaluated in a setting where the treatment assignment process differs from the setting of intended model deployment, failure to account for this difference can lead to suboptimal model development and biased estimates of model performance.
Methods: We consider the setting where data from a randomized trial and an observational study emulating the trial are available for machine learning model development and evaluation. We provide two approaches for estimating the model and assessing model performance under a hypothetical treatment strategy in the target population underlying the observational study. The first approach uses counterfactual predictions from the observational study only and relies on the assumption of conditional exchangeability between treated and untreated individuals (no unmeasured confounding). The second approach leverages the exchangeability between treatment groups in the trial (supported by study design) to "transport" estimates from the trial to the population underlying the observational study, relying on an additional assumption of conditional exchangeability between the populations underlying the observational study and the randomized trial.
Results: We examine the assumptions underlying both approaches for fitting the model and estimating performance in the target population and provide estimators for both objectives. We then develop a joint estimation strategy that combines data from the trial and the observational study, and discuss benchmarking of the trial and observational results.
Conclusions: Both the observational and transportability analyses can be used to fit a model and estimate performance under a counterfactual treatment strategy in the population underlying the observational data, but they rely on different assumptions. In either case, the assumptions are untestable, and deciding which method is more appropriate requires careful contextual consideration. If all assumptions hold, then combining the data from the observational study and the randomized trial can be used for more efficient estimation.
{"title":"Counterfactual prediction from machine learning models: transportability and joint analysis for model development and evaluation using multi-source data.","authors":"Sarah C Voter, Issa J Dahabreh, Christopher B Boyer, Habib Rahbar, Despina Kontos, Jon A Steingrimsson","doi":"10.1186/s41512-025-00201-y","DOIUrl":"10.1186/s41512-025-00201-y","url":null,"abstract":"<p><strong>Background: </strong>When a machine learning model is developed and evaluated in a setting where the treatment assignment process differs from the setting of intended model deployment, failure to account for this difference can lead to suboptimal model development and biased estimates of model performance.</p><p><strong>Methods: </strong>We consider the setting where data from a randomized trial and an observational study emulating the trial are available for machine learning model development and evaluation. We provide two approaches for estimating the model and assessing model performance under a hypothetical treatment strategy in the target population underlying the observational study. The first approach uses counterfactual predictions from the observational study only and relies on the assumption of conditional exchangeability between treated and untreated individuals (no unmeasured confounding). The second approach leverages the exchangeability between treatment groups in the trial (supported by study design) to \"transport\" estimates from the trial to the population underlying the observational study, relying on an additional assumption of conditional exchangeability between the populations underlying the observational study and the randomized trial.</p><p><strong>Results: </strong>We examine the assumptions underlying both approaches for fitting the model and estimating performance in the target population and provide estimators for both objectives. We then develop a joint estimation strategy that combines data from the trial and the observational study, and discuss benchmarking of the trial and observational results.</p><p><strong>Conclusions: </strong>Both the observational and transportability analyses can be used to fit a model and estimate performance under a counterfactual treatment strategy in the population underlying the observational data, but they rely on different assumptions. In either case, the assumptions are untestable, and deciding which method is more appropriate requires careful contextual consideration. If all assumptions hold, then combining the data from the observational study and the randomized trial can be used for more efficient estimation.</p>","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":"9 1","pages":"22"},"PeriodicalIF":2.6,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12490139/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145208367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-08DOI: 10.1186/s41512-025-00205-8
Henry J Domenico, Benjamin F Tillman, Shari L Just, Yeji Ko, Amanda S Mixon, Asli Weitkamp, Jonathan S Schildcrout, Colin Walsh, Thomas Ortel, Benjamin French
Background: Hospital-acquired venous thromboembolism (HA-VTE) is a leading cause of morbidity and mortality among hospitalized adults. Numerous prognostic models have been developed to identify those patients with elevated risk of HA-VTE. None, however, has met the necessary criteria to guide clinical decision-making. This study outlines a protocol for refining and validating a general-purpose prognostic model for HA-VTE, designed for real-time automation within the electronic health record (EHR) system.
Methods: A retrospective cohort of 132,561 inpatient encounters (89,586 individual patients) at a large academic medical center will be collected, along with clinical and demographic data available as part of routine care. Data for temporal, geographic, and domain external validation cohorts will also be collected. Logistic regression will be used to predict occurrence of HA-VTE during an inpatient encounter. Variables considered for model inclusion will be based on prior demonstrated association with HA-VTE and their availability in both retrospective EHR data and routine clinical care. Least absolute shrinkage and selection operator (LASSO) with tenfold cross-validation will be used for initial variable selection. Variables selected by the LASSO procedure, along with those deemed necessary by clinicians, will be used in an unpenalized multivariable logistic regression model. Discrimination and calibration will be reported for the derivation and validation cohorts. Discrimination will be measured using Harrell's C statistic. Calibration will be measured using calibration intercept, calibration slope, Brier score, integrated calibration index, and visual examination of non-linear calibration curve. Model reporting will adhere to the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis guidelines for clinical prediction models using machine learning methods (TRIPOD + AI).
Discussion: We describe methods for developing, evaluating, and validating a prognostic model for HA-VTE using routinely collected EHR data. By combining best practices in statistical development and validation, knowledge engineering, and clinical domain knowledge, the resulting model should be well suited for real-time clinical implementation. Although this protocol describes our development of a model for HA-VTE, the general approach can be applied to other clinical outcomes.
背景:医院获得性静脉血栓栓塞(HA-VTE)是住院成人发病率和死亡率的主要原因。已经开发了许多预后模型来识别HA-VTE风险升高的患者。然而,没有一个达到指导临床决策的必要标准。本研究概述了一种改进和验证HA-VTE通用预后模型的方案,该模型设计用于电子健康记录(EHR)系统中的实时自动化。方法:在一个大型学术医疗中心收集132561例住院患者(89586例个体患者)的回顾性队列,以及作为常规护理一部分的临床和人口统计数据。还将收集时间、地理和领域外部验证队列的数据。逻辑回归将用于预测住院期间HA-VTE的发生。考虑纳入模型的变量将基于先前证明的与HA-VTE的关联,以及它们在回顾性电子病历数据和常规临床护理中的可用性。最小绝对收缩和选择算子(LASSO)与十倍交叉验证将用于初始变量选择。LASSO程序选择的变量,以及临床医生认为必要的变量,将用于无惩罚的多变量逻辑回归模型。将报告衍生和验证队列的鉴别和校准。歧视将使用Harrell的C统计量来衡量。校准将使用校准截距、校准斜率、Brier评分、综合校准指数和非线性校准曲线的目视检查来测量。模型报告将遵循使用机器学习方法(TRIPOD + AI)的临床预测模型的透明报告(Transparent reporting of a multivariable prediction Model for Individual Prognosis Or Diagnosis)指南。讨论:我们描述了利用常规收集的电子病历数据开发、评估和验证HA-VTE预后模型的方法。通过结合统计开发和验证、知识工程和临床领域知识的最佳实践,得到的模型应该非常适合于实时临床实现。虽然该方案描述了我们对HA-VTE模型的开发,但一般方法可以应用于其他临床结果。
{"title":"Predicting venous thromboembolism among hospitalized adults: a protocol for development and validation of an implementable real-time prognostic model.","authors":"Henry J Domenico, Benjamin F Tillman, Shari L Just, Yeji Ko, Amanda S Mixon, Asli Weitkamp, Jonathan S Schildcrout, Colin Walsh, Thomas Ortel, Benjamin French","doi":"10.1186/s41512-025-00205-8","DOIUrl":"10.1186/s41512-025-00205-8","url":null,"abstract":"<p><strong>Background: </strong>Hospital-acquired venous thromboembolism (HA-VTE) is a leading cause of morbidity and mortality among hospitalized adults. Numerous prognostic models have been developed to identify those patients with elevated risk of HA-VTE. None, however, has met the necessary criteria to guide clinical decision-making. This study outlines a protocol for refining and validating a general-purpose prognostic model for HA-VTE, designed for real-time automation within the electronic health record (EHR) system.</p><p><strong>Methods: </strong>A retrospective cohort of 132,561 inpatient encounters (89,586 individual patients) at a large academic medical center will be collected, along with clinical and demographic data available as part of routine care. Data for temporal, geographic, and domain external validation cohorts will also be collected. Logistic regression will be used to predict occurrence of HA-VTE during an inpatient encounter. Variables considered for model inclusion will be based on prior demonstrated association with HA-VTE and their availability in both retrospective EHR data and routine clinical care. Least absolute shrinkage and selection operator (LASSO) with tenfold cross-validation will be used for initial variable selection. Variables selected by the LASSO procedure, along with those deemed necessary by clinicians, will be used in an unpenalized multivariable logistic regression model. Discrimination and calibration will be reported for the derivation and validation cohorts. Discrimination will be measured using Harrell's C statistic. Calibration will be measured using calibration intercept, calibration slope, Brier score, integrated calibration index, and visual examination of non-linear calibration curve. Model reporting will adhere to the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis guidelines for clinical prediction models using machine learning methods (TRIPOD + AI).</p><p><strong>Discussion: </strong>We describe methods for developing, evaluating, and validating a prognostic model for HA-VTE using routinely collected EHR data. By combining best practices in statistical development and validation, knowledge engineering, and clinical domain knowledge, the resulting model should be well suited for real-time clinical implementation. Although this protocol describes our development of a model for HA-VTE, the general approach can be applied to other clinical outcomes.</p>","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":"9 1","pages":"19"},"PeriodicalIF":2.6,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12416065/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145016698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-01DOI: 10.1186/s41512-025-00203-w
Dorthe Odyl Klein, Nick Wilmes, Sophie F Waardenburg, Gouke J Bonsel, Erwin Birnie, Marieke Sjn Wintjens, Stella Cm Heemskerk, Emma Bnj Janssen, Chahinda Ghossein-Doha, Michiel C Warlé, Lotte Mc Jacobs, Bea Hemmen, Jeanine A Verbunt, Bas Ct van Bussel, Susanne van Santen, Bas Ljh Kietselaer, Gwyneth Jansen, Folkert W Asselbergs, Marijke Linschoten, Juanita A Haagsma, S M J van Kuijk
Background: A subset of COVID-19 patients develops post-COVID-19 condition (PCC). This condition results in disability in numerous areas of patients' lives and a reduced health-related quality of life, with societal impact including work absences and increased healthcare utilization. There is a scarcity of models predicting PCC, especially those considering the severity of the initial severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection and incorporating long-term follow-up data. Therefore, we developed and internally validated a prediction model for PCC 2 years after SARS-CoV-2 infection in a cohort of COVID-19 patients.
Methods: Data from the CORona Follow-Up (CORFU) study were used. This research initiative integrated data from multiple Dutch COVID-19 cohort studies. We utilized 2-year follow-up data collected via the questionnaires between October 1st of 2021 and December 31st of 2022. Participants were former COVID-19 patients, approximately 2-year post-SARS-CoV-2 infection. Candidate predictors were selected based on literature and availability across cohorts. The outcome of interest was the prevalence of PCC at 2 years after the initial infection. Logistic regression with backward stepwise elimination identified significant predictors such as sex, BMI and initial disease severity. The model was internally validated using bootstrapping. Model performance was quantified as model fit, discrimination and calibration.
Results: In total 904 former COVID-19 patients were included in the analysis. The cohort included 146 (16.2%) non-hospitalized patients, 511 (56.5%) ward admitted patients, and 247 (27.3%) intensive care unit (ICU) admitted patients. Of all participants, 551 (61.0%) participants suffered from PCC. We included 20 candidate predictors in the multivariable analysis. The final model, after backward elimination, identified sex, body mass index (BMI), ward admission, ICU admission, and comorbidities such as arrhythmia, asthma, angina pectoris, previous stroke, hernia, osteoarthritis, and rheumatoid arthritis as predictors of post-COVID-19 condition. Nagelkerke's R-squared value for the model was 0.19. The optimism-adjusted AUC was 71.2%, and calibration was good across predicted probabilities.
Conclusions: This internally validated prediction model demonstrated moderate discriminative ability to predict PCC 2 years after COVID-19 based on sex, BMI, initial disease severity, and a collection of comorbidities.
{"title":"Development and internal validation of a prediction model for post-COVID-19 condition 2 years after infection-results of the CORFU study.","authors":"Dorthe Odyl Klein, Nick Wilmes, Sophie F Waardenburg, Gouke J Bonsel, Erwin Birnie, Marieke Sjn Wintjens, Stella Cm Heemskerk, Emma Bnj Janssen, Chahinda Ghossein-Doha, Michiel C Warlé, Lotte Mc Jacobs, Bea Hemmen, Jeanine A Verbunt, Bas Ct van Bussel, Susanne van Santen, Bas Ljh Kietselaer, Gwyneth Jansen, Folkert W Asselbergs, Marijke Linschoten, Juanita A Haagsma, S M J van Kuijk","doi":"10.1186/s41512-025-00203-w","DOIUrl":"10.1186/s41512-025-00203-w","url":null,"abstract":"<p><strong>Background: </strong>A subset of COVID-19 patients develops post-COVID-19 condition (PCC). This condition results in disability in numerous areas of patients' lives and a reduced health-related quality of life, with societal impact including work absences and increased healthcare utilization. There is a scarcity of models predicting PCC, especially those considering the severity of the initial severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection and incorporating long-term follow-up data. Therefore, we developed and internally validated a prediction model for PCC 2 years after SARS-CoV-2 infection in a cohort of COVID-19 patients.</p><p><strong>Methods: </strong>Data from the CORona Follow-Up (CORFU) study were used. This research initiative integrated data from multiple Dutch COVID-19 cohort studies. We utilized 2-year follow-up data collected via the questionnaires between October 1st of 2021 and December 31st of 2022. Participants were former COVID-19 patients, approximately 2-year post-SARS-CoV-2 infection. Candidate predictors were selected based on literature and availability across cohorts. The outcome of interest was the prevalence of PCC at 2 years after the initial infection. Logistic regression with backward stepwise elimination identified significant predictors such as sex, BMI and initial disease severity. The model was internally validated using bootstrapping. Model performance was quantified as model fit, discrimination and calibration.</p><p><strong>Results: </strong>In total 904 former COVID-19 patients were included in the analysis. The cohort included 146 (16.2%) non-hospitalized patients, 511 (56.5%) ward admitted patients, and 247 (27.3%) intensive care unit (ICU) admitted patients. Of all participants, 551 (61.0%) participants suffered from PCC. We included 20 candidate predictors in the multivariable analysis. The final model, after backward elimination, identified sex, body mass index (BMI), ward admission, ICU admission, and comorbidities such as arrhythmia, asthma, angina pectoris, previous stroke, hernia, osteoarthritis, and rheumatoid arthritis as predictors of post-COVID-19 condition. Nagelkerke's R-squared value for the model was 0.19. The optimism-adjusted AUC was 71.2%, and calibration was good across predicted probabilities.</p><p><strong>Conclusions: </strong>This internally validated prediction model demonstrated moderate discriminative ability to predict PCC 2 years after COVID-19 based on sex, BMI, initial disease severity, and a collection of comorbidities.</p>","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":"9 1","pages":"18"},"PeriodicalIF":2.6,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12400538/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144980798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-11DOI: 10.1186/s41512-025-00202-x
Emmanuelle A Dankwa, Martyn Plummer, Daniel Chapman, Rima Jeske, Julia Butt, Michael Hill, Tim Waterboer, Iona Y Millwood, Ling Yang, Christiana Kartsonaki
Background: Helicobacter pylori (H. pylori) is a bacterium that colonizes the stomach and is a major risk factor for gastric cancer, with an estimated 89% of non-cardia gastric cancer cases worldwide attributable to H. pylori. Prospective studies provide reliable evidence for quantifying the association between gastric cancer and H. pylori, as they circumvent the risk of a false negative due to possible reduction in antibody levels before cancer development.
Methods: In a large-scale prospective study within the China Kadoorie Biobank, H. pylori infection is being analysed as a risk factor for gastric cancer. The presence of infection is typically determined by serological tests. The immunoblot test, although well established, is more labour intensive and uses a larger amount of plasma than the alternative high-throughput multiplex serology test. Immunoblot outputs a binary positive/negative serostatus classification, while multiplex outputs a vector of continuous antigen measurements. When mapping such multidimensional continuous measurements onto a binary classification, statistical challenges arise in defining classification cut-offs and accounting for the differences in infection evidence provided by different antigens. We discuss these challenges and propose a novel solution to optimize the translation of the continuous measurements from multiplex serology into probabilities of H. pylori infection, using classification algorithms (Bayesian additive regressive trees (BART), multidimensional monotone BART, logistic regression, random forest and elastic net). We (i) calibrate and apply classification models to predict probabilities of H. pylori infection given multiplex measurements, (ii) compare the predictive performance of the models using immunoblot as reference, (iii) discuss reasons for the differences in predictive performance and (iv) apply the calibrated models to gain insights on the relative strengths of infection evidence provided by the various antigens.
Results: All models showed high discriminative ability with at least 95% area under the curve (AUC) estimates on the training and test data. There was no substantial difference between the performance of models on the training and test data.
Conclusions: Classification algorithms can be used to calibrate the H. pylori multiplex serology test to the immunoblot test in the China Kadoorie Biobank. This study furthers our understanding of the applicability of classification algorithms to the context of serologic tests.
背景:幽门螺杆菌(Helicobacter pylori, H. pylori)是一种定植于胃部的细菌,是胃癌的主要危险因素,据估计,全世界89%的非贲门性胃癌病例可归因于幽门螺杆菌。前瞻性研究为量化胃癌和幽门螺杆菌之间的关系提供了可靠的证据,因为它们规避了因癌症发展前抗体水平可能降低而导致假阴性的风险。方法:在中国嘉道理生物库的一项大规模前瞻性研究中,幽门螺旋杆菌感染被分析为胃癌的危险因素。感染的存在通常通过血清学测试来确定。免疫印迹试验虽然已经建立,但比其他高通量多重血清学试验需要更多的劳动强度和更多的血浆。免疫印迹输出二元阳性/阴性血清状态分类,而多元输出连续抗原测量的载体。当将这种多维连续测量映射到二元分类时,在定义分类截止点和考虑不同抗原提供的感染证据的差异方面出现了统计上的挑战。我们讨论了这些挑战,并提出了一种新的解决方案,利用分类算法(贝叶斯加性回归树(BART)、多维单调BART、逻辑回归、随机森林和弹性网络),将多重血清学的连续测量结果优化转化为幽门螺杆菌感染的概率。我们(i)校准和应用分类模型来预测多重测量下幽门螺杆菌感染的概率,(ii)使用免疫印迹作为参考比较模型的预测性能,(iii)讨论预测性能差异的原因,(iv)应用校准模型来深入了解各种抗原提供的感染证据的相对优势。结果:所有模型均显示出较高的判别能力,对训练和测试数据的曲线下面积(AUC)估计至少为95%。模型在训练数据和测试数据上的性能没有显著差异。结论:分类算法可用于校正中国嘉道理生物库的多重幽门螺杆菌血清学检测和免疫印迹检测。这项研究进一步加深了我们对分类算法在血清学测试中的适用性的理解。
{"title":"Calibrating multiplex serology for Helicobacter pylori.","authors":"Emmanuelle A Dankwa, Martyn Plummer, Daniel Chapman, Rima Jeske, Julia Butt, Michael Hill, Tim Waterboer, Iona Y Millwood, Ling Yang, Christiana Kartsonaki","doi":"10.1186/s41512-025-00202-x","DOIUrl":"10.1186/s41512-025-00202-x","url":null,"abstract":"<p><strong>Background: </strong>Helicobacter pylori (H. pylori) is a bacterium that colonizes the stomach and is a major risk factor for gastric cancer, with an estimated 89% of non-cardia gastric cancer cases worldwide attributable to H. pylori. Prospective studies provide reliable evidence for quantifying the association between gastric cancer and H. pylori, as they circumvent the risk of a false negative due to possible reduction in antibody levels before cancer development.</p><p><strong>Methods: </strong>In a large-scale prospective study within the China Kadoorie Biobank, H. pylori infection is being analysed as a risk factor for gastric cancer. The presence of infection is typically determined by serological tests. The immunoblot test, although well established, is more labour intensive and uses a larger amount of plasma than the alternative high-throughput multiplex serology test. Immunoblot outputs a binary positive/negative serostatus classification, while multiplex outputs a vector of continuous antigen measurements. When mapping such multidimensional continuous measurements onto a binary classification, statistical challenges arise in defining classification cut-offs and accounting for the differences in infection evidence provided by different antigens. We discuss these challenges and propose a novel solution to optimize the translation of the continuous measurements from multiplex serology into probabilities of H. pylori infection, using classification algorithms (Bayesian additive regressive trees (BART), multidimensional monotone BART, logistic regression, random forest and elastic net). We (i) calibrate and apply classification models to predict probabilities of H. pylori infection given multiplex measurements, (ii) compare the predictive performance of the models using immunoblot as reference, (iii) discuss reasons for the differences in predictive performance and (iv) apply the calibrated models to gain insights on the relative strengths of infection evidence provided by the various antigens.</p><p><strong>Results: </strong>All models showed high discriminative ability with at least 95% area under the curve (AUC) estimates on the training and test data. There was no substantial difference between the performance of models on the training and test data.</p><p><strong>Conclusions: </strong>Classification algorithms can be used to calibrate the H. pylori multiplex serology test to the immunoblot test in the China Kadoorie Biobank. This study furthers our understanding of the applicability of classification algorithms to the context of serologic tests.</p>","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":"9 1","pages":"17"},"PeriodicalIF":2.6,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12337413/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144818449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}