Pub Date : 2026-02-17DOI: 10.1186/s41512-026-00224-z
Jose Benitez-Aurioles, Laure Wynants, Niels Peek, Patrick Goodley, Philip Crosbie, Matthew Sperrin
{"title":"The continuous net benefit: assessing the clinical utility of prediction models when informing a continuum of decisions.","authors":"Jose Benitez-Aurioles, Laure Wynants, Niels Peek, Patrick Goodley, Philip Crosbie, Matthew Sperrin","doi":"10.1186/s41512-026-00224-z","DOIUrl":"https://doi.org/10.1186/s41512-026-00224-z","url":null,"abstract":"","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":"10 1","pages":"8"},"PeriodicalIF":2.6,"publicationDate":"2026-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146215067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-12DOI: 10.1186/s41512-026-00223-0
Peter C Austin, Daniele Giardiello
Background: The Cox proportional hazards regression model is frequently used to estimate an individual's probability of experiencing an outcome within a specified prediction horizon. A key assumption of this model is that of proportional hazards. An important component of validating a prediction model is assessing its discrimination. Discrimination refers to the ability of predicted risk to separate those who do and do not experience the event. The impact of violation of the proportional hazards assumption on the discrimination of risk estimates obtained from a Cox model has not been examined.
Methods: We used Monte Carlo simulations to assess the impact of the magnitude of the violation of the proportional hazards assumption on the discrimination of a Cox model as assessed using the time-varying area under the curve and on predictive accuracy as assessed using the time-varying index of predictive accuracy.
Results: Compared to settings in which the proportional hazards assumption was satisfied, discrimination and predictive accuracy decreased in settings in which the log-hazard ratio was positively associated with time. Conversely, compared to settings in which the proportional hazards assumption was satisfied, discrimination and predictive accuracy increased in settings in which the log-hazard ratio was negatively associated with time. Compared with the use of a Cox regression model, the use of accelerated failure time parametric survival models, Royston and Parmar's spline-based parametric survival models, and generalized linear models using pseudo-observations did not result in estimates with improved discrimination or predictive accuracy in settings in which the proportional hazards assumption was violated.
Conclusions: Violation of the proportional hazards assumption had an effect on the discrimination of predictions obtained using a Cox regression model.
{"title":"The impact of violation of the proportional hazards assumption on the discrimination of the Cox proportional hazards model.","authors":"Peter C Austin, Daniele Giardiello","doi":"10.1186/s41512-026-00223-0","DOIUrl":"10.1186/s41512-026-00223-0","url":null,"abstract":"<p><strong>Background: </strong>The Cox proportional hazards regression model is frequently used to estimate an individual's probability of experiencing an outcome within a specified prediction horizon. A key assumption of this model is that of proportional hazards. An important component of validating a prediction model is assessing its discrimination. Discrimination refers to the ability of predicted risk to separate those who do and do not experience the event. The impact of violation of the proportional hazards assumption on the discrimination of risk estimates obtained from a Cox model has not been examined.</p><p><strong>Methods: </strong>We used Monte Carlo simulations to assess the impact of the magnitude of the violation of the proportional hazards assumption on the discrimination of a Cox model as assessed using the time-varying area under the curve and on predictive accuracy as assessed using the time-varying index of predictive accuracy.</p><p><strong>Results: </strong>Compared to settings in which the proportional hazards assumption was satisfied, discrimination and predictive accuracy decreased in settings in which the log-hazard ratio was positively associated with time. Conversely, compared to settings in which the proportional hazards assumption was satisfied, discrimination and predictive accuracy increased in settings in which the log-hazard ratio was negatively associated with time. Compared with the use of a Cox regression model, the use of accelerated failure time parametric survival models, Royston and Parmar's spline-based parametric survival models, and generalized linear models using pseudo-observations did not result in estimates with improved discrimination or predictive accuracy in settings in which the proportional hazards assumption was violated.</p><p><strong>Conclusions: </strong>Violation of the proportional hazards assumption had an effect on the discrimination of predictions obtained using a Cox regression model.</p>","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":"10 1","pages":"7"},"PeriodicalIF":2.6,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12895773/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146183189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-10DOI: 10.1186/s41512-025-00217-4
Tom Pollard, Thomas Sounack, Catherine A Gao, Leo Anthony Celi, Charlotta Lindvall, Hyeonhoon Lee, Hyung-Chul Lee, Karel G M Moons, Gary S Collins
{"title":"Protocol for development of a reporting guideline (TRIPOD-Code) for code repositories associated with diagnostic and prognostic prediction model studies.","authors":"Tom Pollard, Thomas Sounack, Catherine A Gao, Leo Anthony Celi, Charlotta Lindvall, Hyeonhoon Lee, Hyung-Chul Lee, Karel G M Moons, Gary S Collins","doi":"10.1186/s41512-025-00217-4","DOIUrl":"10.1186/s41512-025-00217-4","url":null,"abstract":"","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":"10 1","pages":"4"},"PeriodicalIF":2.6,"publicationDate":"2026-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12888484/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146151381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-06DOI: 10.1186/s41512-026-00221-2
Lauren A Anderson, Joie Ensor, Clare L Gillies, Selina T Lock, Kamlesh Khunti, Laura J Gray
{"title":"A comparison of methodological approaches to developing clinical prediction models for individuals living with multiple long-term conditions: a protocol for a systematic review.","authors":"Lauren A Anderson, Joie Ensor, Clare L Gillies, Selina T Lock, Kamlesh Khunti, Laura J Gray","doi":"10.1186/s41512-026-00221-2","DOIUrl":"10.1186/s41512-026-00221-2","url":null,"abstract":"","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":"10 1","pages":"6"},"PeriodicalIF":2.6,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12879393/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146133627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-03DOI: 10.1186/s41512-026-00218-x
Ioannis Partheniadis, Persefoni Talimtzi, Adriani Nikolakopoulou, Anna-Bettina Haidich
Background: Reporting of COVID-19 prognostic models frequently falls short of established standards. The TRIPOD checklist and its 2024 AI extension (TRIPOD + AI) provide a comprehensive framework for assessing reporting quality. We therefore evaluated and compared reporting completeness in conventional versus machine-learning models.
Methods: Studies reporting the development, and internal and external validation of prognostic prediction models for COVID-19 using either conventional or machine learning-based algorithms were included. Literature searches were conducted in MEDLINE, Epistemonikos.org, and Scopus (up to July 31, 2024). Studies using conventional statistical methods were evaluated under TRIPOD, while machine learning-based studies were assessed using TRIPOD + AI. Data extraction followed TRIPOD and TRIPOD + AI checklists, measuring adherence per article and per checklist item. The protocol was prospectively registered at the Open Science Framework ( https://osf.io/kg9yw ).
Results: A total of 53 studies describing 71 prognostic models were identified. Overall, adherence to both guidelines was low, with significantly poorer compliance among machine learning-based studies (TRIPOD + AI) compared to conventional model studies (TRIPOD) (28.4% vs. 38.1%, 95% CI of difference: 4.1-15.4). No study fully adhered to abstract reporting requirements, and appropriate titles were included in only a minority of cases (29.0%, 95% CI: 16.1-46.6 for TRIPOD; 13.6%, 95% CI: 4.8-33.3 for TRIPOD + AI). Sample size calculations were not fully reported in any study. Reporting of methods and results sections was poor across both frameworks.
Conclusion: Lower adherence among machine learning studies reflects the relatively recent publication of the TRIPOD + AI guidelines (April 2024), which postdate many of the included studies. Both conventional and machine learning-based prediction models showed insufficient reporting, with major gaps in model description and performance reporting. Greater compliance with reporting guidelines is critical to improving the clarity, reproducibility, and clinical value of prediction model research.
{"title":"Machine learning-based COVID-19 prognostic models lag behind in reporting quality: findings from a TRIPOD/TRIPOD + AI systematic review.","authors":"Ioannis Partheniadis, Persefoni Talimtzi, Adriani Nikolakopoulou, Anna-Bettina Haidich","doi":"10.1186/s41512-026-00218-x","DOIUrl":"10.1186/s41512-026-00218-x","url":null,"abstract":"<p><strong>Background: </strong>Reporting of COVID-19 prognostic models frequently falls short of established standards. The TRIPOD checklist and its 2024 AI extension (TRIPOD + AI) provide a comprehensive framework for assessing reporting quality. We therefore evaluated and compared reporting completeness in conventional versus machine-learning models.</p><p><strong>Methods: </strong>Studies reporting the development, and internal and external validation of prognostic prediction models for COVID-19 using either conventional or machine learning-based algorithms were included. Literature searches were conducted in MEDLINE, Epistemonikos.org, and Scopus (up to July 31, 2024). Studies using conventional statistical methods were evaluated under TRIPOD, while machine learning-based studies were assessed using TRIPOD + AI. Data extraction followed TRIPOD and TRIPOD + AI checklists, measuring adherence per article and per checklist item. The protocol was prospectively registered at the Open Science Framework ( https://osf.io/kg9yw ).</p><p><strong>Results: </strong>A total of 53 studies describing 71 prognostic models were identified. Overall, adherence to both guidelines was low, with significantly poorer compliance among machine learning-based studies (TRIPOD + AI) compared to conventional model studies (TRIPOD) (28.4% vs. 38.1%, 95% CI of difference: 4.1-15.4). No study fully adhered to abstract reporting requirements, and appropriate titles were included in only a minority of cases (29.0%, 95% CI: 16.1-46.6 for TRIPOD; 13.6%, 95% CI: 4.8-33.3 for TRIPOD + AI). Sample size calculations were not fully reported in any study. Reporting of methods and results sections was poor across both frameworks.</p><p><strong>Conclusion: </strong>Lower adherence among machine learning studies reflects the relatively recent publication of the TRIPOD + AI guidelines (April 2024), which postdate many of the included studies. Both conventional and machine learning-based prediction models showed insufficient reporting, with major gaps in model description and performance reporting. Greater compliance with reporting guidelines is critical to improving the clarity, reproducibility, and clinical value of prediction model research.</p>","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":"10 1","pages":"3"},"PeriodicalIF":2.6,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12866346/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146108706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-15DOI: 10.1186/s41512-026-00219-w
Louise Haddon, Joie Ensor, Kamlesh Khunti, Laura J Gray
Background: Approximately one million adults in the UK are estimated to have undiagnosed type 2 diabetes mellitus (T2DM), with a further 5.1 million adults with nondiabetic hyperglycaemia (NDH) that does not meet the threshold for a diabetes diagnosis. The Leicester Risk Assessment score (LRA) and Leicester Practice Risk score (LPR) are diagnostic risk prediction models that estimate an individual's risk of undiagnosed T2DM and NDH, developed for use in community and primary care settings respectively. The LRA is also used as a prognostic model; neither model has been updated since development. This study will systematically review all applications of these models as diagnostic and prognostic tools and any published updates to evaluate their performance in different populations. This review has been registered with PROSPERO (CRD420251005841).
Methods: We will implement a citation search strategy to search Scopus, Web of Science and Google Scholar, restricted to full text, English language papers. Eligible papers will validate, update or modify either model. Data will be extracted using a form based on the Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies (CHARMS) checklist; missing information will be sought from authors or estimated from other available information where possible. Meta-analysis of predictive performance measures will be completed if sufficient data exist. Subgroup and sensitivity analyses will be used to explore between-study heterogeneity and risk-of-bias impact.
Discussion: This review will identify studies that have implemented, modified or validated the LRA and LPR for the risk of undiagnosed T2DM and NDH in different populations. This will allow summary measures, including level of uncertainty, of model performance to be calculated, making this highly relevant to individuals and stakeholders who recommend and implement these models. Review conclusions will also inform the potential update and recalibration of the models. This will ultimately lead to improved outcomes through earlier diagnosis and management.
背景:据估计,英国约有100万成年人患有未确诊的2型糖尿病(T2DM),另有510万成年人患有未达到糖尿病诊断阈值的非糖尿病性高血糖(NDH)。莱斯特风险评估评分(LRA)和莱斯特实践风险评分(LPR)是诊断性风险预测模型,用于估计个人未确诊的T2DM和NDH的风险,分别用于社区和初级保健机构。LRA也被用作预测模型;这两个模型自开发以来都没有更新过。本研究将系统地回顾这些模型作为诊断和预后工具的所有应用,以及任何已发表的更新,以评估它们在不同人群中的表现。本综述已在普洛斯彼罗注册(CRD420251005841)。方法:我们将实施引文检索策略,检索Scopus、Web of Science和谷歌Scholar,仅限于全文英文论文。合格的论文将验证,更新或修改任何一个模型。将使用基于预测模型研究系统审查关键评估和数据提取清单的表格提取数据;在可能的情况下,将从作者处寻找或从其他现有资料中估计缺失的资料。如果有足够的数据,将完成预测性能指标的meta分析。亚组和敏感性分析将用于研究间异质性和偏倚风险影响。讨论:本综述将确定在不同人群中实施、修改或验证LRA和LPR的未确诊T2DM和NDH风险的研究。这将允许计算模型性能的总结度量,包括不确定性水平,使其与推荐和实现这些模型的个人和涉众高度相关。审查结论还将为模型的可能更新和重新校准提供信息。这将最终通过早期诊断和管理改善结果。
{"title":"Performance of the Leicester risk assessment and Leicester practice risk scores for assessing the risk of undiagnosed type 2 diabetes or prediabetes in diverse populations: protocol for a systematic review of published validations and updates.","authors":"Louise Haddon, Joie Ensor, Kamlesh Khunti, Laura J Gray","doi":"10.1186/s41512-026-00219-w","DOIUrl":"10.1186/s41512-026-00219-w","url":null,"abstract":"<p><strong>Background: </strong>Approximately one million adults in the UK are estimated to have undiagnosed type 2 diabetes mellitus (T2DM), with a further 5.1 million adults with nondiabetic hyperglycaemia (NDH) that does not meet the threshold for a diabetes diagnosis. The Leicester Risk Assessment score (LRA) and Leicester Practice Risk score (LPR) are diagnostic risk prediction models that estimate an individual's risk of undiagnosed T2DM and NDH, developed for use in community and primary care settings respectively. The LRA is also used as a prognostic model; neither model has been updated since development. This study will systematically review all applications of these models as diagnostic and prognostic tools and any published updates to evaluate their performance in different populations. This review has been registered with PROSPERO (CRD420251005841).</p><p><strong>Methods: </strong>We will implement a citation search strategy to search Scopus, Web of Science and Google Scholar, restricted to full text, English language papers. Eligible papers will validate, update or modify either model. Data will be extracted using a form based on the Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies (CHARMS) checklist; missing information will be sought from authors or estimated from other available information where possible. Meta-analysis of predictive performance measures will be completed if sufficient data exist. Subgroup and sensitivity analyses will be used to explore between-study heterogeneity and risk-of-bias impact.</p><p><strong>Discussion: </strong>This review will identify studies that have implemented, modified or validated the LRA and LPR for the risk of undiagnosed T2DM and NDH in different populations. This will allow summary measures, including level of uncertainty, of model performance to be calculated, making this highly relevant to individuals and stakeholders who recommend and implement these models. Review conclusions will also inform the potential update and recalibration of the models. This will ultimately lead to improved outcomes through earlier diagnosis and management.</p>","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":"10 1","pages":"2"},"PeriodicalIF":2.6,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12810001/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145992086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-08DOI: 10.1186/s41512-025-00216-5
Maerziya Yusufujiang, Constanza L Andaur Navarro, Johanna Aa Damen, Toshihiko Takada, Geert-Jan Geersing, Lotty Hooft, Ewoud Schuit, Karel Gm Moons, Valentijn Mt de Jong, Maarten van Smeden
<p><strong>Objectives: </strong>The rise in popularity and off-the-shelf availability of machine learning (ML) and AI-based methodology to develop new prediction models provides developers with ample choices to compare and select the best performing model out of many possible models. Many studies have shown that such comparisons on any particular dataset, the difference in performance between models developed using different techniques (e.g. logistic regression, vs. random forest or neural networks) can often be small, especially when looking at crude performance measures such as the area under the ROC curve. This may lead to the conclusion that such models are essentially exchangeable, and model selection is arbitrary. However, as we will illustrate using a dataset on deep venous thrombosis, prediction models with similar discriminative performance may nonetheless generate different outcome probability estimates for individual patients and potentially lead to meaningfully different decision making.</p><p><strong>Methods: </strong>We developed diagnostic prediction models to predict the presence of deep venous thrombosis (DVT) in a large dataset of patients with leg symptoms suspected of having DVT, using five modelling techniques: unpenalized logistic regression (ULR), ridge logistic regression (RLR), random forests (RF), support vector machine (SVM) and neural network (NN). Age, sex, d-dimer, history of DVT, diagnosis alternative to DVT, and having cancer were used as a fixed set of predictors. Model performance was evaluated in terms of discrimination, calibration, and stability of individual risk prediction for a set of patients across the models.</p><p><strong>Results: </strong>Of the 6,087 suspected patients, 1,146 (19%) were diagnosed with DVT based on leg ultrasound (reference test). Three prediction models (ULR, RLR, NN) had similar discrimination with AUCs point estimates of 0.84. However, the 6087 individuals' estimated probabilities of DVT varied substantially across the five different modelling techniques, highlighting differences in prediction stability. Notably, the RF model tended to overestimate individual risks, while the SVM model tended to underestimate them compared to the other models. While the estimated probabilities were more similar for ULR, RLR and NN, classification measures (sensitivity, specificity, positive and negative predictive value) did differ because of differences in estimated probabilities of individuals near the risk threshold, illustrating that differences, even when relatively small, could potentially lead to different clinical decisions.</p><p><strong>Conclusions: </strong>Prediction models developed with different modeling techniques yielded very different individuals' outcome probabilities, even though the models had similar discriminative performance in this low-dimensional setting. Part of this variation can be explained by differences in calibration but also from modelling choices as estimated risks
{"title":"Prediction models developed using artificial intelligence: similar predictive performances with highly varying predictions for individuals - an illustration in deep vein thrombosis.","authors":"Maerziya Yusufujiang, Constanza L Andaur Navarro, Johanna Aa Damen, Toshihiko Takada, Geert-Jan Geersing, Lotty Hooft, Ewoud Schuit, Karel Gm Moons, Valentijn Mt de Jong, Maarten van Smeden","doi":"10.1186/s41512-025-00216-5","DOIUrl":"10.1186/s41512-025-00216-5","url":null,"abstract":"<p><strong>Objectives: </strong>The rise in popularity and off-the-shelf availability of machine learning (ML) and AI-based methodology to develop new prediction models provides developers with ample choices to compare and select the best performing model out of many possible models. Many studies have shown that such comparisons on any particular dataset, the difference in performance between models developed using different techniques (e.g. logistic regression, vs. random forest or neural networks) can often be small, especially when looking at crude performance measures such as the area under the ROC curve. This may lead to the conclusion that such models are essentially exchangeable, and model selection is arbitrary. However, as we will illustrate using a dataset on deep venous thrombosis, prediction models with similar discriminative performance may nonetheless generate different outcome probability estimates for individual patients and potentially lead to meaningfully different decision making.</p><p><strong>Methods: </strong>We developed diagnostic prediction models to predict the presence of deep venous thrombosis (DVT) in a large dataset of patients with leg symptoms suspected of having DVT, using five modelling techniques: unpenalized logistic regression (ULR), ridge logistic regression (RLR), random forests (RF), support vector machine (SVM) and neural network (NN). Age, sex, d-dimer, history of DVT, diagnosis alternative to DVT, and having cancer were used as a fixed set of predictors. Model performance was evaluated in terms of discrimination, calibration, and stability of individual risk prediction for a set of patients across the models.</p><p><strong>Results: </strong>Of the 6,087 suspected patients, 1,146 (19%) were diagnosed with DVT based on leg ultrasound (reference test). Three prediction models (ULR, RLR, NN) had similar discrimination with AUCs point estimates of 0.84. However, the 6087 individuals' estimated probabilities of DVT varied substantially across the five different modelling techniques, highlighting differences in prediction stability. Notably, the RF model tended to overestimate individual risks, while the SVM model tended to underestimate them compared to the other models. While the estimated probabilities were more similar for ULR, RLR and NN, classification measures (sensitivity, specificity, positive and negative predictive value) did differ because of differences in estimated probabilities of individuals near the risk threshold, illustrating that differences, even when relatively small, could potentially lead to different clinical decisions.</p><p><strong>Conclusions: </strong>Prediction models developed with different modeling techniques yielded very different individuals' outcome probabilities, even though the models had similar discriminative performance in this low-dimensional setting. Part of this variation can be explained by differences in calibration but also from modelling choices as estimated risks","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":"10 1","pages":"1"},"PeriodicalIF":2.6,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12784591/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145936459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-16DOI: 10.1186/s41512-025-00204-9
Richard D Riley, Gary S Collins, Lucinda Archer, Rebecca Whittle, Amardeep Legha, Laura Kirton, Paula Dhiman, Mohsen Sadatsafavi, Nicola J Adderley, Joseph Alderman, Glen P Martin, Joie Ensor
Background: When developing a clinical prediction model using time-to-event data (i.e. with censoring and different lengths of follow-up), previous research focuses on the sample size needed to minimise overfitting and precisely estimating the overall risk. However, instability of individual-level risk estimates may still be large.
Methods: We propose using a decomposition of Fisher's information matrix to help examine and calculate the sample size required for developing a model that aims for precise and fair risk estimates. We propose a six-step process which can be used either before data collection or when an existing dataset is available. Steps 1 to 5 require researchers to specify the overall risk in the target population at a key time-point of interest: an assumed pragmatic 'core model' in the form of an exponential regression model, the (anticipated) joint distribution of core predictors included in that model and the distribution of censoring times. The 'core model' can be specified directly or based on a specified C-index and relative effects of (standardised) predictors. The joint distribution of predictors may be available directly in an existing dataset, in a pilot study or in a synthetic dataset provided by other researchers.
Results: We derive closed-form solutions that decompose the variance of an individual's estimated event rate into Fisher's unit information matrix, predictor values and total sample size; this allows researchers to calculate and examine uncertainty distributions around individual risk estimates and misclassification probabilities for specified sample sizes. We provide an illustrative example in breast cancer and emphasise the importance of clinical context, including any risk thresholds for decision-making, and examine fairness concerns for pre- and postmenopausal women. Lastly, in two empirical evaluations, we provide reassurance that uncertainty interval widths based on our exponential approach are close to using more flexible parametric models.
Conclusions: Our approach allows users to identify the (target) sample size required to develop a prediction model for time-to-event outcomes, via the pmstabilityss module. It aims to facilitate models with improved trust, reliability and fairness in individual-level predictions.
{"title":"A decomposition of Fisher's information to inform sample size for developing or updating fair and precise clinical prediction models - part 2: time-to-event outcomes.","authors":"Richard D Riley, Gary S Collins, Lucinda Archer, Rebecca Whittle, Amardeep Legha, Laura Kirton, Paula Dhiman, Mohsen Sadatsafavi, Nicola J Adderley, Joseph Alderman, Glen P Martin, Joie Ensor","doi":"10.1186/s41512-025-00204-9","DOIUrl":"10.1186/s41512-025-00204-9","url":null,"abstract":"<p><strong>Background: </strong>When developing a clinical prediction model using time-to-event data (i.e. with censoring and different lengths of follow-up), previous research focuses on the sample size needed to minimise overfitting and precisely estimating the overall risk. However, instability of individual-level risk estimates may still be large.</p><p><strong>Methods: </strong>We propose using a decomposition of Fisher's information matrix to help examine and calculate the sample size required for developing a model that aims for precise and fair risk estimates. We propose a six-step process which can be used either before data collection or when an existing dataset is available. Steps 1 to 5 require researchers to specify the overall risk in the target population at a key time-point of interest: an assumed pragmatic 'core model' in the form of an exponential regression model, the (anticipated) joint distribution of core predictors included in that model and the distribution of censoring times. The 'core model' can be specified directly or based on a specified C-index and relative effects of (standardised) predictors. The joint distribution of predictors may be available directly in an existing dataset, in a pilot study or in a synthetic dataset provided by other researchers.</p><p><strong>Results: </strong>We derive closed-form solutions that decompose the variance of an individual's estimated event rate into Fisher's unit information matrix, predictor values and total sample size; this allows researchers to calculate and examine uncertainty distributions around individual risk estimates and misclassification probabilities for specified sample sizes. We provide an illustrative example in breast cancer and emphasise the importance of clinical context, including any risk thresholds for decision-making, and examine fairness concerns for pre- and postmenopausal women. Lastly, in two empirical evaluations, we provide reassurance that uncertainty interval widths based on our exponential approach are close to using more flexible parametric models.</p><p><strong>Conclusions: </strong>Our approach allows users to identify the (target) sample size required to develop a prediction model for time-to-event outcomes, via the pmstabilityss module. It aims to facilitate models with improved trust, reliability and fairness in individual-level predictions.</p>","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":"9 1","pages":"33"},"PeriodicalIF":2.6,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12709744/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145770082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-15DOI: 10.1186/s41512-025-00215-6
Neil R Lawrence, Irina Bacila, Joseph Tonge, Anthea Tucker, Jeremy Dawson, Zi-Qiang Lang, Nils P Krone, Paula Dhiman, Gary S Collins
Background: Prediction of suboptimal growth allows early intervention that can improve outcomes for developing fetus' as well as infants and children. We investigate the risk of bias in statistical or machine learning models to predict the height or weight of a fetus, infant or child under 20 years of age to inform the current standard of research and provide insight into why equations developed over 30 years ago are still recommended for use by national professional bodies.
Methods: We systematically searched MEDLINE and EMBASE for peer reviewed original research studies published in 2022. We included studies if they developed or validated a multivariable model to predict height or weight of an individual using two or more variables, excluding studies assessing imaging or using genetics or metabolomics information. Risk of bias was assessed for all prediction models and analyses using the Prediction model Risk Of Bias ASsessment Tool (PROBAST).
Results: Sixty-four studies were included, in which we assessed the development of 180 models and validation of 61 models. Sample size was only considered in 10% of developed models and 13% of validated models. Despite height and weight being continuous variables, 77% of models developed predicted a dichotomised outcome variable.
Registration: The review was registered on PROSPERO (ID: CRD42023421146), the International prospective register of systematic reviews on 26/4/2023.
{"title":"Risk of bias in machine learning and statistical models to predict height or weight: a systematic review in fetal and paediatric medicine.","authors":"Neil R Lawrence, Irina Bacila, Joseph Tonge, Anthea Tucker, Jeremy Dawson, Zi-Qiang Lang, Nils P Krone, Paula Dhiman, Gary S Collins","doi":"10.1186/s41512-025-00215-6","DOIUrl":"10.1186/s41512-025-00215-6","url":null,"abstract":"<p><strong>Background: </strong>Prediction of suboptimal growth allows early intervention that can improve outcomes for developing fetus' as well as infants and children. We investigate the risk of bias in statistical or machine learning models to predict the height or weight of a fetus, infant or child under 20 years of age to inform the current standard of research and provide insight into why equations developed over 30 years ago are still recommended for use by national professional bodies.</p><p><strong>Methods: </strong>We systematically searched MEDLINE and EMBASE for peer reviewed original research studies published in 2022. We included studies if they developed or validated a multivariable model to predict height or weight of an individual using two or more variables, excluding studies assessing imaging or using genetics or metabolomics information. Risk of bias was assessed for all prediction models and analyses using the Prediction model Risk Of Bias ASsessment Tool (PROBAST).</p><p><strong>Results: </strong>Sixty-four studies were included, in which we assessed the development of 180 models and validation of 61 models. Sample size was only considered in 10% of developed models and 13% of validated models. Despite height and weight being continuous variables, 77% of models developed predicted a dichotomised outcome variable.</p><p><strong>Registration: </strong>The review was registered on PROSPERO (ID: CRD42023421146), the International prospective register of systematic reviews on 26/4/2023.</p>","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":"9 1","pages":"32"},"PeriodicalIF":2.6,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12703889/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-09DOI: 10.1186/s41512-025-00210-x
Charlotte A Smith, Kathryn S Taylor, Nicholas R Jones, Dominik Roth, Amy Magona, Nia Roberts, Clare J Taylor, F D Richard Hobbs, Maria D L A Vazquez-Montes
Background: High natriuretic peptide levels are associated with a poor outcome in adults with chronic heart failure (CHF). However, the incremented prediction accuracy of multivariable prognostic models after adding B-type natriuretic peptide (BNP) and/or N-terminal proBNP (NT-proBNP) remains unclear.
Methods: We carried out a systematic review narrative analysis of added-value studies of BNP and NT-proBNP in CHF prognostication. Primary clinical studies investigating prognostic model development or validation in adult participants with CHF were included. Any studies of individual factors' association with patient outcomes, treatment efficacy, or those using patients with transplant/ventricular assist devices, ≥ 10% of patients with advanced HF, or significant comorbidities, HF secondary to congenital/reversible conditions, or ≥ 33% of patients with valvular HF were excluded. The databases MEDLINE, Embase, Science Citation Index, and Cochrane Prognosis Methods Group Database were searched from January 1990 to February 2024. Predictive performance was measured in terms of discrimination and calibration, the added value in terms of the c-statistic difference before and after adding BNP and/or NT-proBNP to a base model, and the risk reclassification, namely, net reclassification index (NRI) and integrated discrimination improvement (IDI). Risk of bias assessment used the Prediction model Risk Of Bias ASsessment Tool (PROBAST).
Results: Fourteen added-value studies comprising a total of 50,949 individuals were included. Both BNP and NT-proBNP consistently improved mortality prediction performance, but studies only presented separately before and after c-statistics, without formally testing for statistically significant differences. Meta-analysis was impossible due to missing data on the change in predictive performance and data heterogeneity. All studies reported discrimination. Few reported calibration, NRI, and IDI. All studies except one were deemed to be at high risk of bias, whereas 50% showed high applicability to the review question, with only 14% scoring high for applicability concern, and the rest were unclear.
Conclusions: Improving consistency in researching and reporting the added value of natriuretic peptide testing to predict mortality in chronic heart failure patients could facilitate summarizing and interpreting the results more meaningfully.
Registration: This review is a refinement of the methods and a search update of the review of added-value biomarkers in HF prognosis (PROSPERO registration number: CRD42019086993).
背景:高利钠肽水平与成人慢性心力衰竭(CHF)的不良预后相关。然而,在加入b型利钠肽(BNP)和/或n端proBNP (NT-proBNP)后,多变量预后模型的预测准确性增加尚不清楚。方法:我们对BNP和NT-proBNP在CHF预测中的附加价值研究进行了系统回顾和叙述分析。纳入了调查成年CHF患者预后模型开发或验证的初步临床研究。排除任何与患者结局、治疗效果、或使用移植/心室辅助装置患者、≥10%的晚期HF患者、或显著合并症、继发于先天性/可逆性疾病的HF患者或≥33%的瓣膜性HF患者相关的个体因素的研究。检索数据库为MEDLINE、Embase、Science Citation Index和Cochrane Prognosis Methods Group Database,检索时间为1990年1月至2024年2月。通过判别和校准、在基础模型中加入BNP和/或NT-proBNP前后的c统计量差异的附加值以及风险重分类,即净重分类指数(NRI)和综合判别改善(IDI)来衡量预测性能。偏倚风险评估采用预测模型偏倚风险评估工具(PROBAST)。结果:14项附加价值研究共纳入50,949人。BNP和NT-proBNP都能持续提高死亡率预测性能,但研究只分别在c统计前后进行,没有正式检验统计学上的显著差异。由于缺少预测性能变化的数据和数据异质性,不可能进行meta分析。所有的研究都报告了歧视。很少报告校准、NRI和IDI。除一项研究外,所有研究均被认为存在高偏倚风险,而50%的研究显示对综述问题具有高适用性,只有14%的研究在适用性方面得分较高,其余的研究不清楚。结论:提高研究和报道利钠肽检测预测慢性心力衰竭患者死亡率的附加价值的一致性,有助于对结果进行更有意义的总结和解释。注册:本综述是对心力衰竭预后中附加价值生物标志物研究方法的改进和检索更新(PROSPERO注册号:CRD42019086993)。
{"title":"Natriuretic peptides testing and survival prediction models for chronic heart failure: a systematic review of added prognostic value.","authors":"Charlotte A Smith, Kathryn S Taylor, Nicholas R Jones, Dominik Roth, Amy Magona, Nia Roberts, Clare J Taylor, F D Richard Hobbs, Maria D L A Vazquez-Montes","doi":"10.1186/s41512-025-00210-x","DOIUrl":"10.1186/s41512-025-00210-x","url":null,"abstract":"<p><strong>Background: </strong>High natriuretic peptide levels are associated with a poor outcome in adults with chronic heart failure (CHF). However, the incremented prediction accuracy of multivariable prognostic models after adding B-type natriuretic peptide (BNP) and/or N-terminal proBNP (NT-proBNP) remains unclear.</p><p><strong>Methods: </strong>We carried out a systematic review narrative analysis of added-value studies of BNP and NT-proBNP in CHF prognostication. Primary clinical studies investigating prognostic model development or validation in adult participants with CHF were included. Any studies of individual factors' association with patient outcomes, treatment efficacy, or those using patients with transplant/ventricular assist devices, ≥ 10% of patients with advanced HF, or significant comorbidities, HF secondary to congenital/reversible conditions, or ≥ 33% of patients with valvular HF were excluded. The databases MEDLINE, Embase, Science Citation Index, and Cochrane Prognosis Methods Group Database were searched from January 1990 to February 2024. Predictive performance was measured in terms of discrimination and calibration, the added value in terms of the c-statistic difference before and after adding BNP and/or NT-proBNP to a base model, and the risk reclassification, namely, net reclassification index (NRI) and integrated discrimination improvement (IDI). Risk of bias assessment used the Prediction model Risk Of Bias ASsessment Tool (PROBAST).</p><p><strong>Results: </strong>Fourteen added-value studies comprising a total of 50,949 individuals were included. Both BNP and NT-proBNP consistently improved mortality prediction performance, but studies only presented separately before and after c-statistics, without formally testing for statistically significant differences. Meta-analysis was impossible due to missing data on the change in predictive performance and data heterogeneity. All studies reported discrimination. Few reported calibration, NRI, and IDI. All studies except one were deemed to be at high risk of bias, whereas 50% showed high applicability to the review question, with only 14% scoring high for applicability concern, and the rest were unclear.</p><p><strong>Conclusions: </strong>Improving consistency in researching and reporting the added value of natriuretic peptide testing to predict mortality in chronic heart failure patients could facilitate summarizing and interpreting the results more meaningfully.</p><p><strong>Registration: </strong>This review is a refinement of the methods and a search update of the review of added-value biomarkers in HF prognosis (PROSPERO registration number: CRD42019086993).</p>","PeriodicalId":72800,"journal":{"name":"Diagnostic and prognostic research","volume":"9 1","pages":"31"},"PeriodicalIF":2.6,"publicationDate":"2025-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12687552/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145709918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}