Pub Date : 2025-12-04DOI: 10.1016/j.jclinepi.2025.112097
R. Guelimi , C. Choudhary , C. Ollivier , Q. Beytout , Q. Samaran , A. Mubuangankusu , A. Chaimani , E. Sbidian , S. Afach , L. Le Cleach
Objectives
This study was conducted within a large Cochrane living systematic review (SR) on psoriasis treatments with the aim to evaluate the inter-rater agreement of the Cochrane risk of bias tool 2 (RoB-2) tool, to compare its RoB judgments with the original RoB-1, and to explore the impact of changes in RoB judgment between the two tools on the Cochrane network meta-analysis’ (NMA) results.
Study Design and Setting
This study was conducted within the 2025 update of a living Cochrane review on systemic treatments for psoriasis. Four pairs of assessors used RoB-2 to evaluate the RoB of 193 randomized controlled trials for two primary outcomes: Psoriasis Area Severity Index (PASI) 90 (reflecting clear or almost clear skin) and serious adverse events (SAEs). Inter-rater reliability (IRR) was calculated using Cohen's kappa. RoB-2 judgments for 147 trials (PASI 90) and 154 trials (SAEs) were compared to the previous RoB-1 assessments from the Cochrane 2023 update. The impact of using RoB-2 vs. RoB-1 judgments on the NMA's results was explored through sensitivity analyses, with calculation of ratio of risk ratios (RRRs) between the analyses for each treatment effect.
Results
For the RoB-2 overall judgment, the IRR was fair for PASI 90 (kappa = 0.37) and moderate for SAEs (kappa = 0.46). IRR varied between domains (from kappa = 0.33, to kappa = 0.65), with lower IRR found for domains 2, 3, and 5. Significant discrepancies were found between RoB-1 and RoB-2 judgments. Compared to RoB-1, RoB-2 rated a smaller proportion of results as low risk for both PASI 90 (36% vs 58%) and SAEs (13% vs 58%) and a higher proportion as high risk for SAEs (55% vs 29%). For PASI 90, 66/147 (45%) studies showed switches between different judgments, including 18 extreme switches either from low to high or from high to low RoB. For SAEs, 93/154 (60%) studies underwent switches between different judgments, with 32 extreme switches occurring exclusively from low to high RoB. Sensitivity analyses excluding high-risk trials showed moderate impact on the NMA efficacy results (median RRR = 0.92, interquartile range (IQR), 0.91–0.92), but wider changes for SAEs (median RRR = 1.07, IQR, 0.97–1.15).
Conclusion
The transition to RoB-2 in a large Cochrane SR revealed fair-to-moderate inter-rater agreement, underscoring the need for consensus among reviewers. The shift from RoB-1 to RoB-2 led to changes in risk-of-bias judgments in our review. Although the impact on the NMA results was pronounced for SAEs, the changes in results were limited for our efficacy outcome PASI 90.
{"title":"Comparison between risk of bias-1 and risk of bias-2 tool and impact on network meta-analysis results—A case study from a living Cochrane review on psoriasis","authors":"R. Guelimi , C. Choudhary , C. Ollivier , Q. Beytout , Q. Samaran , A. Mubuangankusu , A. Chaimani , E. Sbidian , S. Afach , L. Le Cleach","doi":"10.1016/j.jclinepi.2025.112097","DOIUrl":"10.1016/j.jclinepi.2025.112097","url":null,"abstract":"<div><h3>Objectives</h3><div>This study was conducted within a large Cochrane living systematic review (SR) on psoriasis treatments with the aim to evaluate the inter-rater agreement of the Cochrane risk of bias tool 2 (RoB-2) tool, to compare its RoB judgments with the original RoB-1, and to explore the impact of changes in RoB judgment between the two tools on the Cochrane network meta-analysis’ (NMA) results.</div></div><div><h3>Study Design and Setting</h3><div>This study was conducted within the 2025 update of a living Cochrane review on systemic treatments for psoriasis. Four pairs of assessors used RoB-2 to evaluate the RoB of 193 randomized controlled trials for two primary outcomes: Psoriasis Area Severity Index (PASI) 90 (reflecting clear or almost clear skin) and serious adverse events (SAEs). Inter-rater reliability (IRR) was calculated using Cohen's kappa. RoB-2 judgments for 147 trials (PASI 90) and 154 trials (SAEs) were compared to the previous RoB-1 assessments from the Cochrane 2023 update. The impact of using RoB-2 vs. RoB-1 judgments on the NMA's results was explored through sensitivity analyses, with calculation of ratio of risk ratios (RRRs) between the analyses for each treatment effect.</div></div><div><h3>Results</h3><div>For the RoB-2 overall judgment, the IRR was fair for PASI 90 (kappa = 0.37) and moderate for SAEs (kappa = 0.46). IRR varied between domains (from kappa = 0.33, to kappa = 0.65), with lower IRR found for domains 2, 3, and 5. Significant discrepancies were found between RoB-1 and RoB-2 judgments. Compared to RoB-1, RoB-2 rated a smaller proportion of results as low risk for both PASI 90 (36% vs 58%) and SAEs (13% vs 58%) and a higher proportion as high risk for SAEs (55% vs 29%). For PASI 90, 66/147 (45%) studies showed switches between different judgments, including 18 extreme switches either from low to high or from high to low RoB. For SAEs, 93/154 (60%) studies underwent switches between different judgments, with 32 extreme switches occurring exclusively from low to high RoB. Sensitivity analyses excluding high-risk trials showed moderate impact on the NMA efficacy results (median RRR = 0.92, interquartile range (IQR), 0.91–0.92), but wider changes for SAEs (median RRR = 1.07, IQR, 0.97–1.15).</div></div><div><h3>Conclusion</h3><div>The transition to RoB-2 in a large Cochrane SR revealed fair-to-moderate inter-rater agreement, underscoring the need for consensus among reviewers. The shift from RoB-1 to RoB-2 led to changes in risk-of-bias judgments in our review. Although the impact on the NMA results was pronounced for SAEs, the changes in results were limited for our efficacy outcome PASI 90.</div></div>","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"190 ","pages":"Article 112097"},"PeriodicalIF":5.2,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145696458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03DOI: 10.1016/j.jclinepi.2025.112092
Matthew J. Valente , Biwei Cao , Daniëlle D.B. Holthuijsen , Martijn J.L. Bours , Simone J.P.M. Eussen , Matty P. Weijenberg , Judith J.M. Rijnhart
Objectives
Ratio variables (eg, body mass index (BMI), cholesterol ratios, and metabolite ratios) are widely used as exposure variables in epidemiologic studies on cause-and-effect. While statisticians have emphasized the importance of including main effects of the variables that make up a ratio variable in regression models, main effects are still often omitted in practice. The objective of this study is to demonstrate the impact of omitting main effects from regression models with a ratio variable as the focal exposure on bias in the effect estimates and type I error rates.
Study Design and Setting
We demonstrated the impact of omitting main effects in three steps. First, we showed the connection between regression models with ratio variables and regression models with product terms, which are well-understood by epidemiologists. Second, we estimated models with and without main effects of a ratio variable using a real-life data example. Third, we performed a simulation study to demonstrate the impact of omitting main effects on bias and type I error rates.
Results
We showed the impact of omitting main effects in regression models with ratio terms. In the real-life example, the ratio term was only statistically significantly associated with the outcome when omitting main effects. The simulation study results indicated that the omission of main effects often leads to biased effect estimates and inflated type I error rates.
Conclusion
Regression models with a ratio term as an exposure variable need to include main effects to avoid bias in the effect estimates and inflated type I error rates.
{"title":"Omission of main effects from regression models with a ratio variable as the focal exposure can result in bias and inflated type I error rates","authors":"Matthew J. Valente , Biwei Cao , Daniëlle D.B. Holthuijsen , Martijn J.L. Bours , Simone J.P.M. Eussen , Matty P. Weijenberg , Judith J.M. Rijnhart","doi":"10.1016/j.jclinepi.2025.112092","DOIUrl":"10.1016/j.jclinepi.2025.112092","url":null,"abstract":"<div><h3>Objectives</h3><div>Ratio variables (eg, body mass index (BMI), cholesterol ratios, and metabolite ratios) are widely used as exposure variables in epidemiologic studies on cause-and-effect. While statisticians have emphasized the importance of including main effects of the variables that make up a ratio variable in regression models, main effects are still often omitted in practice. The objective of this study is to demonstrate the impact of omitting main effects from regression models with a ratio variable as the focal exposure on bias in the effect estimates and type I error rates.</div></div><div><h3>Study Design and Setting</h3><div>We demonstrated the impact of omitting main effects in three steps. First, we showed the connection between regression models with ratio variables and regression models with product terms, which are well-understood by epidemiologists. Second, we estimated models with and without main effects of a ratio variable using a real-life data example. Third, we performed a simulation study to demonstrate the impact of omitting main effects on bias and type I error rates.</div></div><div><h3>Results</h3><div>We showed the impact of omitting main effects in regression models with ratio terms. In the real-life example, the ratio term was only statistically significantly associated with the outcome when omitting main effects. The simulation study results indicated that the omission of main effects often leads to biased effect estimates and inflated type I error rates.</div></div><div><h3>Conclusion</h3><div>Regression models with a ratio term as an exposure variable need to include main effects to avoid bias in the effect estimates and inflated type I error rates.</div></div>","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"190 ","pages":"Article 112092"},"PeriodicalIF":5.2,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145688599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-03DOI: 10.1016/j.jclinepi.2025.112093
Adrian Martinez-De la Torre , Polina Leshetkina , Ogie Ahanor , Roxanne Maritz
Background
To examine how functioning-related outcomes in Phase III pharmacological clinical trials for rheumatoid arthritis (RA) align with the International Classification of Functioning, Disability and Health (ICF) Brief Core Set, and to identify which domains of functioning are most frequently represented.
Study Design and Setting
RA is a chronic autoimmune disease and a major cause of disability worldwide. While Phase III randomized controlled trials (RCTs) remain the gold standard for evaluating pharmacological treatments, they often rely on clinical and laboratory endpoints and overlook how therapies affect patients’ functioning. The International Classification of Functioning, Disability and Health (ICF) provides a standardized, patient-centered framework to assess functioning across key domains. A scoping review was conducted in accordance with the JBI methodology for scoping reviews and reported following PRISMA-ScR guidelines. Literature was searched in MEDLINE, EMBASE, and ClinicalTrials.gov from 2010 to 2025. Phase III RCTs evaluating pharmacological interventions in adult patients with RA were included. Functioning-related outcomes were extracted and mapped to ICF categories using standardized linking rules.
Results
Of 852 records screened, 91 met the inclusion criteria. Functioning was frequently assessed through patient-reported outcomes and composite clinical measures. The most commonly linked ICF categories were related to pain and joint mobility within the body functions domain, walking and carrying out daily activities within the activities and participation domain, and joint structures of the shoulder, upper, and lower limbs within body structures. Despite the broad representation, none of the studies explicitly used the ICF framework.
Conclusion
Functioning is often assessed in RA phase III RCTs, but only implicitly and without reference to the ICF framework. Explicitly integrating the ICF could bring greater standardization, comparability, and patient-centeredness in outcome measurement in pharmacological trials, not only in RA but across chronic conditions.
{"title":"Integrating and standardizing functioning outcomes in rheumatoid arthritis pharmacological trials: a scoping review informed by the International Classification of Functioning, Disability and Health (ICF)","authors":"Adrian Martinez-De la Torre , Polina Leshetkina , Ogie Ahanor , Roxanne Maritz","doi":"10.1016/j.jclinepi.2025.112093","DOIUrl":"10.1016/j.jclinepi.2025.112093","url":null,"abstract":"<div><h3>Background</h3><div>To examine how functioning-related outcomes in Phase III pharmacological clinical trials for rheumatoid arthritis (RA) align with the International Classification of Functioning, Disability and Health (ICF) Brief Core Set, and to identify which domains of functioning are most frequently represented.</div></div><div><h3>Study Design and Setting</h3><div>RA is a chronic autoimmune disease and a major cause of disability worldwide. While Phase III randomized controlled trials (RCTs) remain the gold standard for evaluating pharmacological treatments, they often rely on clinical and laboratory endpoints and overlook how therapies affect patients’ functioning. The International Classification of Functioning, Disability and Health (ICF) provides a standardized, patient-centered framework to assess functioning across key domains. A scoping review was conducted in accordance with the JBI methodology for scoping reviews and reported following PRISMA-ScR guidelines. Literature was searched in MEDLINE, EMBASE, and ClinicalTrials.gov from 2010 to 2025. Phase III RCTs evaluating pharmacological interventions in adult patients with RA were included. Functioning-related outcomes were extracted and mapped to ICF categories using standardized linking rules.</div></div><div><h3>Results</h3><div>Of 852 records screened, 91 met the inclusion criteria. Functioning was frequently assessed through patient-reported outcomes and composite clinical measures. The most commonly linked ICF categories were related to pain and joint mobility within the body functions domain, walking and carrying out daily activities within the activities and participation domain, and joint structures of the shoulder, upper, and lower limbs within body structures. Despite the broad representation, none of the studies explicitly used the ICF framework.</div></div><div><h3>Conclusion</h3><div>Functioning is often assessed in RA phase III RCTs, but only implicitly and without reference to the ICF framework. Explicitly integrating the ICF could bring greater standardization, comparability, and patient-centeredness in outcome measurement in pharmacological trials, not only in RA but across chronic conditions.</div></div>","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"190 ","pages":"Article 112093"},"PeriodicalIF":5.2,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145688644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01DOI: 10.1016/j.jclinepi.2025.112072
David Tovey, Andrea C. Tricco
{"title":"Editors’ Choice December 2025","authors":"David Tovey, Andrea C. Tricco","doi":"10.1016/j.jclinepi.2025.112072","DOIUrl":"10.1016/j.jclinepi.2025.112072","url":null,"abstract":"","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"188 ","pages":"Article 112072"},"PeriodicalIF":5.2,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145693692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-26DOI: 10.1016/j.jclinepi.2025.112090
Holger J. Schünemann , Bernardo Sousa-Pinto , Samuel G. Schumacher , Jessie McGowan , David Tovey , Gian Paolo Morgano , Wojtek Wiercioch , Stephanie Chang , Ignacio Neumann
<div><h3>Background</h3><div>A decision threshold (DT) reflects the point at which a decision or judgment changes, leading to the selection of an action or a commitment for one of several alternatives. Thresholds have always played a role in decision-making. Very small effects may achieve statistical significance yet remain not important to patients or the public. Judgments shift, for instance, from “no or trivial effect” to “small, moderate, and large benefit” with direct implications for decision-making. However, in guideline panels and other clinical or policy decisions, these thresholds are often applied subconsciously when interpreting effect estimates from studies and likely to vary across panel members.</div></div><div><h3>Study Design and Setting</h3><div>In this commentary, inspired by the concepts leading to recent publications by the Grading of Recommendations Assessment, Development and Evaluation (GRADE) working group and its members, we argue that the use of DTs has many advantages.</div></div><div><h3>Results</h3><div>DTs are the basis for an interpretation of results that is not centered on “statistical significance.” In addition, DTs are useful for other aspects of evidence synthesis. The certainty of evidence ratings using the GRADE approach (<span><span>https://book.gradepro.org/</span><svg><path></path></svg></span>) are centered on DTs, including the determination of the target of the certainty rating, with advantages for transparency, objectivity, and simplicity. For example, judging imprecision is informed by DTs. Specifically, the number of DTs crossed by the plausible effect sizes, as indicated by its confidence interval, helps determine the degree of uncertainty assigned in a GRADE assessment of imprecision, including the number of levels of certainty a user rates down. DTs have also altered the way how users can transparently integrate bodies of evidence from both nonrandomized and randomized studies. Once determined, DTs can be used to validate automated judgments about the certainty of evidence. Beyond these developments, DTs can be useful for designing primary research. For example, sample size calculations could use standardized DTs for large effects when there are known harms that the intended benefits need to outweigh.</div></div><div><h3>Conclusions</h3><div>DTs have many roles in interpretation, certainty assessments and research planning and design.</div></div><div><h3>Plain Language Summary</h3><div>Decision thresholds are the points where a decision changes—for example, when evidence shifts our judgment from “moderate benefit” to “large benefit.” Unlike statistical significance, decision thresholds focus on what matters to citizens and decision-makers. In health guidelines, these thresholds often influence judgments unconsciously, but making them explicit improves transparency and consistency. The GRADE approach uses decision thresholds to judge how certain we are about evidence, helping to make these judgmen
{"title":"The many roles of decision thresholds for primary research, evidence synthesis, and health decision-making","authors":"Holger J. Schünemann , Bernardo Sousa-Pinto , Samuel G. Schumacher , Jessie McGowan , David Tovey , Gian Paolo Morgano , Wojtek Wiercioch , Stephanie Chang , Ignacio Neumann","doi":"10.1016/j.jclinepi.2025.112090","DOIUrl":"10.1016/j.jclinepi.2025.112090","url":null,"abstract":"<div><h3>Background</h3><div>A decision threshold (DT) reflects the point at which a decision or judgment changes, leading to the selection of an action or a commitment for one of several alternatives. Thresholds have always played a role in decision-making. Very small effects may achieve statistical significance yet remain not important to patients or the public. Judgments shift, for instance, from “no or trivial effect” to “small, moderate, and large benefit” with direct implications for decision-making. However, in guideline panels and other clinical or policy decisions, these thresholds are often applied subconsciously when interpreting effect estimates from studies and likely to vary across panel members.</div></div><div><h3>Study Design and Setting</h3><div>In this commentary, inspired by the concepts leading to recent publications by the Grading of Recommendations Assessment, Development and Evaluation (GRADE) working group and its members, we argue that the use of DTs has many advantages.</div></div><div><h3>Results</h3><div>DTs are the basis for an interpretation of results that is not centered on “statistical significance.” In addition, DTs are useful for other aspects of evidence synthesis. The certainty of evidence ratings using the GRADE approach (<span><span>https://book.gradepro.org/</span><svg><path></path></svg></span>) are centered on DTs, including the determination of the target of the certainty rating, with advantages for transparency, objectivity, and simplicity. For example, judging imprecision is informed by DTs. Specifically, the number of DTs crossed by the plausible effect sizes, as indicated by its confidence interval, helps determine the degree of uncertainty assigned in a GRADE assessment of imprecision, including the number of levels of certainty a user rates down. DTs have also altered the way how users can transparently integrate bodies of evidence from both nonrandomized and randomized studies. Once determined, DTs can be used to validate automated judgments about the certainty of evidence. Beyond these developments, DTs can be useful for designing primary research. For example, sample size calculations could use standardized DTs for large effects when there are known harms that the intended benefits need to outweigh.</div></div><div><h3>Conclusions</h3><div>DTs have many roles in interpretation, certainty assessments and research planning and design.</div></div><div><h3>Plain Language Summary</h3><div>Decision thresholds are the points where a decision changes—for example, when evidence shifts our judgment from “moderate benefit” to “large benefit.” Unlike statistical significance, decision thresholds focus on what matters to citizens and decision-makers. In health guidelines, these thresholds often influence judgments unconsciously, but making them explicit improves transparency and consistency. The GRADE approach uses decision thresholds to judge how certain we are about evidence, helping to make these judgmen","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"190 ","pages":"Article 112090"},"PeriodicalIF":5.2,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145642547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-25DOI: 10.1016/j.jclinepi.2025.112089
Edith Ginika Otalike , Mike Clarke , Farjana Akhter , Areti Angeliki Veroniki , Ngianga-Bakwin Kandala , Joel J. Gagnier
<div><h3>Objectives</h3><div>To systematically identify and synthesize methodological guidance for conducting individual participant data meta-analyses (IPD-MAs) of randomized trials and observational studies, to inform the development of a critical appraisal tool for reports of IPD-MAs.</div></div><div><h3>Study Design and Setting</h3><div>We searched nine major electronic databases and gray literature sources through June 2025 using a strategy developed with a health sciences librarian. To be eligible, articles had to report empirical, simulation-based, consensus-based, or narrative research and offer guidance on the methodology of IPD-MA. Study selection and data extraction were performed independently by two reviewers. Quality was assessed using tools tailored to study design (eg, Aims, Data generating mechanism, Estimands, Methods, and Performance measures, Risk of Bias in Systematic Reviews, Appraisal of Guidelines for Research & Evaluation using Delphi, Scale for the Assessment of Narrative Review Articles). Extracted guidance was categorized thematically and mapped to appraisal domains.</div></div><div><h3>Results</h3><div>After screening 14,736 records, we included 141 studies. These encompassed simulation (38%), empirical (21%), and methodological guidance (12%), among others. Key themes included IPD-MA planning, data access and harmonization, analytical strategies, and other statistical issues, as well as reporting. While there was robust guidance for IPD-MA of randomized trials, recommendations for observational studies are sparse. Across all study types, 63% were rated high quality.</div></div><div><h3>Conclusion</h3><div>This review synthesizes previously fragmented guidance into an integrative synthesis, highlighting best practices and critical domains for evaluating IPD-MAs. These findings formed the evidence base for a Delphi consensus process to develop a dedicated IPD-MA critical appraisal tool.</div></div><div><h3>Plain Language Summary</h3><div>Meta-analyses often pool published summaries from many studies. That approach can miss important details and introduce bias. An IPD-MA instead reanalyses the original, participant-level data across studies. IPD-MAs are powerful but complex, and practical guidance is scattered, especially for observational studies. We wanted to bring these recommendations together in one place and identify candidate items for a tool to assess the quality of a completed IPD-MA. We systematically searched eight databases from their inception to 2025 to identify papers offering practical guidance on conducting IPD-MAs for health interventions. We organized guidance across the full project life cycle, from planning, finding and accessing data, to preparing and checking data, analyzing results, and reporting. We highlighted where experts broadly agree and where gaps remain. We found 141 relevant papers published between 1995 and 2025. Among these, we identified 25 key topic areas and several smaller subt
{"title":"Methodological guidance for individual participant data meta-analyses: a systematic review","authors":"Edith Ginika Otalike , Mike Clarke , Farjana Akhter , Areti Angeliki Veroniki , Ngianga-Bakwin Kandala , Joel J. Gagnier","doi":"10.1016/j.jclinepi.2025.112089","DOIUrl":"10.1016/j.jclinepi.2025.112089","url":null,"abstract":"<div><h3>Objectives</h3><div>To systematically identify and synthesize methodological guidance for conducting individual participant data meta-analyses (IPD-MAs) of randomized trials and observational studies, to inform the development of a critical appraisal tool for reports of IPD-MAs.</div></div><div><h3>Study Design and Setting</h3><div>We searched nine major electronic databases and gray literature sources through June 2025 using a strategy developed with a health sciences librarian. To be eligible, articles had to report empirical, simulation-based, consensus-based, or narrative research and offer guidance on the methodology of IPD-MA. Study selection and data extraction were performed independently by two reviewers. Quality was assessed using tools tailored to study design (eg, Aims, Data generating mechanism, Estimands, Methods, and Performance measures, Risk of Bias in Systematic Reviews, Appraisal of Guidelines for Research & Evaluation using Delphi, Scale for the Assessment of Narrative Review Articles). Extracted guidance was categorized thematically and mapped to appraisal domains.</div></div><div><h3>Results</h3><div>After screening 14,736 records, we included 141 studies. These encompassed simulation (38%), empirical (21%), and methodological guidance (12%), among others. Key themes included IPD-MA planning, data access and harmonization, analytical strategies, and other statistical issues, as well as reporting. While there was robust guidance for IPD-MA of randomized trials, recommendations for observational studies are sparse. Across all study types, 63% were rated high quality.</div></div><div><h3>Conclusion</h3><div>This review synthesizes previously fragmented guidance into an integrative synthesis, highlighting best practices and critical domains for evaluating IPD-MAs. These findings formed the evidence base for a Delphi consensus process to develop a dedicated IPD-MA critical appraisal tool.</div></div><div><h3>Plain Language Summary</h3><div>Meta-analyses often pool published summaries from many studies. That approach can miss important details and introduce bias. An IPD-MA instead reanalyses the original, participant-level data across studies. IPD-MAs are powerful but complex, and practical guidance is scattered, especially for observational studies. We wanted to bring these recommendations together in one place and identify candidate items for a tool to assess the quality of a completed IPD-MA. We systematically searched eight databases from their inception to 2025 to identify papers offering practical guidance on conducting IPD-MAs for health interventions. We organized guidance across the full project life cycle, from planning, finding and accessing data, to preparing and checking data, analyzing results, and reporting. We highlighted where experts broadly agree and where gaps remain. We found 141 relevant papers published between 1995 and 2025. Among these, we identified 25 key topic areas and several smaller subt","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"190 ","pages":"Article 112089"},"PeriodicalIF":5.2,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145642598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-24DOI: 10.1016/j.jclinepi.2025.112088
Cassandra Laurie , Pablo Alonso Coello , Ivan D. Florez , Maxime Lê , David Moher , Manish Sadarangani , Maria E. Sundaram , George Wells , Krista Wilkinson , Kerry Dwan , Scott A. Halperin , Stuart G. Nicholls , Barnaby C. Reeves , Hugh Sharma Waddington , Beverley Shea , Melissa Brouwers , Giorgia Sulis
<div><h3>Background and Objective</h3><div>Vaccine effectiveness (VE) studies are essential for informing immunization policy and public health decision-making. However, the observational nature of most VE studies introduces unique methodological challenges, including biases that are not adequately addressed by existing risk-of-bias (RoB) tools. The Risk of Bias in Vaccine Effectiveness (RoB-VE) project is an international, multiphase methodological research initiative aimed at improving the quality, transparency, interpretability, and reporting of VE research.</div></div><div><h3>Discussion</h3><div>Funded by the Canadian Institutes of Health Research and supported by many global partners, the project seeks to generate a comprehensive toolkit for VE studies. This includes an RoB assessment resource tailored to VE study designs and a complementary reporting guideline to enhance consistency in VE study reporting. The project follows an evidence-informed approach, beginning with a review of the literature to inform tool development, and progressing through interest holder engagement, modified Delphi consensus, usability testing, and beta validation. This introductory paper outlines the rationale, scope, and methodology of the RoB-VE project. These efforts aim to strengthen the methodological foundation of VE research and support more reliable evidence synthesis and policy development.</div></div><div><h3>Plain Language Summary</h3><div>VE studies measure how well vaccines work in real-world scenarios. These studies are essential for shaping vaccination recommendations. To assess the validity of VE studies, it is necessary to carry out an RoB assessment, which involves looking at different aspects of the study (eg, data collection methods, how participants are recruited, etc.) that have the potential to yield misleading results. Existing RoB assessment tools do not fully capture issues particularly relevant to VE studies and inconsistent reporting limits their usefulness. To address this, we are conducting the RoB-VE project. This project aims to improve the quality, transparency, interpretability, and reporting of VE research through the development, validation, and dissemination of a robust and user-friendly RoB assessment tool, specifically tailored for assessing VE studies. Our methodology involves a comprehensive multistep process based on established approaches. A broad range of international participants with diverse expertise and profiles will be engaged along the way to refine and finalize the tool. After pilot testing the beta version of the tool and making further refinements, we aim to deliver version 1 of the tool, which will undergo a large-scale application phase to assess its reliability and usefulness. Additionally, we will develop a reporting guideline to enhance the completeness of reporting of VE studies. This introductory paper outlines the rationale, scope, and methodology of the RoB-VE project. This project will elevate the st
{"title":"The Risk of Bias in Vaccine Effectiveness (RoB-VE) project: introduction to a methodological initiative to improve risk-of-bias assessment and reporting in vaccine effectiveness research","authors":"Cassandra Laurie , Pablo Alonso Coello , Ivan D. Florez , Maxime Lê , David Moher , Manish Sadarangani , Maria E. Sundaram , George Wells , Krista Wilkinson , Kerry Dwan , Scott A. Halperin , Stuart G. Nicholls , Barnaby C. Reeves , Hugh Sharma Waddington , Beverley Shea , Melissa Brouwers , Giorgia Sulis","doi":"10.1016/j.jclinepi.2025.112088","DOIUrl":"10.1016/j.jclinepi.2025.112088","url":null,"abstract":"<div><h3>Background and Objective</h3><div>Vaccine effectiveness (VE) studies are essential for informing immunization policy and public health decision-making. However, the observational nature of most VE studies introduces unique methodological challenges, including biases that are not adequately addressed by existing risk-of-bias (RoB) tools. The Risk of Bias in Vaccine Effectiveness (RoB-VE) project is an international, multiphase methodological research initiative aimed at improving the quality, transparency, interpretability, and reporting of VE research.</div></div><div><h3>Discussion</h3><div>Funded by the Canadian Institutes of Health Research and supported by many global partners, the project seeks to generate a comprehensive toolkit for VE studies. This includes an RoB assessment resource tailored to VE study designs and a complementary reporting guideline to enhance consistency in VE study reporting. The project follows an evidence-informed approach, beginning with a review of the literature to inform tool development, and progressing through interest holder engagement, modified Delphi consensus, usability testing, and beta validation. This introductory paper outlines the rationale, scope, and methodology of the RoB-VE project. These efforts aim to strengthen the methodological foundation of VE research and support more reliable evidence synthesis and policy development.</div></div><div><h3>Plain Language Summary</h3><div>VE studies measure how well vaccines work in real-world scenarios. These studies are essential for shaping vaccination recommendations. To assess the validity of VE studies, it is necessary to carry out an RoB assessment, which involves looking at different aspects of the study (eg, data collection methods, how participants are recruited, etc.) that have the potential to yield misleading results. Existing RoB assessment tools do not fully capture issues particularly relevant to VE studies and inconsistent reporting limits their usefulness. To address this, we are conducting the RoB-VE project. This project aims to improve the quality, transparency, interpretability, and reporting of VE research through the development, validation, and dissemination of a robust and user-friendly RoB assessment tool, specifically tailored for assessing VE studies. Our methodology involves a comprehensive multistep process based on established approaches. A broad range of international participants with diverse expertise and profiles will be engaged along the way to refine and finalize the tool. After pilot testing the beta version of the tool and making further refinements, we aim to deliver version 1 of the tool, which will undergo a large-scale application phase to assess its reliability and usefulness. Additionally, we will develop a reporting guideline to enhance the completeness of reporting of VE studies. This introductory paper outlines the rationale, scope, and methodology of the RoB-VE project. This project will elevate the st","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"190 ","pages":"Article 112088"},"PeriodicalIF":5.2,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145642635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-22DOI: 10.1016/j.jclinepi.2025.112086
Carlos A. Cuello-Garcia , Rebecca L. Morgan , Nancy Santesso , Pablo Alonso-Coello , Romina Brignardello-Petersen , Lukas Schwingshackl , Jan L. Brozek , Srinivasa Vittal Katikireddi , Zachary Munn , Hugh Sharma Waddington , Kevin C. Wilson , Joerg Meerpohl , Daniel Morales , Ignacio Neumann , Peter Tugwell , Gordon Guyatt , Holger J. Schünemann
Background and Objectives
Ideally, guideline developers and health technology assessment authors base intervention decisions on randomized controlled trials (RCTs). However, relying solely on RCTs is uncommon, especially for public health interventions and harms assessment. In these situations, nonrandomized studies of interventions (NRSIs) can provide valuable information. This article presents Grading of Recommendations Assessment, Development, and Evaluation (GRADE) guidance for integrating bodies of evidence RCT and NRSI in evidence syntheses of health interventions.
Methods
Following standard GRADE methods, we developed this guidance through iterative discussions and examples with experts from the GRADE NRSI project group in multiple dedicated meetings. We presented findings of the group discussions for feedback at GRADE Working Group meetings in September 2023 and May 2024.
Results
The resulting GRADE guidance outlines a structured approach: (1) assessing the certainty of evidence (CoE) after defining the number of decision thresholds and the target of the certainty rating; (2) evaluating congruency of effect estimates between RCTs and NRSIs; (3) identifying which GRADE domains are affected by certainty ratings to inform complementariness between RCTs and NRSIs and the overall CoE; and (4) deciding whether and how to use one or both types of studies.
Conclusion
This GRADE guidance offers a structured and practical approach for integrating or not integrating RCTs and NRSIs in evidence syntheses. By addressing the interplay between affected GRADE domains and assessing the congruency of effects, it helps GRADE users determine when and how NRSIs can meaningfully complement or replace RCT evidence to inform certainty ratings and decision-making.
{"title":"Grading of Recommendations, Assessment, Development, and Evaluation guidance 44: strategies to enhance the utilization of randomized and nonrandomized studies in evidence syntheses of healthinterventions","authors":"Carlos A. Cuello-Garcia , Rebecca L. Morgan , Nancy Santesso , Pablo Alonso-Coello , Romina Brignardello-Petersen , Lukas Schwingshackl , Jan L. Brozek , Srinivasa Vittal Katikireddi , Zachary Munn , Hugh Sharma Waddington , Kevin C. Wilson , Joerg Meerpohl , Daniel Morales , Ignacio Neumann , Peter Tugwell , Gordon Guyatt , Holger J. Schünemann","doi":"10.1016/j.jclinepi.2025.112086","DOIUrl":"10.1016/j.jclinepi.2025.112086","url":null,"abstract":"<div><h3>Background and Objectives</h3><div>Ideally, guideline developers and health technology assessment authors base intervention decisions on randomized controlled trials (RCTs). However, relying solely on RCTs is uncommon, especially for public health interventions and harms assessment. In these situations, nonrandomized studies of interventions (NRSIs) can provide valuable information. This article presents Grading of Recommendations Assessment, Development, and Evaluation (GRADE) guidance for integrating bodies of evidence RCT and NRSI in evidence syntheses of health interventions.</div></div><div><h3>Methods</h3><div>Following standard GRADE methods, we developed this guidance through iterative discussions and examples with experts from the GRADE NRSI project group in multiple dedicated meetings. We presented findings of the group discussions for feedback at GRADE Working Group meetings in September 2023 and May 2024.</div></div><div><h3>Results</h3><div>The resulting GRADE guidance outlines a structured approach: (1) assessing the certainty of evidence (CoE) after defining the number of decision thresholds and the target of the certainty rating; (2) evaluating congruency of effect estimates between RCTs and NRSIs; (3) identifying which GRADE domains are affected by certainty ratings to inform complementariness between RCTs and NRSIs and the overall CoE; and (4) deciding whether and how to use one or both types of studies.</div></div><div><h3>Conclusion</h3><div>This GRADE guidance offers a structured and practical approach for integrating or not integrating RCTs and NRSIs in evidence syntheses. By addressing the interplay between affected GRADE domains and assessing the congruency of effects, it helps GRADE users determine when and how NRSIs can meaningfully complement or replace RCT evidence to inform certainty ratings and decision-making.</div></div>","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"190 ","pages":"Article 112086"},"PeriodicalIF":5.2,"publicationDate":"2025-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145598039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-22DOI: 10.1016/j.jclinepi.2025.112087
David A. Savitz , Neil Pearce , Kenneth J. Rothman
Hill's list of considerations for assessing causality, proposed 60 years ago, became a landmark in the interpretation of epidemiologic evidence. However, it has been and continues to be misused as a list of causal criteria to be scored and summed, despite causal inference being unattainable through the application of this or any other algorithm. Recognizing the distinction between statistical associations and causal effects was a key contribution of Hill. While he identified several clues for distinguishing between causal and noncausal associations, causal inference in epidemiology has become much more explicit and effective. Rather than relying on Hill's indirect hints of potential bias by considering strength of association or dose-response gradients, newer methods such as quantitative bias analysis directly assess confounding and other candidate biases that compete with causal explanations, leading to more informed inferences. Similarly, the interpretation of consistency depends on variation in methods across studies; triangulation may be used to search for informative inconsistencies, strengthening causal inference. Most importantly, a causal connection is not a categorical property bestowed upon an association based on Hill's considerations or any other checklist. Causal inference is an inherently indirect process, with the inference gradually crystallizing by withstanding challenges from competing theories in which other explanations, including random error or biases, are found not to account for the measured association.
{"title":"Hill's considerations are not causal criteria","authors":"David A. Savitz , Neil Pearce , Kenneth J. Rothman","doi":"10.1016/j.jclinepi.2025.112087","DOIUrl":"10.1016/j.jclinepi.2025.112087","url":null,"abstract":"<div><div>Hill's list of considerations for assessing causality, proposed 60 years ago, became a landmark in the interpretation of epidemiologic evidence. However, it has been and continues to be misused as a list of causal criteria to be scored and summed, despite causal inference being unattainable through the application of this or any other algorithm. Recognizing the distinction between statistical associations and causal effects was a key contribution of Hill. While he identified several clues for distinguishing between causal and noncausal associations, causal inference in epidemiology has become much more explicit and effective. Rather than relying on Hill's indirect hints of potential bias by considering strength of association or dose-response gradients, newer methods such as quantitative bias analysis directly assess confounding and other candidate biases that compete with causal explanations, leading to more informed inferences. Similarly, the interpretation of consistency depends on variation in methods across studies; triangulation may be used to search for informative inconsistencies, strengthening causal inference. Most importantly, a causal connection is not a categorical property bestowed upon an association based on Hill's considerations or any other checklist. Causal inference is an inherently indirect process, with the inference gradually crystallizing by withstanding challenges from competing theories in which other explanations, including random error or biases, are found not to account for the measured association.</div></div>","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"190 ","pages":"Article 112087"},"PeriodicalIF":5.2,"publicationDate":"2025-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145597988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-21DOI: 10.1016/j.jclinepi.2025.112082
Lee X. Li , Ashley M. Hopkins , Richard Woodman , Ahmad Y. Abuhelwa , Yuan Gao , Natalie Parent , Andrew Rowland , Michael J. Sorich
Background and Objectives
Prognostic models can enhance clinician-patient communication and guide treatment decisions. Numerous machine learning (ML) algorithms are available and offer a novel approach to predicting survival in patients treated with immune checkpoint inhibitors. However, large-scale benchmarking of their performances—particularly in terms of calibration—has not been evaluated across multiple independent cohorts. This study aimed to develop, evaluate, and compare statistical and ML models regarding discrimination, calibration, and variable importance for predicting overall survival across seven clinical trial cohorts of advanced non–small cell lung cancer (NSCLC) undergoing immune checkpoint inhibitor treatment.
Methods
This study included atezolizumab-treated patients with advanced NSCLC from seven clinical trials. We compared two statistical models: Cox proportional-hazard (Coxph) and accelerated failure time models, and 6 ML models: CoxBoost, extreme gradient-boosting (XGBoost), gradient-boosting machines (GBMs), random survival forest, regularized Coxph models (least absolute shrinkage and selection operator [LASSO]), and support vector machines (SVMs). Models were evaluated on discrimination and calibration using a leave-one-study-out nested cross-validation (nCV) framework. Discrimination was assessed using Harrell's concordance index (Cindex), while calibration was assessed using integrated calibration index (ICI) and plot. Variable importance was assessed using Shapley Additive exPlanations (SHAP) values.
Results
In a cohort of 3203 patients, the two statistical models and 5 of the 6 ML models demonstrated comparable and moderate discrimination performances (aggregated Cindex: 0.69–0.70), while SVM exhibited poor discrimination (aggregated Cindex: 0.57). Regarding calibration, the models appeared largely comparable in aggregated plots, except for LASSO, although the XGBoost models demonstrated superior calibration numerically. Across the evaluation cohorts, individual performance measures varied and no single model consistently outperforming the others. Pretreatment neutrophil-to-lymphocyte ratios (NLRs) and Eastern Cooperative Oncology Group Performance Status (ECOGPS) were ranked among the top five most important predictors across all models.
Conclusion
There was no clear best-performing model for either discrimination or calibration, although XGBoost models showed possible superior calibration numerically. Performance of a given model varied across evaluation cohorts, highlighting the importance of model assessment using multiple independent datasets. All models identified pretreatment NLR and ECOGPS as the key prognostic factors.
背景与目的:预后模型可以加强临床与患者的沟通,指导治疗决策。许多机器学习(ML)算法可用,并提供了一种新的方法来预测接受免疫检查点抑制剂治疗的患者的生存。然而,对它们的性能进行大规模基准测试——特别是在校准方面——尚未在多个独立队列中进行评估。本研究旨在开发、评估和比较7个接受免疫检查点抑制剂治疗的晚期非小细胞肺癌(NSCLC)临床试验队列中关于区分、校准和可变重要性的统计和ML模型,以预测总生存期。患者和方法:本研究纳入了七项临床试验的atezolizumab治疗的晚期非小细胞肺癌患者。我们比较了两种统计模型:Cox比例风险(Coxph)和加速失效时间模型,以及6种ML模型:Cox boost、极端梯度增强(XGBoost)、梯度增强机(GBM)、随机生存森林、正则化Coxph模型(LASSO)和支持向量机(SVM)。使用留一项研究的嵌套交叉验证(nCV)框架对模型进行判别和校准评估。判别采用Harrell’s concordance index (Cindex)评价,校正采用integrated calibration index (ICI)和plot评价。变量重要性采用Shapley加性解释(SHAP)值进行评估。结果:在3203例患者队列中,两种统计模型和6种ML模型中的5种具有可比性和中等的判别性能(综合判别指数:0.69-0.70),而SVM的判别性能较差(综合判别指数:0.57)。在校准方面,除了LASSO之外,这些模型在汇总图中表现出很大的可比性,尽管XGBoost模型在数值上显示出更好的校准。在整个评估队列中,个人绩效指标各不相同,没有一个模型始终优于其他模型。治疗前中性粒细胞与淋巴细胞比率(NLR)和东部肿瘤合作组表现状态(ECOGPS)在所有模型中排名前五。结论:虽然XGBoost模型在数值上可能具有更好的校准效果,但在鉴别和校准方面没有明确的最佳模型。给定模型的性能在评估队列中有所不同,突出了使用多个独立数据集进行模型评估的重要性。所有模型均将治疗前NLR和ECOGPS确定为关键预后因素。
{"title":"Discrimination, calibration, and variable importance in statistical and machine learning models for predicting overall survival in advanced non–small cell lung cancer patients treated with immune checkpoint inhibitors","authors":"Lee X. Li , Ashley M. Hopkins , Richard Woodman , Ahmad Y. Abuhelwa , Yuan Gao , Natalie Parent , Andrew Rowland , Michael J. Sorich","doi":"10.1016/j.jclinepi.2025.112082","DOIUrl":"10.1016/j.jclinepi.2025.112082","url":null,"abstract":"<div><h3>Background and Objectives</h3><div>Prognostic models can enhance clinician-patient communication and guide treatment decisions. Numerous machine learning (ML) algorithms are available and offer a novel approach to predicting survival in patients treated with immune checkpoint inhibitors. However, large-scale benchmarking of their performances—particularly in terms of calibration—has not been evaluated across multiple independent cohorts. This study aimed to develop, evaluate, and compare statistical and ML models regarding discrimination, calibration, and variable importance for predicting overall survival across seven clinical trial cohorts of advanced non–small cell lung cancer (NSCLC) undergoing immune checkpoint inhibitor treatment.</div></div><div><h3>Methods</h3><div>This study included atezolizumab-treated patients with advanced NSCLC from seven clinical trials. We compared two statistical models: Cox proportional-hazard (Coxph) and accelerated failure time models, and 6 ML models: CoxBoost, extreme gradient-boosting (XGBoost), gradient-boosting machines (GBMs), random survival forest, regularized Coxph models (least absolute shrinkage and selection operator [LASSO]), and support vector machines (SVMs). Models were evaluated on discrimination and calibration using a leave-one-study-out nested cross-validation (nCV) framework. Discrimination was assessed using Harrell's concordance index (Cindex), while calibration was assessed using integrated calibration index (ICI) and plot. Variable importance was assessed using Shapley Additive exPlanations (SHAP) values.</div></div><div><h3>Results</h3><div>In a cohort of 3203 patients, the two statistical models and 5 of the 6 ML models demonstrated comparable and moderate discrimination performances (aggregated Cindex: 0.69–0.70), while SVM exhibited poor discrimination (aggregated Cindex: 0.57). Regarding calibration, the models appeared largely comparable in aggregated plots, except for LASSO, although the XGBoost models demonstrated superior calibration numerically. Across the evaluation cohorts, individual performance measures varied and no single model consistently outperforming the others. Pretreatment neutrophil-to-lymphocyte ratios (NLRs) and Eastern Cooperative Oncology Group Performance Status (ECOGPS) were ranked among the top five most important predictors across all models.</div></div><div><h3>Conclusion</h3><div>There was no clear best-performing model for either discrimination or calibration, although XGBoost models showed possible superior calibration numerically. Performance of a given model varied across evaluation cohorts, highlighting the importance of model assessment using multiple independent datasets. All models identified pretreatment NLR and ECOGPS as the key prognostic factors.</div></div>","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"190 ","pages":"Article 112082"},"PeriodicalIF":5.2,"publicationDate":"2025-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145589810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}