Pub Date : 2026-02-01Epub Date: 2025-11-19DOI: 10.1016/j.jclinepi.2025.112057
Birgitte Nørgaard , Karen E. Lie , Hans Lund
<div><h3>Objectives</h3><div>To systematically map the factors associated with citation rates, to categorize the types of studies evaluating these factors, and to obtain an overall status of citation bias in scientific health literature.</div></div><div><h3>Study Design and Setting</h3><div>A scoping review was reported following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses scoping review extension checklist. Four electronic databases were searched, and the reference-lists of all included articles were screened. Empirical meta-research studies reporting any source of predictors of citation rates and/or citation bias within health care were included. Data are presented by descriptive statistics such as frequencies, portions, and percentages.</div></div><div><h3>Results</h3><div>A total of 165 studies were included. Fifty-four distinct factors of citation rates were evaluated in 786 quantitative analyses. Regardless of using the same basic methodological approach to calculate citation rate, 78 studies (48%) aimed to examined citation bias, whereas 79 studies (48%) aimed to optimizing article characteristics to enhance authors’ own citation rates. The remaining seven studies (4%) analyzed infrastructural characteristics at publication level to make all studies more accessible.</div></div><div><h3>Conclusion</h3><div>Seventy-nine of the 165 included studies (48%) explicitly recommended modifying paper characteristics—such as title length or author count—to boost citations rather than prioritizing scientific contribution. Such recommendations may conflict with principles of scientific integrity, which emphasize relevance and methodological rigor over strategic citation practices. Given the high proportion of analyses identifying a significant increase in citation rates, publication bias cannot be ruled out.</div></div><div><h3>Plain Language Summary</h3><div>Why was the study done? Within scientific research, it is important to cite previous research. This is done for specific reasons, including crediting earlier authors and providing a credible and trustworthy background for conducting the study. However, findings suggest that citations are not always chosen for their intended purpose. This is known as citation bias. What did the researchers do? The researchers searched for all existing studies evaluating predictors of citation rate, ie, how often is a specific study referred to by other researchers. They systematically mapped these studies to find out both the level of citation bias and the types of citation bias present in scientific health literature. To find these studies, the researchers searched four electronic databases and screened the reference lists of all included studies to be sure to include as many studies as possible. What did the researchers find? The researchers found a total of 165 studies that evaluated predictors of citation rate in no less than 786 analyses. However, the researchers found that the studie
{"title":"Predictors of citation rates and the problem of citation bias: a scoping review","authors":"Birgitte Nørgaard , Karen E. Lie , Hans Lund","doi":"10.1016/j.jclinepi.2025.112057","DOIUrl":"10.1016/j.jclinepi.2025.112057","url":null,"abstract":"<div><h3>Objectives</h3><div>To systematically map the factors associated with citation rates, to categorize the types of studies evaluating these factors, and to obtain an overall status of citation bias in scientific health literature.</div></div><div><h3>Study Design and Setting</h3><div>A scoping review was reported following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses scoping review extension checklist. Four electronic databases were searched, and the reference-lists of all included articles were screened. Empirical meta-research studies reporting any source of predictors of citation rates and/or citation bias within health care were included. Data are presented by descriptive statistics such as frequencies, portions, and percentages.</div></div><div><h3>Results</h3><div>A total of 165 studies were included. Fifty-four distinct factors of citation rates were evaluated in 786 quantitative analyses. Regardless of using the same basic methodological approach to calculate citation rate, 78 studies (48%) aimed to examined citation bias, whereas 79 studies (48%) aimed to optimizing article characteristics to enhance authors’ own citation rates. The remaining seven studies (4%) analyzed infrastructural characteristics at publication level to make all studies more accessible.</div></div><div><h3>Conclusion</h3><div>Seventy-nine of the 165 included studies (48%) explicitly recommended modifying paper characteristics—such as title length or author count—to boost citations rather than prioritizing scientific contribution. Such recommendations may conflict with principles of scientific integrity, which emphasize relevance and methodological rigor over strategic citation practices. Given the high proportion of analyses identifying a significant increase in citation rates, publication bias cannot be ruled out.</div></div><div><h3>Plain Language Summary</h3><div>Why was the study done? Within scientific research, it is important to cite previous research. This is done for specific reasons, including crediting earlier authors and providing a credible and trustworthy background for conducting the study. However, findings suggest that citations are not always chosen for their intended purpose. This is known as citation bias. What did the researchers do? The researchers searched for all existing studies evaluating predictors of citation rate, ie, how often is a specific study referred to by other researchers. They systematically mapped these studies to find out both the level of citation bias and the types of citation bias present in scientific health literature. To find these studies, the researchers searched four electronic databases and screened the reference lists of all included studies to be sure to include as many studies as possible. What did the researchers find? The researchers found a total of 165 studies that evaluated predictors of citation rate in no less than 786 analyses. However, the researchers found that the studie","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"190 ","pages":"Article 112057"},"PeriodicalIF":5.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145566074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
<div><h3>Objectives</h3><div>Systematic reviews (SRs) are pivotal to evidence-based medicine. Structured tools exist to guide their reporting and appraisal, such as Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) and A Measurement Tool to Assess Systematic Reviews (AMSTAR). However, there are limited data on whether peer reviewers of SRs use such tools when assessing manuscripts. This study aimed to investigate the use of structured tools by peer reviewers when assessing SRs of interventions, identify which tools are used, and explore perceived needs for structured tools to support the peer-review process.</div></div><div><h3>Study Design and Setting</h3><div>In 2025, we conducted a cross-sectional study targeting individuals who peer-reviewed at least 1 SR of interventions in the past year. The online survey collected data on demographics, use, and familiarity with structured tools, as well as open-ended responses on potential needs.</div></div><div><h3>Results</h3><div>Two hundred seventeen peer reviewers took part in the study. PRISMA was the most familiar tool (99% familiar or very familiar) and most frequently used during peer review (53% always used). The use of other tools such as AMSTAR, Peer Review of Electronic Search Strategies (PRESS), A Risk of Bias Assessment Tool for Systematic Reviews (ROBIS), and JBI checklist was infrequent. Seventeen percent reported using other structured tools beyond those listed. Most participants indicated that journals rarely required use of structured tools, except PRISMA. A notable proportion (55%) expressed concerns about time constraints, and 25% noted the lack of a comprehensive tool. Nearly half (45%) expressed a need for a dedicated structured tool for SR peer review, with checklists in PDF or embedded formats preferred. Participants expressed both advantages and concerns related to such tools.</div></div><div><h3>Conclusion</h3><div>Most peer reviewers used PRISMA when assessing SRs, while other structured tools were seldom applied. Only a few journals provided or required such tools, revealing inconsistent editorial practices. Participants reported barriers, including time constraints and a lack of suitable instruments. These findings highlight the need for a practical, validated tool, built upon existing instruments and integrated into editorial workflows. Such a tool could make peer review of SRs more consistent and transparent.</div></div><div><h3>Plain Language Summary</h3><div>Systematic reviews (SRs) are a type of research that synthesizes results from primary studies. Several structured tools, such as PRISMA for reporting and AMSTAR 2 for methodological quality, exist to guide how SRs are written and appraised. When manuscripts that report SRs are submitted to scholarly journals, editors invite expert peer reviewers to assess these SRs. In this study, researchers aimed to analyze which tools peer reviewers actually use when evaluating SR manuscripts, their percep
{"title":"Use of structured tools by peer reviewers of systematic reviews: a cross-sectional study reveals high familiarity with Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) but limited use of other tools","authors":"Livia Puljak , Sara Pintur , Tanja Rombey , Craig Lockwood , Dawid Pieper","doi":"10.1016/j.jclinepi.2025.112084","DOIUrl":"10.1016/j.jclinepi.2025.112084","url":null,"abstract":"<div><h3>Objectives</h3><div>Systematic reviews (SRs) are pivotal to evidence-based medicine. Structured tools exist to guide their reporting and appraisal, such as Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) and A Measurement Tool to Assess Systematic Reviews (AMSTAR). However, there are limited data on whether peer reviewers of SRs use such tools when assessing manuscripts. This study aimed to investigate the use of structured tools by peer reviewers when assessing SRs of interventions, identify which tools are used, and explore perceived needs for structured tools to support the peer-review process.</div></div><div><h3>Study Design and Setting</h3><div>In 2025, we conducted a cross-sectional study targeting individuals who peer-reviewed at least 1 SR of interventions in the past year. The online survey collected data on demographics, use, and familiarity with structured tools, as well as open-ended responses on potential needs.</div></div><div><h3>Results</h3><div>Two hundred seventeen peer reviewers took part in the study. PRISMA was the most familiar tool (99% familiar or very familiar) and most frequently used during peer review (53% always used). The use of other tools such as AMSTAR, Peer Review of Electronic Search Strategies (PRESS), A Risk of Bias Assessment Tool for Systematic Reviews (ROBIS), and JBI checklist was infrequent. Seventeen percent reported using other structured tools beyond those listed. Most participants indicated that journals rarely required use of structured tools, except PRISMA. A notable proportion (55%) expressed concerns about time constraints, and 25% noted the lack of a comprehensive tool. Nearly half (45%) expressed a need for a dedicated structured tool for SR peer review, with checklists in PDF or embedded formats preferred. Participants expressed both advantages and concerns related to such tools.</div></div><div><h3>Conclusion</h3><div>Most peer reviewers used PRISMA when assessing SRs, while other structured tools were seldom applied. Only a few journals provided or required such tools, revealing inconsistent editorial practices. Participants reported barriers, including time constraints and a lack of suitable instruments. These findings highlight the need for a practical, validated tool, built upon existing instruments and integrated into editorial workflows. Such a tool could make peer review of SRs more consistent and transparent.</div></div><div><h3>Plain Language Summary</h3><div>Systematic reviews (SRs) are a type of research that synthesizes results from primary studies. Several structured tools, such as PRISMA for reporting and AMSTAR 2 for methodological quality, exist to guide how SRs are written and appraised. When manuscripts that report SRs are submitted to scholarly journals, editors invite expert peer reviewers to assess these SRs. In this study, researchers aimed to analyze which tools peer reviewers actually use when evaluating SR manuscripts, their percep","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"190 ","pages":"Article 112084"},"PeriodicalIF":5.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145582736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-12-04DOI: 10.1016/j.jclinepi.2025.112097
R. Guelimi , C. Choudhary , C. Ollivier , Q. Beytout , Q. Samaran , A. Mubuangankusu , A. Chaimani , E. Sbidian , S. Afach , L. Le Cleach
Objectives
This study was conducted within a large Cochrane living systematic review (SR) on psoriasis treatments with the aim to evaluate the inter-rater agreement of the Cochrane risk of bias tool 2 (RoB-2) tool, to compare its RoB judgments with the original RoB-1, and to explore the impact of changes in RoB judgment between the two tools on the Cochrane network meta-analysis’ (NMA) results.
Study Design and Setting
This study was conducted within the 2025 update of a living Cochrane review on systemic treatments for psoriasis. Four pairs of assessors used RoB-2 to evaluate the RoB of 193 randomized controlled trials for two primary outcomes: Psoriasis Area Severity Index (PASI) 90 (reflecting clear or almost clear skin) and serious adverse events (SAEs). Inter-rater reliability (IRR) was calculated using Cohen's kappa. RoB-2 judgments for 147 trials (PASI 90) and 154 trials (SAEs) were compared to the previous RoB-1 assessments from the Cochrane 2023 update. The impact of using RoB-2 vs. RoB-1 judgments on the NMA's results was explored through sensitivity analyses, with calculation of ratio of risk ratios (RRRs) between the analyses for each treatment effect.
Results
For the RoB-2 overall judgment, the IRR was fair for PASI 90 (kappa = 0.37) and moderate for SAEs (kappa = 0.46). IRR varied between domains (from kappa = 0.33, to kappa = 0.65), with lower IRR found for domains 2, 3, and 5. Significant discrepancies were found between RoB-1 and RoB-2 judgments. Compared to RoB-1, RoB-2 rated a smaller proportion of results as low risk for both PASI 90 (36% vs 58%) and SAEs (13% vs 58%) and a higher proportion as high risk for SAEs (55% vs 29%). For PASI 90, 66/147 (45%) studies showed switches between different judgments, including 18 extreme switches either from low to high or from high to low RoB. For SAEs, 93/154 (60%) studies underwent switches between different judgments, with 32 extreme switches occurring exclusively from low to high RoB. Sensitivity analyses excluding high-risk trials showed moderate impact on the NMA efficacy results (median RRR = 0.92, interquartile range (IQR), 0.91–0.92), but wider changes for SAEs (median RRR = 1.07, IQR, 0.97–1.15).
Conclusion
The transition to RoB-2 in a large Cochrane SR revealed fair-to-moderate inter-rater agreement, underscoring the need for consensus among reviewers. The shift from RoB-1 to RoB-2 led to changes in risk-of-bias judgments in our review. Although the impact on the NMA results was pronounced for SAEs, the changes in results were limited for our efficacy outcome PASI 90.
{"title":"Comparison between risk of bias-1 and risk of bias-2 tool and impact on network meta-analysis results—A case study from a living Cochrane review on psoriasis","authors":"R. Guelimi , C. Choudhary , C. Ollivier , Q. Beytout , Q. Samaran , A. Mubuangankusu , A. Chaimani , E. Sbidian , S. Afach , L. Le Cleach","doi":"10.1016/j.jclinepi.2025.112097","DOIUrl":"10.1016/j.jclinepi.2025.112097","url":null,"abstract":"<div><h3>Objectives</h3><div>This study was conducted within a large Cochrane living systematic review (SR) on psoriasis treatments with the aim to evaluate the inter-rater agreement of the Cochrane risk of bias tool 2 (RoB-2) tool, to compare its RoB judgments with the original RoB-1, and to explore the impact of changes in RoB judgment between the two tools on the Cochrane network meta-analysis’ (NMA) results.</div></div><div><h3>Study Design and Setting</h3><div>This study was conducted within the 2025 update of a living Cochrane review on systemic treatments for psoriasis. Four pairs of assessors used RoB-2 to evaluate the RoB of 193 randomized controlled trials for two primary outcomes: Psoriasis Area Severity Index (PASI) 90 (reflecting clear or almost clear skin) and serious adverse events (SAEs). Inter-rater reliability (IRR) was calculated using Cohen's kappa. RoB-2 judgments for 147 trials (PASI 90) and 154 trials (SAEs) were compared to the previous RoB-1 assessments from the Cochrane 2023 update. The impact of using RoB-2 vs. RoB-1 judgments on the NMA's results was explored through sensitivity analyses, with calculation of ratio of risk ratios (RRRs) between the analyses for each treatment effect.</div></div><div><h3>Results</h3><div>For the RoB-2 overall judgment, the IRR was fair for PASI 90 (kappa = 0.37) and moderate for SAEs (kappa = 0.46). IRR varied between domains (from kappa = 0.33, to kappa = 0.65), with lower IRR found for domains 2, 3, and 5. Significant discrepancies were found between RoB-1 and RoB-2 judgments. Compared to RoB-1, RoB-2 rated a smaller proportion of results as low risk for both PASI 90 (36% vs 58%) and SAEs (13% vs 58%) and a higher proportion as high risk for SAEs (55% vs 29%). For PASI 90, 66/147 (45%) studies showed switches between different judgments, including 18 extreme switches either from low to high or from high to low RoB. For SAEs, 93/154 (60%) studies underwent switches between different judgments, with 32 extreme switches occurring exclusively from low to high RoB. Sensitivity analyses excluding high-risk trials showed moderate impact on the NMA efficacy results (median RRR = 0.92, interquartile range (IQR), 0.91–0.92), but wider changes for SAEs (median RRR = 1.07, IQR, 0.97–1.15).</div></div><div><h3>Conclusion</h3><div>The transition to RoB-2 in a large Cochrane SR revealed fair-to-moderate inter-rater agreement, underscoring the need for consensus among reviewers. The shift from RoB-1 to RoB-2 led to changes in risk-of-bias judgments in our review. Although the impact on the NMA results was pronounced for SAEs, the changes in results were limited for our efficacy outcome PASI 90.</div></div>","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"190 ","pages":"Article 112097"},"PeriodicalIF":5.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145696458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-11-25DOI: 10.1016/j.jclinepi.2025.112089
Edith Ginika Otalike , Mike Clarke , Farjana Akhter , Areti Angeliki Veroniki , Ngianga-Bakwin Kandala , Joel J. Gagnier
<div><h3>Objectives</h3><div>To systematically identify and synthesize methodological guidance for conducting individual participant data meta-analyses (IPD-MAs) of randomized trials and observational studies, to inform the development of a critical appraisal tool for reports of IPD-MAs.</div></div><div><h3>Study Design and Setting</h3><div>We searched nine major electronic databases and gray literature sources through June 2025 using a strategy developed with a health sciences librarian. To be eligible, articles had to report empirical, simulation-based, consensus-based, or narrative research and offer guidance on the methodology of IPD-MA. Study selection and data extraction were performed independently by two reviewers. Quality was assessed using tools tailored to study design (eg, Aims, Data generating mechanism, Estimands, Methods, and Performance measures, Risk of Bias in Systematic Reviews, Appraisal of Guidelines for Research & Evaluation using Delphi, Scale for the Assessment of Narrative Review Articles). Extracted guidance was categorized thematically and mapped to appraisal domains.</div></div><div><h3>Results</h3><div>After screening 14,736 records, we included 141 studies. These encompassed simulation (38%), empirical (21%), and methodological guidance (12%), among others. Key themes included IPD-MA planning, data access and harmonization, analytical strategies, and other statistical issues, as well as reporting. While there was robust guidance for IPD-MA of randomized trials, recommendations for observational studies are sparse. Across all study types, 63% were rated high quality.</div></div><div><h3>Conclusion</h3><div>This review synthesizes previously fragmented guidance into an integrative synthesis, highlighting best practices and critical domains for evaluating IPD-MAs. These findings formed the evidence base for a Delphi consensus process to develop a dedicated IPD-MA critical appraisal tool.</div></div><div><h3>Plain Language Summary</h3><div>Meta-analyses often pool published summaries from many studies. That approach can miss important details and introduce bias. An IPD-MA instead reanalyses the original, participant-level data across studies. IPD-MAs are powerful but complex, and practical guidance is scattered, especially for observational studies. We wanted to bring these recommendations together in one place and identify candidate items for a tool to assess the quality of a completed IPD-MA. We systematically searched eight databases from their inception to 2025 to identify papers offering practical guidance on conducting IPD-MAs for health interventions. We organized guidance across the full project life cycle, from planning, finding and accessing data, to preparing and checking data, analyzing results, and reporting. We highlighted where experts broadly agree and where gaps remain. We found 141 relevant papers published between 1995 and 2025. Among these, we identified 25 key topic areas and several smaller subt
{"title":"Methodological guidance for individual participant data meta-analyses: a systematic review","authors":"Edith Ginika Otalike , Mike Clarke , Farjana Akhter , Areti Angeliki Veroniki , Ngianga-Bakwin Kandala , Joel J. Gagnier","doi":"10.1016/j.jclinepi.2025.112089","DOIUrl":"10.1016/j.jclinepi.2025.112089","url":null,"abstract":"<div><h3>Objectives</h3><div>To systematically identify and synthesize methodological guidance for conducting individual participant data meta-analyses (IPD-MAs) of randomized trials and observational studies, to inform the development of a critical appraisal tool for reports of IPD-MAs.</div></div><div><h3>Study Design and Setting</h3><div>We searched nine major electronic databases and gray literature sources through June 2025 using a strategy developed with a health sciences librarian. To be eligible, articles had to report empirical, simulation-based, consensus-based, or narrative research and offer guidance on the methodology of IPD-MA. Study selection and data extraction were performed independently by two reviewers. Quality was assessed using tools tailored to study design (eg, Aims, Data generating mechanism, Estimands, Methods, and Performance measures, Risk of Bias in Systematic Reviews, Appraisal of Guidelines for Research & Evaluation using Delphi, Scale for the Assessment of Narrative Review Articles). Extracted guidance was categorized thematically and mapped to appraisal domains.</div></div><div><h3>Results</h3><div>After screening 14,736 records, we included 141 studies. These encompassed simulation (38%), empirical (21%), and methodological guidance (12%), among others. Key themes included IPD-MA planning, data access and harmonization, analytical strategies, and other statistical issues, as well as reporting. While there was robust guidance for IPD-MA of randomized trials, recommendations for observational studies are sparse. Across all study types, 63% were rated high quality.</div></div><div><h3>Conclusion</h3><div>This review synthesizes previously fragmented guidance into an integrative synthesis, highlighting best practices and critical domains for evaluating IPD-MAs. These findings formed the evidence base for a Delphi consensus process to develop a dedicated IPD-MA critical appraisal tool.</div></div><div><h3>Plain Language Summary</h3><div>Meta-analyses often pool published summaries from many studies. That approach can miss important details and introduce bias. An IPD-MA instead reanalyses the original, participant-level data across studies. IPD-MAs are powerful but complex, and practical guidance is scattered, especially for observational studies. We wanted to bring these recommendations together in one place and identify candidate items for a tool to assess the quality of a completed IPD-MA. We systematically searched eight databases from their inception to 2025 to identify papers offering practical guidance on conducting IPD-MAs for health interventions. We organized guidance across the full project life cycle, from planning, finding and accessing data, to preparing and checking data, analyzing results, and reporting. We highlighted where experts broadly agree and where gaps remain. We found 141 relevant papers published between 1995 and 2025. Among these, we identified 25 key topic areas and several smaller subt","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"190 ","pages":"Article 112089"},"PeriodicalIF":5.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145642598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-11-19DOI: 10.1016/j.jclinepi.2025.112058
Lidwine B. Mokkink , Iris Eekhout
Reliability and measurement error are related but distinct measurement properties. They are connected because both can be evaluated using the same data, typically collected from studies involving repeated measurements in individuals who are stable on the outcome of interest. However, they are calculated using different statistical methods and refer to different quality aspects of measurement instruments. We explain that a measurement error refers to the precision of a measurement, that is, how similar or close the scores are across repeated measurements in a stable individual (variation within individuals). In contrast, reliability indicates an instrument's ability to distinguish between individuals, which depends both on the variation between individuals (ie, heterogeneity in the outcome being measured in the population) and the precision of the score, ie, the measurement error. Evaluating reliability helps to understand if a particular source of variation (eg, occasion, type of machine, or rater) influences the score, and whether the measurement can be improved by better standardizing this source. Intraclass-correlation coefficients, standards error of measurement, and variance components are explained and illustrated with an example.
{"title":"The measurement properties reliability and measurement error explained – a COSMIN perspective","authors":"Lidwine B. Mokkink , Iris Eekhout","doi":"10.1016/j.jclinepi.2025.112058","DOIUrl":"10.1016/j.jclinepi.2025.112058","url":null,"abstract":"<div><div>Reliability and measurement error are related but distinct measurement properties. They are connected because both can be evaluated using the same data, typically collected from studies involving repeated measurements in individuals who are stable on the outcome of interest. However, they are calculated using different statistical methods and refer to different quality aspects of measurement instruments. We explain that a measurement error refers to the precision of a measurement, that is, how similar or close the scores are across repeated measurements in a stable individual (variation within individuals). In contrast, reliability indicates an instrument's ability to distinguish between individuals, which depends both on the variation between individuals (ie, heterogeneity in the outcome being measured in the population) and the precision of the score, ie, the measurement error. Evaluating reliability helps to understand if a particular source of variation (eg, occasion, type of machine, or rater) influences the score, and whether the measurement can be improved by better standardizing this source. Intraclass-correlation coefficients, standards error of measurement, and variance components are explained and illustrated with an example.</div></div>","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"190 ","pages":"Article 112058"},"PeriodicalIF":5.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145566035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-11-15DOI: 10.1016/j.jclinepi.2025.112059
Eve Tomlinson , Jude Holmes , Anne W.S. Rutjes , Clare Davenport , Mariska Leeflang , Bada Yang , Sue Mallett , Penny Whiting
<div><h3>Objectives</h3><div>Assessment of the applicability of primary studies is an essential but often a challenging aspect of systematic reviews of diagnostic test accuracy studies (DTA reviews). We explored review authors’ applicability assessments for the QUADAS-2 reference standard domain within Cochrane DTA reviews. We highlight applicability concerns, identify potential issues with assessment, and develop a framework for assessing the applicability of the target condition as defined by the reference standard.</div></div><div><h3>Study Design and Setting</h3><div>Methodological review. DTA reviews in the Cochrane Library that used QUADAS-2 and judged applicability for the reference standard domain as “high concern” for at least one study were eligible. One reviewer extracted the rationale for the “high concern” and this was checked by a second reviewer. Two reviewers categorized the rationale inductively into themes, and a third reviewer verified these. Discussions regarding the extracted information informed framework development.</div></div><div><h3>Results</h3><div>We identified 50 eligible reviews. Five themes emerged: study uses different reference standard threshold to define the target condition (six reviews), misclassification by the reference standard in the study such that the target condition in the study does not match the review question (11 reviews), reference standard could not be applied to all participants resulting in a different target condition (five reviews), misunderstanding QUADAS-2 applicability (seven reviews), and insufficient information (21 reviews). Our framework for researchers outlines four potential applicability concerns for the assessment of the target condition as defined by the reference standard: different sub-categories of the target condition, different threshold used to define the target condition, reference standard not applied to full study group, and misclassification of the target condition by the reference standard.</div></div><div><h3>Conclusion</h3><div>Clear sources of applicability concerns are identifiable, but several Cochrane review authors struggle to adequately identify and report them. We have developed an applicability framework to guide review authors in their assessment of applicability concerns for the QUADAS reference standard domain.</div></div><div><h3>Plain Language Summary</h3><div>What is the problem? Doctors use tests to help to decide if a person has a certain condition. They want to know how accurate the test is before they use it. This means how well it can tell people who have the condition from people who do not have it. This information can be found in “diagnostic systematic reviews”. Diagnostic systematic reviews start with a research question. They bring together findings from studies that have already been done to try to answer this question. It is important for researchers to check that the studies match the review question. This is called an “applicability assess
{"title":"Developing a framework for assessing the applicability of the target condition in diagnostic research","authors":"Eve Tomlinson , Jude Holmes , Anne W.S. Rutjes , Clare Davenport , Mariska Leeflang , Bada Yang , Sue Mallett , Penny Whiting","doi":"10.1016/j.jclinepi.2025.112059","DOIUrl":"10.1016/j.jclinepi.2025.112059","url":null,"abstract":"<div><h3>Objectives</h3><div>Assessment of the applicability of primary studies is an essential but often a challenging aspect of systematic reviews of diagnostic test accuracy studies (DTA reviews). We explored review authors’ applicability assessments for the QUADAS-2 reference standard domain within Cochrane DTA reviews. We highlight applicability concerns, identify potential issues with assessment, and develop a framework for assessing the applicability of the target condition as defined by the reference standard.</div></div><div><h3>Study Design and Setting</h3><div>Methodological review. DTA reviews in the Cochrane Library that used QUADAS-2 and judged applicability for the reference standard domain as “high concern” for at least one study were eligible. One reviewer extracted the rationale for the “high concern” and this was checked by a second reviewer. Two reviewers categorized the rationale inductively into themes, and a third reviewer verified these. Discussions regarding the extracted information informed framework development.</div></div><div><h3>Results</h3><div>We identified 50 eligible reviews. Five themes emerged: study uses different reference standard threshold to define the target condition (six reviews), misclassification by the reference standard in the study such that the target condition in the study does not match the review question (11 reviews), reference standard could not be applied to all participants resulting in a different target condition (five reviews), misunderstanding QUADAS-2 applicability (seven reviews), and insufficient information (21 reviews). Our framework for researchers outlines four potential applicability concerns for the assessment of the target condition as defined by the reference standard: different sub-categories of the target condition, different threshold used to define the target condition, reference standard not applied to full study group, and misclassification of the target condition by the reference standard.</div></div><div><h3>Conclusion</h3><div>Clear sources of applicability concerns are identifiable, but several Cochrane review authors struggle to adequately identify and report them. We have developed an applicability framework to guide review authors in their assessment of applicability concerns for the QUADAS reference standard domain.</div></div><div><h3>Plain Language Summary</h3><div>What is the problem? Doctors use tests to help to decide if a person has a certain condition. They want to know how accurate the test is before they use it. This means how well it can tell people who have the condition from people who do not have it. This information can be found in “diagnostic systematic reviews”. Diagnostic systematic reviews start with a research question. They bring together findings from studies that have already been done to try to answer this question. It is important for researchers to check that the studies match the review question. This is called an “applicability assess","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"190 ","pages":"Article 112059"},"PeriodicalIF":5.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145543777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-11-20DOI: 10.1016/j.jclinepi.2025.112056
K.M. Mondragon , C.S. Tan-Lim , R. Velasco Jr. , C.P. Cordero , H.M. Strebel , L. Palileo-Villanueva , J.V. Mantaring
<div><h3>Background</h3><div>Systematic reviews (SRs) with network meta-analyses (NMAs) are increasingly used to inform guidelines, health technology assessments (HTAs), and policy decisions. Their methodological complexity, as well as the difficulty in assessing the exchangeability assumption and the large amount of results, makes appraisal more challenging than for SRs with pairwise NMAs. Numerous SR- and NMA-specific appraisal tools exist, but they vary in scope, intended users, and methodological guidance, and few have been validated.</div></div><div><h3>Objectives</h3><div>To identify and describe appraisal instruments and interpretive guides for SRs and NMAs specifically, summarizing their characteristics, domain coverage, development methods, and measurement-property evaluations.</div></div><div><h3>Methods</h3><div>We conducted a methodological scoping review which included structured appraisal instruments or interpretive guides for SRs with or without NMA-specific domains, aimed at review authors, clinicians, guideline developers, or HTA assessors from published or gray literature in English. Searches (inception–August 2025) covered major databases, registries, organizational websites, and reference lists. Two reviewers independently screened records; data were extracted by one and checked by a second. We synthesized the findings narratively. First, we classified tools as either structured instruments or interpretive guides. Second, we grouped them according to their intended audience and scope. Third, we assessed available measurement-property data using relevant COnsensus-based Standards for the selection of health Measurement INstruments items.</div></div><div><h3>Results</h3><div>Thirty-four articles described 22 instruments (11 NMA-specific, nine systematic reviews with meta-analysis-specific, 2 encompassing both systematic reviews with meta-analysis and NMA). NMA tools added domains such as network geometry, transitivity, and coherence, but guidance on transitivity evaluation, publication bias, and ranking was either limited or ineffective. Reviewer-focused tools were structured with explicit response options, whereas clinician-oriented guides posed appraisal questions with explanations but no prescribed response. Nine instruments reported measurement-property data, with validity and reliability varying widely.</div></div><div><h3>Conclusion</h3><div>This first comprehensive map of systematic reviews with meta-analysis and NMA appraisal resources highlights the need for clearer operational criteria, structured decision rules, and integrated rater training to improve reliability and align foundational SR domains with NMA-specific content.</div></div><div><h3>Plain Language Summary</h3><div>NMA is a way to compare many treatments at once by combining results from multiple studies—even when some treatments have not been directly compared head-to-head. Because NMAs are complex, users need clear tools to judge whether an analysis is tru
{"title":"A scoping review of critical appraisal tools and user guides for systematic reviews with network meta-analysis: methodological gaps and directions for tool development","authors":"K.M. Mondragon , C.S. Tan-Lim , R. Velasco Jr. , C.P. Cordero , H.M. Strebel , L. Palileo-Villanueva , J.V. Mantaring","doi":"10.1016/j.jclinepi.2025.112056","DOIUrl":"10.1016/j.jclinepi.2025.112056","url":null,"abstract":"<div><h3>Background</h3><div>Systematic reviews (SRs) with network meta-analyses (NMAs) are increasingly used to inform guidelines, health technology assessments (HTAs), and policy decisions. Their methodological complexity, as well as the difficulty in assessing the exchangeability assumption and the large amount of results, makes appraisal more challenging than for SRs with pairwise NMAs. Numerous SR- and NMA-specific appraisal tools exist, but they vary in scope, intended users, and methodological guidance, and few have been validated.</div></div><div><h3>Objectives</h3><div>To identify and describe appraisal instruments and interpretive guides for SRs and NMAs specifically, summarizing their characteristics, domain coverage, development methods, and measurement-property evaluations.</div></div><div><h3>Methods</h3><div>We conducted a methodological scoping review which included structured appraisal instruments or interpretive guides for SRs with or without NMA-specific domains, aimed at review authors, clinicians, guideline developers, or HTA assessors from published or gray literature in English. Searches (inception–August 2025) covered major databases, registries, organizational websites, and reference lists. Two reviewers independently screened records; data were extracted by one and checked by a second. We synthesized the findings narratively. First, we classified tools as either structured instruments or interpretive guides. Second, we grouped them according to their intended audience and scope. Third, we assessed available measurement-property data using relevant COnsensus-based Standards for the selection of health Measurement INstruments items.</div></div><div><h3>Results</h3><div>Thirty-four articles described 22 instruments (11 NMA-specific, nine systematic reviews with meta-analysis-specific, 2 encompassing both systematic reviews with meta-analysis and NMA). NMA tools added domains such as network geometry, transitivity, and coherence, but guidance on transitivity evaluation, publication bias, and ranking was either limited or ineffective. Reviewer-focused tools were structured with explicit response options, whereas clinician-oriented guides posed appraisal questions with explanations but no prescribed response. Nine instruments reported measurement-property data, with validity and reliability varying widely.</div></div><div><h3>Conclusion</h3><div>This first comprehensive map of systematic reviews with meta-analysis and NMA appraisal resources highlights the need for clearer operational criteria, structured decision rules, and integrated rater training to improve reliability and align foundational SR domains with NMA-specific content.</div></div><div><h3>Plain Language Summary</h3><div>NMA is a way to compare many treatments at once by combining results from multiple studies—even when some treatments have not been directly compared head-to-head. Because NMAs are complex, users need clear tools to judge whether an analysis is tru","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"190 ","pages":"Article 112056"},"PeriodicalIF":5.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145582728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-12-03DOI: 10.1016/j.jclinepi.2025.112092
Matthew J. Valente , Biwei Cao , Daniëlle D.B. Holthuijsen , Martijn J.L. Bours , Simone J.P.M. Eussen , Matty P. Weijenberg , Judith J.M. Rijnhart
Objectives
Ratio variables (eg, body mass index (BMI), cholesterol ratios, and metabolite ratios) are widely used as exposure variables in epidemiologic studies on cause-and-effect. While statisticians have emphasized the importance of including main effects of the variables that make up a ratio variable in regression models, main effects are still often omitted in practice. The objective of this study is to demonstrate the impact of omitting main effects from regression models with a ratio variable as the focal exposure on bias in the effect estimates and type I error rates.
Study Design and Setting
We demonstrated the impact of omitting main effects in three steps. First, we showed the connection between regression models with ratio variables and regression models with product terms, which are well-understood by epidemiologists. Second, we estimated models with and without main effects of a ratio variable using a real-life data example. Third, we performed a simulation study to demonstrate the impact of omitting main effects on bias and type I error rates.
Results
We showed the impact of omitting main effects in regression models with ratio terms. In the real-life example, the ratio term was only statistically significantly associated with the outcome when omitting main effects. The simulation study results indicated that the omission of main effects often leads to biased effect estimates and inflated type I error rates.
Conclusion
Regression models with a ratio term as an exposure variable need to include main effects to avoid bias in the effect estimates and inflated type I error rates.
{"title":"Omission of main effects from regression models with a ratio variable as the focal exposure can result in bias and inflated type I error rates","authors":"Matthew J. Valente , Biwei Cao , Daniëlle D.B. Holthuijsen , Martijn J.L. Bours , Simone J.P.M. Eussen , Matty P. Weijenberg , Judith J.M. Rijnhart","doi":"10.1016/j.jclinepi.2025.112092","DOIUrl":"10.1016/j.jclinepi.2025.112092","url":null,"abstract":"<div><h3>Objectives</h3><div>Ratio variables (eg, body mass index (BMI), cholesterol ratios, and metabolite ratios) are widely used as exposure variables in epidemiologic studies on cause-and-effect. While statisticians have emphasized the importance of including main effects of the variables that make up a ratio variable in regression models, main effects are still often omitted in practice. The objective of this study is to demonstrate the impact of omitting main effects from regression models with a ratio variable as the focal exposure on bias in the effect estimates and type I error rates.</div></div><div><h3>Study Design and Setting</h3><div>We demonstrated the impact of omitting main effects in three steps. First, we showed the connection between regression models with ratio variables and regression models with product terms, which are well-understood by epidemiologists. Second, we estimated models with and without main effects of a ratio variable using a real-life data example. Third, we performed a simulation study to demonstrate the impact of omitting main effects on bias and type I error rates.</div></div><div><h3>Results</h3><div>We showed the impact of omitting main effects in regression models with ratio terms. In the real-life example, the ratio term was only statistically significantly associated with the outcome when omitting main effects. The simulation study results indicated that the omission of main effects often leads to biased effect estimates and inflated type I error rates.</div></div><div><h3>Conclusion</h3><div>Regression models with a ratio term as an exposure variable need to include main effects to avoid bias in the effect estimates and inflated type I error rates.</div></div>","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"190 ","pages":"Article 112092"},"PeriodicalIF":5.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145688599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-12-03DOI: 10.1016/j.jclinepi.2025.112093
Adrian Martinez-De la Torre , Polina Leshetkina , Ogie Ahanor , Roxanne Maritz
Background
To examine how functioning-related outcomes in Phase III pharmacological clinical trials for rheumatoid arthritis (RA) align with the International Classification of Functioning, Disability and Health (ICF) Brief Core Set, and to identify which domains of functioning are most frequently represented.
Study Design and Setting
RA is a chronic autoimmune disease and a major cause of disability worldwide. While Phase III randomized controlled trials (RCTs) remain the gold standard for evaluating pharmacological treatments, they often rely on clinical and laboratory endpoints and overlook how therapies affect patients’ functioning. The International Classification of Functioning, Disability and Health (ICF) provides a standardized, patient-centered framework to assess functioning across key domains. A scoping review was conducted in accordance with the JBI methodology for scoping reviews and reported following PRISMA-ScR guidelines. Literature was searched in MEDLINE, EMBASE, and ClinicalTrials.gov from 2010 to 2025. Phase III RCTs evaluating pharmacological interventions in adult patients with RA were included. Functioning-related outcomes were extracted and mapped to ICF categories using standardized linking rules.
Results
Of 852 records screened, 91 met the inclusion criteria. Functioning was frequently assessed through patient-reported outcomes and composite clinical measures. The most commonly linked ICF categories were related to pain and joint mobility within the body functions domain, walking and carrying out daily activities within the activities and participation domain, and joint structures of the shoulder, upper, and lower limbs within body structures. Despite the broad representation, none of the studies explicitly used the ICF framework.
Conclusion
Functioning is often assessed in RA phase III RCTs, but only implicitly and without reference to the ICF framework. Explicitly integrating the ICF could bring greater standardization, comparability, and patient-centeredness in outcome measurement in pharmacological trials, not only in RA but across chronic conditions.
{"title":"Integrating and standardizing functioning outcomes in rheumatoid arthritis pharmacological trials: a scoping review informed by the International Classification of Functioning, Disability and Health (ICF)","authors":"Adrian Martinez-De la Torre , Polina Leshetkina , Ogie Ahanor , Roxanne Maritz","doi":"10.1016/j.jclinepi.2025.112093","DOIUrl":"10.1016/j.jclinepi.2025.112093","url":null,"abstract":"<div><h3>Background</h3><div>To examine how functioning-related outcomes in Phase III pharmacological clinical trials for rheumatoid arthritis (RA) align with the International Classification of Functioning, Disability and Health (ICF) Brief Core Set, and to identify which domains of functioning are most frequently represented.</div></div><div><h3>Study Design and Setting</h3><div>RA is a chronic autoimmune disease and a major cause of disability worldwide. While Phase III randomized controlled trials (RCTs) remain the gold standard for evaluating pharmacological treatments, they often rely on clinical and laboratory endpoints and overlook how therapies affect patients’ functioning. The International Classification of Functioning, Disability and Health (ICF) provides a standardized, patient-centered framework to assess functioning across key domains. A scoping review was conducted in accordance with the JBI methodology for scoping reviews and reported following PRISMA-ScR guidelines. Literature was searched in MEDLINE, EMBASE, and ClinicalTrials.gov from 2010 to 2025. Phase III RCTs evaluating pharmacological interventions in adult patients with RA were included. Functioning-related outcomes were extracted and mapped to ICF categories using standardized linking rules.</div></div><div><h3>Results</h3><div>Of 852 records screened, 91 met the inclusion criteria. Functioning was frequently assessed through patient-reported outcomes and composite clinical measures. The most commonly linked ICF categories were related to pain and joint mobility within the body functions domain, walking and carrying out daily activities within the activities and participation domain, and joint structures of the shoulder, upper, and lower limbs within body structures. Despite the broad representation, none of the studies explicitly used the ICF framework.</div></div><div><h3>Conclusion</h3><div>Functioning is often assessed in RA phase III RCTs, but only implicitly and without reference to the ICF framework. Explicitly integrating the ICF could bring greater standardization, comparability, and patient-centeredness in outcome measurement in pharmacological trials, not only in RA but across chronic conditions.</div></div>","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"190 ","pages":"Article 112093"},"PeriodicalIF":5.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145688644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01Epub Date: 2025-11-21DOI: 10.1016/j.jclinepi.2025.112082
Lee X. Li , Ashley M. Hopkins , Richard Woodman , Ahmad Y. Abuhelwa , Yuan Gao , Natalie Parent , Andrew Rowland , Michael J. Sorich
Background and Objectives
Prognostic models can enhance clinician-patient communication and guide treatment decisions. Numerous machine learning (ML) algorithms are available and offer a novel approach to predicting survival in patients treated with immune checkpoint inhibitors. However, large-scale benchmarking of their performances—particularly in terms of calibration—has not been evaluated across multiple independent cohorts. This study aimed to develop, evaluate, and compare statistical and ML models regarding discrimination, calibration, and variable importance for predicting overall survival across seven clinical trial cohorts of advanced non–small cell lung cancer (NSCLC) undergoing immune checkpoint inhibitor treatment.
Methods
This study included atezolizumab-treated patients with advanced NSCLC from seven clinical trials. We compared two statistical models: Cox proportional-hazard (Coxph) and accelerated failure time models, and 6 ML models: CoxBoost, extreme gradient-boosting (XGBoost), gradient-boosting machines (GBMs), random survival forest, regularized Coxph models (least absolute shrinkage and selection operator [LASSO]), and support vector machines (SVMs). Models were evaluated on discrimination and calibration using a leave-one-study-out nested cross-validation (nCV) framework. Discrimination was assessed using Harrell's concordance index (Cindex), while calibration was assessed using integrated calibration index (ICI) and plot. Variable importance was assessed using Shapley Additive exPlanations (SHAP) values.
Results
In a cohort of 3203 patients, the two statistical models and 5 of the 6 ML models demonstrated comparable and moderate discrimination performances (aggregated Cindex: 0.69–0.70), while SVM exhibited poor discrimination (aggregated Cindex: 0.57). Regarding calibration, the models appeared largely comparable in aggregated plots, except for LASSO, although the XGBoost models demonstrated superior calibration numerically. Across the evaluation cohorts, individual performance measures varied and no single model consistently outperforming the others. Pretreatment neutrophil-to-lymphocyte ratios (NLRs) and Eastern Cooperative Oncology Group Performance Status (ECOGPS) were ranked among the top five most important predictors across all models.
Conclusion
There was no clear best-performing model for either discrimination or calibration, although XGBoost models showed possible superior calibration numerically. Performance of a given model varied across evaluation cohorts, highlighting the importance of model assessment using multiple independent datasets. All models identified pretreatment NLR and ECOGPS as the key prognostic factors.
背景与目的:预后模型可以加强临床与患者的沟通,指导治疗决策。许多机器学习(ML)算法可用,并提供了一种新的方法来预测接受免疫检查点抑制剂治疗的患者的生存。然而,对它们的性能进行大规模基准测试——特别是在校准方面——尚未在多个独立队列中进行评估。本研究旨在开发、评估和比较7个接受免疫检查点抑制剂治疗的晚期非小细胞肺癌(NSCLC)临床试验队列中关于区分、校准和可变重要性的统计和ML模型,以预测总生存期。患者和方法:本研究纳入了七项临床试验的atezolizumab治疗的晚期非小细胞肺癌患者。我们比较了两种统计模型:Cox比例风险(Coxph)和加速失效时间模型,以及6种ML模型:Cox boost、极端梯度增强(XGBoost)、梯度增强机(GBM)、随机生存森林、正则化Coxph模型(LASSO)和支持向量机(SVM)。使用留一项研究的嵌套交叉验证(nCV)框架对模型进行判别和校准评估。判别采用Harrell’s concordance index (Cindex)评价,校正采用integrated calibration index (ICI)和plot评价。变量重要性采用Shapley加性解释(SHAP)值进行评估。结果:在3203例患者队列中,两种统计模型和6种ML模型中的5种具有可比性和中等的判别性能(综合判别指数:0.69-0.70),而SVM的判别性能较差(综合判别指数:0.57)。在校准方面,除了LASSO之外,这些模型在汇总图中表现出很大的可比性,尽管XGBoost模型在数值上显示出更好的校准。在整个评估队列中,个人绩效指标各不相同,没有一个模型始终优于其他模型。治疗前中性粒细胞与淋巴细胞比率(NLR)和东部肿瘤合作组表现状态(ECOGPS)在所有模型中排名前五。结论:虽然XGBoost模型在数值上可能具有更好的校准效果,但在鉴别和校准方面没有明确的最佳模型。给定模型的性能在评估队列中有所不同,突出了使用多个独立数据集进行模型评估的重要性。所有模型均将治疗前NLR和ECOGPS确定为关键预后因素。
{"title":"Discrimination, calibration, and variable importance in statistical and machine learning models for predicting overall survival in advanced non–small cell lung cancer patients treated with immune checkpoint inhibitors","authors":"Lee X. Li , Ashley M. Hopkins , Richard Woodman , Ahmad Y. Abuhelwa , Yuan Gao , Natalie Parent , Andrew Rowland , Michael J. Sorich","doi":"10.1016/j.jclinepi.2025.112082","DOIUrl":"10.1016/j.jclinepi.2025.112082","url":null,"abstract":"<div><h3>Background and Objectives</h3><div>Prognostic models can enhance clinician-patient communication and guide treatment decisions. Numerous machine learning (ML) algorithms are available and offer a novel approach to predicting survival in patients treated with immune checkpoint inhibitors. However, large-scale benchmarking of their performances—particularly in terms of calibration—has not been evaluated across multiple independent cohorts. This study aimed to develop, evaluate, and compare statistical and ML models regarding discrimination, calibration, and variable importance for predicting overall survival across seven clinical trial cohorts of advanced non–small cell lung cancer (NSCLC) undergoing immune checkpoint inhibitor treatment.</div></div><div><h3>Methods</h3><div>This study included atezolizumab-treated patients with advanced NSCLC from seven clinical trials. We compared two statistical models: Cox proportional-hazard (Coxph) and accelerated failure time models, and 6 ML models: CoxBoost, extreme gradient-boosting (XGBoost), gradient-boosting machines (GBMs), random survival forest, regularized Coxph models (least absolute shrinkage and selection operator [LASSO]), and support vector machines (SVMs). Models were evaluated on discrimination and calibration using a leave-one-study-out nested cross-validation (nCV) framework. Discrimination was assessed using Harrell's concordance index (Cindex), while calibration was assessed using integrated calibration index (ICI) and plot. Variable importance was assessed using Shapley Additive exPlanations (SHAP) values.</div></div><div><h3>Results</h3><div>In a cohort of 3203 patients, the two statistical models and 5 of the 6 ML models demonstrated comparable and moderate discrimination performances (aggregated Cindex: 0.69–0.70), while SVM exhibited poor discrimination (aggregated Cindex: 0.57). Regarding calibration, the models appeared largely comparable in aggregated plots, except for LASSO, although the XGBoost models demonstrated superior calibration numerically. Across the evaluation cohorts, individual performance measures varied and no single model consistently outperforming the others. Pretreatment neutrophil-to-lymphocyte ratios (NLRs) and Eastern Cooperative Oncology Group Performance Status (ECOGPS) were ranked among the top five most important predictors across all models.</div></div><div><h3>Conclusion</h3><div>There was no clear best-performing model for either discrimination or calibration, although XGBoost models showed possible superior calibration numerically. Performance of a given model varied across evaluation cohorts, highlighting the importance of model assessment using multiple independent datasets. All models identified pretreatment NLR and ECOGPS as the key prognostic factors.</div></div>","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"190 ","pages":"Article 112082"},"PeriodicalIF":5.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145589810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}