Pub Date : 2025-12-18DOI: 10.1016/j.jclinepi.2025.112109
Mohammed Mujaab Kamso , Samuel L. Whittle , Jordi Pardo Pardo , Rachelle Buchbinder , George Wells , Rob Deardon , Tolulope Sajobi , George Tomlinson , Jesse Elliott , Jocelyn Thomas , Shannon E. Kelly , Romina Brignardello-Petersen , Glen S. Hazlewood
Objectives
To implement and evaluate a semi-automated approach to facilitate rating the Grading, Recommendation, Assessment, Development and Evaluation (GRADE) certainty of evidence (CoE) for direct comparisons within two living network meta-analysis.
Methods
For each of three GRADE domains (study limitations, indirectness, and inconsistency), decision rules were developed and used to generate automated judgments for each domain and the overall certainty. Inputs included risk of bias and indirectness ratings for each study and measures of heterogeneity. Indirectness ratings were made by two independent reviewers and resolved through consensus. With the help of an online tool (customized to our project), two independent raters viewed forest plots and additional data and could confirm or modify the suggested rating. Disagreements were resolved by consensus. We evaluated inter-rater reliability and accuracy.
Results
Across 374 direct comparisons, there was perfect agreement (100%) between the automated judgment and reviewer consensus, when only a single study was available (n = 292), and near-perfect agreement when more than one study was available (99%–100% for the three GRADE domains and 96% for overall rating). Inter-rater reliability was near perfect (Gwet's AC1 kappa score ranging from 96% to 100%).
Conclusion
Automated judgments using established decision rules agreed with expert judgment for the vast majority of GRADE CoE ratings.
{"title":"Paper 1: a semi-automated approach facilitated the assessment of the certainty of evidence in a network meta-analysis: part 1 – Direct comparisons","authors":"Mohammed Mujaab Kamso , Samuel L. Whittle , Jordi Pardo Pardo , Rachelle Buchbinder , George Wells , Rob Deardon , Tolulope Sajobi , George Tomlinson , Jesse Elliott , Jocelyn Thomas , Shannon E. Kelly , Romina Brignardello-Petersen , Glen S. Hazlewood","doi":"10.1016/j.jclinepi.2025.112109","DOIUrl":"10.1016/j.jclinepi.2025.112109","url":null,"abstract":"<div><h3>Objectives</h3><div>To implement and evaluate a semi-automated approach to facilitate rating the Grading, Recommendation, Assessment, Development and Evaluation (GRADE) certainty of evidence (CoE) for direct comparisons within two living network meta-analysis.</div></div><div><h3>Methods</h3><div>For each of three GRADE domains (study limitations, indirectness, and inconsistency), decision rules were developed and used to generate automated judgments for each domain and the overall certainty. Inputs included risk of bias and indirectness ratings for each study and measures of heterogeneity. Indirectness ratings were made by two independent reviewers and resolved through consensus. With the help of an online tool (customized to our project), two independent raters viewed forest plots and additional data and could confirm or modify the suggested rating. Disagreements were resolved by consensus. We evaluated inter-rater reliability and accuracy.</div></div><div><h3>Results</h3><div>Across 374 direct comparisons, there was perfect agreement (100%) between the automated judgment and reviewer consensus, when only a single study was available (<em>n</em> = 292), and near-perfect agreement when more than one study was available (99%–100% for the three GRADE domains and 96% for overall rating). Inter-rater reliability was near perfect (Gwet's AC1 kappa score ranging from 96% to 100%).</div></div><div><h3>Conclusion</h3><div>Automated judgments using established decision rules agreed with expert judgment for the vast majority of GRADE CoE ratings.</div></div>","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"191 ","pages":"Article 112109"},"PeriodicalIF":5.2,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145800878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-17DOI: 10.1016/j.jclinepi.2025.112111
Manuel Marques-Cruz, Rafael José Vieira, Sara Gil Mata, Bernardo Sousa-Pinto
{"title":"The opacity and exemption of artificial intelligence or the epic of explainable artificial intelligence, reply to commentary by Rattanapitoon et al","authors":"Manuel Marques-Cruz, Rafael José Vieira, Sara Gil Mata, Bernardo Sousa-Pinto","doi":"10.1016/j.jclinepi.2025.112111","DOIUrl":"10.1016/j.jclinepi.2025.112111","url":null,"abstract":"","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"191 ","pages":"Article 112111"},"PeriodicalIF":5.2,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145795605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-16DOI: 10.1016/j.jclinepi.2025.112112
Azar Alexander-Sefre , Frances Sherratt , Heidi Green , Shaun Treweek , Victoria Shepherd
Background and Objective
Intersectionality provides a framework to help enable critical thinking about how sociodemographic factors interact. There is currently limited evidence on whether the overlap of multiple sociodemographic identities, typically associated with underrepresentation and being underserved in research (eg, minority ethnicity, lower socioeconomic status (SES)), affects health conditions and outcomes. Given the essential role that clinical trials have in the development of effective treatments, this makes it challenging to address whether intersectionality should be considered in trials. This scoping review aimed to map the existing literature on the impact of intersectionality on health experiences and inequalities in developed economies, and identify whether or not the overlap of two or more sociodemographic factors (eg, race/ethnicity and sex/gender and/or SES), is associated with poorer health.
Methods
Following the Arksey and O'Malley Framework and Joanna Briggs Institute methodology, the review adhered to Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews guidelines. Databases searched included Medline, Embase, Web of Science, International Bibliography of the Social Sciences, and Sociological Abstracts. Selection criteria were based on the Population–Concept–Context mnemonic, targeting studies that explicitly referenced intersecting sociodemographic factors and their impact on health experiences. Data were extracted from the Discussions section of the included studies, specifically any reports of the effects of intersectional sociodemographic factors, such as ethnicity, sex, gender, and SES, on health conditions and outcomes.
Results
Thirty-three studies met the inclusion criteria. The review found that people who belong to more than one sociodemographic group typically underserved in research (eg, minoritized ethnic and experience of socioeconomic disadvantage) tend to have poorer health. This review also found that context is an important component, with some traditionally more privileged groups (eg, White, male, and with a high socioeconomic background) having relatively poorer health outcomes depending on the context.
Conclusion
Overall, holding intersectional underserved identities is likely to lead to poorer health; however, there is no simple relationship, and context plays a role. These findings emphasize the need for inclusive clinical trials that account for intersectionality and the necessity of designing inclusive research that reflects diverse populations.
背景:交叉性提供了一个框架,有助于对社会人口因素如何相互作用进行批判性思考。目前关于通常与研究中代表性不足有关的多种社会人口因素(例如,少数族裔、较低的社会经济地位)是否影响健康状况和结果的证据有限。鉴于临床试验在开发有效治疗方法中的重要作用,这使得解决是否应该在试验中考虑交叉性变得具有挑战性。目的/目标:本次范围审查的目的是绘制关于交叉性对发达经济体健康不平等和结果的影响的现有文献,并确定交叉性的社会人口因素如何影响健康。方法:遵循Arksey和O'Malley框架和Joanna Briggs研究所的方法,遵循PRISMA-ScR指南。检索的数据库包括Medline、Embase、Web of Science、International Bibliography of Social Sciences和Sociological Abstracts。选择标准基于人口-概念-背景助记法,目标研究明确引用交叉的社会人口因素及其对健康体验的影响。数据摘自纳入研究的讨论部分,特别是关于交叉社会人口因素(如种族、性别、性别和社会经济地位)对健康状况和结果的影响的任何报告。结果:33项研究符合纳入标准。审查发现,属于一个以上的社会人口群体的人通常在研究中得不到充分的服务(例如,少数民族和社会经济劣势经历),往往健康状况较差。该审查还发现,环境是一个重要组成部分,一些传统上享有特权的群体(例如,白人、男性和具有较高的社会经济背景)的健康结果相对较差,这取决于环境。结论:总体而言,拥有更大的交叉性可能导致更差的健康状况,但没有简单的关系,环境起作用。这些发现强调了考虑多种社会人口因素的包容性临床试验的必要性,以及设计反映不同人群的包容性研究的必要性。
{"title":"Health experiences and inequalities across intersecting social identities in health research: a scoping review","authors":"Azar Alexander-Sefre , Frances Sherratt , Heidi Green , Shaun Treweek , Victoria Shepherd","doi":"10.1016/j.jclinepi.2025.112112","DOIUrl":"10.1016/j.jclinepi.2025.112112","url":null,"abstract":"<div><h3>Background and Objective</h3><div>Intersectionality provides a framework to help enable critical thinking about how sociodemographic factors interact. There is currently limited evidence on whether the overlap of multiple sociodemographic identities, typically associated with underrepresentation and being underserved in research (eg, minority ethnicity, lower socioeconomic status (SES)), affects health conditions and outcomes. Given the essential role that clinical trials have in the development of effective treatments, this makes it challenging to address whether intersectionality should be considered in trials. This scoping review aimed to map the existing literature on the impact of intersectionality on health experiences and inequalities in developed economies, and identify whether or not the overlap of two or more sociodemographic factors (eg, race/ethnicity and sex/gender and/or SES), is associated with poorer health.</div></div><div><h3>Methods</h3><div>Following the Arksey and O'Malley Framework and Joanna Briggs Institute methodology, the review adhered to Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews guidelines. Databases searched included Medline, Embase, Web of Science, International Bibliography of the Social Sciences, and Sociological Abstracts. Selection criteria were based on the Population–Concept–Context mnemonic, targeting studies that explicitly referenced intersecting sociodemographic factors and their impact on health experiences. Data were extracted from the Discussions section of the included studies, specifically any reports of the effects of intersectional sociodemographic factors, such as ethnicity, sex, gender, and SES, on health conditions and outcomes.</div></div><div><h3>Results</h3><div>Thirty-three studies met the inclusion criteria. The review found that people who belong to more than one sociodemographic group typically underserved in research (eg, minoritized ethnic and experience of socioeconomic disadvantage) tend to have poorer health. This review also found that context is an important component, with some traditionally more privileged groups (eg, White, male, and with a high socioeconomic background) having relatively poorer health outcomes depending on the context.</div></div><div><h3>Conclusion</h3><div>Overall, holding intersectional underserved identities is likely to lead to poorer health; however, there is no simple relationship, and context plays a role. These findings emphasize the need for inclusive clinical trials that account for intersectionality and the necessity of designing inclusive research that reflects diverse populations.</div></div>","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"191 ","pages":"Article 112112"},"PeriodicalIF":5.2,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145783582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-16DOI: 10.1016/j.jclinepi.2025.112113
John G. Rizk , Giuseppe Lippi , Carl J. Lavie
Objectives
To evaluate the implementation and reporting practices of overlap weighting in major medical journals.
Study Design and Setting
We reviewed observational studies published from January 2020 to September 2025 in five major medical journals (Annals of Internal Medicine, The British Medical Journal [BMJ], Journal of the American Medical Association [JAMA], JAMA Internal Medicine, and The New England Journal of Medicine [NEJM]) that used overlap weighting as a primary or sensitivity adjustment method. Reporting quality was assessed for estimand specification, definition of the overlap population, justification of the method, acknowledgment of advantages, and discussion of interpretability.
Results
Seventeen eligible studies were identified. Four studies (24%) correctly named the estimand as the average treatment effect in the overlap population; two (12%) misreported the estimand as average treatment effect, and the remainder did not specify an estimand. Ten studies (59%) described the overlap population at least partially. Sixteen studies (94%) highlighted at least one statistical advantage of overlap weighting, yet none acknowledged that results apply only to the overlap population. These results point to a notable gap in estimand reporting.
Conclusion
Clearer specification of the estimand and its target population is essential to prevent misinterpretation. Strengthening reporting standards will support more transparent and appropriate use of overlap weighting in medical research.
Plain Language Summary
This study examined how major medical journals (from 2020–2025) report studies using overlap weighting, a method that focuses on patients who could receive either treatment. Most studies acknowledged advantages of using overlap weighting but did not clearly state that results apply only to this “overlap” group. Clear reporting of the target population and effect estimate is needed to avoid misleading interpretations.
目的:评价主要医学期刊重叠加权的实施和报告实践。研究设计和设置:我们回顾了2020年1月至2025年5月发表在五大医学期刊(Annals of Internal Medicine、BMJ、JAMA、JAMA Internal Medicine和NEJM)上的观察性研究,这些研究使用重叠加权作为主要或敏感性调整方法。对报告质量进行评估,包括评估规范、重叠人群的定义、方法的合理性、优势的确认和可解释性的讨论。结果:确定了17项符合条件的研究。四项研究(24%)正确地将估计命名为重叠人群(ATO)的平均治疗效果;2个(12%)错误地将估计报告为ATE,其余的没有指定估计。10项研究(59%)至少部分描述了重叠种群。16项研究(94%)强调了至少一个重叠加权的统计优势,但没有一个承认结果只适用于重叠人群。这些结果表明,在估算和报告方面存在显著差距。结论:明确评价指标及其目标人群是防止误读的关键。加强报告标准将有助于在医学研究中更加透明和适当地使用重叠加权。简明语言摘要:本评论研究了主要医学期刊(2020-2025)如何使用重叠加权法报道研究,这种方法侧重于可以接受任何一种治疗的患者。大多数研究正确地使用了这种方法,但没有明确说明结果只适用于这一“重叠”群体。需要清楚地报告目标人群和效果估计,以避免误导性的解释。
{"title":"Beautiful weights, misinterpreted effects: the use and misuse of overlap weighting in major medical journals, 2020–2025","authors":"John G. Rizk , Giuseppe Lippi , Carl J. Lavie","doi":"10.1016/j.jclinepi.2025.112113","DOIUrl":"10.1016/j.jclinepi.2025.112113","url":null,"abstract":"<div><h3>Objectives</h3><div>To evaluate the implementation and reporting practices of overlap weighting in major medical journals.</div></div><div><h3>Study Design and Setting</h3><div>We reviewed observational studies published from January 2020 to September 2025 in five major medical journals (<em>Annals of Internal Medicine</em>, <em>The British Medical Journal [</em><em>BMJ</em><em>]</em>, <em>Journal of the American Medical Association</em> [<em>JAMA</em>], <em>JAMA Internal Medicine</em>, and <em>The New England Journal of Medicine</em> [<em>NEJM</em>]) that used overlap weighting as a primary or sensitivity adjustment method. Reporting quality was assessed for estimand specification, definition of the overlap population, justification of the method, acknowledgment of advantages, and discussion of interpretability.</div></div><div><h3>Results</h3><div>Seventeen eligible studies were identified. Four studies (24%) correctly named the estimand as the average treatment effect in the overlap population; two (12%) misreported the estimand as average treatment effect, and the remainder did not specify an estimand. Ten studies (59%) described the overlap population at least partially. Sixteen studies (94%) highlighted at least one statistical advantage of overlap weighting, yet none acknowledged that results apply only to the overlap population. These results point to a notable gap in estimand reporting.</div></div><div><h3>Conclusion</h3><div>Clearer specification of the estimand and its target population is essential to prevent misinterpretation. Strengthening reporting standards will support more transparent and appropriate use of overlap weighting in medical research.</div></div><div><h3>Plain Language Summary</h3><div>This study examined how major medical journals (from 2020–2025) report studies using overlap weighting, a method that focuses on patients who could receive either treatment. Most studies acknowledged advantages of using overlap weighting but did not clearly state that results apply only to this “overlap” group. Clear reporting of the target population and effect estimate is needed to avoid misleading interpretations.</div></div>","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"191 ","pages":"Article 112113"},"PeriodicalIF":5.2,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145783628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-16DOI: 10.1016/j.jclinepi.2025.112110
Mohammed Mujaab Kamso , Samuel L. Whittle , Jordi Pardo Pardo , Rachelle Buchbinder , George Wells , Rob Deardon , Tolulope Sajobi , George Tomlinson , Jesse Elliott , Jocelyn Thomas , Shannon E. Kelly , Romina Brignardello-Petersen , Glen S. Hazlewood
Objectives
To implement a semiautomated approach to facilitate rating the Grading, Recommendation, Assessment, Development and Evaluation certainty of evidence (CoE) for indirect and network meta-analysis (NMA) estimates.
Methods
We developed and implemented algorithms for generating automated ratings for the CoE for indirect and network estimates in two living NMAs of rheumatoid arthritis treatment. At the indirect stage, inputs included CoE ratings for direct estimates and the contribution matrix. Intransitivity ratings were assigned based on the indirectness ratings of the two direct estimates with the highest percent contribution. An online tool (customized to our project) facilitated assessment of imprecision on the network estimate. Automated ratings were reviewed by two independent experts.
Results
Across 1306 indirect comparisons, the contribution matrix identified the dominant branches of evidence regardless of whether a single first order loop was present (80%) or not. The reviewers agreed with all automated CoE ratings for incoherence (n = 34), network estimates (n = 34) and imprecision (n = 1447). They agreed with the automated intransitivity algorithm except when the total contribution of the top-two direct estimates was low (eg, <50%, which occurred in 38% of the estimates).
Conclusion
Automated approaches facilitated CoE ratings for indirect and network estimates. Further work is required to define appropriate algorithms for intransitivity.
{"title":"Paper 2: a semi-automated approach facilitated the assessment of the certainty of evidence for in a network meta-analysis: part 2 – indirect and mixed comparisons","authors":"Mohammed Mujaab Kamso , Samuel L. Whittle , Jordi Pardo Pardo , Rachelle Buchbinder , George Wells , Rob Deardon , Tolulope Sajobi , George Tomlinson , Jesse Elliott , Jocelyn Thomas , Shannon E. Kelly , Romina Brignardello-Petersen , Glen S. Hazlewood","doi":"10.1016/j.jclinepi.2025.112110","DOIUrl":"10.1016/j.jclinepi.2025.112110","url":null,"abstract":"<div><h3>Objectives</h3><div>To implement a semiautomated approach to facilitate rating the Grading, Recommendation, Assessment, Development and Evaluation certainty of evidence (CoE) for indirect and network meta-analysis (NMA) estimates.</div></div><div><h3>Methods</h3><div>We developed and implemented algorithms for generating automated ratings for the CoE for indirect and network estimates in two living NMAs of rheumatoid arthritis treatment. At the indirect stage, inputs included CoE ratings for direct estimates and the contribution matrix. Intransitivity ratings were assigned based on the indirectness ratings of the two direct estimates with the highest percent contribution. An online tool (customized to our project) facilitated assessment of imprecision on the network estimate. Automated ratings were reviewed by two independent experts.</div></div><div><h3>Results</h3><div>Across 1306 indirect comparisons, the contribution matrix identified the dominant branches of evidence regardless of whether a single first order loop was present (80%) or not. The reviewers agreed with all automated CoE ratings for incoherence (<em>n</em> = 34), network estimates (<em>n</em> = 34) and imprecision (<em>n</em> = 1447). They agreed with the automated intransitivity algorithm except when the total contribution of the top-two direct estimates was low (eg, <50%, which occurred in 38% of the estimates).</div></div><div><h3>Conclusion</h3><div>Automated approaches facilitated CoE ratings for indirect and network estimates. Further work is required to define appropriate algorithms for intransitivity.</div></div>","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"191 ","pages":"Article 112110"},"PeriodicalIF":5.2,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145783561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-15DOI: 10.1016/j.jclinepi.2025.112102
Declan Devane , Johanna Pope , Paula Byrne , Evan Forde , Isabel O'Byrne , Steven Woloshin , Eileen Culloty , Darren Dahly , Ingeborg Hess Elgersma , Heather Munthe-Kaas , Conor Judge , Martin O'Donnell , Finn Krewer , Sandra Galvin , Nikita N. Burke , Theresa Tierney , KM Saif-Ur-Rahman , Tom Conway , James Thomas
<div><h3>Objectives</h3><div>To compare the comprehension, readability, quality, safety, and trustworthiness of artificial intelligence (AI)-assisted vs human-generated plain language summaries (PLSs) for Cochrane systematic reviews.</div></div><div><h3>Study Design</h3><div>Randomized, parallel-group, two-arm, noninferiority trial (ISRCTN85699985).</div></div><div><h3>Setting</h3><div>Online survey platform, September 2025.</div></div><div><h3>Participants</h3><div>Adults aged 18 years or older with a minimum English reading proficiency of 7 out of 10, recruited via Prolific. Of the 500 individuals screened, 465 were randomized and 453 completed per-protocol analysis.</div></div><div><h3>Interventions</h3><div>Participants were randomly assigned to three AI-assisted PLSs developed with ChatGPT and human-in-the-loop verification, or to three published human-generated Cochrane PLSs for the same reviews.</div></div><div><h3>Outcomes</h3><div>Primary: comprehension (10-item questionnaire, noninferiority margin 10%). Secondary: readability quality and safety, trustworthiness, and authorship perception.</div></div><div><h3>Results</h3><div>Mean comprehension scores were 88.9% (<em>n</em> = 228) in the AI-assisted group and 89.0% (<em>n</em> = 225) in the human-generated group (mean difference −0.03 percentage points, 95% CI: −1.9% to 2.0%); the upper CI bound (2.0 percentage points) did not exceed the +10 percentage-point noninferiority margin, demonstrating noninferiority. Flesch-Kincaid Grade Level showed no significant difference (8.20 vs 8.38, <em>P</em> = .722), although formal noninferiority was missed (upper 95% CI bound 1.72 exceeded the 1.0 grade level margin). AI-assisted summaries scored higher on Flesch Reading Ease (63.33 vs 50.00, <em>P</em> = .008) and lower on the Coleman-Liau Index. All summaries met prespecified quality and safety standards (100% in both groups). Trustworthiness scores were comparable (3.98 vs 3.91, difference 0.068, 95% CI: −0.043 to 0.179; meeting noninferiority). Participants demonstrated limited ability to distinguish between authorship, correctly identifying AI-assisted summaries in 56.3% of cases and human-generated summaries in 34.7% (≈ chance for a three-option question), with 55.4% of human-generated summaries misattributed as AI-assisted. Exploratory subgroup analysis showed an age interaction (<em>P</em> = .023), though based on a small subgroup (<em>n</em> = 14, 3%).</div></div><div><h3>Conclusion</h3><div>AI-assisted PLSs with human oversight achieved comprehension levels noninferior to those of human-generated Cochrane summaries, with comparable quality, safety, and trust ratings. AI summaries were largely indistinguishable from those generated by humans. Pretrial verification identified and corrected numerical errors, confirming the need for human oversight. These findings support human-in-the-loop AI workflows for PLS production, though formal evaluation of the time and resource implications is needed
目的比较人工智能(AI)辅助与人工生成的简单语言摘要(pls)在Cochrane系统评价中的理解性、可读性、质量、安全性和可信度。研究设计:随机、平行组、双臂、非劣效性试验(ISRCTN85699985)。在线调查平台,2025年9月。参与者:18岁或以上的成年人,英语阅读能力至少达到7分(满分10分),通过多产网站招募。在筛选的500人中,465人被随机分配,453人完成了每个方案的分析。干预措施:参与者被随机分配到三个由ChatGPT和人在环验证开发的人工智能辅助PLSs中,或三个已发表的人工生成的Cochrane PLSs中进行相同的评价。主要结果:理解(10项问卷,非劣效度10%)。其次:可读性、质量和安全性、可信度和作者感知。结果人工智能辅助组的平均理解分数为88.9% (n = 228),人工辅助组的平均理解分数为89.0% (n = 225)(平均差异为- 0.03个百分点,95% CI: - 1.9% ~ 2.0%);CI上限(2.0个百分点)未超过+10个百分点的非劣效性边际,表明非劣效性。Flesch-Kincaid分级水平没有显着差异(8.20 vs 8.38, P = .722),尽管错过了正式的非劣效性(95% CI上限1.72超过1.0等级水平界限)。人工智能辅助摘要在Flesch Reading Ease得分较高(63.33 vs 50.00, P = 0.008),而在Coleman-Liau Index得分较低。所有总结均符合预先规定的质量和安全标准(两组均为100%)。可信度评分具有可比性(3.98 vs 3.91,差异0.068,95% CI: - 0.043 ~ 0.179;符合非劣效性)。参与者表现出有限的区分作者的能力,在56.3%的情况下正确识别人工智能辅助的摘要,在34.7%的情况下正确识别人工生成的摘要(三选项问题的概率≈),55.4%的人工生成的摘要被错误地归因于人工智能辅助。探索性亚组分析显示年龄相互作用(P = 0.023),尽管基于小亚组(n = 14.3%)。结论在人工监督下,人工智能辅助的sds达到了不低于人工生成的Cochrane摘要的理解水平,具有相当的质量、安全性和信任评级。人工智能的摘要在很大程度上与人类生成的摘要无法区分。审前验证识别并纠正了数值误差,确认了人工监督的必要性。这些发现支持PLS生产的人工智能工作流程,尽管需要对时间和资源影响进行正式评估,以建立优于传统手工方法的效率收益。
{"title":"Comparison of AI-assisted and human-generated plain language summaries for Cochrane reviews: a randomised non-inferiority trial (HIET-1) [Registered Report - stage II]","authors":"Declan Devane , Johanna Pope , Paula Byrne , Evan Forde , Isabel O'Byrne , Steven Woloshin , Eileen Culloty , Darren Dahly , Ingeborg Hess Elgersma , Heather Munthe-Kaas , Conor Judge , Martin O'Donnell , Finn Krewer , Sandra Galvin , Nikita N. Burke , Theresa Tierney , KM Saif-Ur-Rahman , Tom Conway , James Thomas","doi":"10.1016/j.jclinepi.2025.112102","DOIUrl":"10.1016/j.jclinepi.2025.112102","url":null,"abstract":"<div><h3>Objectives</h3><div>To compare the comprehension, readability, quality, safety, and trustworthiness of artificial intelligence (AI)-assisted vs human-generated plain language summaries (PLSs) for Cochrane systematic reviews.</div></div><div><h3>Study Design</h3><div>Randomized, parallel-group, two-arm, noninferiority trial (ISRCTN85699985).</div></div><div><h3>Setting</h3><div>Online survey platform, September 2025.</div></div><div><h3>Participants</h3><div>Adults aged 18 years or older with a minimum English reading proficiency of 7 out of 10, recruited via Prolific. Of the 500 individuals screened, 465 were randomized and 453 completed per-protocol analysis.</div></div><div><h3>Interventions</h3><div>Participants were randomly assigned to three AI-assisted PLSs developed with ChatGPT and human-in-the-loop verification, or to three published human-generated Cochrane PLSs for the same reviews.</div></div><div><h3>Outcomes</h3><div>Primary: comprehension (10-item questionnaire, noninferiority margin 10%). Secondary: readability quality and safety, trustworthiness, and authorship perception.</div></div><div><h3>Results</h3><div>Mean comprehension scores were 88.9% (<em>n</em> = 228) in the AI-assisted group and 89.0% (<em>n</em> = 225) in the human-generated group (mean difference −0.03 percentage points, 95% CI: −1.9% to 2.0%); the upper CI bound (2.0 percentage points) did not exceed the +10 percentage-point noninferiority margin, demonstrating noninferiority. Flesch-Kincaid Grade Level showed no significant difference (8.20 vs 8.38, <em>P</em> = .722), although formal noninferiority was missed (upper 95% CI bound 1.72 exceeded the 1.0 grade level margin). AI-assisted summaries scored higher on Flesch Reading Ease (63.33 vs 50.00, <em>P</em> = .008) and lower on the Coleman-Liau Index. All summaries met prespecified quality and safety standards (100% in both groups). Trustworthiness scores were comparable (3.98 vs 3.91, difference 0.068, 95% CI: −0.043 to 0.179; meeting noninferiority). Participants demonstrated limited ability to distinguish between authorship, correctly identifying AI-assisted summaries in 56.3% of cases and human-generated summaries in 34.7% (≈ chance for a three-option question), with 55.4% of human-generated summaries misattributed as AI-assisted. Exploratory subgroup analysis showed an age interaction (<em>P</em> = .023), though based on a small subgroup (<em>n</em> = 14, 3%).</div></div><div><h3>Conclusion</h3><div>AI-assisted PLSs with human oversight achieved comprehension levels noninferior to those of human-generated Cochrane summaries, with comparable quality, safety, and trust ratings. AI summaries were largely indistinguishable from those generated by humans. Pretrial verification identified and corrected numerical errors, confirming the need for human oversight. These findings support human-in-the-loop AI workflows for PLS production, though formal evaluation of the time and resource implications is needed","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"191 ","pages":"Article 112102"},"PeriodicalIF":5.2,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145897844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-13DOI: 10.1016/j.jclinepi.2025.112101
Chao Zhang , Ruohua Yan , Xiaohang Liu, Xiaolu Nie, Yaguang Peng, Xiaoxia Peng
<div><h3>Background and Objective</h3><div>To systematically evaluate the performance of <em>k</em>-fold cross-validation and bootstrap-based optimism correction methods for internal validation of statistical and machine learning models.</div></div><div><h3>Methods</h3><div>A total of 239,415 inpatients were extracted from an open access database named Medical Information Mart for Intensive Care IV, of which 39,145 were randomly sampled as a predefined reference dataset. Among the remaining simulation dataset with 200,000 inpatients, training sets with sample sizes ranging from 595 to 5946 were randomly selected, and multiple prediction models were developed in each training set using various modeling strategies, including logistic regression, least absolute shrinkage and selection operator regression, Naive Bayes, Support Vector Machine, K-Nearest Neighbors, Light Gradient Boosting Machine, and Random Forest. The dependent variable of the model was acute kidney injury (AKI), a binary outcome with an incidence of 18.5%, and the independent variables included 22 common predictors of AKI. For each model, 2-fold, 5-fold, and 10-fold cross-validation were used for internal validation to calculate area under the receiver-operating characteristic curve (AUC), which is a common metric for quantifying the overall ability of a model to discriminate between positive or negative classifications. In addition, the Harrell, .632, and .632+ AUC estimators were calculated for internal validation based on bootstrapping. The above simulation process was repeated 1000 times to obtain 1000 estimates of AUC for each internal validation method of each model. The model performance was simultaneously evaluated in the reference dataset to obtain an empirical AUC (analogous to the “gold standard”). Then, by comparing the 1000 AUC estimates with the empirical AUC, the accuracy of internal validation methods for different models was assessed.</div></div><div><h3>Results</h3><div>For parametric models, the .632+ estimator provided the most accurate estimates of AUC, followed by 10-fold cross-validation with only slight bias. In contrast, for nonparametric models, all bootstrap-based optimism correction methods significantly overestimated AUC, and the overestimation was not reduced by increasing the sample size. Most strikingly, 10-fold cross-validation demonstrated stable and good performance across all scenarios considered, regardless of the modeling strategy or sample size.</div></div><div><h3>Conclusion</h3><div>The performance of bootstrap-based optimism correction methods can be affected by model complexity, although the .632+ estimator performs best in parameter models based on small-sample training. In comparison, 10-fold cross-validation is more robust and easier to implement. Therefore, it is recommended to prioritize 10-fold cross-validation as the internal validation method for prediction models.</div></div><div><h3>Plain Language Summary</h3><div>With the exponen
{"title":"Empirical simulation of internal validation methods for prediction models: comparing k-fold cross-validation with bootstrap-based optimism correction","authors":"Chao Zhang , Ruohua Yan , Xiaohang Liu, Xiaolu Nie, Yaguang Peng, Xiaoxia Peng","doi":"10.1016/j.jclinepi.2025.112101","DOIUrl":"10.1016/j.jclinepi.2025.112101","url":null,"abstract":"<div><h3>Background and Objective</h3><div>To systematically evaluate the performance of <em>k</em>-fold cross-validation and bootstrap-based optimism correction methods for internal validation of statistical and machine learning models.</div></div><div><h3>Methods</h3><div>A total of 239,415 inpatients were extracted from an open access database named Medical Information Mart for Intensive Care IV, of which 39,145 were randomly sampled as a predefined reference dataset. Among the remaining simulation dataset with 200,000 inpatients, training sets with sample sizes ranging from 595 to 5946 were randomly selected, and multiple prediction models were developed in each training set using various modeling strategies, including logistic regression, least absolute shrinkage and selection operator regression, Naive Bayes, Support Vector Machine, K-Nearest Neighbors, Light Gradient Boosting Machine, and Random Forest. The dependent variable of the model was acute kidney injury (AKI), a binary outcome with an incidence of 18.5%, and the independent variables included 22 common predictors of AKI. For each model, 2-fold, 5-fold, and 10-fold cross-validation were used for internal validation to calculate area under the receiver-operating characteristic curve (AUC), which is a common metric for quantifying the overall ability of a model to discriminate between positive or negative classifications. In addition, the Harrell, .632, and .632+ AUC estimators were calculated for internal validation based on bootstrapping. The above simulation process was repeated 1000 times to obtain 1000 estimates of AUC for each internal validation method of each model. The model performance was simultaneously evaluated in the reference dataset to obtain an empirical AUC (analogous to the “gold standard”). Then, by comparing the 1000 AUC estimates with the empirical AUC, the accuracy of internal validation methods for different models was assessed.</div></div><div><h3>Results</h3><div>For parametric models, the .632+ estimator provided the most accurate estimates of AUC, followed by 10-fold cross-validation with only slight bias. In contrast, for nonparametric models, all bootstrap-based optimism correction methods significantly overestimated AUC, and the overestimation was not reduced by increasing the sample size. Most strikingly, 10-fold cross-validation demonstrated stable and good performance across all scenarios considered, regardless of the modeling strategy or sample size.</div></div><div><h3>Conclusion</h3><div>The performance of bootstrap-based optimism correction methods can be affected by model complexity, although the .632+ estimator performs best in parameter models based on small-sample training. In comparison, 10-fold cross-validation is more robust and easier to implement. Therefore, it is recommended to prioritize 10-fold cross-validation as the internal validation method for prediction models.</div></div><div><h3>Plain Language Summary</h3><div>With the exponen","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"190 ","pages":"Article 112101"},"PeriodicalIF":5.2,"publicationDate":"2025-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-13DOI: 10.1016/j.jclinepi.2025.112108
Shyam Sundar Sah, Abhishek Kumbhalwar
{"title":"Comment on \"Most methodological characteristics do not exaggerate effect estimates in nutrition RCTs: findings from a metaepidemiological study\".","authors":"Shyam Sundar Sah, Abhishek Kumbhalwar","doi":"10.1016/j.jclinepi.2025.112108","DOIUrl":"10.1016/j.jclinepi.2025.112108","url":null,"abstract":"","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":" ","pages":"112108"},"PeriodicalIF":5.2,"publicationDate":"2025-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145764453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-13DOI: 10.1016/j.jclinepi.2025.112106
Ulrike Paschen, Stefan Sauerland
{"title":"Frameworks for assessing diagnostic interventions are useful for HTA work, but context-dependent","authors":"Ulrike Paschen, Stefan Sauerland","doi":"10.1016/j.jclinepi.2025.112106","DOIUrl":"10.1016/j.jclinepi.2025.112106","url":null,"abstract":"","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"189 ","pages":"Article 112106"},"PeriodicalIF":5.2,"publicationDate":"2025-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145764521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-13DOI: 10.1016/j.jclinepi.2025.112107
Werner Vach
{"title":"Comment on “In humble defense of unexplainable black box prediction models in healthcare”","authors":"Werner Vach","doi":"10.1016/j.jclinepi.2025.112107","DOIUrl":"10.1016/j.jclinepi.2025.112107","url":null,"abstract":"","PeriodicalId":51079,"journal":{"name":"Journal of Clinical Epidemiology","volume":"191 ","pages":"Article 112107"},"PeriodicalIF":5.2,"publicationDate":"2025-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145764464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}