Careful evaluation of research methodology is fundamental to scientific progress but represents a significant burden on human experts. The complexity of functional MRI (fMRI) methods makes transparent reporting, as suggested by OHBM COBIDAS guidelines, particularly critical. Large Language Models (LLMs) present a potential solution for rapid, scalable methodological assessment. We evaluated three state-of-the-art LLMs (Gemini 2.5 Pro, Claude 4 Sonnet, ChatGPT-o3-pro) against human expert ratings. Fifty fMRI articles (taken from 2016 to 2025) were independently evaluated by ten human experts and three LLMs using an 82-item COBIDAS-based rubric. Human raters demonstrated excellent inter-rater reliability (ICC = 0.801), while LLMs showed poor internal agreement (ICC = 0.254). When comparing total scores across papers, Gemini showed strong positive correlation with human consensus (r = 0.693, p < 0.0001), Claude showed moderate positive correlation (r = 0.394, p = 0.004), while ChatGPT showed negative correlation (r = −0.172, p = 0.233). Gemini maintained high reliability when added to human raters (combined ICC = 0.811), achieving 85.3 % exact agreement and 98.8 % within-1-point agreement. Domain-specific analysis revealed Gemini's consistently high agreement across all six COBIDAS sections (experimental design: 0.915, statistical modeling: 0.880), while ChatGPT and Claude showed weaker, more variable performance. Obvious differences emerged in determining non-applicable items: humans marked 40.5 % as not applicable versus 32.3 % for Gemini, 9.2 % for ChatGPT and 21.1 % for Claude. ChatGPT exhibited extreme score volatility, with papers ranging from 0 to 121 points compared to humans' 44.2–77.7 range. LLM scoring required 1–7 min versus 30–35 min for humans. This proof-of-concept study demonstrates that LLM-assisted methodological evaluation is feasible for complex neuroimaging research and could likely be applied to other research fields.
扫码关注我们
求助内容:
应助结果提醒方式:
