Can ChatGPT evaluate research quality?

IF 1.5 3区 管理学 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE Journal of Data and Information Science Pub Date : 2024-04-30 DOI:10.2478/jdis-2024-0013
Mike Thelwall
{"title":"Can ChatGPT evaluate research quality?","authors":"Mike Thelwall","doi":"10.2478/jdis-2024-0013","DOIUrl":null,"url":null,"abstract":"Purpose Assess whether ChatGPT 4.0 is accurate enough to perform research evaluations on journal articles to automate this time-consuming task. Design/methodology/approach Test the extent to which ChatGPT-4 can assess the quality of journal articles using a case study of the published scoring guidelines of the UK Research Excellence Framework (REF) 2021 to create a research evaluation ChatGPT. This was applied to 51 of my own articles and compared against my own quality judgements. Findings ChatGPT-4 can produce plausible document summaries and quality evaluation rationales that match the REF criteria. Its overall scores have weak correlations with my self-evaluation scores of the same documents (averaging r=0.281 over 15 iterations, with 8 being statistically significantly different from 0). In contrast, the average scores from the 15 iterations produced a statistically significant positive correlation of 0.509. Thus, averaging scores from multiple ChatGPT-4 rounds seems more effective than individual scores. The positive correlation may be due to ChatGPT being able to extract the author’s significance, rigour, and originality claims from inside each paper. If my weakest articles are removed, then the correlation with average scores (r=0.200) falls below statistical significance, suggesting that ChatGPT struggles to make fine-grained evaluations. Research limitations The data is self-evaluations of a convenience sample of articles from one academic in one field. Practical implications Overall, ChatGPT does not yet seem to be accurate enough to be trusted for any formal or informal research quality evaluation tasks. Research evaluators, including journal editors, should therefore take steps to control its use. Originality/value This is the first published attempt at post-publication expert review accuracy testing for ChatGPT.","PeriodicalId":44622,"journal":{"name":"Journal of Data and Information Science","volume":null,"pages":null},"PeriodicalIF":1.5000,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Science","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.2478/jdis-2024-0013","RegionNum":3,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose Assess whether ChatGPT 4.0 is accurate enough to perform research evaluations on journal articles to automate this time-consuming task. Design/methodology/approach Test the extent to which ChatGPT-4 can assess the quality of journal articles using a case study of the published scoring guidelines of the UK Research Excellence Framework (REF) 2021 to create a research evaluation ChatGPT. This was applied to 51 of my own articles and compared against my own quality judgements. Findings ChatGPT-4 can produce plausible document summaries and quality evaluation rationales that match the REF criteria. Its overall scores have weak correlations with my self-evaluation scores of the same documents (averaging r=0.281 over 15 iterations, with 8 being statistically significantly different from 0). In contrast, the average scores from the 15 iterations produced a statistically significant positive correlation of 0.509. Thus, averaging scores from multiple ChatGPT-4 rounds seems more effective than individual scores. The positive correlation may be due to ChatGPT being able to extract the author’s significance, rigour, and originality claims from inside each paper. If my weakest articles are removed, then the correlation with average scores (r=0.200) falls below statistical significance, suggesting that ChatGPT struggles to make fine-grained evaluations. Research limitations The data is self-evaluations of a convenience sample of articles from one academic in one field. Practical implications Overall, ChatGPT does not yet seem to be accurate enough to be trusted for any formal or informal research quality evaluation tasks. Research evaluators, including journal editors, should therefore take steps to control its use. Originality/value This is the first published attempt at post-publication expert review accuracy testing for ChatGPT.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
ChatGPT 可以评估研究质量吗?
目的 评估 ChatGPT 4.0 在对期刊论文进行研究评估时是否足够准确,以自动完成这项耗时的任务。设计/方法/途径 测试 ChatGPT-4 可在多大程度上评估期刊论文的质量,使用英国 2021 年卓越研究框架 (REF) 公布的评分指南进行案例研究,创建研究评估 ChatGPT。该方法适用于我自己的 51 篇文章,并与我自己的质量判断进行了比较。研究结果 ChatGPT-4 可以生成符合 REF 标准的可信文件摘要和质量评价理由。它的总分与我对相同文件的自我评价分数之间的相关性较弱(15 次迭代的平均 r=0.281,其中 8 次与 0 有显著的统计学差异)。相比之下,15 次迭代的平均得分产生了 0.509 的统计意义上的正相关。因此,多轮 ChatGPT-4 的平均得分似乎比单轮得分更有效。正相关的原因可能是 ChatGPT 能够从每篇论文中提取出作者的重要性、严谨性和原创性主张。如果剔除我最弱的文章,那么与平均分的相关性(r=0.200)就会低于统计显著性,这表明 ChatGPT 难以做出精细的评价。研究局限性 这些数据是来自一个领域的一位学者对方便抽样的文章进行的自我评价。实际意义 总体而言,ChatGPT 似乎还不够准确,不能用于任何正式或非正式的研究质量评估任务。因此,包括期刊编辑在内的研究评估人员应采取措施控制其使用。原创性/价值 这是首次公开尝试对 ChatGPT 进行发表后专家评审准确性测试。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Data and Information Science
Journal of Data and Information Science INFORMATION SCIENCE & LIBRARY SCIENCE-
CiteScore
3.50
自引率
6.70%
发文量
495
期刊介绍: JDIS devotes itself to the study and application of the theories, methods, techniques, services, infrastructural facilities using big data to support knowledge discovery for decision & policy making. The basic emphasis is big data-based, analytics centered, knowledge discovery driven, and decision making supporting. The special effort is on the knowledge discovery to detect and predict structures, trends, behaviors, relations, evolutions and disruptions in research, innovation, business, politics, security, media and communications, and social development, where the big data may include metadata or full content data, text or non-textural data, structured or non-structural data, domain specific or cross-domain data, and dynamic or interactive data. The main areas of interest are: (1) New theories, methods, and techniques of big data based data mining, knowledge discovery, and informatics, including but not limited to scientometrics, communication analysis, social network analysis, tech & industry analysis, competitive intelligence, knowledge mapping, evidence based policy analysis, and predictive analysis. (2) New methods, architectures, and facilities to develop or improve knowledge infrastructure capable to support knowledge organization and sophisticated analytics, including but not limited to ontology construction, knowledge organization, semantic linked data, knowledge integration and fusion, semantic retrieval, domain specific knowledge infrastructure, and semantic sciences. (3) New mechanisms, methods, and tools to embed knowledge analytics and knowledge discovery into actual operation, service, or managerial processes, including but not limited to knowledge assisted scientific discovery, data mining driven intelligent workflows in learning, communications, and management. Specific topic areas may include: Knowledge organization Knowledge discovery and data mining Knowledge integration and fusion Semantic Web metrics Scientometrics Analytic and diagnostic informetrics Competitive intelligence Predictive analysis Social network analysis and metrics Semantic and interactively analytic retrieval Evidence-based policy analysis Intelligent knowledge production Knowledge-driven workflow management and decision-making Knowledge-driven collaboration and its management Domain knowledge infrastructure with knowledge fusion and analytics Development of data and information services
期刊最新文献
Detecting LLM-assisted writing in scientific communication: Are we there yet? Beyond authorship: Analyzing contributions in PLOS ONE and the challenges of appropriate attribution Performance evaluation of seven multi-label classification methods on real-world patent and publication datasets Can ChatGPT evaluate research quality? Amend: an integrated platform of retracted papers and concerned papers
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1