Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT

IF 2.1 3区教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH British Educational Research Journal Pub Date : 2024-09-16 DOI:10.1002/berj.4069

Jonas Flodén

{"title":"Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT","authors":"Jonas Flodén","doi":"10.1002/berj.4069","DOIUrl":null,"url":null,"abstract":"<p>This study compares how the generative AI (GenAI) large language model (LLM) ChatGPT performs in grading university exams compared to human teachers. Aspects investigated include consistency, large discrepancies and length of answer. Implications for higher education, including the role of teachers and ethics, are also discussed. Three Master's-level exams were scored using ChatGPT 3.5, and the results were compared with the teachers' scoring and the grading teachers were interviewed. In total, 463 exam responses were graded. With each response being graded at least three times, a total of 1389 gradings were conducted. For the final exam scores, 70% of ChatGPT's gradings were within 10% of the teachers' gradings and 31% within 5%. ChatGPT tended to give marginally higher scores. The agreement on grades is 30%, but 45% of the exams received an adjacent grade. On individual questions, ChatGPT is more inclined to avoid very high or very low scores. ChatGPT struggles to correctly score questions closely related to the course lectures but performs better on more general questions. The AI can generate plausible scores on university exams that, at first glance, look similar to a human grader. There are differences but it is not unlikely that two different human graders could result in similar discrepancies. During the interviews, teachers expressed their surprise at how well ChatGPT's grading matched their own. Increased use of AI can lead to ethical challenges as exams are entrusted to a machine whose decision-making criteria are not fully understood, especially concerning potential bias in training data.</p>","PeriodicalId":51410,"journal":{"name":"British Educational Research Journal","volume":"51 1","pages":"201-224"},"PeriodicalIF":2.1000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/berj.4069","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"British Educational Research Journal","FirstCategoryId":"95","ListUrlMain":"https://bera-journals.onlinelibrary.wiley.com/doi/10.1002/berj.4069","RegionNum":3,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}

引用次数: 0

Abstract

This study compares how the generative AI (GenAI) large language model (LLM) ChatGPT performs in grading university exams compared to human teachers. Aspects investigated include consistency, large discrepancies and length of answer. Implications for higher education, including the role of teachers and ethics, are also discussed. Three Master's-level exams were scored using ChatGPT 3.5, and the results were compared with the teachers' scoring and the grading teachers were interviewed. In total, 463 exam responses were graded. With each response being graded at least three times, a total of 1389 gradings were conducted. For the final exam scores, 70% of ChatGPT's gradings were within 10% of the teachers' gradings and 31% within 5%. ChatGPT tended to give marginally higher scores. The agreement on grades is 30%, but 45% of the exams received an adjacent grade. On individual questions, ChatGPT is more inclined to avoid very high or very low scores. ChatGPT struggles to correctly score questions closely related to the course lectures but performs better on more general questions. The AI can generate plausible scores on university exams that, at first glance, look similar to a human grader. There are differences but it is not unlikely that two different human graders could result in similar discrepancies. During the interviews, teachers expressed their surprise at how well ChatGPT's grading matched their own. Increased use of AI can lead to ethical challenges as exams are entrusted to a machine whose decision-making criteria are not fully understood, especially concerning potential bias in training data.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用大型语言模型对考试进行评分：使用ChatGPT对高等教育考试进行人类和人工智能评分的比较

本研究比较了生成式人工智能（GenAI）大型语言模型（LLM） ChatGPT在大学考试评分方面与人类教师的表现。调查的方面包括一致性，大差异和长度的答案。对高等教育的影响，包括教师的角色和道德，也进行了讨论。采用ChatGPT 3.5对三个硕士水平考试进行评分，并与教师评分进行比较，并对评分教师进行访谈。总共有463份试卷被评分。每个回答至少评分三次，总共进行了1389次评分。对于期末考试成绩，70%的ChatGPT评分与老师评分在10%以内，31%在5%以内。ChatGPT倾向于给出稍高的分数。分数的一致性是30%，但45%的考试得到了相近的分数。在个别问题上，ChatGPT更倾向于避免非常高或非常低的分数。ChatGPT很难正确地为与课程内容密切相关的问题打分，但在更一般的问题上表现得更好。这种人工智能可以在大学考试中得出合理的分数，乍一看，它与人类评分员很相似。虽然存在差异，但两名不同的评分者也不太可能产生类似的差异。在采访中，老师们对ChatGPT的评分与他们自己的评分如此之高感到惊讶。越来越多地使用人工智能可能会带来道德挑战，因为考试被委托给一台决策标准尚未完全理解的机器，尤其是在训练数据中潜在的偏见。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

British Educational Research Journal EDUCATION & EDUCATIONAL RESEARCH-

CiteScore

4.70

自引率

8.70%

发文量

期刊介绍： The British Educational Research Journal is an international peer reviewed medium for the publication of articles of interest to researchers in education and has rapidly become a major focal point for the publication of educational research from throughout the world. For further information on the association please visit the British Educational Research Association web site. The journal is interdisciplinary in approach, and includes reports of case studies, experiments and surveys, discussions of conceptual and methodological issues and of underlying assumptions in educational research, accounts of research in progress, and book reviews.

期刊最新文献

Issue Information Towards a material-dialogic theory of climate teacher education: A global North–South dialogue A typology of schools across the four nations of the United Kingdom: Class, race and geography Enhancing online MBA programmes: Student perceptions and key factors in programme design and delivery The absent presence of disability in British higher education