Accuracy and reliability of large language models in assessing learning outcomes achievement across cognitive domains.

IF 1.7 4区教育学 Q2 EDUCATION, SCIENTIFIC DISCIPLINES Advances in Physiology Education Pub Date : 2024-12-01 DOI:10.1152/advan.00137.2024

Swapna Haresh Teckwani, Amanda Huee-Ping Wong, Nathasha Vihangi Luke, Ivan Cherh Chiet Low

{"title":"Accuracy and reliability of large language models in assessing learning outcomes achievement across cognitive domains.","authors":"Swapna Haresh Teckwani, Amanda Huee-Ping Wong, Nathasha Vihangi Luke, Ivan Cherh Chiet Low","doi":"10.1152/advan.00137.2024","DOIUrl":null,"url":null,"abstract":"The advent of artificial intelligence (AI), particularly large language models (LLMs) like ChatGPT and Gemini, has significantly impacted the educational landscape, offering unique opportunities for learning and assessment. In the realm of written assessment grading, traditionally viewed as a laborious and subjective process, this study sought to evaluate the accuracy and reliability of these LLMs in evaluating the achievement of learning outcomes across different cognitive domains in a scientific inquiry course on sports physiology. Human graders and three LLMs, GPT-3.5, GPT-4o, and Gemini, were tasked with scoring submitted student assignments according to a set of rubrics aligned with various cognitive domains, namely \"Understand,\" \"Analyze,\" and \"Evaluate\" from the revised Bloom's taxonomy and \"Scientific Inquiry Competency.\" Our findings revealed that while LLMs demonstrated some level of competency, they do not yet meet the assessment standards of human graders. Specifically, interrater reliability (percentage agreement and correlation analysis) between human graders was superior as compared to between two grading rounds for each LLM, respectively. Furthermore, concordance and correlation between human and LLM graders were mostly moderate to poor in terms of overall scores and across the pre-specified cognitive domains. The results suggest a future where AI could complement human expertise in educational assessment but underscore the importance of adaptive learning by educators and continuous improvement in current AI technologies to fully realize this potential.NEW & NOTEWORTHY The advent of large language models (LLMs) such as ChatGPT and Gemini has offered new learning and assessment opportunities to integrate artificial intelligence (AI) with education. This study evaluated the accuracy of LLMs in assessing an assignment from a course on sports physiology. Concordance and correlation between human graders and LLMs were mostly moderate to poor. The findings suggest AI's potential to complement human expertise in educational assessment alongside the need for adaptive learning by educators.","PeriodicalId":50852,"journal":{"name":"Advances in Physiology Education","volume":"48 4","pages":"904-914"},"PeriodicalIF":1.7000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Physiology Education","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1152/advan.00137.2024","RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

Abstract

The advent of artificial intelligence (AI), particularly large language models (LLMs) like ChatGPT and Gemini, has significantly impacted the educational landscape, offering unique opportunities for learning and assessment. In the realm of written assessment grading, traditionally viewed as a laborious and subjective process, this study sought to evaluate the accuracy and reliability of these LLMs in evaluating the achievement of learning outcomes across different cognitive domains in a scientific inquiry course on sports physiology. Human graders and three LLMs, GPT-3.5, GPT-4o, and Gemini, were tasked with scoring submitted student assignments according to a set of rubrics aligned with various cognitive domains, namely "Understand," "Analyze," and "Evaluate" from the revised Bloom's taxonomy and "Scientific Inquiry Competency." Our findings revealed that while LLMs demonstrated some level of competency, they do not yet meet the assessment standards of human graders. Specifically, interrater reliability (percentage agreement and correlation analysis) between human graders was superior as compared to between two grading rounds for each LLM, respectively. Furthermore, concordance and correlation between human and LLM graders were mostly moderate to poor in terms of overall scores and across the pre-specified cognitive domains. The results suggest a future where AI could complement human expertise in educational assessment but underscore the importance of adaptive learning by educators and continuous improvement in current AI technologies to fully realize this potential.NEW & NOTEWORTHY The advent of large language models (LLMs) such as ChatGPT and Gemini has offered new learning and assessment opportunities to integrate artificial intelligence (AI) with education. This study evaluated the accuracy of LLMs in assessing an assignment from a course on sports physiology. Concordance and correlation between human graders and LLMs were mostly moderate to poor. The findings suggest AI's potential to complement human expertise in educational assessment alongside the need for adaptive learning by educators.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

大型语言模型在评估各认知领域学习成果成就方面的准确性和可靠性。

人工智能（AI）的出现，尤其是像 ChatGPT 和 Gemini 这样的大型语言模型（LLM）的出现，对教育领域产生了重大影响，为学习和评估提供了独特的机会。在传统上被视为费力且主观的书面评估评分领域，本研究试图评估这些 LLM 在评价运动生理学科学探究课程中不同认知领域的学习成果时的准确性和可靠性。人类评分员和三种 LLM（GPT-3.5、GPT-4o 和 Gemini）负责根据一套与不同认知领域（即修订版布鲁姆分类法中的 "理解"、"分析 "和 "评价"）和 "科学探究能力 "相一致的评分标准，对提交的学生作业进行评分。我们的研究结果表明，虽然学习能力测验显示了一定程度的能力，但还没有达到人类评分员的评估标准。具体而言，人工评分员之间的信度（百分比一致和相关分析）分别优于每名语文教员两轮评分之间的信度。此外，就总分和预先指定的认知领域而言，人类和 LLM 评级人员之间的一致性和相关性大多为中等至较差。这些结果表明，未来人工智能可以在教育评估中补充人类的专业知识，但同时也强调了教育工作者自适应学习的重要性，以及当前人工智能技术不断改进以充分发挥这一潜力的重要性。新进展和注意事项 ChatGPT 和 Gemini 等大型语言模型（LLM）的出现为人工智能（AI）与教育的结合提供了新的学习和评估机会。本研究评估了 LLM 在评估运动生理学课程作业时的准确性。人类评分员与 LLM 之间的一致性和相关性大多为中等至较差。研究结果表明，人工智能有潜力在教育评估中补充人类的专业知识，同时教育工作者也需要进行适应性学习。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Advances in Physiology Education 医学-生理学

CiteScore

3.40

自引率

19.00%

发文量

100

审稿时长

>12 weeks

期刊介绍： Advances in Physiology Education promotes and disseminates educational scholarship in order to enhance teaching and learning of physiology, neuroscience and pathophysiology. The journal publishes peer-reviewed descriptions of innovations that improve teaching in the classroom and laboratory, essays on education, and review articles based on our current understanding of physiological mechanisms. Submissions that evaluate new technologies for teaching and research, and educational pedagogy, are especially welcome. The audience for the journal includes educators at all levels: K–12, undergraduate, graduate, and professional programs.