Hands-on analysis of using large language models for the auto evaluation of programming assignments

IF 3 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Information Systems Pub Date : 2024-10-15 DOI:10.1016/j.is.2024.102473
Kareem Mohamed , Mina Yousef , Walaa Medhat , Ensaf Hussein Mohamed , Ghada Khoriba , Tamer Arafa
{"title":"Hands-on analysis of using large language models for the auto evaluation of programming assignments","authors":"Kareem Mohamed ,&nbsp;Mina Yousef ,&nbsp;Walaa Medhat ,&nbsp;Ensaf Hussein Mohamed ,&nbsp;Ghada Khoriba ,&nbsp;Tamer Arafa","doi":"10.1016/j.is.2024.102473","DOIUrl":null,"url":null,"abstract":"<div><div>The increasing adoption of programming education necessitates efficient and accurate methods for evaluating students’ coding assignments. Traditional manual grading is time-consuming, often inconsistent, and prone to subjective biases. This paper explores the application of large language models (LLMs) for the automated evaluation of programming assignments. LLMs can use advanced natural language processing capabilities to assess code quality, functionality, and adherence to best practices, providing detailed feedback and grades. We demonstrate the effectiveness of LLMs through experiments comparing their performance with human evaluators across various programming tasks. Our study evaluates the performance of several LLMs for automated grading. Gemini 1.5 Pro achieves an exact match accuracy of 86% and a <span><math><mrow><mo>±</mo><mn>1</mn></mrow></math></span> accuracy of 98%. GPT-4o also demonstrates strong performance, with exact match and <span><math><mrow><mo>±</mo><mn>1</mn></mrow></math></span> accuracies of 69% and 97%, respectively. Both models correlate highly with human evaluations, indicating their potential for reliable automated grading. However, models such as Llama 3 70B and Mixtral 8 <span><math><mo>×</mo></math></span> 7B exhibit low accuracy and alignment with human grading, particularly in problem-solving tasks. These findings suggest that advanced LLMs are instrumental in scalable, automated educational assessment. Additionally, LLMs enhance the learning experience by offering personalized, instant feedback, fostering an iterative learning process. The findings suggest that LLMs could play a pivotal role in the future of programming education, ensuring scalability and consistency in evaluation.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"128 ","pages":"Article 102473"},"PeriodicalIF":3.0000,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306437924001315","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

The increasing adoption of programming education necessitates efficient and accurate methods for evaluating students’ coding assignments. Traditional manual grading is time-consuming, often inconsistent, and prone to subjective biases. This paper explores the application of large language models (LLMs) for the automated evaluation of programming assignments. LLMs can use advanced natural language processing capabilities to assess code quality, functionality, and adherence to best practices, providing detailed feedback and grades. We demonstrate the effectiveness of LLMs through experiments comparing their performance with human evaluators across various programming tasks. Our study evaluates the performance of several LLMs for automated grading. Gemini 1.5 Pro achieves an exact match accuracy of 86% and a ±1 accuracy of 98%. GPT-4o also demonstrates strong performance, with exact match and ±1 accuracies of 69% and 97%, respectively. Both models correlate highly with human evaluations, indicating their potential for reliable automated grading. However, models such as Llama 3 70B and Mixtral 8 × 7B exhibit low accuracy and alignment with human grading, particularly in problem-solving tasks. These findings suggest that advanced LLMs are instrumental in scalable, automated educational assessment. Additionally, LLMs enhance the learning experience by offering personalized, instant feedback, fostering an iterative learning process. The findings suggest that LLMs could play a pivotal role in the future of programming education, ensuring scalability and consistency in evaluation.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用大型语言模型自动评估编程作业的实践分析
随着编程教育的日益普及,我们需要高效、准确的方法来评估学生的编码作业。传统的人工评分费时费力,往往不一致,而且容易产生主观偏见。本文探讨了大语言模型(LLM)在编程作业自动评估中的应用。LLM 可以使用先进的自然语言处理能力来评估代码质量、功能和是否符合最佳实践,并提供详细的反馈和评分。我们通过比较 LLM 与人类评估员在各种编程任务中的表现,证明了 LLM 的有效性。我们的研究评估了几种用于自动分级的 LLM 的性能。Gemini 1.5 Pro 的精确匹配准确率为 86%,±1 准确率为 98%。GPT-4o 也表现出强劲的性能,精确匹配准确率和 ±1 准确率分别为 69% 和 97%。这两个模型都与人类评估结果高度相关,表明它们具有可靠的自动分级潜力。然而,Llama 3 70B 和 Mixtral 8 × 7B 等模型的准确度和与人类分级的一致性较低,尤其是在解决问题的任务中。这些研究结果表明,先进的 LLM 在可扩展的自动教育评估中具有重要作用。此外,LLM 还能提供个性化的即时反馈,促进迭代学习过程,从而增强学习体验。研究结果表明,LLM 可在未来的编程教育中发挥关键作用,确保评估的可扩展性和一致性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Information Systems
Information Systems 工程技术-计算机:信息系统
CiteScore
9.40
自引率
2.70%
发文量
112
审稿时长
53 days
期刊介绍: Information systems are the software and hardware systems that support data-intensive applications. The journal Information Systems publishes articles concerning the design and implementation of languages, data models, process models, algorithms, software and hardware for information systems. Subject areas include data management issues as presented in the principal international database conferences (e.g., ACM SIGMOD/PODS, VLDB, ICDE and ICDT/EDBT) as well as data-related issues from the fields of data mining/machine learning, information retrieval coordinated with structured data, internet and cloud data management, business process management, web semantics, visual and audio information systems, scientific computing, and data science. Implementation papers having to do with massively parallel data management, fault tolerance in practice, and special purpose hardware for data-intensive systems are also welcome. Manuscripts from application domains, such as urban informatics, social and natural science, and Internet of Things, are also welcome. All papers should highlight innovative solutions to data management problems such as new data models, performance enhancements, and show how those innovations contribute to the goals of the application.
期刊最新文献
Two-level massive string dictionaries A generative and discriminative model for diversity-promoting recommendation Soundness unknotted: An efficient soundness checking algorithm for arbitrary cyclic process models by loosening loops The composition diagram of a complex process: Enhancing understanding of hierarchical business processes Editorial Board
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1