Hands-on analysis of using large language models for the auto evaluation of programming assignments

IF 3 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Information Systems Pub Date : 2024-10-15 DOI:10.1016/j.is.2024.102473

Kareem Mohamed , Mina Yousef , Walaa Medhat , Ensaf Hussein Mohamed , Ghada Khoriba , Tamer Arafa

{"title":"Hands-on analysis of using large language models for the auto evaluation of programming assignments","authors":"Kareem Mohamed , Mina Yousef , Walaa Medhat , Ensaf Hussein Mohamed , Ghada Khoriba , Tamer Arafa","doi":"10.1016/j.is.2024.102473","DOIUrl":null,"url":null,"abstract":"<div><div>The increasing adoption of programming education necessitates efficient and accurate methods for evaluating students’ coding assignments. Traditional manual grading is time-consuming, often inconsistent, and prone to subjective biases. This paper explores the application of large language models (LLMs) for the automated evaluation of programming assignments. LLMs can use advanced natural language processing capabilities to assess code quality, functionality, and adherence to best practices, providing detailed feedback and grades. We demonstrate the effectiveness of LLMs through experiments comparing their performance with human evaluators across various programming tasks. Our study evaluates the performance of several LLMs for automated grading. Gemini 1.5 Pro achieves an exact match accuracy of 86% and a <span><math><mrow><mo>±</mo><mn>1</mn></mrow></math></span> accuracy of 98%. GPT-4o also demonstrates strong performance, with exact match and <span><math><mrow><mo>±</mo><mn>1</mn></mrow></math></span> accuracies of 69% and 97%, respectively. Both models correlate highly with human evaluations, indicating their potential for reliable automated grading. However, models such as Llama 3 70B and Mixtral 8 <span><math><mo>×</mo></math></span> 7B exhibit low accuracy and alignment with human grading, particularly in problem-solving tasks. These findings suggest that advanced LLMs are instrumental in scalable, automated educational assessment. Additionally, LLMs enhance the learning experience by offering personalized, instant feedback, fostering an iterative learning process. The findings suggest that LLMs could play a pivotal role in the future of programming education, ensuring scalability and consistency in evaluation.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"128 ","pages":"Article 102473"},"PeriodicalIF":3.0000,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306437924001315","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The increasing adoption of programming education necessitates efficient and accurate methods for evaluating students’ coding assignments. Traditional manual grading is time-consuming, often inconsistent, and prone to subjective biases. This paper explores the application of large language models (LLMs) for the automated evaluation of programming assignments. LLMs can use advanced natural language processing capabilities to assess code quality, functionality, and adherence to best practices, providing detailed feedback and grades. We demonstrate the effectiveness of LLMs through experiments comparing their performance with human evaluators across various programming tasks. Our study evaluates the performance of several LLMs for automated grading. Gemini 1.5 Pro achieves an exact match accuracy of 86% and a

\pm 1

accuracy of 98%. GPT-4o also demonstrates strong performance, with exact match and

\pm 1

accuracies of 69% and 97%, respectively. Both models correlate highly with human evaluations, indicating their potential for reliable automated grading. However, models such as Llama 3 70B and Mixtral 8

\times

7B exhibit low accuracy and alignment with human grading, particularly in problem-solving tasks. These findings suggest that advanced LLMs are instrumental in scalable, automated educational assessment. Additionally, LLMs enhance the learning experience by offering personalized, instant feedback, fostering an iterative learning process. The findings suggest that LLMs could play a pivotal role in the future of programming education, ensuring scalability and consistency in evaluation.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用大型语言模型自动评估编程作业的实践分析

随着编程教育的日益普及，我们需要高效、准确的方法来评估学生的编码作业。传统的人工评分费时费力，往往不一致，而且容易产生主观偏见。本文探讨了大语言模型（LLM）在编程作业自动评估中的应用。LLM 可以使用先进的自然语言处理能力来评估代码质量、功能和是否符合最佳实践，并提供详细的反馈和评分。我们通过比较 LLM 与人类评估员在各种编程任务中的表现，证明了 LLM 的有效性。我们的研究评估了几种用于自动分级的 LLM 的性能。Gemini 1.5 Pro 的精确匹配准确率为 86%，±1 准确率为 98%。GPT-4o 也表现出强劲的性能，精确匹配准确率和 ±1 准确率分别为 69% 和 97%。这两个模型都与人类评估结果高度相关，表明它们具有可靠的自动分级潜力。然而，Llama 3 70B 和 Mixtral 8 × 7B 等模型的准确度和与人类分级的一致性较低，尤其是在解决问题的任务中。这些研究结果表明，先进的 LLM 在可扩展的自动教育评估中具有重要作用。此外，LLM 还能提供个性化的即时反馈，促进迭代学习过程，从而增强学习体验。研究结果表明，LLM 可在未来的编程教育中发挥关键作用，确保评估的可扩展性和一致性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Information Systems 工程技术-计算机：信息系统

CiteScore

9.40

自引率

2.70%

发文量

112

审稿时长

53 days

期刊介绍： Information systems are the software and hardware systems that support data-intensive applications. The journal Information Systems publishes articles concerning the design and implementation of languages, data models, process models, algorithms, software and hardware for information systems. Subject areas include data management issues as presented in the principal international database conferences (e.g., ACM SIGMOD/PODS, VLDB, ICDE and ICDT/EDBT) as well as data-related issues from the fields of data mining/machine learning, information retrieval coordinated with structured data, internet and cloud data management, business process management, web semantics, visual and audio information systems, scientific computing, and data science. Implementation papers having to do with massively parallel data management, fault tolerance in practice, and special purpose hardware for data-intensive systems are also welcome. Manuscripts from application domains, such as urban informatics, social and natural science, and Internet of Things, are also welcome. All papers should highlight innovative solutions to data management problems such as new data models, performance enhancements, and show how those innovations contribute to the goals of the application.

期刊最新文献

Discovering partially ordered workflow models Learning policies for resource allocation in business processes STracker: A framework for identifying sentiment changes in customer feedbacks Two-level massive string dictionaries A generative and discriminative model for diversity-promoting recommendation