评估 CS1 学生课程中由 LLM 生成的逻辑错误解释的质量

Proceedings of the 16th Annual ACM India Compute Conference Pub Date : 2023-12-09 DOI:10.1145/3627217.3627233

Rishabh Balse, Viraj Kumar, Prajish Prasad, J. Warriem

{"title":"评估 CS1 学生课程中由 LLM 生成的逻辑错误解释的质量","authors":"Rishabh Balse, Viraj Kumar, Prajish Prasad, J. Warriem","doi":"10.1145/3627217.3627233","DOIUrl":null,"url":null,"abstract":"When students in CS1 (Introductory Programming) write erroneous code, course staff can use automated tools to provide various types of helpful feedback. In this paper, we focus on syntactically correct student code containing logical errors. Tools that explain logical errors typically require course staff to invest greater effort than tools that detect such errors. To reduce this effort, prior work has investigated the use of Large Language Models (LLMs) such as GPT-3 to generate explanations. Unfortunately, these explanations can be incomplete or incorrect, and therefore unhelpful if presented to students directly. Nevertheless, LLM-generated explanations may be of adequate quality for Teaching Assistants (TAs) to efficiently craft helpful explanations on their basis. We evaluate the quality of explanations generated by an LLM (GPT-3.5-turbo) in two ways, for 30 buggy student solutions across 6 code-writing problems. First, in a study with 5 undergraduate TAs, we compare TA perception of LLM-generated and peer-generated explanation quality. TAs were unaware which explanations were LLM-generated, but they found them to be comparable in quality to peer-generated explanations. Second, we performed a detailed manual analysis of LLM-generated explanations for all 30 buggy solutions. We found at least one incorrect statement in 15/30 explanations (50%). However, in 28/30 cases (93%), the LLM-generated explanation correctly identified at least one logical error. Our results suggest that for large CS1 courses, TAs with adequate training to detect erroneous statements may be able to extract value from such explanations.","PeriodicalId":508655,"journal":{"name":"Proceedings of the 16th Annual ACM India Compute Conference","volume":"33 4","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating the Quality of LLM-Generated Explanations for Logical Errors in CS1 Student Programs\",\"authors\":\"Rishabh Balse, Viraj Kumar, Prajish Prasad, J. Warriem\",\"doi\":\"10.1145/3627217.3627233\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"When students in CS1 (Introductory Programming) write erroneous code, course staff can use automated tools to provide various types of helpful feedback. In this paper, we focus on syntactically correct student code containing logical errors. Tools that explain logical errors typically require course staff to invest greater effort than tools that detect such errors. To reduce this effort, prior work has investigated the use of Large Language Models (LLMs) such as GPT-3 to generate explanations. Unfortunately, these explanations can be incomplete or incorrect, and therefore unhelpful if presented to students directly. Nevertheless, LLM-generated explanations may be of adequate quality for Teaching Assistants (TAs) to efficiently craft helpful explanations on their basis. We evaluate the quality of explanations generated by an LLM (GPT-3.5-turbo) in two ways, for 30 buggy student solutions across 6 code-writing problems. First, in a study with 5 undergraduate TAs, we compare TA perception of LLM-generated and peer-generated explanation quality. TAs were unaware which explanations were LLM-generated, but they found them to be comparable in quality to peer-generated explanations. Second, we performed a detailed manual analysis of LLM-generated explanations for all 30 buggy solutions. We found at least one incorrect statement in 15/30 explanations (50%). However, in 28/30 cases (93%), the LLM-generated explanation correctly identified at least one logical error. Our results suggest that for large CS1 courses, TAs with adequate training to detect erroneous statements may be able to extract value from such explanations.\",\"PeriodicalId\":508655,\"journal\":{\"name\":\"Proceedings of the 16th Annual ACM India Compute Conference\",\"volume\":\"33 4\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-12-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 16th Annual ACM India Compute Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3627217.3627233\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th Annual ACM India Compute Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3627217.3627233","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

当 CS1（程序设计入门）课程的学生编写错误代码时，课程工作人员可以使用自动化工具提供各种类型的有用反馈。在本文中，我们将重点关注包含逻辑错误的语法正确的学生代码。与检测逻辑错误的工具相比，解释逻辑错误的工具通常需要课程工作人员投入更多精力。为了减少这种工作量，之前的工作已经研究了使用大型语言模型（LLM）（如 GPT-3）来生成解释。遗憾的是，这些解释可能是不完整或不正确的，因此如果直接呈现给学生，将毫无帮助。不过，LLM 生成的解释可能具有足够的质量，教学助理（TA）可以在此基础上有效地制作出有用的解释。我们从两个方面评估了 LLM（GPT-3.5-turbo）针对 6 个代码编写问题中的 30 个错误学生解决方案生成的解释的质量。首先，在一项由 5 名本科生助教参与的研究中，我们比较了助教对 LLM 生成的解释和同伴生成的解释质量的感知。助教并不知道哪些解释是由 LLM 生成的，但他们认为这些解释的质量与同行生成的解释相当。其次，我们对 LLM 生成的所有 30 个错误解决方案的解释进行了详细的人工分析。我们在 15/30 个解释（50%）中发现了至少一个错误的陈述。然而，在 28/30 个案例（93%）中，LLM 生成的解释至少正确识别了一个逻辑错误。我们的结果表明，对于大型 CS1 课程而言，受过充分培训的助教可以从这些解释中发现错误陈述。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Evaluating the Quality of LLM-Generated Explanations for Logical Errors in CS1 Student Programs

When students in CS1 (Introductory Programming) write erroneous code, course staff can use automated tools to provide various types of helpful feedback. In this paper, we focus on syntactically correct student code containing logical errors. Tools that explain logical errors typically require course staff to invest greater effort than tools that detect such errors. To reduce this effort, prior work has investigated the use of Large Language Models (LLMs) such as GPT-3 to generate explanations. Unfortunately, these explanations can be incomplete or incorrect, and therefore unhelpful if presented to students directly. Nevertheless, LLM-generated explanations may be of adequate quality for Teaching Assistants (TAs) to efficiently craft helpful explanations on their basis. We evaluate the quality of explanations generated by an LLM (GPT-3.5-turbo) in two ways, for 30 buggy student solutions across 6 code-writing problems. First, in a study with 5 undergraduate TAs, we compare TA perception of LLM-generated and peer-generated explanation quality. TAs were unaware which explanations were LLM-generated, but they found them to be comparable in quality to peer-generated explanations. Second, we performed a detailed manual analysis of LLM-generated explanations for all 30 buggy solutions. We found at least one incorrect statement in 15/30 explanations (50%). However, in 28/30 cases (93%), the LLM-generated explanation correctly identified at least one logical error. Our results suggest that for large CS1 courses, TAs with adequate training to detect erroneous statements may be able to extract value from such explanations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 16th Annual ACM India Compute Conference

自引率

0.00%

发文量