{"title":"Achieving Human Level Partial Credit Grading of Written Responses to Physics Conceptual Question using GPT-3.5 with Only Prompt Engineering","authors":"Zhongzhou Chen, Tong Wan","doi":"arxiv-2407.15251","DOIUrl":null,"url":null,"abstract":"Large language modules (LLMs) have great potential for auto-grading student\nwritten responses to physics problems due to their capacity to process and\ngenerate natural language. In this explorative study, we use a prompt\nengineering technique, which we name \"scaffolded chain of thought (COT)\", to\ninstruct GPT-3.5 to grade student written responses to a physics conceptual\nquestion. Compared to common COT prompting, scaffolded COT prompts GPT-3.5 to\nexplicitly compare student responses to a detailed, well-explained rubric\nbefore generating the grading outcome. We show that when compared to human\nraters, the grading accuracy of GPT-3.5 using scaffolded COT is 20% - 30%\nhigher than conventional COT. The level of agreement between AI and human\nraters can reach 70% - 80%, comparable to the level between two human raters.\nThis shows promise that an LLM-based AI grader can achieve human-level grading\naccuracy on a physics conceptual problem using prompt engineering techniques\nalone.","PeriodicalId":501565,"journal":{"name":"arXiv - PHYS - Physics Education","volume":"245 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Physics Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.15251","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Large language modules (LLMs) have great potential for auto-grading student
written responses to physics problems due to their capacity to process and
generate natural language. In this explorative study, we use a prompt
engineering technique, which we name "scaffolded chain of thought (COT)", to
instruct GPT-3.5 to grade student written responses to a physics conceptual
question. Compared to common COT prompting, scaffolded COT prompts GPT-3.5 to
explicitly compare student responses to a detailed, well-explained rubric
before generating the grading outcome. We show that when compared to human
raters, the grading accuracy of GPT-3.5 using scaffolded COT is 20% - 30%
higher than conventional COT. The level of agreement between AI and human
raters can reach 70% - 80%, comparable to the level between two human raters.
This shows promise that an LLM-based AI grader can achieve human-level grading
accuracy on a physics conceptual problem using prompt engineering techniques
alone.