A Comparative Analysis of the Rating of College Students’ Essays by ChatGPT versus Human Raters

Q3 Social Sciences International Journal of Learning, Teaching and Educational Research Pub Date : 2024-02-28 DOI:10.26803/ijlter.23.2.23

Potchong M. Jackaria, Bonjovi Hassan Hajan, Al-Rashiff H. Mastul

{"title":"A Comparative Analysis of the Rating of College Students’ Essays by ChatGPT versus Human Raters","authors":"Potchong M. Jackaria, Bonjovi Hassan Hajan, Al-Rashiff H. Mastul","doi":"10.26803/ijlter.23.2.23","DOIUrl":null,"url":null,"abstract":"The use of generative artificial intelligence (AI) in education has engendered mixed reactions due to its ability to generate human-like responses to questions. For education to benefit from this modern technology, there is a need to determine how such capability can be used to improve teaching and learning. Hence, using a comparative−descriptive research design, this study aimed to perform a comparative analysis between Chat Generative Pre-Trained Transformer (ChatGPT) version 3.5 and human raters in scoring students’ essays. Twenty essays were used of college students in a professional education course at the Mindanao State University – Tawi-Tawi College of Technology and Oceanography, a public university in southern Philippines. The essays were rated independently by three human raters using a scoring rubric from Carrol and West (1989) as adapted by Tuyen et al. (2019). For the AI ratings, the essays were encoded and inputted into ChatGPT 3.5 using prompts and the rubric. The responses were then screenshotted and recorded along with the human ratings for statistical analysis. Using the intraclass correlation coefficient (ICC), results show that among the human raters, the consistency was good, indicating the reliability of the rubric, while a moderate consistency was found in the ChatGPT 3.5 ratings. Comparison of the human and ChatGPT 3.5 ratings show poor consistency, implying the that the ratings of human raters and ChatGPT 3.5 were not linearly related. The finding implies that teachers should be cautious when using ChatGPT in rating students’ written works, suggesting further that using ChatGPT 3.5, in its current version, still needs human assistance to ensure the accuracy of its generated information. Rating of other types of student works using ChatGPT 3.5 or other generative AI tools may be investigated in future research.","PeriodicalId":37101,"journal":{"name":"International Journal of Learning, Teaching and Educational Research","volume":"73 4","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Learning, Teaching and Educational Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.26803/ijlter.23.2.23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Social Sciences","Score":null,"Total":0}

引用次数: 0

Abstract

The use of generative artificial intelligence (AI) in education has engendered mixed reactions due to its ability to generate human-like responses to questions. For education to benefit from this modern technology, there is a need to determine how such capability can be used to improve teaching and learning. Hence, using a comparative−descriptive research design, this study aimed to perform a comparative analysis between Chat Generative Pre-Trained Transformer (ChatGPT) version 3.5 and human raters in scoring students’ essays. Twenty essays were used of college students in a professional education course at the Mindanao State University – Tawi-Tawi College of Technology and Oceanography, a public university in southern Philippines. The essays were rated independently by three human raters using a scoring rubric from Carrol and West (1989) as adapted by Tuyen et al. (2019). For the AI ratings, the essays were encoded and inputted into ChatGPT 3.5 using prompts and the rubric. The responses were then screenshotted and recorded along with the human ratings for statistical analysis. Using the intraclass correlation coefficient (ICC), results show that among the human raters, the consistency was good, indicating the reliability of the rubric, while a moderate consistency was found in the ChatGPT 3.5 ratings. Comparison of the human and ChatGPT 3.5 ratings show poor consistency, implying the that the ratings of human raters and ChatGPT 3.5 were not linearly related. The finding implies that teachers should be cautious when using ChatGPT in rating students’ written works, suggesting further that using ChatGPT 3.5, in its current version, still needs human assistance to ensure the accuracy of its generated information. Rating of other types of student works using ChatGPT 3.5 or other generative AI tools may be investigated in future research.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ChatGPT 与人工评分员对大学生作文评分的比较分析

生成式人工智能（AI）能够对问题生成类似人类的回答，因此在教育领域的应用引起了不同的反响。为了让教育从这一现代技术中受益，有必要确定如何利用这种能力来改进教学。因此，本研究采用比较-描述性研究设计，旨在对 Chat Generative Pre-Trained Transformer（ChatGPT）3.5 版与人类评分员在学生作文评分中的表现进行比较分析。研究使用了菲律宾南部一所公立大学--棉兰老州立大学塔威塔威技术与海洋学院专业教育课程中大学生的 20 篇文章。这些论文由三位人类评分员使用经 Tuyen 等人（2019）改编的 Carrol 和 West（1989）评分标准进行独立评分。在人工智能评分方面，作文被编码并使用提示和评分标准输入 ChatGPT 3.5。然后对回复进行截图，并与人工评分一起记录下来，以便进行统计分析。使用类内相关系数（ICC），结果显示人类评分者之间的一致性很好，这表明评分标准是可靠的，而 ChatGPT 3.5 的评分一致性适中。人类评分与 ChatGPT 3.5 评分的一致性较差，这意味着人类评分者的评分与 ChatGPT 3.5 的评分不呈线性关系。这一发现意味着教师在使用 ChatGPT 对学生的书面作品进行评分时应谨慎，并进一步表明，目前版本的 ChatGPT 3.5 仍需要人工协助以确保其生成信息的准确性。使用 ChatGPT 3.5 或其他人工智能生成工具对其他类型的学生作品进行评分，可以在今后的研究中进行探讨。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊