A Comparison of Machine-Graded (ChatGPT) and Human-Graded Essay Scores in Veterinary Admissions

IF 1.1 3区农林科学 Q3 EDUCATION, SCIENTIFIC DISCIPLINES Journal of veterinary medical education Pub Date : 2024-05-22 DOI:10.3138/jvme-2023-0162

Raphael Vanderstichel, Henrik Stryhn

{"title":"A Comparison of Machine-Graded (ChatGPT) and Human-Graded Essay Scores in Veterinary Admissions","authors":"Raphael Vanderstichel, Henrik Stryhn","doi":"10.3138/jvme-2023-0162","DOIUrl":null,"url":null,"abstract":"Admissions committees have historically emphasized cognitive measures, but a paradigm shift toward holistic reviews now places greater importance on non-cognitive skills. These holistic reviews may include personal statements, experiences, references, interviews, multiple mini-interviews, and situational judgment tests, often requiring substantial faculty resources. Leveraging advances in artificial intelligence, particularly in natural language processing, this study was conducted to assess the agreement of essay scores graded by both humans and machines (OpenAI's ChatGPT). Correlations were calculated among these scores and cognitive and non-cognitive measures in the admissions process. Human-derived scores from 778 applicants in 2021 and 552 in 2022 had item-specific inter-rater reliabilities ranging from 0.07 to 0.41, while machine-derived inter-replicate reliabilities ranged from 0.41 to 0.61. Pairwise correlations between human- and machine-derived essay scores and other admissions criteria revealed moderate correlations between the two scoring methods (0.41) and fair correlations between the essays and the multiple mini-interview (0.20 and 0.22 for human and machine scores, respectively). Despite having very low correlations, machine-graded scores exhibited slightly stronger correlations with cognitive measures (0.10 to 0.15) compared to human-graded scores (0.01 to 0.02). Importantly, machine scores demonstrated higher precision, approximately two to three times greater than human scores in both years. This study emphasizes the importance of careful item design, rubric development, and prompt formulation when using machine-based essay grading. It also underscores the importance of employing replicates and robust statistical analyses to ensure equitable applicant ranking when integrating machine grading into the admissions process.","PeriodicalId":17575,"journal":{"name":"Journal of veterinary medical education","volume":null,"pages":null},"PeriodicalIF":1.1000,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of veterinary medical education","FirstCategoryId":"97","ListUrlMain":"https://doi.org/10.3138/jvme-2023-0162","RegionNum":3,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

Abstract

Admissions committees have historically emphasized cognitive measures, but a paradigm shift toward holistic reviews now places greater importance on non-cognitive skills. These holistic reviews may include personal statements, experiences, references, interviews, multiple mini-interviews, and situational judgment tests, often requiring substantial faculty resources. Leveraging advances in artificial intelligence, particularly in natural language processing, this study was conducted to assess the agreement of essay scores graded by both humans and machines (OpenAI's ChatGPT). Correlations were calculated among these scores and cognitive and non-cognitive measures in the admissions process. Human-derived scores from 778 applicants in 2021 and 552 in 2022 had item-specific inter-rater reliabilities ranging from 0.07 to 0.41, while machine-derived inter-replicate reliabilities ranged from 0.41 to 0.61. Pairwise correlations between human- and machine-derived essay scores and other admissions criteria revealed moderate correlations between the two scoring methods (0.41) and fair correlations between the essays and the multiple mini-interview (0.20 and 0.22 for human and machine scores, respectively). Despite having very low correlations, machine-graded scores exhibited slightly stronger correlations with cognitive measures (0.10 to 0.15) compared to human-graded scores (0.01 to 0.02). Importantly, machine scores demonstrated higher precision, approximately two to three times greater than human scores in both years. This study emphasizes the importance of careful item design, rubric development, and prompt formulation when using machine-based essay grading. It also underscores the importance of employing replicates and robust statistical analyses to ensure equitable applicant ranking when integrating machine grading into the admissions process.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

兽医招生中机器评分（ChatGPT）与人工评分作文的比较

招生委员会历来重视认知能力的衡量标准，但现在向全面审查的模式转变，更加重视非认知技能。这些全面审查可能包括个人陈述、经历、推荐信、面试、多个小型面试和情境判断测试，通常需要大量的教师资源。本研究利用人工智能（尤其是自然语言处理）的进步，评估了由人类和机器（OpenAI 的 ChatGPT）评分的论文分数的一致性。计算了这些分数与录取过程中的认知和非认知测量之间的相关性。来自 2021 年 778 名申请者和 2022 年 552 名申请者的人工评分的特定项目评分者间信度介于 0.07 到 0.41 之间，而机器评分的重复间信度介于 0.41 到 0.61 之间。人工和机器得出的论文分数与其他录取标准之间的配对相关性显示，两种评分方法之间的相关性适中（0.41），论文和多重小型面试之间的相关性尚可（人工和机器评分的相关性分别为 0.20 和 0.22）。尽管相关性很低，但与人工评分（0.01 至 0.02）相比，机器评分与认知测量的相关性（0.10 至 0.15）略高。重要的是，机器评分表现出更高的精确度，在这两年中约为人工评分的 2 到 3 倍。这项研究强调了在使用机器作文评分时，精心设计题目、开发评分标准和制定评分提示的重要性。它还强调了在将机器评分整合到录取过程中时，采用重复和稳健的统计分析以确保申请人排名公平的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of veterinary medical education 农林科学-兽医学

CiteScore

2.20

自引率

30.00%

发文量

113

审稿时长

>36 weeks

期刊介绍： The Journal of Veterinary Medical Education (JVME) is the peer-reviewed scholarly journal of the Association of American Veterinary Medical Colleges (AAVMC). As an internationally distributed journal, JVME provides a forum for the exchange of ideas, research, and discoveries about veterinary medical education. This exchange benefits veterinary faculty, students, and the veterinary profession as a whole by preparing veterinarians to better perform their professional activities and to meet the needs of society. The journal’s areas of focus include best practices and educational methods in veterinary education; recruitment, training, and mentoring of students at all levels of education, including undergraduate, graduate, veterinary technology, and continuing education; clinical instruction and assessment; institutional policy; and other challenges and issues faced by veterinary educators domestically and internationally. Veterinary faculty of all countries are encouraged to participate as contributors, reviewers, and institutional representatives.