{"title":"A Comparison of Machine-Graded (ChatGPT) and Human-Graded Essay Scores in Veterinary Admissions","authors":"Raphael Vanderstichel, Henrik Stryhn","doi":"10.3138/jvme-2023-0162","DOIUrl":null,"url":null,"abstract":"Admissions committees have historically emphasized cognitive measures, but a paradigm shift toward holistic reviews now places greater importance on non-cognitive skills. These holistic reviews may include personal statements, experiences, references, interviews, multiple mini-interviews, and situational judgment tests, often requiring substantial faculty resources. Leveraging advances in artificial intelligence, particularly in natural language processing, this study was conducted to assess the agreement of essay scores graded by both humans and machines (OpenAI's ChatGPT). Correlations were calculated among these scores and cognitive and non-cognitive measures in the admissions process. Human-derived scores from 778 applicants in 2021 and 552 in 2022 had item-specific inter-rater reliabilities ranging from 0.07 to 0.41, while machine-derived inter-replicate reliabilities ranged from 0.41 to 0.61. Pairwise correlations between human- and machine-derived essay scores and other admissions criteria revealed moderate correlations between the two scoring methods (0.41) and fair correlations between the essays and the multiple mini-interview (0.20 and 0.22 for human and machine scores, respectively). Despite having very low correlations, machine-graded scores exhibited slightly stronger correlations with cognitive measures (0.10 to 0.15) compared to human-graded scores (0.01 to 0.02). Importantly, machine scores demonstrated higher precision, approximately two to three times greater than human scores in both years. This study emphasizes the importance of careful item design, rubric development, and prompt formulation when using machine-based essay grading. It also underscores the importance of employing replicates and robust statistical analyses to ensure equitable applicant ranking when integrating machine grading into the admissions process.","PeriodicalId":17575,"journal":{"name":"Journal of veterinary medical education","volume":null,"pages":null},"PeriodicalIF":1.1000,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of veterinary medical education","FirstCategoryId":"97","ListUrlMain":"https://doi.org/10.3138/jvme-2023-0162","RegionNum":3,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 0
Abstract
Admissions committees have historically emphasized cognitive measures, but a paradigm shift toward holistic reviews now places greater importance on non-cognitive skills. These holistic reviews may include personal statements, experiences, references, interviews, multiple mini-interviews, and situational judgment tests, often requiring substantial faculty resources. Leveraging advances in artificial intelligence, particularly in natural language processing, this study was conducted to assess the agreement of essay scores graded by both humans and machines (OpenAI's ChatGPT). Correlations were calculated among these scores and cognitive and non-cognitive measures in the admissions process. Human-derived scores from 778 applicants in 2021 and 552 in 2022 had item-specific inter-rater reliabilities ranging from 0.07 to 0.41, while machine-derived inter-replicate reliabilities ranged from 0.41 to 0.61. Pairwise correlations between human- and machine-derived essay scores and other admissions criteria revealed moderate correlations between the two scoring methods (0.41) and fair correlations between the essays and the multiple mini-interview (0.20 and 0.22 for human and machine scores, respectively). Despite having very low correlations, machine-graded scores exhibited slightly stronger correlations with cognitive measures (0.10 to 0.15) compared to human-graded scores (0.01 to 0.02). Importantly, machine scores demonstrated higher precision, approximately two to three times greater than human scores in both years. This study emphasizes the importance of careful item design, rubric development, and prompt formulation when using machine-based essay grading. It also underscores the importance of employing replicates and robust statistical analyses to ensure equitable applicant ranking when integrating machine grading into the admissions process.
期刊介绍:
The Journal of Veterinary Medical Education (JVME) is the peer-reviewed scholarly journal of the Association of American Veterinary Medical Colleges (AAVMC). As an internationally distributed journal, JVME provides a forum for the exchange of ideas, research, and discoveries about veterinary medical education. This exchange benefits veterinary faculty, students, and the veterinary profession as a whole by preparing veterinarians to better perform their professional activities and to meet the needs of society.
The journal’s areas of focus include best practices and educational methods in veterinary education; recruitment, training, and mentoring of students at all levels of education, including undergraduate, graduate, veterinary technology, and continuing education; clinical instruction and assessment; institutional policy; and other challenges and issues faced by veterinary educators domestically and internationally. Veterinary faculty of all countries are encouraged to participate as contributors, reviewers, and institutional representatives.