Evaluating the psychometric properties of ChatGPT-generated questions

Shreya Bhandari , Yunting Liu , Yerin Kwak , Zachary A. Pardos
{"title":"Evaluating the psychometric properties of ChatGPT-generated questions","authors":"Shreya Bhandari ,&nbsp;Yunting Liu ,&nbsp;Yerin Kwak ,&nbsp;Zachary A. Pardos","doi":"10.1016/j.caeai.2024.100284","DOIUrl":null,"url":null,"abstract":"<div><p>Not much is known about how LLM-generated questions compare to gold-standard, traditional formative assessments concerning their difficulty and discrimination parameters, which are valued properties in the psychometric measurement field. We follow a rigorous measurement methodology to compare a set of ChatGPT-generated questions, produced from one lesson summary in a textbook, to existing questions from a published Creative Commons textbook. To do this, we collected and analyzed responses from 207 test respondents who answered questions from both item pools and used a linking methodology to compare IRT properties between the two pools. We find that neither the difficulty nor discrimination parameters of the 15 items in each pool differ statistically significantly, with some evidence that the ChatGPT items were marginally better at differentiating different respondent abilities. The response time also does not differ significantly between the two sources of items. The ChatGPT-generated items showed evidence of unidimensionality and did not affect the unidimensionality of the original set of items when tested together. Finally, through a fine-grained learning objective labeling analysis, we found greater similarity in the learning objective distribution of ChatGPT-generated items and the items from the target OpenStax lesson (0.9666) than between ChatGPT-generated items and adjacent OpenStax lessons (0.6859 for the previous lesson and 0.6153 for the subsequent lesson). These results corroborate our conclusion that generative AI can produce algebra items of similar quality to existing textbook questions that share the same construct or constructs as those questions.</p></div>","PeriodicalId":34469,"journal":{"name":"Computers and Education Artificial Intelligence","volume":"7 ","pages":"Article 100284"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666920X24000870/pdfft?md5=91d7e8564077ef80c2ba5f18fa4e22fb&pid=1-s2.0-S2666920X24000870-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers and Education Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666920X24000870","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Social Sciences","Score":null,"Total":0}
引用次数: 0

Abstract

Not much is known about how LLM-generated questions compare to gold-standard, traditional formative assessments concerning their difficulty and discrimination parameters, which are valued properties in the psychometric measurement field. We follow a rigorous measurement methodology to compare a set of ChatGPT-generated questions, produced from one lesson summary in a textbook, to existing questions from a published Creative Commons textbook. To do this, we collected and analyzed responses from 207 test respondents who answered questions from both item pools and used a linking methodology to compare IRT properties between the two pools. We find that neither the difficulty nor discrimination parameters of the 15 items in each pool differ statistically significantly, with some evidence that the ChatGPT items were marginally better at differentiating different respondent abilities. The response time also does not differ significantly between the two sources of items. The ChatGPT-generated items showed evidence of unidimensionality and did not affect the unidimensionality of the original set of items when tested together. Finally, through a fine-grained learning objective labeling analysis, we found greater similarity in the learning objective distribution of ChatGPT-generated items and the items from the target OpenStax lesson (0.9666) than between ChatGPT-generated items and adjacent OpenStax lessons (0.6859 for the previous lesson and 0.6153 for the subsequent lesson). These results corroborate our conclusion that generative AI can produce algebra items of similar quality to existing textbook questions that share the same construct or constructs as those questions.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
评估 ChatGPT 生成问题的心理测量特性
关于 LLM 生成的问题与黄金标准的传统形成性评估在难度和区分度参数方面的比较,目前所知不多,而难度和区分度参数是心理测量领域的重要属性。我们采用严格的测量方法,将根据教科书中的一课摘要生成的一组 ChatGPT 问题与已出版的知识共享教科书中的现有问题进行比较。为此,我们收集并分析了 207 位回答两个题库中问题的受测者的答卷,并使用链接方法比较了两个题库的 IRT 特性。我们发现,两个题库中 15 个题目的难度和区分度参数在统计上都没有显著差异,有证据表明 ChatGPT 题目在区分不同答题者能力方面略胜一筹。两种来源的题目在回答时间上也没有明显差异。ChatGPT 生成的项目显示出单维性,并且在一起测试时不会影响原始项目集的单维性。最后,通过细粒度的学习目标标签分析,我们发现 ChatGPT 生成的项目与目标 OpenStax 课程中的项目(0.9666)在学习目标分布上的相似度高于 ChatGPT 生成的项目与相邻 OpenStax 课程之间的相似度(前一课程为 0.6859,后一课程为 0.6153)。这些结果证实了我们的结论,即生成式人工智能可以生成与现有教科书问题质量相似的代数题目,而且这些题目与现有教科书问题具有相同的构造。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
16.80
自引率
0.00%
发文量
66
审稿时长
50 days
期刊最新文献
Enhancing data analysis and programming skills through structured prompt training: The impact of generative AI in engineering education Understanding the practices, perceptions, and (dis)trust of generative AI among instructors: A mixed-methods study in the U.S. higher education Technological self-efficacy and sense of coherence: Key drivers in teachers' AI acceptance and adoption The influence of AI literacy on complex problem-solving skills through systematic thinking skills and intuition thinking skills: An empirical study in Thai gen Z accounting students Psychometrics of an Elo-based large-scale online learning system
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1