Field-Testing Multiple-Choice Questions With AI Examinees: English Grammar Items.

IF 2.1 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Educational and Psychological Measurement Pub Date : 2024-10-03 DOI:10.1177/00131644241281053
Hotaka Maeda
{"title":"Field-Testing Multiple-Choice Questions With AI Examinees: English Grammar Items.","authors":"Hotaka Maeda","doi":"10.1177/00131644241281053","DOIUrl":null,"url":null,"abstract":"<p><p>Field-testing is an essential yet often resource-intensive step in the development of high-quality educational assessments. I introduce an innovative method for field-testing newly written exam items by substituting human examinees with artificially intelligent (AI) examinees. The proposed approach is demonstrated using 466 four-option multiple-choice English grammar questions. Pre-trained transformer language models are fine-tuned based on the 2-parameter logistic (2PL) item response model to respond like human test-takers. Each AI examinee is associated with a latent ability θ, and the item text is used to predict response selection probabilities for each of the four response options. For the best modeling approach identified, the overall correlation between the true and predicted 2PL correct response probabilities was .82 (bias = 0.00, root mean squared error = 0.18). The study results were promising, showing that item response data generated from AI can be used to calculate item proportion correct, item discrimination, conduct item calibration with anchors, distractor analysis, dimensionality analysis, and latent trait scoring. However, the proposed approach did not achieve the level of accuracy obtainable with human examinee response data. If further refined, potential resource savings in transitioning from human to AI field-testing could be enormous. AI could shorten the field-testing timeline, prevent examinees from seeing low-quality field-test items in real exams, shorten test lengths, eliminate test security, item exposure, and sample size concerns, reduce overall cost, and help expand the item bank. Example Python code from this study is available on Github: https://github.com/hotakamaeda/ai_field_testing1.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241281053"},"PeriodicalIF":2.1000,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562880/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Educational and Psychological Measurement","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1177/00131644241281053","RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Field-testing is an essential yet often resource-intensive step in the development of high-quality educational assessments. I introduce an innovative method for field-testing newly written exam items by substituting human examinees with artificially intelligent (AI) examinees. The proposed approach is demonstrated using 466 four-option multiple-choice English grammar questions. Pre-trained transformer language models are fine-tuned based on the 2-parameter logistic (2PL) item response model to respond like human test-takers. Each AI examinee is associated with a latent ability θ, and the item text is used to predict response selection probabilities for each of the four response options. For the best modeling approach identified, the overall correlation between the true and predicted 2PL correct response probabilities was .82 (bias = 0.00, root mean squared error = 0.18). The study results were promising, showing that item response data generated from AI can be used to calculate item proportion correct, item discrimination, conduct item calibration with anchors, distractor analysis, dimensionality analysis, and latent trait scoring. However, the proposed approach did not achieve the level of accuracy obtainable with human examinee response data. If further refined, potential resource savings in transitioning from human to AI field-testing could be enormous. AI could shorten the field-testing timeline, prevent examinees from seeing low-quality field-test items in real exams, shorten test lengths, eliminate test security, item exposure, and sample size concerns, reduce overall cost, and help expand the item bank. Example Python code from this study is available on Github: https://github.com/hotakamaeda/ai_field_testing1.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
与人工智能考生一起实地测试多项选择题:英语语法项目。
在开发高质量的教育评估过程中,实地测试是必不可少的一步,但往往需要耗费大量资源。我介绍了一种创新方法,即用人工智能(AI)考生代替人类考生,对新编写的考试项目进行实地测试。我们使用 466 道四选一的英语语法选择题对所提出的方法进行了演示。预先训练好的转换器语言模型根据 2 参数逻辑(2PL)项目响应模型进行微调,以做出与人类考生类似的响应。每个人工智能考生都与潜在能力 θ 相关联,题目文本用于预测四个回答选项中每个选项的回答选择概率。在确定的最佳建模方法中,真实的 2PL 正确作答概率与预测的 2PL 正确作答概率之间的总体相关性为 0.82(偏差 = 0.00,均方根误差 = 0.18)。研究结果很有希望,表明人工智能生成的项目反应数据可用于计算项目正确率、项目区分度、使用锚点进行项目校准、干扰项分析、维度分析和潜在特质评分。然而,所提出的方法并没有达到使用人类考生答题数据所能达到的准确度。如果进一步改进,从人类实地测试过渡到人工智能实地测试可能会节省大量资源。人工智能可以缩短现场测试的时间,防止考生在真实考试中看到低质量的现场测试项目,缩短测试长度,消除测试安全、项目暴露和样本大小方面的顾虑,降低总体成本,并有助于扩大项目库。本研究的 Python 代码示例可在 Github 上获取:https://github.com/hotakamaeda/ai_field_testing1。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Educational and Psychological Measurement
Educational and Psychological Measurement 医学-数学跨学科应用
CiteScore
5.50
自引率
7.40%
发文量
49
审稿时长
6-12 weeks
期刊介绍: Educational and Psychological Measurement (EPM) publishes referred scholarly work from all academic disciplines interested in the study of measurement theory, problems, and issues. Theoretical articles address new developments and techniques, and applied articles deal with innovation applications.
期刊最新文献
Optimal Number of Replications for Obtaining Stable Dynamic Fit Index Cutoffs. Invariance: What Does Measurement Invariance Allow Us to Claim? Detecting Differential Item Functioning Using Response Time. Assessing the Speed-Accuracy Tradeoff in Psychological Testing Using Experimental Manipulations. On Latent Structure Examination of Behavioral Measuring Instruments in Complex Empirical Settings.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1