首页 > 最新文献

Journal of Educational Measurement最新文献

英文 中文
Evaluating General-Purpose Multimodal AI for Q-Matrix Generation from Math Items: A Cognitive Diagnostic Modeling Exploration 评估从数学项目生成q矩阵的通用多模态AI:认知诊断建模探索
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2026-01-29 DOI: 10.1111/jedm.70028
Kang Xue, James J. Appleton

Cognitive Diagnostic Models (CDMs) provide fine-grained diagnostic feedback, but their central component—the Q-matrix—remains costly and labor-intensive to construct. This study explores the automated generation of Q-matrices using general-purpose AI, including ChatGPT-4o, Gemini-2.5-pro, and Claude-sonnet-4. We evaluated two prompting strategies (all-at-once and one-by-one) across TIMSS 2007, TIMSS 2011, and PISA 2012 mathematics assessments. Results show that AI-generated Q-matrices approximate human baselines with competitive model fitting performance (AIC, BIC, log-likelihood, and SRMSR) and acceptable classification discrepancies. While AI predictions for larger and more complicated assessments (TIMSS 07 and 11) were generally sparser than human-generated Q-matrices, they still achieved equal or better fit statistics under most CDMs. In contrast, for the smaller and less complicated PISA 2012 assessment, AI-generated Q-matrices matched human density and fitting quality. Importantly, chatbot-human matching accuracy remained high across models, with Gemini benefiting from all-at-once prompting, ChatGPT-4o maintaining stable performance under both strategies, and Claude showing sensitivity to prompt structure. These findings highlight both the promise and current limitations of automated Q-matrix generation, underscoring opportunities for integrating LLMs into scalable diagnostic assessment practices.

认知诊断模型(CDMs)提供了细粒度的诊断反馈,但其核心组件q矩阵的构建仍然是昂贵和劳动密集型的。本研究探讨了使用通用人工智能(包括chatgpt - 40、Gemini-2.5-pro和Claude-sonnet-4)自动生成q矩阵。我们在TIMSS 2007、TIMSS 2011和PISA 2012数学评估中评估了两种提示策略(一次性和一对一)。结果表明,人工智能生成的q矩阵近似于人类基线,具有具有竞争力的模型拟合性能(AIC、BIC、对数似然和SRMSR)和可接受的分类差异。虽然人工智能对更大更复杂的评估(TIMSS 07和11)的预测通常比人类生成的q矩阵更稀疏,但它们在大多数cdm下仍然获得了相同或更好的拟合统计。相比之下,对于规模较小且不太复杂的PISA 2012评估,人工智能生成的q矩阵匹配了人口密度和拟合质量。重要的是,聊天机器人与人类的匹配精度在各个模型中仍然很高,Gemini受益于一次性提示,chatgpt - 40在两种策略下都保持稳定的性能,Claude对提示结构表现出敏感性。这些发现强调了自动化q矩阵生成的前景和当前的局限性,强调了将llm集成到可扩展的诊断评估实践中的机会。
{"title":"Evaluating General-Purpose Multimodal AI for Q-Matrix Generation from Math Items: A Cognitive Diagnostic Modeling Exploration","authors":"Kang Xue,&nbsp;James J. Appleton","doi":"10.1111/jedm.70028","DOIUrl":"https://doi.org/10.1111/jedm.70028","url":null,"abstract":"<p>Cognitive Diagnostic Models (CDMs) provide fine-grained diagnostic feedback, but their central component—the Q-matrix—remains costly and labor-intensive to construct. This study explores the automated generation of Q-matrices using general-purpose AI, including ChatGPT-4o, Gemini-2.5-pro, and Claude-sonnet-4. We evaluated two prompting strategies (all-at-once and one-by-one) across TIMSS 2007, TIMSS 2011, and PISA 2012 mathematics assessments. Results show that AI-generated Q-matrices approximate human baselines with competitive model fitting performance (AIC, BIC, log-likelihood, and SRMSR) and acceptable classification discrepancies. While AI predictions for larger and more complicated assessments (TIMSS 07 and 11) were generally sparser than human-generated Q-matrices, they still achieved equal or better fit statistics under most CDMs. In contrast, for the smaller and less complicated PISA 2012 assessment, AI-generated Q-matrices matched human density and fitting quality. Importantly, chatbot-human matching accuracy remained high across models, with Gemini benefiting from all-at-once prompting, ChatGPT-4o maintaining stable performance under both strategies, and Claude showing sensitivity to prompt structure. These findings highlight both the promise and current limitations of automated Q-matrix generation, underscoring opportunities for integrating LLMs into scalable diagnostic assessment practices.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146130123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction to “Using GPT-4 to Augment Imbalanced Data for Automatic Scoring” 更正“使用GPT-4增加自动评分的不平衡数据”
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2026-01-28 DOI: 10.1111/jedm.70033

Fang, L., Lee, G., & Zhai, X. (2025), Using GPT-4 to augment imbalanced data for automatic scoring. Journal of Educational Measurement, 62, 959–995. https://doi.org/10.1111/jedm.70020

Gyeonggeon Lee carried out his work for this article while at the University of Georgia and continued it after moving to Nanyang Technological University, Singapore, where he is now affiliated. The original article erroneously indicated that the work was conducted solely at the University of Georgia.

We apologize for this error.

方,L。李,G。,,翟,x(2025),使用GPT-4增加不平衡数据进行自动评分。教育测量学报,6(2):959-995。https://doi.org/10.1111/jedm.70020Gyeonggeon Lee在佐治亚大学(University of Georgia)完成了本文的工作,并在搬到新加坡南洋理工大学(Nanyang Technological University)后继续工作。最初的文章错误地指出,这项工作完全是在佐治亚大学进行的。我们为这个错误道歉。
{"title":"Correction to “Using GPT-4 to Augment Imbalanced Data for Automatic Scoring”","authors":"","doi":"10.1111/jedm.70033","DOIUrl":"10.1111/jedm.70033","url":null,"abstract":"<p>Fang, L., Lee, G., &amp; Zhai, X. (2025), Using GPT-4 to augment imbalanced data for automatic scoring. <i>Journal of Educational Measurement</i>, 62, 959–995. https://doi.org/10.1111/jedm.70020</p><p>Gyeonggeon Lee carried out his work for this article while at the University of Georgia and continued it after moving to Nanyang Technological University, Singapore, where he is now affiliated. The original article erroneously indicated that the work was conducted solely at the University of Georgia.</p><p>We apologize for this error.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70033","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146136871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AI and Measurement Concerns: Dealing with Imbalanced Data in Autoscoring 人工智能和测量问题:处理自动评分中的不平衡数据
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2026-01-28 DOI: 10.1111/jedm.70031
Yunting Liu, Yijun Xiang, Xutao Feng, Mark Wilson

Unbiasedness for proficiency estimates is important for autoscoring engines since the outcome might be used for future learning or placement. Imbalanced training data may lead to certain biases and lower the prediction accuracy for classification algorithms. In this article, we investigated several data augmentation methods to lower the negative effect of imbalanced data in measurement settings. Four approaches were examined: (1) Resampling methods, either oversampling or undersampling; (2) Active resampling methods, where the resampling weight is based on representativeness in the training set; (3) Data expansion methods using synonym Replacement, slightly changing the meaning or semantics of the original answers; and (4) Content recreation method using Generative AI (e.g., ChatGPT) to create responses for less populated scores. We compared the performance (e.g., Accuracy, QWK, F1) as well as the distance metric for different combinations of the methods. Two datasets with different imbalanced distributions were used. Results show that all four methods can help to mitigate the bias issue and the efficacy was influenced by the imbalance level, representativeness of the original data and the level of increment in the variety of the response (i.e., lexical diversity). In general, resampling and GenAI with active resampling showed the best overall performance.

熟练程度估计的无偏性对于自动评分引擎来说很重要,因为结果可能用于未来的学习或安置。训练数据的不平衡会导致一定的偏差,降低分类算法的预测精度。在本文中,我们研究了几种数据增强方法,以降低测量设置中数据不平衡的负面影响。研究了四种方法:(1)重采样方法,包括过采样和欠采样;(2)主动重采样方法,重采样权值基于训练集的代表性;(3)使用同义词替换的数据扩展方法,略微改变原答案的含义或语义;(4)使用生成式人工智能(例如,ChatGPT)的内容再创造方法,为较少填充的分数创建响应。我们比较了不同方法组合的性能(例如,精度,QWK, F1)以及距离度量。使用了两个具有不同不平衡分布的数据集。结果表明,四种方法均能有效缓解偏倚问题,其效果受不平衡程度、原始数据的代表性和应答多样性增量水平(即词汇多样性)的影响。总的来说,重采样和主动重采样的GenAI表现出最好的综合性能。
{"title":"AI and Measurement Concerns: Dealing with Imbalanced Data in Autoscoring","authors":"Yunting Liu,&nbsp;Yijun Xiang,&nbsp;Xutao Feng,&nbsp;Mark Wilson","doi":"10.1111/jedm.70031","DOIUrl":"https://doi.org/10.1111/jedm.70031","url":null,"abstract":"<p>Unbiasedness for proficiency estimates is important for autoscoring engines since the outcome might be used for future learning or placement. Imbalanced training data may lead to certain biases and lower the prediction accuracy for classification algorithms. In this article, we investigated several data augmentation methods to lower the negative effect of imbalanced data in measurement settings. Four approaches were examined: (1) Resampling methods, either oversampling or undersampling; (2) Active resampling methods, where the resampling weight is based on representativeness in the training set; (3) Data expansion methods using synonym Replacement, slightly changing the meaning or semantics of the original answers; and (4) Content recreation method using Generative AI (e.g., ChatGPT) to create responses for less populated scores. We compared the performance (e.g., Accuracy, QWK, <i>F</i>1) as well as the distance metric for different combinations of the methods. Two datasets with different imbalanced distributions were used. Results show that all four methods can help to mitigate the bias issue and the efficacy was influenced by the imbalance level, representativeness of the original data and the level of increment in the variety of the response (i.e., lexical diversity). In general, resampling and GenAI with active resampling showed the best overall performance.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70031","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146130218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generalizability Theory for Randomly Parallel Testing 随机并行检验的推广理论
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2026-01-21 DOI: 10.1111/jedm.70029
Won-Chan Lee, Stella Y. Kim, Seungwon Shin

Advancements in artificial intelligence (AI) have brought significant changes to testing practices, including the emergence of randomly parallel testing (RPT), in which examinees receive different but psychometrically similar sets of items generated from templates or AI-based systems. This paper presents a generalizability theory (GT) framework for estimating conditional standard errors of measurement (CSEMs) and related reliability indices, with a particular focus on design structures commonly encountered in RPT within domain-referenced testing contexts. The proposed framework supports the evaluation of score precision across a variety of operational designs, including crossed, nested, and multivariate configurations. Several illustrative examples are provided to demonstrate the methodology in practical settings. The paper also addresses key psychometric and interpretive challenges associated with RPT and outlines promising directions for future research.

人工智能(AI)的进步给考试实践带来了重大变化,包括随机并行测试(RPT)的出现,在这种测试中,考生收到的是由模板或基于AI的系统生成的不同但心理测量学上相似的项目集。本文提出了一种估计条件测量标准误差(cems)和相关可靠性指标的概括性理论(GT)框架,并特别关注了在领域参考测试环境中RPT中常见的设计结构。所建议的框架支持跨各种操作设计(包括交叉、嵌套和多变量配置)评估分数精度。提供了几个说明性的例子来演示实际设置的方法。本文还讨论了与RPT相关的关键心理测量和解释挑战,并概述了未来研究的有希望的方向。
{"title":"Generalizability Theory for Randomly Parallel Testing","authors":"Won-Chan Lee,&nbsp;Stella Y. Kim,&nbsp;Seungwon Shin","doi":"10.1111/jedm.70029","DOIUrl":"10.1111/jedm.70029","url":null,"abstract":"<p>Advancements in artificial intelligence (AI) have brought significant changes to testing practices, including the emergence of randomly parallel testing (RPT), in which examinees receive different but psychometrically similar sets of items generated from templates or AI-based systems. This paper presents a generalizability theory (GT) framework for estimating conditional standard errors of measurement (CSEMs) and related reliability indices, with a particular focus on design structures commonly encountered in RPT within domain-referenced testing contexts. The proposed framework supports the evaluation of score precision across a variety of operational designs, including crossed, nested, and multivariate configurations. Several illustrative examples are provided to demonstrate the methodology in practical settings. The paper also addresses key psychometric and interpretive challenges associated with RPT and outlines promising directions for future research.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70029","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146091461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Simultaneous Detection of Compromised Items and Examinees with Item Preknowledge in Online Assessments Using Response Time Data 基于反应时间数据的在线评估中折衷题和有项目预知的考生的同时检测
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2026-01-21 DOI: 10.1111/jedm.70030
Cengiz Zopluoglu

The rapid transition from traditional paper-and-pencil tests to computer-based testing systems has significantly altered the educational landscape, particularly during the COVID-19 pandemic. While online assessments offer numerous advantages, they also present unique challenges, with test security being paramount. This article addresses the critical issue of test fraud in digital assessments, specifically focusing on item preknowledge, where examinees have prior access to test items. Using response-time data, we propose a statistical framework for simultaneously identifying compromised items and examinees with item preknowledge in a single-step analysis. Unlike existing methods, our model does not require prior knowledge about the compromised status of items. Using a large-scale online certification exam dataset, we demonstrate the model's application in detecting significant signals in response times, identifying potentially compromised items, and examinees with potential item preknowledge.

从传统的纸笔考试到基于计算机的考试系统的迅速转变,极大地改变了教育格局,特别是在2019冠状病毒病大流行期间。虽然在线评估提供了许多优势,但它们也提出了独特的挑战,其中测试安全性是最重要的。本文解决了数字评估中考试作弊的关键问题,特别关注项目预知,即考生事先获得考试项目。利用响应时间数据,我们提出了一个统计框架,可以在单步分析中同时识别受损项目和具有项目预知的考生。与现有的方法不同,我们的模型不需要事先了解物品的受损状态。使用大规模在线认证考试数据集,我们展示了该模型在检测响应时间中的重要信号,识别潜在的受损项目以及具有潜在项目预知的考生方面的应用。
{"title":"Simultaneous Detection of Compromised Items and Examinees with Item Preknowledge in Online Assessments Using Response Time Data","authors":"Cengiz Zopluoglu","doi":"10.1111/jedm.70030","DOIUrl":"10.1111/jedm.70030","url":null,"abstract":"<p>The rapid transition from traditional paper-and-pencil tests to computer-based testing systems has significantly altered the educational landscape, particularly during the COVID-19 pandemic. While online assessments offer numerous advantages, they also present unique challenges, with test security being paramount. This article addresses the critical issue of test fraud in digital assessments, specifically focusing on item preknowledge, where examinees have prior access to test items. Using response-time data, we propose a statistical framework for simultaneously identifying compromised items and examinees with item preknowledge in a single-step analysis. Unlike existing methods, our model does not require prior knowledge about the compromised status of items. Using a large-scale online certification exam dataset, we demonstrate the model's application in detecting significant signals in response times, identifying potentially compromised items, and examinees with potential item preknowledge.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70030","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146057815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving Ability Estimation Accuracy for Automated Item Generated Forms under Multistage Testing 提高多阶段测试下自动生成项目表单的能力估计精度
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-12-22 DOI: 10.1111/jedm.70027
Stella Y. Kim, Won-Chan Lee

The emergence of automated item generation (AIG) techniques has intensified discussions around their application in assessment development. Some testing companies have already begun developing software to construct exams using AIG. However, the current literature offers limited insights into the characteristics of items generated through AIG, particularly in the realm of multistage testing (MST). This study proposes a novel approach for adjusting template item parameters to enhance ability estimation accuracy under the MST context. A simulation study was conducted using two MST designs with varying numbers of stages and modules. Results demonstrated that the proposed method significantly improved the accuracy of person parameter estimates compared to a more practical, yet less precise, approach that assumes all item clones share identical parameters.

自动项目生成(AIG)技术的出现加剧了对其在评估开发中的应用的讨论。一些考试公司已经开始开发使用AIG构建考试的软件。然而,目前的文献对通过AIG产生的项目的特征提供了有限的见解,特别是在多阶段测试(MST)领域。本研究提出了一种调整模板项参数以提高MST环境下能力估计精度的新方法。采用两种不同级段和模块数量的MST设计进行了模拟研究。结果表明,与假设所有项目克隆共享相同参数的更实用但精度较低的方法相比,所提出的方法显着提高了人员参数估计的准确性。
{"title":"Improving Ability Estimation Accuracy for Automated Item Generated Forms under Multistage Testing","authors":"Stella Y. Kim,&nbsp;Won-Chan Lee","doi":"10.1111/jedm.70027","DOIUrl":"https://doi.org/10.1111/jedm.70027","url":null,"abstract":"<p>The emergence of automated item generation (AIG) techniques has intensified discussions around their application in assessment development. Some testing companies have already begun developing software to construct exams using AIG. However, the current literature offers limited insights into the characteristics of items generated through AIG, particularly in the realm of multistage testing (MST). This study proposes a novel approach for adjusting template item parameters to enhance ability estimation accuracy under the MST context. A simulation study was conducted using two MST designs with varying numbers of stages and modules. Results demonstrated that the proposed method significantly improved the accuracy of person parameter estimates compared to a more practical, yet less precise, approach that assumes all item clones share identical parameters.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70027","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146099374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parameter Estimation in Comparative Judgment Under Random and Adaptive Scheduling Schemes 随机调度和自适应调度下比较判断的参数估计
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-12-22 DOI: 10.1111/jedm.70022
Ian Hamilton, Nick Tawn

Comparative judgment is an assessment method where item ratings are estimated based on rankings of subsets of the items. These rankings are typically pairwise, with ratings taken to be the estimated parameters from fitting a Bradley-Terry model. Likelihood penalization is often employed to ensure finiteness of estimates. Adaptive scheduling of the comparisons can increase the efficiency of the assessment. We show that the most commonly used penalty in Comparative Judgment is not the best-performing penalty under adaptive scheduling and can lead to substantial bias in parameter estimation. We demonstrate this using simulated and real data and provide a theoretical explanation for the relative performance of the penalties considered, including identifying a preferred alternative. Further, we propose a novel approach based on a parametric bootstrap. It is found to produce better parameter estimates for adaptive schedules and to be robust to variations in underlying strength distributions. The work allows for more efficient implementations of comparative judgment.

比较判断是一种评估方法,它根据项目子集的排名来估计项目的评级。这些排名通常是两两的,评级是通过拟合布拉德利-特里模型估计的参数。可能性惩罚通常用于确保估计的有限性。自适应比较调度可以提高评估效率。研究表明,在自适应调度下,比较判断中最常用的惩罚并不是性能最好的惩罚,并且会导致参数估计的严重偏差。我们使用模拟和真实数据证明了这一点,并为所考虑的惩罚的相对性能提供了理论解释,包括确定首选替代方案。此外,我们提出了一种基于参数自举的新方法。研究发现,它对自适应计划产生更好的参数估计,并且对潜在强度分布的变化具有鲁棒性。这项工作允许更有效地实施比较判断。
{"title":"Parameter Estimation in Comparative Judgment Under Random and Adaptive Scheduling Schemes","authors":"Ian Hamilton,&nbsp;Nick Tawn","doi":"10.1111/jedm.70022","DOIUrl":"https://doi.org/10.1111/jedm.70022","url":null,"abstract":"<p>Comparative judgment is an assessment method where item ratings are estimated based on rankings of subsets of the items. These rankings are typically pairwise, with ratings taken to be the estimated parameters from fitting a Bradley-Terry model. Likelihood penalization is often employed to ensure finiteness of estimates. Adaptive scheduling of the comparisons can increase the efficiency of the assessment. We show that the most commonly used penalty in Comparative Judgment is not the best-performing penalty under adaptive scheduling and can lead to substantial bias in parameter estimation. We demonstrate this using simulated and real data and provide a theoretical explanation for the relative performance of the penalties considered, including identifying a preferred alternative. Further, we propose a novel approach based on a parametric bootstrap. It is found to produce better parameter estimates for adaptive schedules and to be robust to variations in underlying strength distributions. The work allows for more efficient implementations of comparative judgment.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70022","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146099375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving Ability Estimation Accuracy for Automated Item Generated Forms under Multistage Testing 提高多阶段测试下自动生成项目表单的能力估计精度
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-12-22 DOI: 10.1111/jedm.70027
Stella Y. Kim, Won-Chan Lee

The emergence of automated item generation (AIG) techniques has intensified discussions around their application in assessment development. Some testing companies have already begun developing software to construct exams using AIG. However, the current literature offers limited insights into the characteristics of items generated through AIG, particularly in the realm of multistage testing (MST). This study proposes a novel approach for adjusting template item parameters to enhance ability estimation accuracy under the MST context. A simulation study was conducted using two MST designs with varying numbers of stages and modules. Results demonstrated that the proposed method significantly improved the accuracy of person parameter estimates compared to a more practical, yet less precise, approach that assumes all item clones share identical parameters.

自动项目生成(AIG)技术的出现加剧了对其在评估开发中的应用的讨论。一些考试公司已经开始开发使用AIG构建考试的软件。然而,目前的文献对通过AIG产生的项目的特征提供了有限的见解,特别是在多阶段测试(MST)领域。本研究提出了一种调整模板项参数以提高MST环境下能力估计精度的新方法。采用两种不同级段和模块数量的MST设计进行了模拟研究。结果表明,与假设所有项目克隆共享相同参数的更实用但精度较低的方法相比,所提出的方法显着提高了人员参数估计的准确性。
{"title":"Improving Ability Estimation Accuracy for Automated Item Generated Forms under Multistage Testing","authors":"Stella Y. Kim,&nbsp;Won-Chan Lee","doi":"10.1111/jedm.70027","DOIUrl":"https://doi.org/10.1111/jedm.70027","url":null,"abstract":"<p>The emergence of automated item generation (AIG) techniques has intensified discussions around their application in assessment development. Some testing companies have already begun developing software to construct exams using AIG. However, the current literature offers limited insights into the characteristics of items generated through AIG, particularly in the realm of multistage testing (MST). This study proposes a novel approach for adjusting template item parameters to enhance ability estimation accuracy under the MST context. A simulation study was conducted using two MST designs with varying numbers of stages and modules. Results demonstrated that the proposed method significantly improved the accuracy of person parameter estimates compared to a more practical, yet less precise, approach that assumes all item clones share identical parameters.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70027","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146099376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parameter Estimation in Comparative Judgment Under Random and Adaptive Scheduling Schemes 随机调度和自适应调度下比较判断的参数估计
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-12-22 DOI: 10.1111/jedm.70022
Ian Hamilton, Nick Tawn

Comparative judgment is an assessment method where item ratings are estimated based on rankings of subsets of the items. These rankings are typically pairwise, with ratings taken to be the estimated parameters from fitting a Bradley-Terry model. Likelihood penalization is often employed to ensure finiteness of estimates. Adaptive scheduling of the comparisons can increase the efficiency of the assessment. We show that the most commonly used penalty in Comparative Judgment is not the best-performing penalty under adaptive scheduling and can lead to substantial bias in parameter estimation. We demonstrate this using simulated and real data and provide a theoretical explanation for the relative performance of the penalties considered, including identifying a preferred alternative. Further, we propose a novel approach based on a parametric bootstrap. It is found to produce better parameter estimates for adaptive schedules and to be robust to variations in underlying strength distributions. The work allows for more efficient implementations of comparative judgment.

比较判断是一种评估方法,它根据项目子集的排名来估计项目的评级。这些排名通常是两两的,评级是通过拟合布拉德利-特里模型估计的参数。可能性惩罚通常用于确保估计的有限性。自适应比较调度可以提高评估效率。研究表明,在自适应调度下,比较判断中最常用的惩罚并不是性能最好的惩罚,并且会导致参数估计的严重偏差。我们使用模拟和真实数据证明了这一点,并为所考虑的惩罚的相对性能提供了理论解释,包括确定首选替代方案。此外,我们提出了一种基于参数自举的新方法。研究发现,它对自适应计划产生更好的参数估计,并且对潜在强度分布的变化具有鲁棒性。这项工作允许更有效地实施比较判断。
{"title":"Parameter Estimation in Comparative Judgment Under Random and Adaptive Scheduling Schemes","authors":"Ian Hamilton,&nbsp;Nick Tawn","doi":"10.1111/jedm.70022","DOIUrl":"https://doi.org/10.1111/jedm.70022","url":null,"abstract":"<p>Comparative judgment is an assessment method where item ratings are estimated based on rankings of subsets of the items. These rankings are typically pairwise, with ratings taken to be the estimated parameters from fitting a Bradley-Terry model. Likelihood penalization is often employed to ensure finiteness of estimates. Adaptive scheduling of the comparisons can increase the efficiency of the assessment. We show that the most commonly used penalty in Comparative Judgment is not the best-performing penalty under adaptive scheduling and can lead to substantial bias in parameter estimation. We demonstrate this using simulated and real data and provide a theoretical explanation for the relative performance of the penalties considered, including identifying a preferred alternative. Further, we propose a novel approach based on a parametric bootstrap. It is found to produce better parameter estimates for adaptive schedules and to be robust to variations in underlying strength distributions. The work allows for more efficient implementations of comparative judgment.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70022","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146099377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Group Fit Statistic for the Multilevel Item Response Model 多层次项目反应模型的组拟合统计
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-12-14 DOI: 10.1111/jedm.70024
Yishan Ding, Ji Seung Yang, Youngjin Han

Aberrant behaviors among test-takers in large-scale assessments are often more prevalent within specific groups or testing sites. While various techniques have been developed to detect individual-level test-takers' aberrant behaviors, research in detecting those behaviors at the group level is rare. We propose a group fit statistic lz2$ l_{z2}$ by extending the lz$ l_z$ statistic to a multilevel item response model. This new statistic demonstrates adequate power and effectively controls the Type I error rate, particularly when true latent variable values are used or when group sizes are large, such as 500. When latent variable estimates are employed, an adjustment to the lz2$l_{z2}$ based on the posterior predictive checking approach can offer improved control over the Type I error rate.

在大规模评估中,考生的异常行为往往在特定群体或考场内更为普遍。虽然已经开发了各种技术来检测个体水平的考生的异常行为,但在群体水平上检测这些行为的研究很少。通过将lz2 $ l_z$统计量扩展到多层次的项目响应模型,提出了lz2 $ l_{z2}$的组拟合统计量。这个新的统计数据显示了足够的能力,并有效地控制了I型错误率,特别是当使用真正的潜在变量值或当组规模很大时,例如500。当使用潜在变量估计时,基于后验预测检查方法对lz2 $l_{z2}$进行调整可以改善对I型错误率的控制。
{"title":"A Group Fit Statistic for the Multilevel Item Response Model","authors":"Yishan Ding,&nbsp;Ji Seung Yang,&nbsp;Youngjin Han","doi":"10.1111/jedm.70024","DOIUrl":"10.1111/jedm.70024","url":null,"abstract":"<p>Aberrant behaviors among test-takers in large-scale assessments are often more prevalent within specific groups or testing sites. While various techniques have been developed to detect individual-level test-takers' aberrant behaviors, research in detecting those behaviors at the group level is rare. We propose a group fit statistic <span></span><math>\u0000 <semantics>\u0000 <msub>\u0000 <mi>l</mi>\u0000 <mrow>\u0000 <mi>z</mi>\u0000 <mn>2</mn>\u0000 </mrow>\u0000 </msub>\u0000 <annotation>$ l_{z2}$</annotation>\u0000 </semantics></math> by extending the <span></span><math>\u0000 <semantics>\u0000 <msub>\u0000 <mi>l</mi>\u0000 <mi>z</mi>\u0000 </msub>\u0000 <annotation>$ l_z$</annotation>\u0000 </semantics></math> statistic to a multilevel item response model. This new statistic demonstrates adequate power and effectively controls the Type I error rate, particularly when true latent variable values are used or when group sizes are large, such as 500. When latent variable estimates are employed, an adjustment to the <span></span><math>\u0000 <semantics>\u0000 <msub>\u0000 <mi>l</mi>\u0000 <mrow>\u0000 <mi>z</mi>\u0000 <mn>2</mn>\u0000 </mrow>\u0000 </msub>\u0000 <annotation>$l_{z2}$</annotation>\u0000 </semantics></math> based on the posterior predictive checking approach can offer improved control over the Type I error rate.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2025-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70024","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146096369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Educational Measurement
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1