首页 > 最新文献

Journal of Educational Measurement最新文献

英文 中文
From Item Estimates to Test Operations: The Cascading Effect of Rapid Guessing 从项目估计到测试操作:快速猜测的级联效应
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-09-28 DOI: 10.1111/jedm.70010
Sarah Alahmadi, Christine E. DeMars

Inadequate test-taking effort poses a significant challenge, particularly when low-stakes test results inform high-stakes policy and psychometric decisions. We examined how rapid guessing (RG), a common form of low test-taking effort, biases item parameter estimates, particularly the discrimination and difficulty parameters. Previous research reported conflicting findings on the direction of bias and what contributes to it. Using simulated data that replicate real-world, low-stakes testing conditions, this study reconciles the inconsistencies by identifying the conditions under which item parameters are over- or underestimated. Bias is influenced by item-related factors (true parameter values and the number of RG responses the items receive) and examinee-related factors (proficiency differences between rapid guessers and non-rapid guessers, the variability in RG behavior among rapid guessers, and the pattern of RG responses throughout the test). The findings highlight that ignoring RG not only distorts proficiency estimates but may also impact broader test operations, including adaptive testing, equating, and standard setting. By demonstrating the potential far-reaching effects of RG, we underline the need for testing professionals to implement methods that mitigate RG's impact (such as motivation filtering) to protect the integrity of their psychometric work.

不充分的应试努力构成了重大挑战,特别是当低风险的测试结果为高风险的政策和心理测量决策提供信息时。我们研究了快速猜测(RG),一种常见的低测试努力形式,如何偏差项目参数估计,特别是辨别和难度参数。先前的研究报告了关于偏见的方向和导致偏见的原因的相互矛盾的发现。使用模拟数据来复制真实世界,低风险的测试条件,本研究通过确定项目参数被高估或低估的条件来调和不一致性。偏见受项目相关因素(真实参数值和项目收到的RG反应数量)和考生相关因素(快速猜测者和非快速猜测者之间的熟练程度差异、快速猜测者之间RG行为的可变性以及整个测试过程中RG反应的模式)的影响。研究结果强调,忽略RG不仅扭曲了熟练程度估计,而且可能影响更广泛的测试操作,包括适应性测试、等同和标准设置。通过展示RG潜在的深远影响,我们强调测试专业人员需要实施减轻RG影响的方法(如动机过滤),以保护他们心理测量工作的完整性。
{"title":"From Item Estimates to Test Operations: The Cascading Effect of Rapid Guessing","authors":"Sarah Alahmadi,&nbsp;Christine E. DeMars","doi":"10.1111/jedm.70010","DOIUrl":"https://doi.org/10.1111/jedm.70010","url":null,"abstract":"<p>Inadequate test-taking effort poses a significant challenge, particularly when low-stakes test results inform high-stakes policy and psychometric decisions. We examined how rapid guessing (RG), a common form of low test-taking effort, biases item parameter estimates, particularly the discrimination and difficulty parameters. Previous research reported conflicting findings on the direction of bias and what contributes to it. Using simulated data that replicate real-world, low-stakes testing conditions, this study reconciles the inconsistencies by identifying the conditions under which item parameters are over- or underestimated. Bias is influenced by item-related factors (true parameter values and the number of RG responses the items receive) and examinee-related factors (proficiency differences between rapid guessers and non-rapid guessers, the variability in RG behavior among rapid guessers, and the pattern of RG responses throughout the test). The findings highlight that ignoring RG not only distorts proficiency estimates but may also impact broader test operations, including adaptive testing, equating, and standard setting. By demonstrating the potential far-reaching effects of RG, we underline the need for testing professionals to implement methods that mitigate RG's impact (such as motivation filtering) to protect the integrity of their psychometric work.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"740-762"},"PeriodicalIF":1.6,"publicationDate":"2025-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70010","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145754592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Special Issue: Adaptive Testing in Large-Scale Assessments 特刊:大规模评估中的适应性测试
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-09-25 DOI: 10.1111/jedm.70009
Peter van Rijn, Francesco Avvisati
{"title":"Special Issue: Adaptive Testing in Large-Scale Assessments","authors":"Peter van Rijn,&nbsp;Francesco Avvisati","doi":"10.1111/jedm.70009","DOIUrl":"https://doi.org/10.1111/jedm.70009","url":null,"abstract":"","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 3","pages":"385-391"},"PeriodicalIF":1.6,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145341777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Precision and Bias of Cut Score Estimates from the Beuk Standard Setting Method Beuk标准设定法切割分数估计的精度和偏差
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-09-09 DOI: 10.1111/jedm.70007
Joseph H. Grochowalski, Lei Wan, Lauren Molin, Amy H. Hendrickson

The Beuk standard setting method derives cut scores through expert judgment that balances content and normative perspectives. This study developed a method to estimate confidence intervals for Beuk settings and assessed their accuracy via simulations. Simulations varied SME panel size, expert agreement, cut score locations, score distributions, and decision alignment. Panels with 20+ participants provided precise and accurate cut score estimates if strongly agreed upon. Larger panels did not improve precision significantly. Cut score location influenced confidence interval widths, highlighting its importance in planning. Real data showed SME disagreement increased bias and variance of Beuk estimates. Use Beuk cut scores cautiously with small panels, flat score distributions, or significant expert disagreement.

Beuk的标准制定方法是通过平衡内容和规范观点的专家判断得出分数。本研究开发了一种方法来估计Beuk设置的置信区间,并通过模拟评估其准确性。模拟改变了中小企业小组的规模、专家的一致意见、分值位置、分值分布和决策一致性。20+参与者的小组提供了精确和准确的切割分数估计,如果强烈同意。较大的面板并没有显著提高精度。切分位置影响置信区间宽度,突出了其在规划中的重要性。实际数据显示,中小企业的分歧增加了Beuk估计的偏差和方差。谨慎地使用Beuk cut分数与小面板,平坦的分数分布,或显著的专家分歧。
{"title":"The Precision and Bias of Cut Score Estimates from the Beuk Standard Setting Method","authors":"Joseph H. Grochowalski,&nbsp;Lei Wan,&nbsp;Lauren Molin,&nbsp;Amy H. Hendrickson","doi":"10.1111/jedm.70007","DOIUrl":"https://doi.org/10.1111/jedm.70007","url":null,"abstract":"<p>The Beuk standard setting method derives cut scores through expert judgment that balances content and normative perspectives. This study developed a method to estimate confidence intervals for Beuk settings and assessed their accuracy via simulations. Simulations varied SME panel size, expert agreement, cut score locations, score distributions, and decision alignment. Panels with 20+ participants provided precise and accurate cut score estimates if strongly agreed upon. Larger panels did not improve precision significantly. Cut score location influenced confidence interval widths, highlighting its importance in planning. Real data showed SME disagreement increased bias and variance of Beuk estimates. Use Beuk cut scores cautiously with small panels, flat score distributions, or significant expert disagreement.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"687-717"},"PeriodicalIF":1.6,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Simultaneous Detection of Cheaters and Compromised Items Using a Biclustering Approach 使用双聚类方法同时检测作弊者和受损物品
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-09-08 DOI: 10.1111/jedm.70004
Hyeryung Lee, Walter P. Vispoel

Traditional methods for detecting cheating on assessments tend to focus on either identifying cheaters or compromised items in isolation, overlooking their interconnection. In this study, we present a novel biclustering approach that simultaneously detects both cheaters and compromised items by identifying coherent subgroups of examinees and items exhibiting suspicious response patterns. To identify these patterns, our method leverages response accuracy, response time, and distractor choice data. We evaluated the approach on real datasets and compared its performance with existing detection approaches. Additionally, a comprehensive simulation study was conducted, modeling a variety of realistic cheating scenarios such as answer copying, pre-knowledge of test items, and distinct forms of rapid guessing. Our findings revealed that the biclustering method outperformed previous methods in simultaneously distinguishing cheating and non-cheating behaviors within the empirical study. The simulation analyses further revealed the conditions under which the biclustering approach was most effective in both regards. Overall, the findings underscore the flexibility of biclustering and its adaptability in enhancing test security within diverse testing environments.

检测评估作弊的传统方法往往侧重于识别作弊者或孤立的受损项目,而忽略了它们之间的相互联系。在这项研究中,我们提出了一种新的双聚类方法,通过识别连贯的考生亚组和表现出可疑反应模式的项目,同时检测作弊者和受损项目。为了识别这些模式,我们的方法利用了响应准确性、响应时间和干扰选择数据。我们在真实数据集上评估了该方法,并将其与现有检测方法的性能进行了比较。此外,我们还进行了一项全面的模拟研究,模拟了各种现实的作弊场景,如答案抄袭、测试项目的预先知识和不同形式的快速猜测。我们的研究结果表明,在实证研究中,双聚类方法在同时区分作弊和非作弊行为方面优于以往的方法。仿真分析进一步揭示了双聚类方法在两方面都最有效的条件。总的来说,研究结果强调了双集群的灵活性及其在不同测试环境中增强测试安全性的适应性。
{"title":"Simultaneous Detection of Cheaters and Compromised Items Using a Biclustering Approach","authors":"Hyeryung Lee,&nbsp;Walter P. Vispoel","doi":"10.1111/jedm.70004","DOIUrl":"https://doi.org/10.1111/jedm.70004","url":null,"abstract":"<p>Traditional methods for detecting cheating on assessments tend to focus on either identifying cheaters or compromised items in isolation, overlooking their interconnection. In this study, we present a novel biclustering approach that simultaneously detects both cheaters and compromised items by identifying coherent subgroups of examinees and items exhibiting suspicious response patterns. To identify these patterns, our method leverages response accuracy, response time, and distractor choice data. We evaluated the approach on real datasets and compared its performance with existing detection approaches. Additionally, a comprehensive simulation study was conducted, modeling a variety of realistic cheating scenarios such as answer copying, pre-knowledge of test items, and distinct forms of rapid guessing. Our findings revealed that the biclustering method outperformed previous methods in simultaneously distinguishing cheating and non-cheating behaviors within the empirical study. The simulation analyses further revealed the conditions under which the biclustering approach was most effective in both regards. Overall, the findings underscore the flexibility of biclustering and its adaptability in enhancing test security within diverse testing environments.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"608-638"},"PeriodicalIF":1.6,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Classification Consistency and Accuracy Indices for Simple Structure MIRT Model 简单结构MIRT模型的分类一致性和精度指标
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-09-04 DOI: 10.1111/jedm.70006
Huan Liu, Won-Chan Lee

This study investigates the estimation of classification consistency and accuracy indices for composite summed and theta scores within the SS-MIRT framework, using five popular approaches, including the Lee, Rudner, Guo, Bayesian EAP, and Bayesian MCMC approaches. The procedures are illustrated through analysis of two real datasets and further evaluated via a simulation study under various conditions. Overall, results indicated that all five approaches performed well, producing classification indices estimates that were highly consistent in both magnitude and pattern. However, the results also indicated that factors such as the ability estimator, score metric, and cut score location can significantly influence estimation outcomes. Consequently, these considerations should guide practitioners in selecting the most appropriate estimation approach for their specific assessment context.

本研究研究了SS-MIRT框架内综合求和和theta分数的分类一致性和准确性指标的估计,使用了五种流行的方法,包括Lee, Rudner, Guo,贝叶斯EAP和贝叶斯MCMC方法。通过对两个真实数据集的分析说明了这些过程,并通过各种条件下的模拟研究进一步进行了评估。总体而言,结果表明所有五种方法都表现良好,产生的分类指数估计在大小和模式上都高度一致。然而,结果也表明,诸如能力估计器、分数度量和分数切割位置等因素可以显著影响估计结果。因此,这些考虑应该指导从业者为他们具体的评估环境选择最合适的评估方法。
{"title":"Classification Consistency and Accuracy Indices for Simple Structure MIRT Model","authors":"Huan Liu,&nbsp;Won-Chan Lee","doi":"10.1111/jedm.70006","DOIUrl":"https://doi.org/10.1111/jedm.70006","url":null,"abstract":"<p>This study investigates the estimation of classification consistency and accuracy indices for composite summed and theta scores within the SS-MIRT framework, using five popular approaches, including the Lee, Rudner, Guo, Bayesian EAP, and Bayesian MCMC approaches. The procedures are illustrated through analysis of two real datasets and further evaluated via a simulation study under various conditions. Overall, results indicated that all five approaches performed well, producing classification indices estimates that were highly consistent in both magnitude and pattern. However, the results also indicated that factors such as the ability estimator, score metric, and cut score location can significantly influence estimation outcomes. Consequently, these considerations should guide practitioners in selecting the most appropriate estimation approach for their specific assessment context.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"663-686"},"PeriodicalIF":1.6,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70006","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multiple Sets of Initial Values Method for MLE-EM and Its Variants in Cognitive Diagnosis Models 认知诊断模型中MLE-EM及其变体的多初始值集方法
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-09-01 DOI: 10.1111/jedm.70005
Yue Zhao, Yuerong Wu, Yanlou Liu, Tao Xin, Yiming Wang

Cognitive diagnosis models (CDMs) are widely used to assess individuals’ latent characteristics, offering detailed diagnostic insights for tailored instructional development. Maximum likelihood estimation using the expectation-maximization algorithm (MLE-EM) or its variants, such as the EM algorithm with monotonic constraints and Bayes modal estimation, typically uses a single set of initial values (SIV). The MLE-EM method is sensitive to initial values, especially when dealing with non-convex likelihood functions. This sensitivity implies that different initial values may converge to different local maximum likelihood solutions, but SIV does not guarantee a satisfactory local optimum. Thus, we introduced the multiple sets of initial values (MIV) method to reduce sensitivity to the choice of initial values. We compared MIV and SIV in terms of convergence, log-likelihood values of the converged solutions, parameter recovery, and time consumption under varying conditions of item quality, sample size, attribute correlation, number of initial sets, and convergence settings. The results showed that MIV outperformed SIV in terms of convergence. Applying the MIV method increased the probability of obtaining solutions with higher log-likelihood values. We provide a detailed discussion of this outcome under small sample conditions in which MIV performed worse than SIV.

认知诊断模型(CDMs)被广泛用于评估个体的潜在特征,为量身定制的教学开发提供详细的诊断见解。使用期望最大化算法(MLE-EM)或其变体(例如具有单调约束的EM算法和贝叶斯模态估计)的最大似然估计通常使用单个初始值集(SIV)。MLE-EM方法对初始值很敏感,特别是在处理非凸似然函数时。这种敏感性意味着不同的初始值可能收敛到不同的局部最大似然解,但SIV不能保证令人满意的局部最优解。因此,我们引入了多初始值集(MIV)方法来降低对初始值选择的敏感性。我们比较了MIV和SIV的收敛性、收敛解的对数似然值、参数恢复和在项目质量、样本量、属性相关性、初始集数量和收敛设置等不同条件下的时间消耗。结果表明,MIV在收敛性方面优于SIV。应用MIV方法增加了获得对数似然值较高的解的概率。我们在小样本条件下详细讨论了这一结果,其中MIV的表现比SIV差。
{"title":"Multiple Sets of Initial Values Method for MLE-EM and Its Variants in Cognitive Diagnosis Models","authors":"Yue Zhao,&nbsp;Yuerong Wu,&nbsp;Yanlou Liu,&nbsp;Tao Xin,&nbsp;Yiming Wang","doi":"10.1111/jedm.70005","DOIUrl":"https://doi.org/10.1111/jedm.70005","url":null,"abstract":"<p>Cognitive diagnosis models (CDMs) are widely used to assess individuals’ latent characteristics, offering detailed diagnostic insights for tailored instructional development. Maximum likelihood estimation using the expectation-maximization algorithm (MLE-EM) or its variants, such as the EM algorithm with monotonic constraints and Bayes modal estimation, typically uses a single set of initial values (SIV). The MLE-EM method is sensitive to initial values, especially when dealing with non-convex likelihood functions. This sensitivity implies that different initial values may converge to different local maximum likelihood solutions, but SIV does not guarantee a satisfactory local optimum. Thus, we introduced the multiple sets of initial values (MIV) method to reduce sensitivity to the choice of initial values. We compared MIV and SIV in terms of convergence, log-likelihood values of the converged solutions, parameter recovery, and time consumption under varying conditions of item quality, sample size, attribute correlation, number of initial sets, and convergence settings. The results showed that MIV outperformed SIV in terms of convergence. Applying the MIV method increased the probability of obtaining solutions with higher log-likelihood values. We provide a detailed discussion of this outcome under small sample conditions in which MIV performed worse than SIV.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"639-662"},"PeriodicalIF":1.6,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparing Data-Driven Methods for Removing Options in Assessment Items 比较删除评估项目中选项的数据驱动方法
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-09-01 DOI: 10.1111/jedm.70003
William Muntean, Joe Betts, Zhuoran Wang, Hao Jia

Test items with problematic options often require revision to improve their psychometric properties. When an option is identified as ambiguous or nonfunctioning, the traditional approach involves removing the option and conducting another field test to gather new response data—a process that, while effective, is resource-intensive. This study compares two methods for handling option removal: the Retesting method (administering modified items to new examinees) versus the Recalculating method (computationally removing options from existing response data). Through a controlled experiment with multiple-response and matrix-format items, we examined whether these methods produce equivalent item characteristics. Results show striking similarities between methods across multiple psychometric item properties. These findings suggest that the Recalculating method may offer an efficient alternative for items with sufficient option choices. We discuss implementation considerations and present our experimental design and analytical approach as a framework that other testing programs can adapt to evaluate whether the Recalculating method is appropriate for their specific contexts.

带有问题选项的测试项目通常需要修改以改善其心理测量特性。当一个选项被确定为不明确或不起作用时,传统的方法包括删除该选项并进行另一次现场测试以收集新的响应数据——这一过程虽然有效,但需要耗费大量资源。本研究比较了两种处理选项删除的方法:重新测试方法(将修改后的项目管理给新考生)和重新计算方法(从现有的回答数据中计算删除选项)。通过多反应和矩阵格式项目的对照实验,我们检验了这些方法是否产生等效的项目特征。结果显示,在不同的心理测量项目属性之间的方法有着惊人的相似性。这些发现表明,重新计算方法可以为具有足够选项的项目提供有效的替代方案。我们讨论了实现方面的考虑,并提出了我们的实验设计和分析方法,作为其他测试程序可以适应的框架,以评估重新计算方法是否适合其特定环境。
{"title":"Comparing Data-Driven Methods for Removing Options in Assessment Items","authors":"William Muntean,&nbsp;Joe Betts,&nbsp;Zhuoran Wang,&nbsp;Hao Jia","doi":"10.1111/jedm.70003","DOIUrl":"https://doi.org/10.1111/jedm.70003","url":null,"abstract":"<p>Test items with problematic options often require revision to improve their psychometric properties. When an option is identified as ambiguous or nonfunctioning, the traditional approach involves removing the option and conducting another field test to gather new response data—a process that, while effective, is resource-intensive. This study compares two methods for handling option removal: the Retesting method (administering modified items to new examinees) versus the Recalculating method (computationally removing options from existing response data). Through a controlled experiment with multiple-response and matrix-format items, we examined whether these methods produce equivalent item characteristics. Results show striking similarities between methods across multiple psychometric item properties. These findings suggest that the Recalculating method may offer an efficient alternative for items with sufficient option choices. We discuss implementation considerations and present our experimental design and analytical approach as a framework that other testing programs can adapt to evaluate whether the Recalculating method is appropriate for their specific contexts.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"588-607"},"PeriodicalIF":1.6,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70003","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic Prompt Engineering for Automatic Scoring 自动评分的自动提示工程
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-08-17 DOI: 10.1111/jedm.70002
Mingfeng Xue, Yunting Liu, Xingyao Xiao, Mark Wilson

Prompts play a crucial role in eliciting accurate outputs from large language models (LLMs). This study examines the effectiveness of an automatic prompt engineering (APE) framework for automatic scoring in educational measurement. We collected constructed-response data from 930 students across 11 items and used human scores as the true labels. A baseline was established by providing LLMs with the original human-scoring instructions and materials. APE was then applied to optimize prompts for each item. We found that on average, APE increased scoring accuracy by 9%; few-shot learning (i.e., giving multiple labeled examples related to the goal) increased APE performance by 2%; a high temperature (i.e., a parameter for output randomness) was needed in at least part of the APE to improve the scoring accuracy; Quadratic Weighted Kappa (QWK) showed a similar pattern. These findings support the use of APE in automatic scoring. Moreover, compared with the manual scoring instructions, APE tended to restate and reformat the scoring prompts, which could give rise to concerns about validity. Thus, the creative variability introduced by LLMs raises considerations about the balance between innovation and adherence to scoring rubrics.

提示符在从大型语言模型(llm)中获得准确输出方面起着至关重要的作用。本研究探讨了自动提示工程(APE)框架在教育测量中自动评分的有效性。我们从930名学生中收集了11个项目的建构性回答数据,并使用人类得分作为真实标签。通过向法学硕士提供原始的人工评分指令和材料来建立基线。然后应用APE优化每个项目的提示。我们发现,APE平均将评分准确率提高了9%;少次学习(即给出与目标相关的多个标记示例)使APE性能提高了2%;至少部分APE需要较高的温度(即输出随机性参数)来提高评分精度;二次加权Kappa (QWK)表现出类似的模式。这些发现支持在自动评分中使用APE。此外,与手工评分说明相比,APE倾向于重述和重新格式化评分提示,这可能会引起效度问题。因此,法学硕士引入的创造性可变性引起了对创新与遵守评分标准之间平衡的考虑。
{"title":"Automatic Prompt Engineering for Automatic Scoring","authors":"Mingfeng Xue,&nbsp;Yunting Liu,&nbsp;Xingyao Xiao,&nbsp;Mark Wilson","doi":"10.1111/jedm.70002","DOIUrl":"https://doi.org/10.1111/jedm.70002","url":null,"abstract":"<p>Prompts play a crucial role in eliciting accurate outputs from large language models (LLMs). This study examines the effectiveness of an automatic prompt engineering (APE) framework for automatic scoring in educational measurement. We collected constructed-response data from 930 students across 11 items and used human scores as the true labels. A baseline was established by providing LLMs with the original human-scoring instructions and materials. APE was then applied to optimize prompts for each item. We found that on average, APE increased scoring accuracy by 9%; few-shot learning (i.e., giving multiple labeled examples related to the goal) increased APE performance by 2%; a high temperature (i.e., a parameter for output randomness) was needed in at least part of the APE to improve the scoring accuracy; Quadratic Weighted Kappa (QWK) showed a similar pattern. These findings support the use of APE in automatic scoring. Moreover, compared with the manual scoring instructions, APE tended to restate and reformat the scoring prompts, which could give rise to concerns about validity. Thus, the creative variability introduced by LLMs raises considerations about the balance between innovation and adherence to scoring rubrics.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"559-587"},"PeriodicalIF":1.6,"publicationDate":"2025-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
How Many Plausible Values? 有多少可信的值?
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-07-03 DOI: 10.1111/jedm.70000
Paul A. Jewsbury, Daniel F. McCaffrey, Yue Jia, Eugenio J. Gonzalez

Large-scale survey assessments (LSAs) such as NAEP, TIMSS, PIRLS, IELS, and NAPLAN produce plausible values of student proficiency for estimating population statistics. Plausible values are imputed values for latent proficiency variables. While prominently used for LSAs, they are applicable to a wide range of latent variable modelling contexts such as surveys about psychological dispositions or beliefs. Following the practice of multiple imputation, LSAs produce multiple sets of plausible values for each survey. The criteria used to determine the number of plausible values remains unresolved and is inconsistent in practice. We show analytically and via simulation that the number of plausible values used determines the amount of Monte Carlo error on point estimates and standard errors as a function of the fraction of missing information. We derive expressions to determine the number of plausible values required to reach a given level of precision. We analyze real data from a LSA to provide guidelines supported by theory, simulation, and real data on the number of plausible values. Finally, we illustrate the impact with a power analysis. Our results show there is meaningful benefit to the use of greater numbers of plausible values than currently generated by LSAs.

大规模调查评估(LSAs),如NAEP、TIMSS、PIRLS、IELS和NAPLAN,为估计人口统计数据提供了学生熟练程度的可信值。可信值是潜在熟练度变量的估算值。虽然主要用于lsa,但它们适用于广泛的潜在变量建模上下文,例如关于心理倾向或信念的调查。遵循多重imputation的做法,lsa为每次调查产生多组似是而非的值。用于确定合理值的数量的标准仍然没有解决,并且在实践中是不一致的。我们通过分析和模拟表明,所使用的可信值的数量决定了点估计和标准误差上的蒙特卡罗误差的数量,作为缺失信息比例的函数。我们推导表达式来确定达到给定精度水平所需的可信值的数量。我们分析来自LSA的真实数据,以提供理论、模拟和真实数据支持的关于可信值数量的指导方针。最后,我们通过功率分析来说明其影响。我们的结果表明,与目前由lsa生成的值相比,使用更多的似是而非的值有意义的好处。
{"title":"How Many Plausible Values?","authors":"Paul A. Jewsbury,&nbsp;Daniel F. McCaffrey,&nbsp;Yue Jia,&nbsp;Eugenio J. Gonzalez","doi":"10.1111/jedm.70000","DOIUrl":"https://doi.org/10.1111/jedm.70000","url":null,"abstract":"<p>Large-scale survey assessments (LSAs) such as NAEP, TIMSS, PIRLS, IELS, and NAPLAN produce plausible values of student proficiency for estimating population statistics. Plausible values are imputed values for latent proficiency variables. While prominently used for LSAs, they are applicable to a wide range of latent variable modelling contexts such as surveys about psychological dispositions or beliefs. Following the practice of multiple imputation, LSAs produce multiple sets of plausible values for each survey. The criteria used to determine the number of plausible values remains unresolved and is inconsistent in practice. We show analytically and via simulation that the number of plausible values used determines the amount of Monte Carlo error on point estimates and standard errors as a function of the fraction of missing information. We derive expressions to determine the number of plausible values required to reach a given level of precision. We analyze real data from a LSA to provide guidelines supported by theory, simulation, and real data on the number of plausible values. Finally, we illustrate the impact with a power analysis. Our results show there is meaningful benefit to the use of greater numbers of plausible values than currently generated by LSAs.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"531-558"},"PeriodicalIF":1.6,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parametric Bootstrap Mantel-Haenszel Statistic for Aggregated Testlet Effects 聚合测试效应的参数Bootstrap Mantel-Haenszel统计量
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-06-17 DOI: 10.1111/jedm.12440
Youn Seon Lim

While testlets have proven useful for assessing complex skills, the stem shared by multiple items often induces correlations between responses, leading to violations of local independence (LI), which can result in biased parameter and ability estimates. Diagnostic procedures for detecting testlet effects typically involve model comparisons testing for the inclusion of extra testlet parameters or, at the item level, testing for pairwise LI. Rosenbaum's adaptation of the Mantel-Haenszel (MH) χ2$chi ^2$-statistic belongs to the latter category. The MH χ2$chi ^2$-statistic has also been used in cognitive diagnosis for detecting violations of LI and for the identification of testlet effects. However, this approach is not without limitations, as it lacks a rationale for integrating multiple pairwise MH χ2$chi ^2$-statistics and any notion of the sampling distribution of such an integrated statistic. In this article, a procedure for integrating multiple pairwise MH χ2$chi ^2$-statistics to evaluate testlet effects in cognitive diagnosis is proposed. The unknown sampling distribution issue is addressed by implementing a parametric bootstrap resampling scheme. Results from simulation studies demonstrate the performance of the proposed parametric bootstrap testlet MH χ2$chi ^2$-statistic, and its application to the 2015 PISA Collaborative Problem Solving (CPS) data set illustrates the method's practical merits.

虽然测试已被证明对评估复杂技能很有用,但多个项目共享的系统通常会导致反应之间的相关性,从而导致违反局部独立性(LI),从而导致参数和能力估计有偏差。检测小测试效应的诊断程序通常包括模型比较检验是否包含额外的小测试参数,或者在项目水平上检验成对LI。Rosenbaum对Mantel-Haenszel (MH) χ 2$ chi ^2$ -统计量的改编属于后一类。MH χ 2$ chi ^2$ -统计量也被用于认知诊断,以检测LI的违规行为和识别测试效应。然而,这种方法并非没有限制,因为它缺乏对多个成对MH χ 2$ chi ^2$ -统计量进行积分的基本原理,也缺乏对这种积分统计量的抽样分布的任何概念。本文提出了一种利用MH χ 2$ chi ^2$ -统计量的多重两两积分方法来评估认知诊断中的测试效果。通过实现参数自举重采样方案,解决了未知采样分布问题。仿真研究的结果证明了所提出的参数自举测试let MH χ 2$ chi ^2$ -统计的性能,其在2015年PISA协作问题解决(CPS)数据集上的应用表明了该方法的实际优点。
{"title":"Parametric Bootstrap Mantel-Haenszel Statistic for Aggregated Testlet Effects","authors":"Youn Seon Lim","doi":"10.1111/jedm.12440","DOIUrl":"https://doi.org/10.1111/jedm.12440","url":null,"abstract":"<p>While testlets have proven useful for assessing complex skills, the stem shared by multiple items often induces correlations between responses, leading to violations of local independence (LI), which can result in biased parameter and ability estimates. Diagnostic procedures for detecting testlet effects typically involve model comparisons testing for the inclusion of extra testlet parameters or, at the item level, testing for pairwise LI. Rosenbaum's adaptation of the Mantel-Haenszel (MH) <span></span><math>\u0000 <semantics>\u0000 <msup>\u0000 <mi>χ</mi>\u0000 <mn>2</mn>\u0000 </msup>\u0000 <annotation>$chi ^2$</annotation>\u0000 </semantics></math>-statistic belongs to the latter category. The MH <span></span><math>\u0000 <semantics>\u0000 <msup>\u0000 <mi>χ</mi>\u0000 <mn>2</mn>\u0000 </msup>\u0000 <annotation>$chi ^2$</annotation>\u0000 </semantics></math>-statistic has also been used in cognitive diagnosis for detecting violations of LI and for the identification of testlet effects. However, this approach is not without limitations, as it lacks a rationale for integrating multiple pairwise MH <span></span><math>\u0000 <semantics>\u0000 <msup>\u0000 <mi>χ</mi>\u0000 <mn>2</mn>\u0000 </msup>\u0000 <annotation>$chi ^2$</annotation>\u0000 </semantics></math>-statistics and any notion of the sampling distribution of such an integrated statistic. In this article, a procedure for integrating multiple pairwise MH <span></span><math>\u0000 <semantics>\u0000 <msup>\u0000 <mi>χ</mi>\u0000 <mn>2</mn>\u0000 </msup>\u0000 <annotation>$chi ^2$</annotation>\u0000 </semantics></math>-statistics to evaluate testlet effects in cognitive diagnosis is proposed. The unknown sampling distribution issue is addressed by implementing a parametric bootstrap resampling scheme. Results from simulation studies demonstrate the performance of the proposed parametric bootstrap testlet MH <span></span><math>\u0000 <semantics>\u0000 <msup>\u0000 <mi>χ</mi>\u0000 <mn>2</mn>\u0000 </msup>\u0000 <annotation>$chi ^2$</annotation>\u0000 </semantics></math>-statistic, and its application to the 2015 PISA Collaborative Problem Solving (CPS) data set illustrates the method's practical merits.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"503-530"},"PeriodicalIF":1.6,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12440","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Educational Measurement
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1