首页 > 最新文献

Journal of Educational Measurement最新文献

英文 中文
Automatic Prompt Engineering for Automatic Scoring 自动评分的自动提示工程
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-08-17 DOI: 10.1111/jedm.70002
Mingfeng Xue, Yunting Liu, Xingyao Xiao, Mark Wilson

Prompts play a crucial role in eliciting accurate outputs from large language models (LLMs). This study examines the effectiveness of an automatic prompt engineering (APE) framework for automatic scoring in educational measurement. We collected constructed-response data from 930 students across 11 items and used human scores as the true labels. A baseline was established by providing LLMs with the original human-scoring instructions and materials. APE was then applied to optimize prompts for each item. We found that on average, APE increased scoring accuracy by 9%; few-shot learning (i.e., giving multiple labeled examples related to the goal) increased APE performance by 2%; a high temperature (i.e., a parameter for output randomness) was needed in at least part of the APE to improve the scoring accuracy; Quadratic Weighted Kappa (QWK) showed a similar pattern. These findings support the use of APE in automatic scoring. Moreover, compared with the manual scoring instructions, APE tended to restate and reformat the scoring prompts, which could give rise to concerns about validity. Thus, the creative variability introduced by LLMs raises considerations about the balance between innovation and adherence to scoring rubrics.

提示符在从大型语言模型(llm)中获得准确输出方面起着至关重要的作用。本研究探讨了自动提示工程(APE)框架在教育测量中自动评分的有效性。我们从930名学生中收集了11个项目的建构性回答数据,并使用人类得分作为真实标签。通过向法学硕士提供原始的人工评分指令和材料来建立基线。然后应用APE优化每个项目的提示。我们发现,APE平均将评分准确率提高了9%;少次学习(即给出与目标相关的多个标记示例)使APE性能提高了2%;至少部分APE需要较高的温度(即输出随机性参数)来提高评分精度;二次加权Kappa (QWK)表现出类似的模式。这些发现支持在自动评分中使用APE。此外,与手工评分说明相比,APE倾向于重述和重新格式化评分提示,这可能会引起效度问题。因此,法学硕士引入的创造性可变性引起了对创新与遵守评分标准之间平衡的考虑。
{"title":"Automatic Prompt Engineering for Automatic Scoring","authors":"Mingfeng Xue,&nbsp;Yunting Liu,&nbsp;Xingyao Xiao,&nbsp;Mark Wilson","doi":"10.1111/jedm.70002","DOIUrl":"https://doi.org/10.1111/jedm.70002","url":null,"abstract":"<p>Prompts play a crucial role in eliciting accurate outputs from large language models (LLMs). This study examines the effectiveness of an automatic prompt engineering (APE) framework for automatic scoring in educational measurement. We collected constructed-response data from 930 students across 11 items and used human scores as the true labels. A baseline was established by providing LLMs with the original human-scoring instructions and materials. APE was then applied to optimize prompts for each item. We found that on average, APE increased scoring accuracy by 9%; few-shot learning (i.e., giving multiple labeled examples related to the goal) increased APE performance by 2%; a high temperature (i.e., a parameter for output randomness) was needed in at least part of the APE to improve the scoring accuracy; Quadratic Weighted Kappa (QWK) showed a similar pattern. These findings support the use of APE in automatic scoring. Moreover, compared with the manual scoring instructions, APE tended to restate and reformat the scoring prompts, which could give rise to concerns about validity. Thus, the creative variability introduced by LLMs raises considerations about the balance between innovation and adherence to scoring rubrics.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"559-587"},"PeriodicalIF":1.6,"publicationDate":"2025-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Topic Testlet Model for Calibrating Testlet Constructed Responses 一种校正测试题构造响应的主题测试题模型
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-08-07 DOI: 10.1111/jedm.70001
Jiawei Xiong, Huan (Hailey) Kuang, Cheng Tang, Qidi Liu, Bowen Wang, George Engelhard Jr., Allan S. Cohen, Xinhui (Maggie) Xiong, Rufei Sheng

Constructed responses (CRs) within testlets are widely used to assess complex skills but can pose calibration challenges due to local item dependence. A few current testlet models incorporate testlet-specific effects to address local dependence but struggle with interpreting these effects and may not fully capture the complexities of CR items because they rely only on response or score patterns. A Topic Testlet Model (TTM) integrates topic modeling within a psychometric framework was proposed. It uses latent topics from student written responses to adjust for local dependence, enable simultaneous calibration, and provide insights into evaluating student reasoning and writing in testlet CR items. Using empirical data from both English Language and Arts as well as Science assessments for grades 3-12, we compare the TTM with existing models in terms of ability estimates, item parameter estimates, and overall model fit. Simulation studies further demonstrate parameter recovery under various testing scenarios. Results show that the TTM effectively accounts for local dependence, improves testlet effect interpretability, and demonstrates a better fit than the existing models. TTM advances CR testlet calibration, leveraging additional information from student written responses to improve the precision of the assessment systems and validity of the use of test scores.

在测试问卷中构建反应(CRs)被广泛用于评估复杂技能,但由于局部项目依赖性,可能会带来校准挑战。一些当前的测试模型结合了特定于测试的效果来解决局部依赖性,但是很难解释这些效果,并且可能无法完全捕捉到CR项目的复杂性,因为它们只依赖于反应或评分模式。提出了一种将主题建模整合到心理测量框架中的主题测试模型(TTM)。它使用学生书面回答中的潜在主题来调整局部依赖性,实现同步校准,并提供对评估学生在测试CR项目中的推理和写作的见解。利用3-12年级的英语语言和艺术以及科学评估的经验数据,我们将TTM与现有模型在能力估计、项目参数估计和整体模型拟合方面进行比较。仿真研究进一步验证了各种测试场景下的参数恢复。结果表明,TTM有效地考虑了局部依赖性,提高了测试集效应的可解释性,与现有模型相比具有更好的拟合性。TTM推进了CR测试的校准,利用学生书面回答的额外信息来提高评估系统的准确性和测试分数使用的有效性。
{"title":"A Topic Testlet Model for Calibrating Testlet Constructed Responses","authors":"Jiawei Xiong,&nbsp;Huan (Hailey) Kuang,&nbsp;Cheng Tang,&nbsp;Qidi Liu,&nbsp;Bowen Wang,&nbsp;George Engelhard Jr.,&nbsp;Allan S. Cohen,&nbsp;Xinhui (Maggie) Xiong,&nbsp;Rufei Sheng","doi":"10.1111/jedm.70001","DOIUrl":"10.1111/jedm.70001","url":null,"abstract":"<p>Constructed responses (CRs) within testlets are widely used to assess complex skills but can pose calibration challenges due to local item dependence. A few current testlet models incorporate testlet-specific effects to address local dependence but struggle with interpreting these effects and may not fully capture the complexities of CR items because they rely only on response or score patterns. A Topic Testlet Model (TTM) integrates topic modeling within a psychometric framework was proposed. It uses latent topics from student written responses to adjust for local dependence, enable simultaneous calibration, and provide insights into evaluating student reasoning and writing in testlet CR items. Using empirical data from both English Language and Arts as well as Science assessments for grades 3-12, we compare the TTM with existing models in terms of ability estimates, item parameter estimates, and overall model fit. Simulation studies further demonstrate parameter recovery under various testing scenarios. Results show that the TTM effectively accounts for local dependence, improves testlet effect interpretability, and demonstrates a better fit than the existing models. TTM advances CR testlet calibration, leveraging additional information from student written responses to improve the precision of the assessment systems and validity of the use of test scores.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70001","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146096362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
How Many Plausible Values? 有多少可信的值?
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-07-03 DOI: 10.1111/jedm.70000
Paul A. Jewsbury, Daniel F. McCaffrey, Yue Jia, Eugenio J. Gonzalez

Large-scale survey assessments (LSAs) such as NAEP, TIMSS, PIRLS, IELS, and NAPLAN produce plausible values of student proficiency for estimating population statistics. Plausible values are imputed values for latent proficiency variables. While prominently used for LSAs, they are applicable to a wide range of latent variable modelling contexts such as surveys about psychological dispositions or beliefs. Following the practice of multiple imputation, LSAs produce multiple sets of plausible values for each survey. The criteria used to determine the number of plausible values remains unresolved and is inconsistent in practice. We show analytically and via simulation that the number of plausible values used determines the amount of Monte Carlo error on point estimates and standard errors as a function of the fraction of missing information. We derive expressions to determine the number of plausible values required to reach a given level of precision. We analyze real data from a LSA to provide guidelines supported by theory, simulation, and real data on the number of plausible values. Finally, we illustrate the impact with a power analysis. Our results show there is meaningful benefit to the use of greater numbers of plausible values than currently generated by LSAs.

大规模调查评估(LSAs),如NAEP、TIMSS、PIRLS、IELS和NAPLAN,为估计人口统计数据提供了学生熟练程度的可信值。可信值是潜在熟练度变量的估算值。虽然主要用于lsa,但它们适用于广泛的潜在变量建模上下文,例如关于心理倾向或信念的调查。遵循多重imputation的做法,lsa为每次调查产生多组似是而非的值。用于确定合理值的数量的标准仍然没有解决,并且在实践中是不一致的。我们通过分析和模拟表明,所使用的可信值的数量决定了点估计和标准误差上的蒙特卡罗误差的数量,作为缺失信息比例的函数。我们推导表达式来确定达到给定精度水平所需的可信值的数量。我们分析来自LSA的真实数据,以提供理论、模拟和真实数据支持的关于可信值数量的指导方针。最后,我们通过功率分析来说明其影响。我们的结果表明,与目前由lsa生成的值相比,使用更多的似是而非的值有意义的好处。
{"title":"How Many Plausible Values?","authors":"Paul A. Jewsbury,&nbsp;Daniel F. McCaffrey,&nbsp;Yue Jia,&nbsp;Eugenio J. Gonzalez","doi":"10.1111/jedm.70000","DOIUrl":"https://doi.org/10.1111/jedm.70000","url":null,"abstract":"<p>Large-scale survey assessments (LSAs) such as NAEP, TIMSS, PIRLS, IELS, and NAPLAN produce plausible values of student proficiency for estimating population statistics. Plausible values are imputed values for latent proficiency variables. While prominently used for LSAs, they are applicable to a wide range of latent variable modelling contexts such as surveys about psychological dispositions or beliefs. Following the practice of multiple imputation, LSAs produce multiple sets of plausible values for each survey. The criteria used to determine the number of plausible values remains unresolved and is inconsistent in practice. We show analytically and via simulation that the number of plausible values used determines the amount of Monte Carlo error on point estimates and standard errors as a function of the fraction of missing information. We derive expressions to determine the number of plausible values required to reach a given level of precision. We analyze real data from a LSA to provide guidelines supported by theory, simulation, and real data on the number of plausible values. Finally, we illustrate the impact with a power analysis. Our results show there is meaningful benefit to the use of greater numbers of plausible values than currently generated by LSAs.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"531-558"},"PeriodicalIF":1.6,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parametric Bootstrap Mantel-Haenszel Statistic for Aggregated Testlet Effects 聚合测试效应的参数Bootstrap Mantel-Haenszel统计量
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-06-17 DOI: 10.1111/jedm.12440
Youn Seon Lim

While testlets have proven useful for assessing complex skills, the stem shared by multiple items often induces correlations between responses, leading to violations of local independence (LI), which can result in biased parameter and ability estimates. Diagnostic procedures for detecting testlet effects typically involve model comparisons testing for the inclusion of extra testlet parameters or, at the item level, testing for pairwise LI. Rosenbaum's adaptation of the Mantel-Haenszel (MH) χ2$chi ^2$-statistic belongs to the latter category. The MH χ2$chi ^2$-statistic has also been used in cognitive diagnosis for detecting violations of LI and for the identification of testlet effects. However, this approach is not without limitations, as it lacks a rationale for integrating multiple pairwise MH χ2$chi ^2$-statistics and any notion of the sampling distribution of such an integrated statistic. In this article, a procedure for integrating multiple pairwise MH χ2$chi ^2$-statistics to evaluate testlet effects in cognitive diagnosis is proposed. The unknown sampling distribution issue is addressed by implementing a parametric bootstrap resampling scheme. Results from simulation studies demonstrate the performance of the proposed parametric bootstrap testlet MH χ2$chi ^2$-statistic, and its application to the 2015 PISA Collaborative Problem Solving (CPS) data set illustrates the method's practical merits.

虽然测试已被证明对评估复杂技能很有用,但多个项目共享的系统通常会导致反应之间的相关性,从而导致违反局部独立性(LI),从而导致参数和能力估计有偏差。检测小测试效应的诊断程序通常包括模型比较检验是否包含额外的小测试参数,或者在项目水平上检验成对LI。Rosenbaum对Mantel-Haenszel (MH) χ 2$ chi ^2$ -统计量的改编属于后一类。MH χ 2$ chi ^2$ -统计量也被用于认知诊断,以检测LI的违规行为和识别测试效应。然而,这种方法并非没有限制,因为它缺乏对多个成对MH χ 2$ chi ^2$ -统计量进行积分的基本原理,也缺乏对这种积分统计量的抽样分布的任何概念。本文提出了一种利用MH χ 2$ chi ^2$ -统计量的多重两两积分方法来评估认知诊断中的测试效果。通过实现参数自举重采样方案,解决了未知采样分布问题。仿真研究的结果证明了所提出的参数自举测试let MH χ 2$ chi ^2$ -统计的性能,其在2015年PISA协作问题解决(CPS)数据集上的应用表明了该方法的实际优点。
{"title":"Parametric Bootstrap Mantel-Haenszel Statistic for Aggregated Testlet Effects","authors":"Youn Seon Lim","doi":"10.1111/jedm.12440","DOIUrl":"https://doi.org/10.1111/jedm.12440","url":null,"abstract":"<p>While testlets have proven useful for assessing complex skills, the stem shared by multiple items often induces correlations between responses, leading to violations of local independence (LI), which can result in biased parameter and ability estimates. Diagnostic procedures for detecting testlet effects typically involve model comparisons testing for the inclusion of extra testlet parameters or, at the item level, testing for pairwise LI. Rosenbaum's adaptation of the Mantel-Haenszel (MH) <span></span><math>\u0000 <semantics>\u0000 <msup>\u0000 <mi>χ</mi>\u0000 <mn>2</mn>\u0000 </msup>\u0000 <annotation>$chi ^2$</annotation>\u0000 </semantics></math>-statistic belongs to the latter category. The MH <span></span><math>\u0000 <semantics>\u0000 <msup>\u0000 <mi>χ</mi>\u0000 <mn>2</mn>\u0000 </msup>\u0000 <annotation>$chi ^2$</annotation>\u0000 </semantics></math>-statistic has also been used in cognitive diagnosis for detecting violations of LI and for the identification of testlet effects. However, this approach is not without limitations, as it lacks a rationale for integrating multiple pairwise MH <span></span><math>\u0000 <semantics>\u0000 <msup>\u0000 <mi>χ</mi>\u0000 <mn>2</mn>\u0000 </msup>\u0000 <annotation>$chi ^2$</annotation>\u0000 </semantics></math>-statistics and any notion of the sampling distribution of such an integrated statistic. In this article, a procedure for integrating multiple pairwise MH <span></span><math>\u0000 <semantics>\u0000 <msup>\u0000 <mi>χ</mi>\u0000 <mn>2</mn>\u0000 </msup>\u0000 <annotation>$chi ^2$</annotation>\u0000 </semantics></math>-statistics to evaluate testlet effects in cognitive diagnosis is proposed. The unknown sampling distribution issue is addressed by implementing a parametric bootstrap resampling scheme. Results from simulation studies demonstrate the performance of the proposed parametric bootstrap testlet MH <span></span><math>\u0000 <semantics>\u0000 <msup>\u0000 <mi>χ</mi>\u0000 <mn>2</mn>\u0000 </msup>\u0000 <annotation>$chi ^2$</annotation>\u0000 </semantics></math>-statistic, and its application to the 2015 PISA Collaborative Problem Solving (CPS) data set illustrates the method's practical merits.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"503-530"},"PeriodicalIF":1.6,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12440","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Linking Error on Achievement Levels Accounting for Dependencies and Complex Sampling 考虑依赖关系和复杂抽样的成就水平的链接错误
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-06-15 DOI: 10.1111/jedm.12439
Paul A. Jewsbury

Alternate assessments of the same construct or assessments that have undergone a change in the conditions of measurement are often linked in an attempt to establish score comparability. As the link must be estimated from the data, linking contributes error variance into estimators. We propose a novel method to account for linking variance in standard error estimation for achievement or proficiency levels, a primary outcome for many international, national, and U.S. state assessments. Achievement levels are proportions of a population within some range of ability, such as the proportion of the population classified as proficient or advanced. The method is validated in a simulation and with real data. Our method allows for sampling weights and complex sampling and involves an easily calculated correction term that may be added to conventional estimates of the error variance, correcting the conventional estimates for neglecting variance due to linking. Furthermore, the method accounts for dependencies between linking with other sources of variance, allowing for the method to be much more widely applicable to a range of score comparisons than traditional methods of linking variance estimation.

同一结构的替代评估或在测量条件中经历了变化的评估经常被联系起来,试图建立分数的可比性。由于链接必须从数据中估计,链接将误差方差贡献给估计器。我们提出了一种新的方法来解释成就或熟练程度的标准误差估计中的关联方差,这是许多国际、国家和美国州评估的主要结果。成就等级是在某种能力范围内的人口比例,例如被归类为精通或高级的人口比例。通过仿真和实际数据验证了该方法的有效性。我们的方法允许采样权值和复杂采样,并且包含一个易于计算的校正项,可以将其添加到误差方差的常规估计中,纠正由于链接而忽略方差的常规估计。此外,该方法考虑了链接与其他方差来源之间的依赖关系,使得该方法比传统的链接方差估计方法更广泛地适用于分数比较范围。
{"title":"Linking Error on Achievement Levels Accounting for Dependencies and Complex Sampling","authors":"Paul A. Jewsbury","doi":"10.1111/jedm.12439","DOIUrl":"10.1111/jedm.12439","url":null,"abstract":"<p>Alternate assessments of the same construct or assessments that have undergone a change in the conditions of measurement are often linked in an attempt to establish score comparability. As the link must be estimated from the data, linking contributes error variance into estimators. We propose a novel method to account for linking variance in standard error estimation for achievement or proficiency levels, a primary outcome for many international, national, and U.S. state assessments. Achievement levels are proportions of a population within some range of ability, such as the proportion of the population classified as proficient or advanced. The method is validated in a simulation and with real data. Our method allows for sampling weights and complex sampling and involves an easily calculated correction term that may be added to conventional estimates of the error variance, correcting the conventional estimates for neglecting variance due to linking. Furthermore, the method accounts for dependencies between linking with other sources of variance, allowing for the method to be much more widely applicable to a range of score comparisons than traditional methods of linking variance estimation.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2025-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146091058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating the Consistency and Reliability of Attribution Methods in Automated Short Answer Grading (ASAG) Systems: Toward an Explainable Scoring System 评估自动简答评分(ASAG)系统中归属方法的一致性和可靠性:迈向一个可解释的评分系统
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-06-01 DOI: 10.1111/jedm.12438
Wallace N. Pinto Jr, Jinnie Shin

In recent years, the application of explainability techniques to automated essay scoring and automated short-answer grading (ASAG) models, particularly those based on transformer architectures, has gained significant attention. However, the reliability and consistency of these techniques remain underexplored. This study systematically investigates the use of attribution scores in ASAG systems, focusing on their consistency in reflecting model decisions. Specifically, we examined how attribution scores generated by different methods—namely Local Interpretable Model-agnostic Explanations (LIME), Integrated Gradients (IG), Hierarchical Explanation via Divisive Generation (HEDGE), and Leave-One-Out (LOO)—compare in their consistency and ability to illustrate the decision-making processes of transformer-based scoring systems trained on a publicly available response dataset. Additionally, we analyzed how attribution scores varied across different scoring categories in a polytomously scored response dataset and across two transformer-based scoring model architectures: Bidirectional Encoder Representations from Transformers (BERT) and Decoding-enhanced BERT with Disentangled Attention (DeBERTa-v2). Our findings highlight the challenges in evaluating explainability metrics, with important implications for both high-stakes and formative assessment contexts. This study contributes to the development of more reliable and transparent ASAG systems.

近年来,可解释性技术在自动作文评分和自动简答评分(ASAG)模型中的应用,特别是那些基于变压器架构的模型,获得了极大的关注。然而,这些技术的可靠性和一致性仍有待进一步研究。本研究系统地调查了归因分数在ASAG系统中的使用,重点关注它们在反映模型决策方面的一致性。具体来说,我们研究了由不同方法生成的归因分数——即局部可解释模型不可知论解释(LIME)、综合梯度(IG)、通过分裂生成的分层解释(HEDGE)和Leave-One-Out (LOO)——在一致性和说明基于公共响应数据集训练的基于变压器的评分系统的决策过程的能力方面进行了比较。此外,我们还分析了在多分制评分的响应数据集中,归因分数在不同评分类别之间的差异,以及在两种基于变压器的评分模型架构之间的差异:来自变压器的双向编码器表示(BERT)和带有解纠缠注意力的解码增强的BERT (DeBERTa-v2)。我们的研究结果强调了评估可解释性指标的挑战,对高风险和形成性评估环境都具有重要意义。这项研究有助于开发更可靠和透明的ASAG系统。
{"title":"Evaluating the Consistency and Reliability of Attribution Methods in Automated Short Answer Grading (ASAG) Systems: Toward an Explainable Scoring System","authors":"Wallace N. Pinto Jr,&nbsp;Jinnie Shin","doi":"10.1111/jedm.12438","DOIUrl":"https://doi.org/10.1111/jedm.12438","url":null,"abstract":"<p>In recent years, the application of explainability techniques to automated essay scoring and automated short-answer grading (ASAG) models, particularly those based on transformer architectures, has gained significant attention. However, the reliability and consistency of these techniques remain underexplored. This study systematically investigates the use of attribution scores in ASAG systems, focusing on their consistency in reflecting model decisions. Specifically, we examined how attribution scores generated by different methods—namely Local Interpretable Model-agnostic Explanations (LIME), Integrated Gradients (IG), Hierarchical Explanation via Divisive Generation (HEDGE), and Leave-One-Out (LOO)—compare in their consistency and ability to illustrate the decision-making processes of transformer-based scoring systems trained on a publicly available response dataset. Additionally, we analyzed how attribution scores varied across different scoring categories in a polytomously scored response dataset and across two transformer-based scoring model architectures: Bidirectional Encoder Representations from Transformers (BERT) and Decoding-enhanced BERT with Disentangled Attention (DeBERTa-v2). Our findings highlight the challenges in evaluating explainability metrics, with important implications for both high-stakes and formative assessment contexts. This study contributes to the development of more reliable and transparent ASAG systems.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 2","pages":"248-281"},"PeriodicalIF":1.4,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144525053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparing and Combining IRTree Models and Anchoring Vignettes in Addressing Response Styles IRTree模型与锚定小片段在寻址响应风格中的比较与结合
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-05-06 DOI: 10.1111/jedm.12437
Mingfeng Xue, Ping Chen

Response styles pose great threats to psychological measurements. This research compares IRTree models and anchoring vignettes in addressing response styles and estimating the target traits. It also explores the potential of combining them at the item level and total-score level (ratios of extreme and middle responses to vignettes). Four models were evaluated: three multidimensional IRTree models with different levels of using vignette data and a nominal response model (NRM) addressing extreme and midpoint response styles with item-level vignette responses. Simulation results indicated that the IRTree model using item-level vignette responses outperformed others in estimating the target trait and response styles to different extents, with performance improving as the number of vignettes increased. Empirical findings further demonstrated that models using item-level vignette information yielded higher reliability and closely aligned target trait estimates. These results underscore the value of integrating anchoring vignettes with IRTree models to enhance estimation accuracy and control for response styles.

反应方式对心理测量构成了很大的威胁。本研究比较了IRTree模型和锚定小片段在处理反应风格和估计目标特征方面的差异。它还探讨了在项目水平和总分水平(对小插曲的极端和中等反应的比率)结合它们的潜力。评估了四种模型:三个多维IRTree模型,使用不同水平的小片段数据和一个标称反应模型(NRM),处理极端和中点反应风格,使用项目级小片段反应。仿真结果表明,IRTree模型在不同程度上优于其他模型对目标性状和反应风格的估计,并随着小图像数量的增加而提高。实证结果进一步表明,使用项目级小图像信息的模型产生了更高的可靠性和紧密一致的目标性状估计。这些结果强调了将锚定小片段与IRTree模型集成在一起以提高估计精度和对响应风格的控制的价值。
{"title":"Comparing and Combining IRTree Models and Anchoring Vignettes in Addressing Response Styles","authors":"Mingfeng Xue,&nbsp;Ping Chen","doi":"10.1111/jedm.12437","DOIUrl":"https://doi.org/10.1111/jedm.12437","url":null,"abstract":"<p>Response styles pose great threats to psychological measurements. This research compares IRTree models and anchoring vignettes in addressing response styles and estimating the target traits. It also explores the potential of combining them at the item level and total-score level (ratios of extreme and middle responses to vignettes). Four models were evaluated: three multidimensional IRTree models with different levels of using vignette data and a nominal response model (NRM) addressing extreme and midpoint response styles with item-level vignette responses. Simulation results indicated that the IRTree model using item-level vignette responses outperformed others in estimating the target trait and response styles to different extents, with performance improving as the number of vignettes increased. Empirical findings further demonstrated that models using item-level vignette information yielded higher reliability and closely aligned target trait estimates. These results underscore the value of integrating anchoring vignettes with IRTree models to enhance estimation accuracy and control for response styles.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 2","pages":"225-247"},"PeriodicalIF":1.4,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144524996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Validation for Personalized Assessments: A Threats-to-Validity Approach 个性化评估的验证:从威胁到有效性的方法
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-04-23 DOI: 10.1111/jedm.12434
Sandip Sinharay, Randy E. Bennett, Michael Kane, Jesse R. Sparks

Personalized assessments are of increasing interest because of their potential to lead to more equitable decisions about the examinees. However, one obstacle to the widespread use of personalized assessments is the lack of a measurement toolkit that can be used to analyze data from these assessments. This article takes one step toward building such a toolkit by proposing a validation framework for personalized assessments. The framework is built on the threats-to-validity approach. We demonstrate applications of the suggested framework using the AP 3D Art and Design Portfolio examination and a more restrictive culturally relevant assessment as examples.

个性化评估越来越引起人们的兴趣,因为它们有可能导致对考生做出更公平的决定。然而,广泛使用个性化评估的一个障碍是缺乏可用于分析这些评估数据的度量工具包。本文通过提出用于个性化评估的验证框架,向构建这样一个工具包迈出了一步。该框架建立在有效性威胁方法的基础上。我们以AP 3D艺术与设计作品集考试和更具限制性的文化相关评估为例,展示了建议框架的应用。
{"title":"Validation for Personalized Assessments: A Threats-to-Validity Approach","authors":"Sandip Sinharay,&nbsp;Randy E. Bennett,&nbsp;Michael Kane,&nbsp;Jesse R. Sparks","doi":"10.1111/jedm.12434","DOIUrl":"https://doi.org/10.1111/jedm.12434","url":null,"abstract":"<p>Personalized assessments are of increasing interest because of their potential to lead to more equitable decisions about the examinees. However, one obstacle to the widespread use of personalized assessments is the lack of a measurement toolkit that can be used to analyze data from these assessments. This article takes one step toward building such a toolkit by proposing a validation framework for personalized assessments. The framework is built on the threats-to-validity approach. We demonstrate applications of the suggested framework using the AP 3D Art and Design Portfolio examination and a more restrictive culturally relevant assessment as examples.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 2","pages":"282-310"},"PeriodicalIF":1.4,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144524631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Addressing Bias in Spoken Language Systems Used in the Development and Implementation of Automated Child Language-Based Assessment 解决在开发和实施基于儿童语言的自动评估中使用的口语系统中的偏见
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-04-23 DOI: 10.1111/jedm.12435
Alison L. Bailey, Alexander Johnson, Natarajan Balaji Shankar, Hariram Veeramani, Julie A. Washington, Abeer Alwan

This article addresses bias in Spoken Language Systems (SLS) that involve both Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) and reports experiments to improve the performance of SLS for automated language and literacy-related assessments with students who are under served in the U.S. educational system. We frame bias in SLS in terms of testing fairness and validity, stemming in part from the exclusion of sufficiently large training datasets in varieties of English other than General American English (GAE). We adopt an Interpretation/Use Argument approach to validity focused on clarity of constructs and scoring accuracy. While SLS use ASR to automatically transcribe students’ utterances, and apply NLP algorithms to ASR transcripts to measure students’ speech samples, it is well-documented in studies with adults that ASR is typically more problematic for African American English (AAE) speakers than for other groups due to differences in prosody, pronunciation, word usage, and grammar. We utilized child speech and text corpora to improve algorithms that score oral task responses for child AAE speakers and, in some experiments, children with oral language and reading difficulties. Favorable results provide impetus and possible solutions for fair and inclusive assessments for diverse student groups in the future.

本文讨论了涉及自动语音识别(ASR)和自然语言处理(NLP)的口语系统(SLS)中的偏见,并报告了一些实验,以提高SLS在美国教育系统中服务不足的学生的自动语言和读写能力相关评估中的表现。我们从测试公平性和有效性的角度来定义SLS中的偏差,部分原因是排除了除通用美式英语(GAE)以外的各种英语的足够大的训练数据集。我们采用解释/使用论证的方法来提高效度,重点是结构的清晰度和评分的准确性。虽然SLS使用ASR来自动转录学生的话语,并将NLP算法应用于ASR转录来测量学生的语音样本,但在成人研究中有充分的证据表明,由于韵律、发音、词汇使用和语法的差异,非裔美国英语(AAE)使用者通常比其他群体更容易出现ASR问题。我们利用儿童语音和文本语料库来改进对儿童AAE说话者以及在一些实验中有口语和阅读困难的儿童的口头任务反应进行评分的算法。有利的结果为未来对不同学生群体进行公平和包容的评估提供了动力和可能的解决方案。
{"title":"Addressing Bias in Spoken Language Systems Used in the Development and Implementation of Automated Child Language-Based Assessment","authors":"Alison L. Bailey,&nbsp;Alexander Johnson,&nbsp;Natarajan Balaji Shankar,&nbsp;Hariram Veeramani,&nbsp;Julie A. Washington,&nbsp;Abeer Alwan","doi":"10.1111/jedm.12435","DOIUrl":"10.1111/jedm.12435","url":null,"abstract":"<p>This article addresses bias in Spoken Language Systems (SLS) that involve both Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) and reports experiments to improve the performance of SLS for automated language and literacy-related assessments with students who are under served in the U.S. educational system. We frame bias in SLS in terms of testing fairness and validity, stemming in part from the exclusion of sufficiently large training datasets in varieties of English other than General American English (GAE). We adopt an Interpretation/Use Argument approach to validity focused on clarity of constructs and scoring accuracy. While SLS use ASR to automatically transcribe students’ utterances, and apply NLP algorithms to ASR transcripts to measure students’ speech samples, it is well-documented in studies with adults that ASR is typically more problematic for African American English (AAE) speakers than for other groups due to differences in prosody, pronunciation, word usage, and grammar. We utilized child speech and text corpora to improve algorithms that score oral task responses for child AAE speakers and, in some experiments, children with oral language and reading difficulties. Favorable results provide impetus and possible solutions for fair and inclusive assessments for diverse student groups in the future.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12435","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146130167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Multiple Maximum Exposure Rates in Computerized Adaptive Testing 在计算机化自适应测试中使用多重最大暴露率
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-04-16 DOI: 10.1111/jedm.12436
Kylie Gorney, Mark D. Reckase

In computerized adaptive testing, item exposure control methods are often used to provide a more balanced usage of the item pool. Many of the most popular methods, including the restricted method (Revuelta and Ponsoda), use a single maximum exposure rate to limit the proportion of times that each item is administered. However, Barrada et al. showed that by using multiple maximum exposure rates, it is possible to obtain an even more balanced usage of the item pool. Therefore, in this paper, we develop four extensions of the restricted method that involve the use of multiple maximum exposure rates. A detailed simulation study reveals that (a) all four of the new methods improve item pool utilization and (b) three of the new methods also improve measurement accuracy. Taken together, these results are highly encouraging, as they reveal that it is possible to improve both types of outcomes simultaneously.

在计算机化的自适应测试中,项目暴露控制方法通常用于提供更平衡的项目池使用。许多最流行的方法,包括限制性方法(Revuelta和Ponsoda),使用单一的最大暴露率来限制每个项目的使用次数比例。然而,Barrada等人表明,通过使用多个最大曝光率,有可能获得更平衡的项目池使用。因此,在本文中,我们发展了涉及使用多个最大暴露率的限制方法的四种扩展。一项详细的模拟研究表明:(A)所有四种新方法都提高了项目池的利用率,(b)三种新方法也提高了测量精度。综上所述,这些结果非常令人鼓舞,因为它们表明,同时改善这两种结果是可能的。
{"title":"Using Multiple Maximum Exposure Rates in Computerized Adaptive Testing","authors":"Kylie Gorney,&nbsp;Mark D. Reckase","doi":"10.1111/jedm.12436","DOIUrl":"https://doi.org/10.1111/jedm.12436","url":null,"abstract":"<p>In computerized adaptive testing, item exposure control methods are often used to provide a more balanced usage of the item pool. Many of the most popular methods, including the restricted method (Revuelta and Ponsoda), use a single maximum exposure rate to limit the proportion of times that each item is administered. However, Barrada et al. showed that by using multiple maximum exposure rates, it is possible to obtain an even more balanced usage of the item pool. Therefore, in this paper, we develop four extensions of the restricted method that involve the use of multiple maximum exposure rates. A detailed simulation study reveals that (a) all four of the new methods improve item pool utilization and (b) three of the new methods also improve measurement accuracy. Taken together, these results are highly encouraging, as they reveal that it is possible to improve both types of outcomes simultaneously.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 2","pages":"360-379"},"PeriodicalIF":1.4,"publicationDate":"2025-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12436","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144524607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Educational Measurement
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1