IRT Observed-Score Equating for Rater-Mediated Assessments Using a Hierarchical Rater Model

IF 1.6 4区心理学 Q3 PSYCHOLOGY, APPLIED Journal of Educational Measurement Pub Date : 2025-01-13 DOI:10.1111/jedm.12425

Tong Wu, Stella Y. Kim, Carl Westine, Michelle Boyer

{"title":"IRT Observed-Score Equating for Rater-Mediated Assessments Using a Hierarchical Rater Model","authors":"Tong Wu, Stella Y. Kim, Carl Westine, Michelle Boyer","doi":"10.1111/jedm.12425","DOIUrl":null,"url":null,"abstract":"<p>While significant attention has been given to test equating to ensure score comparability, limited research has explored equating methods for rater-mediated assessments, where human raters inherently introduce error. If not properly addressed, these errors can undermine score interchangeability and test validity. This study proposes an equating method that accounts for rater errors by utilizing item response theory (IRT) observed-score equating with a hierarchical rater model (HRM). Its effectiveness is compared to an IRT observed-score equating method using the generalized partial credit model across 16 rater combinations with varying levels of rater bias and variability. The results indicate that equating performance depends on the interaction between rater bias and variability across forms. Both the proposed and traditional methods demonstrated robustness in terms of bias and RMSE when rater bias and variability were similar between forms, with a few exceptions. However, when rater errors varied significantly across forms, the proposed method consistently produced more stable equating results. Differences in standard error between the methods were minimal under most conditions.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 1","pages":"145-171"},"PeriodicalIF":1.6000,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Educational Measurement","FirstCategoryId":"102","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jedm.12425","RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"PSYCHOLOGY, APPLIED","Score":null,"Total":0}

引用次数: 0

Abstract

While significant attention has been given to test equating to ensure score comparability, limited research has explored equating methods for rater-mediated assessments, where human raters inherently introduce error. If not properly addressed, these errors can undermine score interchangeability and test validity. This study proposes an equating method that accounts for rater errors by utilizing item response theory (IRT) observed-score equating with a hierarchical rater model (HRM). Its effectiveness is compared to an IRT observed-score equating method using the generalized partial credit model across 16 rater combinations with varying levels of rater bias and variability. The results indicate that equating performance depends on the interaction between rater bias and variability across forms. Both the proposed and traditional methods demonstrated robustness in terms of bias and RMSE when rater bias and variability were similar between forms, with a few exceptions. However, when rater errors varied significantly across forms, the proposed method consistently produced more stable equating results. Differences in standard error between the methods were minimal under most conditions.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用分层评分模型对评分介导的评估进行IRT观察得分相等

虽然相当重视测试相等以确保分数的可比性，但有限的研究探索了评分中介评估的相等方法，其中人为评分者固有地引入错误。如果处理不当，这些错误会破坏分数的互换性和测试的有效性。本研究提出了一种等价化方法，利用项目反应理论（IRT）观察得分等价化与分层评分模型（HRM）来解释评分者误差。将其有效性与使用广义部分信用模型的IRT观察得分等同方法进行比较，该方法跨16种具有不同程度的评分偏差和可变性的评分组合。结果表明，相等的性能取决于评分偏差和不同形式的可变性之间的相互作用。除了少数例外，当不同形式之间的偏差和可变性相似时，所提出的方法和传统方法在偏差和RMSE方面都表现出鲁棒性。然而，当不同形式的评分误差显著变化时，所提出的方法始终产生更稳定的相等结果。在大多数条件下，两种方法的标准误差差异极小。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Educational Measurement Multiple-

CiteScore

2.30

自引率

7.70%

发文量

期刊介绍： The Journal of Educational Measurement (JEM) publishes original measurement research, provides reviews of measurement publications, and reports on innovative measurement applications. The topics addressed will interest those concerned with the practice of measurement in field settings, as well as be of interest to measurement theorists. In addition to presenting new contributions to measurement theory and practice, JEM also serves as a vehicle for improving educational measurement applications in a variety of settings.

期刊最新文献

Incorporating Measurement Errors in Fixed Person Parameter Calibration Evaluating General-Purpose Multimodal AI for Q-Matrix Generation from Math Items: A Cognitive Diagnostic Modeling Exploration AI and Measurement Concerns: Dealing with Imbalanced Data in Autoscoring Correction to “Using GPT-4 to Augment Imbalanced Data for Automatic Scoring” Issue Information