Effects of Misbehaving Common Items on Aggregate Scores and an Application of the Mantel-Haenszel Statistic in Test Equating. CSE Report 688.

National Center for Research on Evaluation, Standards, and Student Testing Pub Date : 2006-04-01 DOI:10.1037/e644592011-001

M. Michaelides

{"title":"Effects of Misbehaving Common Items on Aggregate Scores and an Application of the Mantel-Haenszel Statistic in Test Equating. CSE Report 688.","authors":"M. Michaelides","doi":"10.1037/e644592011-001","DOIUrl":null,"url":null,"abstract":"Consistent behavior is a desirable characteristic that common items are expected to have when administered to different groups. Findings from the literature have established that items do not always behave in consistent ways; item indices and IRT item parameter estimates of the same items differ when obtained from different administrations. Content effects, such as discrepancies in instructional emphasis, and context effects, such as changes in the presentation, format, and positioning of the item, may result in differential item difficulty for different groups. When common items are differentially difficult for two groups, using them to generate an equating transformation is questionable. The delta-plot method is a simple, graphical procedure that identifies such items by examining their classical test theory difficulty values. After inspection, such items are likely to drop to a noncommon-item status. Two studies are described in this report. Study 1 investigates the influence of common items that behave inconsistently across two administrations on equated score summaries. Study 2 applies an alternative to the delta-plot method for flagging common items for differential behavior across administrations. The first study examines the effects of retaining versus discarding the common items flagged as outliers by the delta-plot method on equated score summary statistics. For four statewide assessments that were administered in two consecutive years under the common-item nonequivalent groups design, the equating functions that transform the Year2 to the Year-1 scale are estimated using four different IRT equating methods (Stocking & Lord, Haebara, mean/sigma, mean/mean) under two IRT models—the threeand the oneparameter logistic models for the dichotomous items with Samejima’s (1969) graded response model for polytomous items. The changes in the Year-2 equated mean scores, mean gains or declines from Year 1 to Year 2, and proportions above a cut-off point are examined when all the common items are used in the equating process versus when the delta-plot outliers are excluded from the common-item pool. Results under the four equating methods 1 The author would like to thank Edward Haertel for his thoughtful guidance on this project and for reviewing this report. Thanks also to Measured Progress Inc. for providing data for this project, as well as John Donoghue, Neil Dorans, Kyoko Ito, Michael Jodoin, Michael Nering, David Rogosa, Richard Shavelson, Wendy Yen, Rebecca Zwick and seminar participants at CTB/McGraw Hill and ETS for suggestions. Any errors and omissions are the responsibility of the author. Results from the two studies in this report were presented at the 2003 and 2005 Annual Meetings of the American Educational Research Association.","PeriodicalId":19116,"journal":{"name":"National Center for Research on Evaluation, Standards, and Student Testing","volume":"86 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2006-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"National Center for Research on Evaluation, Standards, and Student Testing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1037/e644592011-001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Consistent behavior is a desirable characteristic that common items are expected to have when administered to different groups. Findings from the literature have established that items do not always behave in consistent ways; item indices and IRT item parameter estimates of the same items differ when obtained from different administrations. Content effects, such as discrepancies in instructional emphasis, and context effects, such as changes in the presentation, format, and positioning of the item, may result in differential item difficulty for different groups. When common items are differentially difficult for two groups, using them to generate an equating transformation is questionable. The delta-plot method is a simple, graphical procedure that identifies such items by examining their classical test theory difficulty values. After inspection, such items are likely to drop to a noncommon-item status. Two studies are described in this report. Study 1 investigates the influence of common items that behave inconsistently across two administrations on equated score summaries. Study 2 applies an alternative to the delta-plot method for flagging common items for differential behavior across administrations. The first study examines the effects of retaining versus discarding the common items flagged as outliers by the delta-plot method on equated score summary statistics. For four statewide assessments that were administered in two consecutive years under the common-item nonequivalent groups design, the equating functions that transform the Year2 to the Year-1 scale are estimated using four different IRT equating methods (Stocking & Lord, Haebara, mean/sigma, mean/mean) under two IRT models—the threeand the oneparameter logistic models for the dichotomous items with Samejima’s (1969) graded response model for polytomous items. The changes in the Year-2 equated mean scores, mean gains or declines from Year 1 to Year 2, and proportions above a cut-off point are examined when all the common items are used in the equating process versus when the delta-plot outliers are excluded from the common-item pool. Results under the four equating methods 1 The author would like to thank Edward Haertel for his thoughtful guidance on this project and for reviewing this report. Thanks also to Measured Progress Inc. for providing data for this project, as well as John Donoghue, Neil Dorans, Kyoko Ito, Michael Jodoin, Michael Nering, David Rogosa, Richard Shavelson, Wendy Yen, Rebecca Zwick and seminar participants at CTB/McGraw Hill and ETS for suggestions. Any errors and omissions are the responsibility of the author. Results from the two studies in this report were presented at the 2003 and 2005 Annual Meetings of the American Educational Research Association.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

行为不良的常见题对总分的影响及Mantel-Haenszel统计量在测试等式中的应用。CSE报告688。

一致的行为是一个理想的特征，当管理到不同的组时，期望公共项具有一致的行为。从文献中发现，物品并不总是以一致的方式行为;同一项目的项目指数和IRT项目参数估计值在不同的管理机构中有所不同。内容效应，如教学重点的差异，以及上下文效应，如项目的呈现、格式和定位的变化，可能会导致不同群体的项目难度不同。当公共项对两个组来说有不同的难度时，使用它们来生成等价转换是值得怀疑的。三角图方法是一种简单的图形程序，通过检查经典测试理论难度值来识别这些项目。检查后，这些物品很可能会降为非常见物品状态。本报告描述了两项研究。研究1调查了在两个行政部门中表现不一致的常见项目对等分摘要的影响。研究2应用了一种替代三角图方法，用于标记跨行政部门差异行为的共同项目。第一项研究考察了在等分汇总统计中使用delta-plot方法保留和丢弃标记为异常值的常见项目的影响。对于连续两年在普通项目非等效组设计下进行的四个全州范围的评估，使用四种不同的IRT等效方法(Stocking & Lord, Haebara, mean/sigma, mean/mean)在两种IRT模型下估计将Year2转换为Year-1量表的等效函数——三参数和一参数逻辑模型用于二分类项目，Samejima(1969)的分级反应模型用于多分类项目。当所有的共同项目都被使用在相等过程中，而当从共同项目池中排除了δ图异常值时，我们检查了第2年相等平均分数的变化，从第1年到第2年的平均收益或下降，以及高于截断点的比例。四种等效方法下的结果1作者感谢Edward Haertel对这个项目的周到指导以及对这个报告的审阅。同时感谢measurement Progress Inc.为本项目提供的数据，以及John Donoghue、Neil Dorans、Kyoko Ito、Michael Jodoin、Michael Nering、David Rogosa、Richard Shavelson、Wendy Yen、Rebecca Zwick和CTB/McGraw Hill和ETS的研讨会参与者提供的建议。任何错误和遗漏是作者的责任。本报告中两项研究的结果分别在2003年和2005年美国教育研究协会年会上发表。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

National Center for Research on Evaluation, Standards, and Student Testing

自引率

0.00%

发文量