{"title":"Effects of Misbehaving Common Items on Aggregate Scores and an Application of the Mantel-Haenszel Statistic in Test Equating. CSE Report 688.","authors":"M. Michaelides","doi":"10.1037/e644592011-001","DOIUrl":null,"url":null,"abstract":"Consistent behavior is a desirable characteristic that common items are expected to have when administered to different groups. Findings from the literature have established that items do not always behave in consistent ways; item indices and IRT item parameter estimates of the same items differ when obtained from different administrations. Content effects, such as discrepancies in instructional emphasis, and context effects, such as changes in the presentation, format, and positioning of the item, may result in differential item difficulty for different groups. When common items are differentially difficult for two groups, using them to generate an equating transformation is questionable. The delta-plot method is a simple, graphical procedure that identifies such items by examining their classical test theory difficulty values. After inspection, such items are likely to drop to a noncommon-item status. Two studies are described in this report. Study 1 investigates the influence of common items that behave inconsistently across two administrations on equated score summaries. Study 2 applies an alternative to the delta-plot method for flagging common items for differential behavior across administrations. The first study examines the effects of retaining versus discarding the common items flagged as outliers by the delta-plot method on equated score summary statistics. For four statewide assessments that were administered in two consecutive years under the common-item nonequivalent groups design, the equating functions that transform the Year2 to the Year-1 scale are estimated using four different IRT equating methods (Stocking & Lord, Haebara, mean/sigma, mean/mean) under two IRT models—the threeand the oneparameter logistic models for the dichotomous items with Samejima’s (1969) graded response model for polytomous items. The changes in the Year-2 equated mean scores, mean gains or declines from Year 1 to Year 2, and proportions above a cut-off point are examined when all the common items are used in the equating process versus when the delta-plot outliers are excluded from the common-item pool. Results under the four equating methods 1 The author would like to thank Edward Haertel for his thoughtful guidance on this project and for reviewing this report. Thanks also to Measured Progress Inc. for providing data for this project, as well as John Donoghue, Neil Dorans, Kyoko Ito, Michael Jodoin, Michael Nering, David Rogosa, Richard Shavelson, Wendy Yen, Rebecca Zwick and seminar participants at CTB/McGraw Hill and ETS for suggestions. Any errors and omissions are the responsibility of the author. Results from the two studies in this report were presented at the 2003 and 2005 Annual Meetings of the American Educational Research Association.","PeriodicalId":19116,"journal":{"name":"National Center for Research on Evaluation, Standards, and Student Testing","volume":"86 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2006-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"National Center for Research on Evaluation, Standards, and Student Testing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1037/e644592011-001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Consistent behavior is a desirable characteristic that common items are expected to have when administered to different groups. Findings from the literature have established that items do not always behave in consistent ways; item indices and IRT item parameter estimates of the same items differ when obtained from different administrations. Content effects, such as discrepancies in instructional emphasis, and context effects, such as changes in the presentation, format, and positioning of the item, may result in differential item difficulty for different groups. When common items are differentially difficult for two groups, using them to generate an equating transformation is questionable. The delta-plot method is a simple, graphical procedure that identifies such items by examining their classical test theory difficulty values. After inspection, such items are likely to drop to a noncommon-item status. Two studies are described in this report. Study 1 investigates the influence of common items that behave inconsistently across two administrations on equated score summaries. Study 2 applies an alternative to the delta-plot method for flagging common items for differential behavior across administrations. The first study examines the effects of retaining versus discarding the common items flagged as outliers by the delta-plot method on equated score summary statistics. For four statewide assessments that were administered in two consecutive years under the common-item nonequivalent groups design, the equating functions that transform the Year2 to the Year-1 scale are estimated using four different IRT equating methods (Stocking & Lord, Haebara, mean/sigma, mean/mean) under two IRT models—the threeand the oneparameter logistic models for the dichotomous items with Samejima’s (1969) graded response model for polytomous items. The changes in the Year-2 equated mean scores, mean gains or declines from Year 1 to Year 2, and proportions above a cut-off point are examined when all the common items are used in the equating process versus when the delta-plot outliers are excluded from the common-item pool. Results under the four equating methods 1 The author would like to thank Edward Haertel for his thoughtful guidance on this project and for reviewing this report. Thanks also to Measured Progress Inc. for providing data for this project, as well as John Donoghue, Neil Dorans, Kyoko Ito, Michael Jodoin, Michael Nering, David Rogosa, Richard Shavelson, Wendy Yen, Rebecca Zwick and seminar participants at CTB/McGraw Hill and ETS for suggestions. Any errors and omissions are the responsibility of the author. Results from the two studies in this report were presented at the 2003 and 2005 Annual Meetings of the American Educational Research Association.