John R. Donoghue, Catherine A. McClellan, Melinda R. Hess
{"title":"随时间调查建构反应评分:研究设计对趋势评分统计的影响","authors":"John R. Donoghue, Catherine A. McClellan, Melinda R. Hess","doi":"10.1002/ets2.12360","DOIUrl":null,"url":null,"abstract":"<p>When constructed-response items are administered for a second time, it is necessary to evaluate whether the current Time B administration's raters have drifted from the scoring of the original administration at Time A. To study this, Time A papers are sampled and rescored by Time B scorers. Commonly the scores are compared using the proportion of exact agreement across times and/or <i>t</i>-statistics comparing Time A means to Time B means. It is common to treat these rescores with procedures that assume a multinomial sampling model, which is incorrect. The correct, product-multinomial model reflects the stratification of Time A scores. Using direct computation, the research report demonstrates that both proportion of exact agreement and the <i>t</i>-statistic can deviate substantially from expected behavior, providing misleading results. Reweighting the rescore table gives each statistic the correct expected value but does not guarantee that the usual sampling distributions hold. It is also noted that the results apply to a wider class of situations in which a set of papers is scored by one group of raters or scoring engine and then a sample is selected to be evaluated by a different group of raters or scoring engine.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2022 1","pages":"1-14"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12360","citationCount":"0","resultStr":"{\"title\":\"Investigating Constructed-Response Scoring Over Time: The Effects of Study Design on Trend Rescore Statistics\",\"authors\":\"John R. Donoghue, Catherine A. McClellan, Melinda R. Hess\",\"doi\":\"10.1002/ets2.12360\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>When constructed-response items are administered for a second time, it is necessary to evaluate whether the current Time B administration's raters have drifted from the scoring of the original administration at Time A. To study this, Time A papers are sampled and rescored by Time B scorers. Commonly the scores are compared using the proportion of exact agreement across times and/or <i>t</i>-statistics comparing Time A means to Time B means. It is common to treat these rescores with procedures that assume a multinomial sampling model, which is incorrect. The correct, product-multinomial model reflects the stratification of Time A scores. Using direct computation, the research report demonstrates that both proportion of exact agreement and the <i>t</i>-statistic can deviate substantially from expected behavior, providing misleading results. Reweighting the rescore table gives each statistic the correct expected value but does not guarantee that the usual sampling distributions hold. It is also noted that the results apply to a wider class of situations in which a set of papers is scored by one group of raters or scoring engine and then a sample is selected to be evaluated by a different group of raters or scoring engine.</p>\",\"PeriodicalId\":11972,\"journal\":{\"name\":\"ETS Research Report Series\",\"volume\":\"2022 1\",\"pages\":\"1-14\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12360\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ETS Research Report Series\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/ets2.12360\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Social Sciences\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ETS Research Report Series","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ets2.12360","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Social Sciences","Score":null,"Total":0}
Investigating Constructed-Response Scoring Over Time: The Effects of Study Design on Trend Rescore Statistics
When constructed-response items are administered for a second time, it is necessary to evaluate whether the current Time B administration's raters have drifted from the scoring of the original administration at Time A. To study this, Time A papers are sampled and rescored by Time B scorers. Commonly the scores are compared using the proportion of exact agreement across times and/or t-statistics comparing Time A means to Time B means. It is common to treat these rescores with procedures that assume a multinomial sampling model, which is incorrect. The correct, product-multinomial model reflects the stratification of Time A scores. Using direct computation, the research report demonstrates that both proportion of exact agreement and the t-statistic can deviate substantially from expected behavior, providing misleading results. Reweighting the rescore table gives each statistic the correct expected value but does not guarantee that the usual sampling distributions hold. It is also noted that the results apply to a wider class of situations in which a set of papers is scored by one group of raters or scoring engine and then a sample is selected to be evaluated by a different group of raters or scoring engine.