随时间调查建构反应评分:研究设计对趋势评分统计的影响

Q3 Social Sciences ETS Research Report Series Pub Date : 2022-10-22 DOI:10.1002/ets2.12360

John R. Donoghue, Catherine A. McClellan, Melinda R. Hess

{"title":"随时间调查建构反应评分:研究设计对趋势评分统计的影响","authors":"John R. Donoghue, Catherine A. McClellan, Melinda R. Hess","doi":"10.1002/ets2.12360","DOIUrl":null,"url":null,"abstract":"When constructed-response items are administered for a second time, it is necessary to evaluate whether the current Time B administration's raters have drifted from the scoring of the original administration at Time A. To study this, Time A papers are sampled and rescored by Time B scorers. Commonly the scores are compared using the proportion of exact agreement across times and/or t-statistics comparing Time A means to Time B means. It is common to treat these rescores with procedures that assume a multinomial sampling model, which is incorrect. The correct, product-multinomial model reflects the stratification of Time A scores. Using direct computation, the research report demonstrates that both proportion of exact agreement and the t-statistic can deviate substantially from expected behavior, providing misleading results. Reweighting the rescore table gives each statistic the correct expected value but does not guarantee that the usual sampling distributions hold. It is also noted that the results apply to a wider class of situations in which a set of papers is scored by one group of raters or scoring engine and then a sample is selected to be evaluated by a different group of raters or scoring engine.","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2022 1","pages":"1-14"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12360","citationCount":"0","resultStr":"{\"title\":\"Investigating Constructed-Response Scoring Over Time: The Effects of Study Design on Trend Rescore Statistics\",\"authors\":\"John R. Donoghue, Catherine A. McClellan, Melinda R. Hess\",\"doi\":\"10.1002/ets2.12360\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"When constructed-response items are administered for a second time, it is necessary to evaluate whether the current Time B administration's raters have drifted from the scoring of the original administration at Time A. To study this, Time A papers are sampled and rescored by Time B scorers. Commonly the scores are compared using the proportion of exact agreement across times and/or t-statistics comparing Time A means to Time B means. It is common to treat these rescores with procedures that assume a multinomial sampling model, which is incorrect. The correct, product-multinomial model reflects the stratification of Time A scores. Using direct computation, the research report demonstrates that both proportion of exact agreement and the t-statistic can deviate substantially from expected behavior, providing misleading results. Reweighting the rescore table gives each statistic the correct expected value but does not guarantee that the usual sampling distributions hold. It is also noted that the results apply to a wider class of situations in which a set of papers is scored by one group of raters or scoring engine and then a sample is selected to be evaluated by a different group of raters or scoring engine.\",\"PeriodicalId\":11972,\"journal\":{\"name\":\"ETS Research Report Series\",\"volume\":\"2022 1\",\"pages\":\"1-14\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12360\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ETS Research Report Series\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/ets2.12360\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Social Sciences\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ETS Research Report Series","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ets2.12360","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Social Sciences","Score":null,"Total":0}

引用次数: 0

摘要

当第二次进行建构性回答项目时，有必要评估当前时间B管理的评分是否偏离了时间a的原始管理的评分。为了研究这一点，时间a的试卷被抽样并由时间B的评分者重新评分。通常，使用跨时间精确一致的比例和/或将时间A均值与时间B均值进行比较的t统计来比较分数。通常用假设多项式抽样模型的程序来处理这些资源，这是不正确的。正确的产品多项模型反映了Time A分数的分层。通过直接计算，研究报告表明，精确一致的比例和t统计量都可能大大偏离预期行为，从而提供误导性的结果。重新加权分数表为每个统计量提供了正确的期望值，但不能保证通常的抽样分布保持不变。还需要注意的是，结果适用于更广泛的情况，其中一组论文由一组评分者或评分引擎评分，然后选择一个样本由另一组评分者或评分引擎进行评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Investigating Constructed-Response Scoring Over Time: The Effects of Study Design on Trend Rescore Statistics

When constructed-response items are administered for a second time, it is necessary to evaluate whether the current Time B administration's raters have drifted from the scoring of the original administration at Time A. To study this, Time A papers are sampled and rescored by Time B scorers. Commonly the scores are compared using the proportion of exact agreement across times and/or t-statistics comparing Time A means to Time B means. It is common to treat these rescores with procedures that assume a multinomial sampling model, which is incorrect. The correct, product-multinomial model reflects the stratification of Time A scores. Using direct computation, the research report demonstrates that both proportion of exact agreement and the t-statistic can deviate substantially from expected behavior, providing misleading results. Reweighting the rescore table gives each statistic the correct expected value but does not guarantee that the usual sampling distributions hold. It is also noted that the results apply to a wider class of situations in which a set of papers is scored by one group of raters or scoring engine and then a sample is selected to be evaluated by a different group of raters or scoring engine.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ETS Research Report Series Social Sciences-Education

CiteScore

1.20

自引率

0.00%

发文量