随时间调查建构反应评分:研究设计对趋势评分统计的影响

Q3 Social Sciences ETS Research Report Series Pub Date : 2022-10-22 DOI:10.1002/ets2.12360
John R. Donoghue, Catherine A. McClellan, Melinda R. Hess
{"title":"随时间调查建构反应评分:研究设计对趋势评分统计的影响","authors":"John R. Donoghue,&nbsp;Catherine A. McClellan,&nbsp;Melinda R. Hess","doi":"10.1002/ets2.12360","DOIUrl":null,"url":null,"abstract":"<p>When constructed-response items are administered for a second time, it is necessary to evaluate whether the current Time B administration's raters have drifted from the scoring of the original administration at Time A. To study this, Time A papers are sampled and rescored by Time B scorers. Commonly the scores are compared using the proportion of exact agreement across times and/or <i>t</i>-statistics comparing Time A means to Time B means. It is common to treat these rescores with procedures that assume a multinomial sampling model, which is incorrect. The correct, product-multinomial model reflects the stratification of Time A scores. Using direct computation, the research report demonstrates that both proportion of exact agreement and the <i>t</i>-statistic can deviate substantially from expected behavior, providing misleading results. Reweighting the rescore table gives each statistic the correct expected value but does not guarantee that the usual sampling distributions hold. It is also noted that the results apply to a wider class of situations in which a set of papers is scored by one group of raters or scoring engine and then a sample is selected to be evaluated by a different group of raters or scoring engine.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2022 1","pages":"1-14"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12360","citationCount":"0","resultStr":"{\"title\":\"Investigating Constructed-Response Scoring Over Time: The Effects of Study Design on Trend Rescore Statistics\",\"authors\":\"John R. Donoghue,&nbsp;Catherine A. McClellan,&nbsp;Melinda R. Hess\",\"doi\":\"10.1002/ets2.12360\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>When constructed-response items are administered for a second time, it is necessary to evaluate whether the current Time B administration's raters have drifted from the scoring of the original administration at Time A. To study this, Time A papers are sampled and rescored by Time B scorers. Commonly the scores are compared using the proportion of exact agreement across times and/or <i>t</i>-statistics comparing Time A means to Time B means. It is common to treat these rescores with procedures that assume a multinomial sampling model, which is incorrect. The correct, product-multinomial model reflects the stratification of Time A scores. Using direct computation, the research report demonstrates that both proportion of exact agreement and the <i>t</i>-statistic can deviate substantially from expected behavior, providing misleading results. Reweighting the rescore table gives each statistic the correct expected value but does not guarantee that the usual sampling distributions hold. It is also noted that the results apply to a wider class of situations in which a set of papers is scored by one group of raters or scoring engine and then a sample is selected to be evaluated by a different group of raters or scoring engine.</p>\",\"PeriodicalId\":11972,\"journal\":{\"name\":\"ETS Research Report Series\",\"volume\":\"2022 1\",\"pages\":\"1-14\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12360\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ETS Research Report Series\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/ets2.12360\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Social Sciences\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ETS Research Report Series","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ets2.12360","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Social Sciences","Score":null,"Total":0}
引用次数: 0

摘要

当第二次进行建构性回答项目时,有必要评估当前时间B管理的评分是否偏离了时间a的原始管理的评分。为了研究这一点,时间a的试卷被抽样并由时间B的评分者重新评分。通常,使用跨时间精确一致的比例和/或将时间A均值与时间B均值进行比较的t统计来比较分数。通常用假设多项式抽样模型的程序来处理这些资源,这是不正确的。正确的产品多项模型反映了Time A分数的分层。通过直接计算,研究报告表明,精确一致的比例和t统计量都可能大大偏离预期行为,从而提供误导性的结果。重新加权分数表为每个统计量提供了正确的期望值,但不能保证通常的抽样分布保持不变。还需要注意的是,结果适用于更广泛的情况,其中一组论文由一组评分者或评分引擎评分,然后选择一个样本由另一组评分者或评分引擎进行评估。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Investigating Constructed-Response Scoring Over Time: The Effects of Study Design on Trend Rescore Statistics

When constructed-response items are administered for a second time, it is necessary to evaluate whether the current Time B administration's raters have drifted from the scoring of the original administration at Time A. To study this, Time A papers are sampled and rescored by Time B scorers. Commonly the scores are compared using the proportion of exact agreement across times and/or t-statistics comparing Time A means to Time B means. It is common to treat these rescores with procedures that assume a multinomial sampling model, which is incorrect. The correct, product-multinomial model reflects the stratification of Time A scores. Using direct computation, the research report demonstrates that both proportion of exact agreement and the t-statistic can deviate substantially from expected behavior, providing misleading results. Reweighting the rescore table gives each statistic the correct expected value but does not guarantee that the usual sampling distributions hold. It is also noted that the results apply to a wider class of situations in which a set of papers is scored by one group of raters or scoring engine and then a sample is selected to be evaluated by a different group of raters or scoring engine.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
ETS Research Report Series
ETS Research Report Series Social Sciences-Education
CiteScore
1.20
自引率
0.00%
发文量
17
期刊最新文献
Building a Validity Argument for the TOEFL Junior® Tests Validity, Reliability, and Fairness Evidence for the JD‐Next Exam Practical Considerations in Item Calibration With Small Samples Under Multistage Test Design: A Case Study Practical Considerations in Item Calibration With Small Samples Under Multistage Test Design: A Case Study Modeling Writing Traits in a Formative Essay Corpus
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1