首页 > 最新文献

Applied Measurement in Education最新文献

英文 中文
Detecting Local Dependence: A Threshold-Autoregressive Item Response Theory (TAR-IRT) Approach for Polytomous Items 局部依赖性检测:一种阈值-自回归项目反应理论(TAR-IRT)方法
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2020-07-20 DOI: 10.1080/08957347.2020.1789136
Xiaodan Tang, G. Karabatsos, Haiqin Chen
ABSTRACT In applications of item response theory (IRT) models, it is known that empirical violations of the local independence (LI) assumption can significantly bias parameter estimates. To address this issue, we propose a threshold-autoregressive item response theory (TAR-IRT) model that additionally accounts for order dependence among the item responses of each examinee. The TAR-IRT approach also defines a new family of IRT models for polytomous item responses under both unidimensional and multidimensional frameworks, with order-dependent effects between item responses and relevant dimensions. The feasibility of the proposed model was demonstrated by an empirical study using a polytomous response data. A simulation study for polytomous item responses with order effects of different magnitude in an education context shows that the TAR modeling framework could provide more accurate ability estimation than the partial credit model when order effect exists.
摘要在项目反应理论(IRT)模型的应用中,已知违反局部独立性(LI)假设的经验会显著偏离参数估计。为了解决这个问题,我们提出了一个阈值自回归项目反应理论(TAR-IRT)模型,该模型额外考虑了每个考生项目反应之间的顺序依赖性。TAR-IRT方法还定义了一个新的IRT模型家族,用于在一维和多维框架下的多角形项目反应,在项目反应和相关维度之间具有顺序依赖效应。通过使用多面体响应数据的实证研究证明了所提出模型的可行性。一项针对教育背景下具有不同程度顺序效应的多模项目反应的模拟研究表明,当存在顺序效应时,TAR建模框架可以比部分学分模型提供更准确的能力估计。
{"title":"Detecting Local Dependence: A Threshold-Autoregressive Item Response Theory (TAR-IRT) Approach for Polytomous Items","authors":"Xiaodan Tang, G. Karabatsos, Haiqin Chen","doi":"10.1080/08957347.2020.1789136","DOIUrl":"https://doi.org/10.1080/08957347.2020.1789136","url":null,"abstract":"ABSTRACT In applications of item response theory (IRT) models, it is known that empirical violations of the local independence (LI) assumption can significantly bias parameter estimates. To address this issue, we propose a threshold-autoregressive item response theory (TAR-IRT) model that additionally accounts for order dependence among the item responses of each examinee. The TAR-IRT approach also defines a new family of IRT models for polytomous item responses under both unidimensional and multidimensional frameworks, with order-dependent effects between item responses and relevant dimensions. The feasibility of the proposed model was demonstrated by an empirical study using a polytomous response data. A simulation study for polytomous item responses with order effects of different magnitude in an education context shows that the TAR modeling framework could provide more accurate ability estimation than the partial credit model when order effect exists.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"280 - 292"},"PeriodicalIF":1.5,"publicationDate":"2020-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1789136","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42274266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Validating Rubric Scoring Processes: An Application of an Item Response Tree Model 验证标题评分过程:项目反应树模型的应用
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2020-07-20 DOI: 10.1080/08957347.2020.1789143
Aaron J. Myers, Allison J. Ames, B. Leventhal, Madison A. Holzman
ABSTRACT When rating performance assessments, raters may ascribe different scores for the same performance when rubric application does not align with the intended application of the scoring criteria. Given performance assessment score interpretation assumes raters apply rubrics as rubric developers intended, misalignment between raters’ scoring processes and the intended scoring processes may lead to invalid inferences from these scores. In an effort to standardize raters’ scoring processes, an alternative scoring method was used. With this method, rubric developers’ intended scoring processes are made explicit by requiring raters to respond to a series of selected-response statements resembling a decision tree. To determine if raters scored essays as intended using a traditional rubric and the alternative scoring method, an IRT model with a tree-like structure (IRTree) was specified to depict the intended scoring processes and fit to data from each scoring method. Results suggest raters using the alternative method may be better able to rate as intended and thus the alternative method may be a viable alternative to traditional rubric scoring. Implications of the IRTree model are discussed.
在对绩效评估进行评级时,当评分标准的应用与评分标准的预期应用不一致时,评分者可能会对相同的绩效给出不同的分数。鉴于绩效评估分数解释假设评分者按照评分开发者的意图应用评分标准,评分者的评分过程和预期评分过程之间的不一致可能导致从这些分数中得出无效的推断。为了使评分者的评分过程标准化,采用了另一种评分方法。使用这种方法,通过要求评分者对一系列类似于决策树的选择响应语句作出响应,使规则开发人员预期的评分过程变得明确。为了确定评分者是否使用传统的评分方法和替代评分方法对文章进行评分,指定了一个具有树状结构(IRTree)的IRT模型来描述预期的评分过程,并适合来自每种评分方法的数据。结果表明,使用替代方法的评分者可能能够更好地按照预期进行评分,因此替代方法可能是传统评分的可行替代方法。讨论了IRTree模型的含义。
{"title":"Validating Rubric Scoring Processes: An Application of an Item Response Tree Model","authors":"Aaron J. Myers, Allison J. Ames, B. Leventhal, Madison A. Holzman","doi":"10.1080/08957347.2020.1789143","DOIUrl":"https://doi.org/10.1080/08957347.2020.1789143","url":null,"abstract":"ABSTRACT When rating performance assessments, raters may ascribe different scores for the same performance when rubric application does not align with the intended application of the scoring criteria. Given performance assessment score interpretation assumes raters apply rubrics as rubric developers intended, misalignment between raters’ scoring processes and the intended scoring processes may lead to invalid inferences from these scores. In an effort to standardize raters’ scoring processes, an alternative scoring method was used. With this method, rubric developers’ intended scoring processes are made explicit by requiring raters to respond to a series of selected-response statements resembling a decision tree. To determine if raters scored essays as intended using a traditional rubric and the alternative scoring method, an IRT model with a tree-like structure (IRTree) was specified to depict the intended scoring processes and fit to data from each scoring method. Results suggest raters using the alternative method may be better able to rate as intended and thus the alternative method may be a viable alternative to traditional rubric scoring. Implications of the IRTree model are discussed.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"293 - 308"},"PeriodicalIF":1.5,"publicationDate":"2020-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1789143","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42133773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
An IRT Mixture Model for Rating Scale Confusion Associated with Negatively Worded Items in Measures of Social-Emotional Learning 社会情绪学习量表中负面词汇混淆的IRT混合模型
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2020-07-16 DOI: 10.1080/08957347.2020.1789140
D. Bolt, Y. Wang, R. Meyer, L. Pier
ABSTRACT We illustrate the application of mixture IRT models to evaluate respondent confusion due to the negative wording of certain items on a social-emotional learning (SEL) assessment. Using actual student self-report ratings on four social-emotional learning scales collected from students in grades 3–12 from CORE Districts in the state of California, we also evaluate the consequences of the potential confusion in biasing student- and school-level scores as well as the estimated correlational relationships between SEL constructs and student-level variables. Models of both full and partial confusion are examined. Our results suggest that (1) rating scale confusion due to negatively worded items does appear to be present; (2) the confusion is most prevalent at lower grade levels (third–fifth); and (3) the occurrence of confusion is positively related to both reading proficiency and ELL status, as anticipated, and consequently biases estimates of SEL correlations with these student-level variables. For these reasons, we suggest future iterations of the SEL measures use only positively oriented items.
摘要:我们展示了混合IRT模型的应用,以评估由于社会情感学习(SEL)评估中某些项目的负面措辞而导致的受访者困惑。使用从加利福尼亚州CORE区3-12年级学生中收集的四个社会情绪学习量表上的实际学生自我报告评分,我们还评估了学生和学校水平分数偏差的潜在混淆的后果,以及SEL结构和学生水平变量之间的估计相关关系。研究了完全和部分混淆的模型。我们的研究结果表明:(1)由于措辞消极的项目导致的评分表混乱似乎确实存在;(2) 这种困惑在低年级(三年级至五年级)最为普遍;和(3)正如预期的那样,困惑的发生与阅读能力和ELL状态呈正相关,因此对SEL与这些学生水平变量的相关性的估计存在偏差。出于这些原因,我们建议SEL度量的未来迭代只使用正向项。
{"title":"An IRT Mixture Model for Rating Scale Confusion Associated with Negatively Worded Items in Measures of Social-Emotional Learning","authors":"D. Bolt, Y. Wang, R. Meyer, L. Pier","doi":"10.1080/08957347.2020.1789140","DOIUrl":"https://doi.org/10.1080/08957347.2020.1789140","url":null,"abstract":"ABSTRACT We illustrate the application of mixture IRT models to evaluate respondent confusion due to the negative wording of certain items on a social-emotional learning (SEL) assessment. Using actual student self-report ratings on four social-emotional learning scales collected from students in grades 3–12 from CORE Districts in the state of California, we also evaluate the consequences of the potential confusion in biasing student- and school-level scores as well as the estimated correlational relationships between SEL constructs and student-level variables. Models of both full and partial confusion are examined. Our results suggest that (1) rating scale confusion due to negatively worded items does appear to be present; (2) the confusion is most prevalent at lower grade levels (third–fifth); and (3) the occurrence of confusion is positively related to both reading proficiency and ELL status, as anticipated, and consequently biases estimates of SEL correlations with these student-level variables. For these reasons, we suggest future iterations of the SEL measures use only positively oriented items.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"331 - 348"},"PeriodicalIF":1.5,"publicationDate":"2020-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1789140","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43253014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Evaluating Random and Systematic Error in Student Growth Percentiles 评价学生成长百分位数的随机和系统误差
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2020-07-15 DOI: 10.1080/08957347.2020.1789139
C. Wells, S. Sireci
ABSTRACT Student growth percentiles (SGPs) are currently used by several states and school districts to provide information about individual students as well as to evaluate teachers, schools, and school districts. For SGPs to be defensible for these purposes, they should be reliable. In this study, we examine the amount of systematic and random error in SGPs by simulating test scores for four grades and estimating SGPs using one, two, or three conditioning years. The results indicated that, although the amount of systematic error was small to moderate, the amount of random error was substantial, regardless of the number of conditioning years. For example, the standard error of the SGP estimates associated with an SGP value of 56 was 22.2 resulting in a 68% confidence interval that would range from 33.8 to 78.2 when using three conditioning years. The results are consistent with previous research and suggest SGP estimates are too imprecise to be reported for the purpose of understanding students’ progress over time.
摘要学生增长百分位数(SGP)目前被几个州和学区用于提供学生个人信息以及评估教师、学校和学区。为了使SGP能够用于这些目的,它们应该是可靠的。在这项研究中,我们通过模拟四个年级的考试成绩,并使用一年、两年或三年的条件年来估计SGP,来检验SGP中的系统性和随机性误差。结果表明,尽管系统误差的大小是小到中等的,但无论调节年限如何,随机误差的大小都是巨大的。例如,与SGP值为56相关的SGP估计的标准误差为22.2,导致在使用三个条件年时,68%的置信区间在33.8至78.2之间。研究结果与之前的研究一致,表明SGP估计值过于不精确,无法用于了解学生随时间的进步。
{"title":"Evaluating Random and Systematic Error in Student Growth Percentiles","authors":"C. Wells, S. Sireci","doi":"10.1080/08957347.2020.1789139","DOIUrl":"https://doi.org/10.1080/08957347.2020.1789139","url":null,"abstract":"ABSTRACT Student growth percentiles (SGPs) are currently used by several states and school districts to provide information about individual students as well as to evaluate teachers, schools, and school districts. For SGPs to be defensible for these purposes, they should be reliable. In this study, we examine the amount of systematic and random error in SGPs by simulating test scores for four grades and estimating SGPs using one, two, or three conditioning years. The results indicated that, although the amount of systematic error was small to moderate, the amount of random error was substantial, regardless of the number of conditioning years. For example, the standard error of the SGP estimates associated with an SGP value of 56 was 22.2 resulting in a 68% confidence interval that would range from 33.8 to 78.2 when using three conditioning years. The results are consistent with previous research and suggest SGP estimates are too imprecise to be reported for the purpose of understanding students’ progress over time.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"349 - 361"},"PeriodicalIF":1.5,"publicationDate":"2020-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1789139","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43006041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
The Impact of Setting Scoring Expectations on Rater Scoring Rates and Accuracy 设定评分期望对评分者评分率和准确性的影响
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2020-07-02 DOI: 10.1080/08957347.2020.1750401
Cathy L. W. Wendler, Nancy Glazer, B. Bridgeman
ABSTRACT Efficient constructed response (CR) scoring requires both accuracy and speed from human raters. This study was designed to determine if setting scoring rate expectations would encourage raters to score at a faster pace, and if so, if there would be differential effects on scoring accuracy for raters who score at different rates. Three rater groups (slow, medium, and fast) and two conditions (informed and uninformed) were used. In both conditions, raters were given identical scoring directions, but only the informed groups were given an expected scoring rate. Results indicated no significant differences across the two conditions. However, there were significant increases in scoring rates for medium and slow raters compared to their previous operational rates, regardless of whether they were in the informed or uninformed condition. Results also showed there were no significant effects on rater accuracy for either of the two conditions or for any of the rater groups.
摘要高效的构建反应(CR)评分需要人工评分的准确性和速度。这项研究旨在确定设定评分率预期是否会鼓励评分者以更快的速度评分,如果是,对以不同评分率评分的评分者的评分准确性是否会产生不同的影响。使用三个评分组(慢、中、快)和两种条件(知情和不知情)。在这两种情况下,评分者都得到了相同的评分指导,但只有知情组得到了预期的评分率。结果表明,两种情况之间没有显著差异。然而,与之前的手术率相比,中等和慢速评分者的评分率显著提高,无论他们是在知情还是不知情的情况下。结果还显示,无论是在这两种情况下,还是在任何评分组中,评分者的准确性都没有显著影响。
{"title":"The Impact of Setting Scoring Expectations on Rater Scoring Rates and Accuracy","authors":"Cathy L. W. Wendler, Nancy Glazer, B. Bridgeman","doi":"10.1080/08957347.2020.1750401","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750401","url":null,"abstract":"ABSTRACT Efficient constructed response (CR) scoring requires both accuracy and speed from human raters. This study was designed to determine if setting scoring rate expectations would encourage raters to score at a faster pace, and if so, if there would be differential effects on scoring accuracy for raters who score at different rates. Three rater groups (slow, medium, and fast) and two conditions (informed and uninformed) were used. In both conditions, raters were given identical scoring directions, but only the informed groups were given an expected scoring rate. Results indicated no significant differences across the two conditions. However, there were significant increases in scoring rates for medium and slow raters compared to their previous operational rates, regardless of whether they were in the informed or uninformed condition. Results also showed there were no significant effects on rater accuracy for either of the two conditions or for any of the rater groups.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"248 - 254"},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750401","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42360842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Understanding and Interpreting Human Scoring 理解和解释人类评分
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2020-07-02 DOI: 10.1080/08957347.2020.1750402
Nancy Glazer, E. Wolfe
ABSTRACT This introductory article describes how constructed response scoring is carried out, particularly the rater monitoring processes and illustrates three potential designs for conducting rater monitoring in an operational scoring project. The introduction also presents a framework for interpreting research conducted by those who study the constructed response scoring process. That framework identifies three classifications of inputs (rater characteristics, response content, and rating context) which typically serve as independent variables in constructed response scoring research as well as three primary outcomes (rating quality, rating speed, and rater attitude) which serve as the dependent variables in those studies. Finally, we explain how each of the articles in this issue can be classified according to that framework.
这篇介绍性文章描述了构建的反应评分是如何进行的,特别是评分者监测过程,并举例说明了在一个可操作的评分项目中进行评分者监测的三种潜在设计。引言部分还提出了一个解释研究的框架,这些研究是由那些研究构建反应评分过程的人进行的。该框架确定了三种输入分类(评分者特征、评分内容和评分背景),它们通常作为构建反应评分研究中的独立变量,以及三个主要结果(评分质量、评分速度和评分者态度),它们作为这些研究中的因变量。最后,我们将解释如何根据该框架对本期的每篇文章进行分类。
{"title":"Understanding and Interpreting Human Scoring","authors":"Nancy Glazer, E. Wolfe","doi":"10.1080/08957347.2020.1750402","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750402","url":null,"abstract":"ABSTRACT This introductory article describes how constructed response scoring is carried out, particularly the rater monitoring processes and illustrates three potential designs for conducting rater monitoring in an operational scoring project. The introduction also presents a framework for interpreting research conducted by those who study the constructed response scoring process. That framework identifies three classifications of inputs (rater characteristics, response content, and rating context) which typically serve as independent variables in constructed response scoring research as well as three primary outcomes (rating quality, rating speed, and rater attitude) which serve as the dependent variables in those studies. Finally, we explain how each of the articles in this issue can be classified according to that framework.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"191 - 197"},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750402","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42557747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Why Should We Care about Human Raters? 为什么我们应该关心人类评级机构?
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2020-07-02 DOI: 10.1080/08957347.2020.1750407
E. Wolfe, Cathy L. W. Wendler
For more than a decade, measurement practitioners and researchers have emphasized evaluating, improving, and implementing automated scoring of constructed response (CR) items and tasks. There is go...
十多年来,测量从业者和研究人员一直强调评估、改进和实施构建反应(CR)项目和任务的自动评分。去吧。。。
{"title":"Why Should We Care about Human Raters?","authors":"E. Wolfe, Cathy L. W. Wendler","doi":"10.1080/08957347.2020.1750407","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750407","url":null,"abstract":"For more than a decade, measurement practitioners and researchers have emphasized evaluating, improving, and implementing automated scoring of constructed response (CR) items and tasks. There is go...","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"189 - 190"},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750407","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46471978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Commentary on “Using Human Raters in Constructed Response Scoring: Understanding, Predicting, and Modifying Performance” 对“在构建反应评分中使用人类评分员:理解、预测和修改表现”的评论
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2020-07-02 DOI: 10.1080/08957347.2020.1750408
Walter D. Way
This special issue of AME provides a rich set of articles related to monitoring human scoring of constructed response items. As a starting point for this commentary, is it worth mentioning that the...
AME的这一期特刊提供了一组丰富的文章,这些文章与监视人工对构建的响应项评分有关。作为这篇评论的开头,是否值得一提的是……
{"title":"Commentary on “Using Human Raters in Constructed Response Scoring: Understanding, Predicting, and Modifying Performance”","authors":"Walter D. Way","doi":"10.1080/08957347.2020.1750408","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750408","url":null,"abstract":"This special issue of AME provides a rich set of articles related to monitoring human scoring of constructed response items. As a starting point for this commentary, is it worth mentioning that the...","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"255 - 261"},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750408","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41452462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating Human Scoring Using Generalizability Theory 用概化理论评价人的得分
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2020-07-02 DOI: 10.1080/08957347.2020.1750403
Y. Bimpeh, W. Pointer, Ben A. Smith, Liz Harrison
ABSTRACT Many high-stakes examinations in the United Kingdom (UK) use both constructed-response items and selected-response items. We need to evaluate the inter-rater reliability for constructed-response items that are scored by humans. While there are a variety of methods for evaluating rater consistency across ratings in the psychometric literature, we apply generalizability theory (G theory) to data from routine monitoring of ratings to derive an estimate for inter-rater reliability. UK examinations use a combination of double or multiple rating for routine monitoring, creating a more complex design that consists of cross-pairing of raters and overlapping of raters for different groups of candidates or items. This sampling design is neither fully crossed nor is it nested. Each double- or multiple-scored item takes a different set of candidates, and the number of sampled candidates per item varies. Therefore, the standard G theory method, and its various forms for estimating inter-rater reliability, cannot be directly applied to the operational data. We propose a method that takes double or multiple rating data as given and analyzes the datasets at the item level in order to obtain more accurate and stable variance component estimates. We adapt the variance component in observed scores for an unbalanced one-facet crossed design with some missing observations. These estimates can be used to make inferences about the reliability of the entire scoring process. We illustrate the proposed method by applying it to real scoring data.
摘要在英国,许多高风险考试既使用构建的回答项目,也使用选择的回答项目。我们需要评估由人类评分的构建反应项目的评分者间可靠性。虽然在心理测量学文献中有多种方法可以评估评分者在不同评分之间的一致性,但我们将可推广性理论(G理论)应用于日常评分监测的数据,以得出评分者间可靠性的估计值。英国考试采用双重或多重评分相结合的方式进行日常监测,形成了一种更复杂的设计,包括评分者的交叉配对和不同考生组或项目的评分者的重叠。这种抽样设计既不是完全交叉的,也不是嵌套的。每个双分或多分项目都有一组不同的候选人,每个项目的抽样候选人数量各不相同。因此,标准的G理论方法及其用于评估评级机构间可靠性的各种形式,不能直接应用于操作数据。我们提出了一种方法,该方法采用给定的双倍或多重评级数据,并在项目级别分析数据集,以获得更准确和稳定的方差分量估计。我们将观察到的分数中的方差分量调整为具有一些缺失观察的不平衡单方面交叉设计。这些估计可用于推断整个评分过程的可靠性。我们通过将所提出的方法应用于真实的评分数据来说明它。
{"title":"Evaluating Human Scoring Using Generalizability Theory","authors":"Y. Bimpeh, W. Pointer, Ben A. Smith, Liz Harrison","doi":"10.1080/08957347.2020.1750403","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750403","url":null,"abstract":"ABSTRACT Many high-stakes examinations in the United Kingdom (UK) use both constructed-response items and selected-response items. We need to evaluate the inter-rater reliability for constructed-response items that are scored by humans. While there are a variety of methods for evaluating rater consistency across ratings in the psychometric literature, we apply generalizability theory (G theory) to data from routine monitoring of ratings to derive an estimate for inter-rater reliability. UK examinations use a combination of double or multiple rating for routine monitoring, creating a more complex design that consists of cross-pairing of raters and overlapping of raters for different groups of candidates or items. This sampling design is neither fully crossed nor is it nested. Each double- or multiple-scored item takes a different set of candidates, and the number of sampled candidates per item varies. Therefore, the standard G theory method, and its various forms for estimating inter-rater reliability, cannot be directly applied to the operational data. We propose a method that takes double or multiple rating data as given and analyzes the datasets at the item level in order to obtain more accurate and stable variance component estimates. We adapt the variance component in observed scores for an unbalanced one-facet crossed design with some missing observations. These estimates can be used to make inferences about the reliability of the entire scoring process. We illustrate the proposed method by applying it to real scoring data.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"198 - 209"},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750403","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43345855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
The Impact of Operational Scoring Experience and Additional Mentored Training on Raters’ Essay Scoring Accuracy 操作评分经验和额外的指导培训对评分员论文评分准确性的影响
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2020-07-02 DOI: 10.1080/08957347.2020.1750404
Ikkyu Choi, E. Wolfe
ABSTRACT Rater training is essential in ensuring the quality of constructed response scoring. Most of the current knowledge about rater training comes from experimental contexts with an emphasis on short-term effects. Few sources are available for empirical evidence on whether and how raters become more accurate as they gain scoring experiences or what long-term effects training can have. In this study, we addressed this research gap by tracking how the accuracies of new raters change through experience and by examining the impact of an additional training session on their accuracies in scoring calibration and monitoring essays. We found that, on average, raters’ accuracy improved with scoring experience and that individual raters differed in their accuracy trajectories. The estimated average effect of the training was an approximately six percent increase in the calibration essay accuracy. On the other hand, we observed a smaller impact on the monitoring essay accuracy. Our follow-up analysis showed that this differential impact of the additional training on the calibration and monitoring essay accuracy could be accounted for by successful gatekeeping through calibration.
评分员培训是保证构建反应评分质量的必要条件。目前大多数关于评估师训练的知识都来自实验背景,强调短期效果。关于评分员在获得评分经验后是否以及如何变得更准确,或者训练会产生什么长期影响,很少有经验证据可供参考。在这项研究中,我们通过跟踪新评分者的准确性如何随着经验而变化,以及通过检查额外的培训课程对评分校准和监控论文准确性的影响,解决了这一研究差距。我们发现,平均而言,评分者的准确性随着评分经验的增加而提高,而且个别评分者的准确性轨迹有所不同。估计培训的平均效果是校准论文准确性增加了大约6%。另一方面,我们观察到对监控论文准确性的影响较小。我们的后续分析表明,这种额外培训对校准和监测论文准确性的差异影响可以通过校准成功把关来解释。
{"title":"The Impact of Operational Scoring Experience and Additional Mentored Training on Raters’ Essay Scoring Accuracy","authors":"Ikkyu Choi, E. Wolfe","doi":"10.1080/08957347.2020.1750404","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750404","url":null,"abstract":"ABSTRACT Rater training is essential in ensuring the quality of constructed response scoring. Most of the current knowledge about rater training comes from experimental contexts with an emphasis on short-term effects. Few sources are available for empirical evidence on whether and how raters become more accurate as they gain scoring experiences or what long-term effects training can have. In this study, we addressed this research gap by tracking how the accuracies of new raters change through experience and by examining the impact of an additional training session on their accuracies in scoring calibration and monitoring essays. We found that, on average, raters’ accuracy improved with scoring experience and that individual raters differed in their accuracy trajectories. The estimated average effect of the training was an approximately six percent increase in the calibration essay accuracy. On the other hand, we observed a smaller impact on the monitoring essay accuracy. Our follow-up analysis showed that this differential impact of the additional training on the calibration and monitoring essay accuracy could be accounted for by successful gatekeeping through calibration.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"210 - 222"},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750404","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45226677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Applied Measurement in Education
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1