首页 > 最新文献

Applied Measurement in Education最新文献

英文 中文
Predictive Modeling of Rater Behavior: Implications for Quality Assurance in Essay Scoring 评分者行为的预测模型:对论文评分质量保证的启示
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2020-07-02 DOI: 10.1080/08957347.2020.1750406
I. Bejar, Chen Li, D. McCaffrey
ABSTRACT We evaluate the feasibility of developing predictive models of rater behavior, that is, rater-specific models for predicting the scores produced by a rater under operational conditions. In the present study, the dependent variable is the score assigned to essays by a rater, and the predictors are linguistic attributes of the essays used by the e-rater® engine. Specifically, for each rater, the linear regression of rater scores on the linguistic attributes is obtained based on data from two consecutive time periods. The regression from each period was cross validated against data from the other period. Raters were characterized in terms of their level of predictability and the importance of the predictors. Results suggest that rater models capture stable individual differences among raters. To evaluate the feasibility of using rater models as a quality control mechanism, we evaluated the relationship between rater predictability and inter-rater agreement and performance on pre-scored essays. Finally, we conducted a simulation whereby raters are simulated to score exclusively as a function of essay length at different points during the scoring day. We concluded that predictive rater models merit further investigation as a means of quality controlling human scoring.
摘要:我们评估了开发评分者行为预测模型的可行性,即评分者特定模型,用于预测评分者在操作条件下产生的分数。在本研究中,因变量是评分者分配给论文的分数,预测因素是e-rater®引擎使用的论文的语言属性。具体而言,对于每个评分者,基于来自两个连续时间段的数据来获得评分者对语言属性的得分的线性回归。每个时期的回归与其他时期的数据进行交叉验证。评分者的特点是他们的可预测性水平和预测因素的重要性。结果表明,评分者模型捕捉到了评分者之间稳定的个体差异。为了评估使用评分者模型作为质量控制机制的可行性,我们评估了评分者的可预测性与评分者之间的一致性以及预评分论文的表现之间的关系。最后,我们进行了一个模拟,模拟评分者在评分日的不同时间点只根据文章长度进行评分。我们得出的结论是,作为一种质量控制人类评分的手段,预测评分模型值得进一步研究。
{"title":"Predictive Modeling of Rater Behavior: Implications for Quality Assurance in Essay Scoring","authors":"I. Bejar, Chen Li, D. McCaffrey","doi":"10.1080/08957347.2020.1750406","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750406","url":null,"abstract":"ABSTRACT We evaluate the feasibility of developing predictive models of rater behavior, that is, rater-specific models for predicting the scores produced by a rater under operational conditions. In the present study, the dependent variable is the score assigned to essays by a rater, and the predictors are linguistic attributes of the essays used by the e-rater® engine. Specifically, for each rater, the linear regression of rater scores on the linguistic attributes is obtained based on data from two consecutive time periods. The regression from each period was cross validated against data from the other period. Raters were characterized in terms of their level of predictability and the importance of the predictors. Results suggest that rater models capture stable individual differences among raters. To evaluate the feasibility of using rater models as a quality control mechanism, we evaluated the relationship between rater predictability and inter-rater agreement and performance on pre-scored essays. Finally, we conducted a simulation whereby raters are simulated to score exclusively as a function of essay length at different points during the scoring day. We concluded that predictive rater models merit further investigation as a means of quality controlling human scoring.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"234 - 247"},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750406","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45042742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Applying Cognitive Theory to the Human Essay Rating Process 认知理论在人类论文评分过程中的应用
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2020-07-02 DOI: 10.1080/08957347.2020.1750405
B. Finn, Burcu Arslan, M. Walsh
ABSTRACT To score an essay response, raters draw on previously trained skills and knowledge about the underlying rubric and score criterion. Cognitive processes such as remembering, forgetting, and skill decay likely influence rater performance. To investigate how forgetting influences scoring, we evaluated raters’ scoring accuracy on TOEFL and GRE essays. We used binomial linear mixed effect models to evaluate how the effect of various predictors such as time spent scoring each response and days between scoring sessions relate to scoring accuracy. Results suggest that for both nonoperational (i.e., calibration samples completed prior to a scoring session) and operational scoring (i.e., validity samples interspersed among actual student responses), the number of days in a scoring gap negatively affects performance. The findings, as well as other results from the models are discussed in the context of cognitive influences on knowledge and skill retention.
为了给作文打分,评分者利用以前训练过的技能和知识,了解潜在的标题和评分标准。记忆、遗忘和技能衰退等认知过程可能会影响评分者的表现。为了研究遗忘如何影响评分,我们评估了评分者在托福和GRE作文中的评分准确性。我们使用二项线性混合效应模型来评估各种预测因素的影响,如为每个回答评分所花费的时间和评分间隔的天数与评分准确性的关系。结果表明,对于非操作性评分(即在评分之前完成的校准样本)和操作性评分(即在实际学生回答中散布的效度样本),评分间隔的天数对成绩产生负面影响。这些发现以及模型的其他结果在认知对知识和技能保留的影响的背景下进行了讨论。
{"title":"Applying Cognitive Theory to the Human Essay Rating Process","authors":"B. Finn, Burcu Arslan, M. Walsh","doi":"10.1080/08957347.2020.1750405","DOIUrl":"https://doi.org/10.1080/08957347.2020.1750405","url":null,"abstract":"ABSTRACT To score an essay response, raters draw on previously trained skills and knowledge about the underlying rubric and score criterion. Cognitive processes such as remembering, forgetting, and skill decay likely influence rater performance. To investigate how forgetting influences scoring, we evaluated raters’ scoring accuracy on TOEFL and GRE essays. We used binomial linear mixed effect models to evaluate how the effect of various predictors such as time spent scoring each response and days between scoring sessions relate to scoring accuracy. Results suggest that for both nonoperational (i.e., calibration samples completed prior to a scoring session) and operational scoring (i.e., validity samples interspersed among actual student responses), the number of days in a scoring gap negatively affects performance. The findings, as well as other results from the models are discussed in the context of cognitive influences on knowledge and skill retention.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"223 - 233"},"PeriodicalIF":1.5,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1750405","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46828655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Gauging Uncertainty in Test-to-Curriculum Alignment Indices 衡量测试中的不确定性与课程一致性指标
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2020-03-03 DOI: 10.1080/08957347.2020.1732387
A. Traynor, Tingxuan Li, Shuqi Zhou
ABSTRACT During the development of large-scale school achievement tests, panels of independent subject-matter experts use systematic judgmental methods to rate the correspondence between a given test’s items and performance objective statements. The individual experts’ ratings may then be used to compute summary indices to quantify the match between a given test and its target item domain. The magnitude of alignment index variability across experts within a panel, and randomly-sampled panels, is largely unknown, however. Using rater-by-item data from alignment reviews of 14 US states’ achievement tests, we examine observed distributions and estimate standard errors for three alignment indices developed by Webb. Our results suggest that alignment decisions based on the recommended criterion for the balance-of-representation index may often be uncertain, and that the criterion for the depth-of-knowledge consistency index should perhaps be reconsidered. We also examine current recommendations about the number of expert panelists required to compute these alignment indices.
摘要在大规模学校成绩测试的发展过程中,独立学科专家小组使用系统的判断方法来评估给定测试项目与成绩目标陈述之间的对应性。然后,可以使用个人专家的评级来计算汇总指数,以量化给定测试与其目标项目领域之间的匹配。然而,一个小组内的专家和随机抽样的小组之间的比对指数变化幅度在很大程度上是未知的。使用来自美国14个州成绩测试的一致性审查的逐项评分数据,我们检查了Webb开发的三个一致性指数的观测分布并估计了标准误差。我们的研究结果表明,基于代表性平衡指数的推荐标准的调整决策往往是不确定的,也许应该重新考虑知识深度一致性指数的标准。我们还研究了目前关于计算这些对齐指数所需专家小组成员数量的建议。
{"title":"Gauging Uncertainty in Test-to-Curriculum Alignment Indices","authors":"A. Traynor, Tingxuan Li, Shuqi Zhou","doi":"10.1080/08957347.2020.1732387","DOIUrl":"https://doi.org/10.1080/08957347.2020.1732387","url":null,"abstract":"ABSTRACT During the development of large-scale school achievement tests, panels of independent subject-matter experts use systematic judgmental methods to rate the correspondence between a given test’s items and performance objective statements. The individual experts’ ratings may then be used to compute summary indices to quantify the match between a given test and its target item domain. The magnitude of alignment index variability across experts within a panel, and randomly-sampled panels, is largely unknown, however. Using rater-by-item data from alignment reviews of 14 US states’ achievement tests, we examine observed distributions and estimate standard errors for three alignment indices developed by Webb. Our results suggest that alignment decisions based on the recommended criterion for the balance-of-representation index may often be uncertain, and that the criterion for the depth-of-knowledge consistency index should perhaps be reconsidered. We also examine current recommendations about the number of expert panelists required to compute these alignment indices.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"141 - 158"},"PeriodicalIF":1.5,"publicationDate":"2020-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1732387","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49412039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
The Impact of Test-Taking Disengagement on Item Content Representation 测试脱离对项目内容表示的影响
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2020-03-03 DOI: 10.1080/08957347.2020.1732386
S. Wise
ABSTRACT In achievement testing there is typically a practical requirement that the set of items administered should be representative of some target content domain. This is accomplished by establishing test blueprints specifying the content constraints to be followed when selecting the items for a test. Sometimes, however, students give disengaged responses to some of their test items, which raises the issue of the degree to which the set of engaged responses maintain the intended content representation. The current investigation reports the results of two studies focused on rapid-guessing behavior. The first study showed evidence that differential rapid guessing often resulted in test events with meaningfully distorted content representation. The second study found that the differences in test taking engagement across content categories were primarily due to differences in the reading load of items. Implications for test-score validity are discussed along with suggestions for addressing the problem.
摘要在成绩测试中,通常有一个实际要求,即所管理的项目集应代表某个目标内容领域。这是通过建立测试蓝图来实现的,该蓝图规定了在选择测试项目时要遵循的内容约束。然而,有时,学生对他们的一些测试项目给出了脱离实际的回答,这就提出了一个问题,即一组参与的回答在多大程度上保持了预期的内容表现。目前的调查报告了两项关于快速猜测行为的研究结果。第一项研究表明,有证据表明,差异快速猜测往往会导致测试事件的内容表现有意义地扭曲。第二项研究发现,不同内容类别的参与度差异主要是由于项目阅读量的差异。讨论了对测试成绩有效性的影响以及解决该问题的建议。
{"title":"The Impact of Test-Taking Disengagement on Item Content Representation","authors":"S. Wise","doi":"10.1080/08957347.2020.1732386","DOIUrl":"https://doi.org/10.1080/08957347.2020.1732386","url":null,"abstract":"ABSTRACT In achievement testing there is typically a practical requirement that the set of items administered should be representative of some target content domain. This is accomplished by establishing test blueprints specifying the content constraints to be followed when selecting the items for a test. Sometimes, however, students give disengaged responses to some of their test items, which raises the issue of the degree to which the set of engaged responses maintain the intended content representation. The current investigation reports the results of two studies focused on rapid-guessing behavior. The first study showed evidence that differential rapid guessing often resulted in test events with meaningfully distorted content representation. The second study found that the differences in test taking engagement across content categories were primarily due to differences in the reading load of items. Implications for test-score validity are discussed along with suggestions for addressing the problem.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"83 - 94"},"PeriodicalIF":1.5,"publicationDate":"2020-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1732386","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41488342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
The Trade-Off between Model Fit, Invariance, and Validity: The Case of PISA Science Assessments 模型拟合、不变性和有效性之间的权衡:以PISA科学评估为例
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2020-03-03 DOI: 10.1080/08957347.2020.1732384
Yasmine H. El Masri, D. Andrich
ABSTRACT In large-scale educational assessments, it is generally required that tests are composed of items that function invariantly across the groups to be compared. Despite efforts to ensure invariance in the item construction phase, for a range of reasons (including the security of items) it is often necessary to account for differential item functioning (DIF) of items post hoc. This typically requires a choice among retaining an item as it is despite its DIF, deleting the item, or resolving (splitting) an item by creating a distinct item for each group. These options involve a trade-off between model fit and the invariance of item parameters, and each option could be valid depending on whether or not the source of DIF is relevant or irrelevant to the variable being assessed. We argue that making a choice requires a careful analysis of statistical DIF and its substantive source. We illustrate our argument by analyzing PISA 2006 science data of three countries (UK, France and Jordan) using the Rasch model, which was the model used for the analyses of all PISA 2006 data. We identify items with real DIF across countries and examine the implications for model fit, invariance, and the validity of cross-country comparisons when these items are either eliminated, resolved or retained.
摘要:在大规模的教育评估中,通常要求测试由在待比较的组中功能不变的项目组成。尽管努力确保项目构建阶段的不变性,但由于一系列原因(包括项目的安全性),通常有必要考虑项目的差异项目功能(DIF)。这通常需要在保留项目(尽管有DIF)、删除项目或通过为每个组创建一个不同的项目来解决(拆分)项目之间进行选择。这些选项涉及模型拟合和项目参数不变性之间的权衡,每个选项都可能有效,这取决于DIF的来源是否与被评估的变量相关或无关。我们认为,做出选择需要仔细分析统计DIF及其实质性来源。我们通过使用Rasch模型分析三个国家(英国、法国和约旦)的PISA 2006科学数据来说明我们的论点,Rasch模型是用于分析PISA 2006所有数据的模型。我们确定了各国具有真实DIF的项目,并检查了当这些项目被消除、解决或保留时,对模型拟合、不变性和跨国比较的有效性的影响。
{"title":"The Trade-Off between Model Fit, Invariance, and Validity: The Case of PISA Science Assessments","authors":"Yasmine H. El Masri, D. Andrich","doi":"10.1080/08957347.2020.1732384","DOIUrl":"https://doi.org/10.1080/08957347.2020.1732384","url":null,"abstract":"ABSTRACT In large-scale educational assessments, it is generally required that tests are composed of items that function invariantly across the groups to be compared. Despite efforts to ensure invariance in the item construction phase, for a range of reasons (including the security of items) it is often necessary to account for differential item functioning (DIF) of items post hoc. This typically requires a choice among retaining an item as it is despite its DIF, deleting the item, or resolving (splitting) an item by creating a distinct item for each group. These options involve a trade-off between model fit and the invariance of item parameters, and each option could be valid depending on whether or not the source of DIF is relevant or irrelevant to the variable being assessed. We argue that making a choice requires a careful analysis of statistical DIF and its substantive source. We illustrate our argument by analyzing PISA 2006 science data of three countries (UK, France and Jordan) using the Rasch model, which was the model used for the analyses of all PISA 2006 data. We identify items with real DIF across countries and examine the implications for model fit, invariance, and the validity of cross-country comparisons when these items are either eliminated, resolved or retained.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"174 - 188"},"PeriodicalIF":1.5,"publicationDate":"2020-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1732384","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43277464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Comparing Cut Scores from the Angoff Method and Two Variations of the Hofstee and Beuk Methods 比较Angoff法和Hofstee法和Beuk法的两种变体的切割分数
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2020-03-03 DOI: 10.1080/08957347.2020.1732385
Adam E. Wyse
ABSTRACT This article compares cut scores from two variations of the Hofstee and Beuk methods, which determine cut scores by resolving inconsistencies in panelists’ judgments about cut scores and pass rates, with the Angoff method. The first variation uses responses to the Hofstee and Beuk percentage correct and pass rate questions to calculate cut scores. The second variation uses Angoff ratings to determine percentage correct data in combination with responses to pass rate questions. Analysis of data from 15 standard settings suggested that the Hofstee and Beuk methods yielded similar cut scores, and that cut scores were about 2% lower when using Angoff ratings. The two approaches also differed in the weight assigned to cut score judgments in the Beuk method and in the occurrence of undefined cut scores in the Hofstee method. Findings also indicated that the Hofstee and Beuk methods often produced higher cut scores and lower pass rates than the Angoff method. It is suggested that attention needs to be paid to the strategy used to estimate Hofstee and Beuk cut scores.
摘要本文比较了Hofstee和Beuk方法的两种变体的分数线,这两种方法通过解决小组成员对分数线和通过率的判断不一致来确定分数线。第一种变体使用对Hofstee和Beuk正确率和通过率问题的回答来计算分数。第二种变体使用Angoff评分来确定正确数据的百分比,并结合对通过率问题的回答。对来自15个标准设置的数据的分析表明,Hofstee和Beuk方法产生了相似的分数,并且在使用Angoff评分时,分数降低了约2%。这两种方法在Beuk方法中分配给切割分数判断的权重和Hofstee方法中未定义切割分数的出现方面也有所不同。研究结果还表明,Hofstee和Beuk方法通常比Angoff方法产生更高的切入得分和更低的通过率。建议需要注意用于估计Hofstee和Beuk得分的策略。
{"title":"Comparing Cut Scores from the Angoff Method and Two Variations of the Hofstee and Beuk Methods","authors":"Adam E. Wyse","doi":"10.1080/08957347.2020.1732385","DOIUrl":"https://doi.org/10.1080/08957347.2020.1732385","url":null,"abstract":"ABSTRACT This article compares cut scores from two variations of the Hofstee and Beuk methods, which determine cut scores by resolving inconsistencies in panelists’ judgments about cut scores and pass rates, with the Angoff method. The first variation uses responses to the Hofstee and Beuk percentage correct and pass rate questions to calculate cut scores. The second variation uses Angoff ratings to determine percentage correct data in combination with responses to pass rate questions. Analysis of data from 15 standard settings suggested that the Hofstee and Beuk methods yielded similar cut scores, and that cut scores were about 2% lower when using Angoff ratings. The two approaches also differed in the weight assigned to cut score judgments in the Beuk method and in the occurrence of undefined cut scores in the Hofstee method. Findings also indicated that the Hofstee and Beuk methods often produced higher cut scores and lower pass rates than the Angoff method. It is suggested that attention needs to be paid to the strategy used to estimate Hofstee and Beuk cut scores.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"159 - 173"},"PeriodicalIF":1.5,"publicationDate":"2020-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1732385","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49228294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Rasch Model Extensions for Enhanced Formative Assessments in MOOCs Rasch模型扩展在mooc中增强形成性评估
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2020-03-03 DOI: 10.1080/08957347.2020.1732382
D. Abbakumov, P. Desmet, W. Van den Noortgate
ABSTRACT Formative assessments are an important component of massive open online courses (MOOCs), online courses with open access and unlimited student participation. Accurate conclusions on students’ proficiency via formative, however, face several challenges: (a) students are typically allowed to make several attempts; and (b) student performance might be affected by other variables, such as interest. Thus, neglecting the effects of attempts and interest in proficiency evaluation might result in biased conclusions. In this study, we try to solve this limitation and propose two extensions of the common psychometric model, the Rasch model, by including the effects of attempts and interest. We illustrate these extensions using real MOOC data and evaluate them using cross-validation. We found that (a) the effects of attempts and interest on the performance are positive on average but both vary among students; (b) a part of the variance in proficiency parameters is due to variation between students in the effect of interest; and (c) the overall accuracy of prediction of student’s item responses using the extensions is 4.3% higher than using the Rasch model.
形成性评估是大规模开放在线课程(MOOCs)的重要组成部分,mooc是开放获取和无限制学生参与的在线课程。然而,通过形成性测试对学生的熟练程度做出准确的结论面临着几个挑战:(a)学生通常被允许进行多次尝试;(b)学生的表现可能会受到其他变量的影响,比如兴趣。因此,忽视尝试和兴趣对熟练程度评价的影响可能会导致有偏见的结论。在本研究中,我们试图解决这一限制,并提出了两种扩展常见的心理测量模型,Rasch模型,包括尝试和兴趣的影响。我们使用真实的MOOC数据来说明这些扩展,并使用交叉验证来评估它们。我们发现:(a)尝试和兴趣对成绩的影响总体上是积极的,但在学生之间存在差异;(b)熟练程度参数的部分差异是由于学生之间兴趣影响的差异;(c)使用扩展预测学生项目反应的总体准确性比使用Rasch模型高4.3%。
{"title":"Rasch Model Extensions for Enhanced Formative Assessments in MOOCs","authors":"D. Abbakumov, P. Desmet, W. Van den Noortgate","doi":"10.1080/08957347.2020.1732382","DOIUrl":"https://doi.org/10.1080/08957347.2020.1732382","url":null,"abstract":"ABSTRACT Formative assessments are an important component of massive open online courses (MOOCs), online courses with open access and unlimited student participation. Accurate conclusions on students’ proficiency via formative, however, face several challenges: (a) students are typically allowed to make several attempts; and (b) student performance might be affected by other variables, such as interest. Thus, neglecting the effects of attempts and interest in proficiency evaluation might result in biased conclusions. In this study, we try to solve this limitation and propose two extensions of the common psychometric model, the Rasch model, by including the effects of attempts and interest. We illustrate these extensions using real MOOC data and evaluate them using cross-validation. We found that (a) the effects of attempts and interest on the performance are positive on average but both vary among students; (b) a part of the variance in proficiency parameters is due to variation between students in the effect of interest; and (c) the overall accuracy of prediction of student’s item responses using the extensions is 4.3% higher than using the Rasch model.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"113 - 123"},"PeriodicalIF":1.5,"publicationDate":"2020-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1732382","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44842343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Subscore Equating and Profile Reporting 分值相等和概要报告
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2020-03-03 DOI: 10.1080/08957347.2020.1732381
Euijin Lim, Won‐Chan Lee
ABSTRACT The purpose of this study is to address the necessity of subscore equating and to evaluate the performance of various equating methods for subtests. Assuming the random groups design and number-correct scoring, this paper analyzed real data and simulated data with four study factors including test dimensionality, subtest length, form difference in difficulty, and sample size. The results indicated that reporting subscores without equating provides misleading information in terms of score profiles and that reporting subscores without a pre-specified test specification brings practical issues such as constructing alternate subtest forms with comparable difficulty, conducting equating between forms with different lengths, and deciding an appropriate score scale to be reported.
摘要本研究的目的是解决子测验等值的必要性,并评估各种子测验等值方法的性能。假设随机分组设计和数字正确评分,本文分析了真实数据和模拟数据,包括测试维度、子测试长度、难度形式差异和样本量四个研究因素。结果表明,在没有等式的情况下报告分量表在分数概况方面提供了误导性信息,而在没有预先指定的测试规范的情况下,报告分量表带来了实际问题,如构建具有可比难度的替代子测验表、在不同长度的表格之间进行等式、,以及决定要报告的适当的分数量表。
{"title":"Subscore Equating and Profile Reporting","authors":"Euijin Lim, Won‐Chan Lee","doi":"10.1080/08957347.2020.1732381","DOIUrl":"https://doi.org/10.1080/08957347.2020.1732381","url":null,"abstract":"ABSTRACT The purpose of this study is to address the necessity of subscore equating and to evaluate the performance of various equating methods for subtests. Assuming the random groups design and number-correct scoring, this paper analyzed real data and simulated data with four study factors including test dimensionality, subtest length, form difference in difficulty, and sample size. The results indicated that reporting subscores without equating provides misleading information in terms of score profiles and that reporting subscores without a pre-specified test specification brings practical issues such as constructing alternate subtest forms with comparable difficulty, conducting equating between forms with different lengths, and deciding an appropriate score scale to be reported.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"95 - 112"},"PeriodicalIF":1.5,"publicationDate":"2020-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1732381","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47621779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
The Effectiveness and Features of Formative Assessment in US K-12 Education: A Systematic Review 形成性评估在美国K-12教育中的有效性和特点:系统综述
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2020-03-02 DOI: 10.1080/08957347.2020.1732383
Hansol Lee, Huy Q. Chung, Yu Zhang, J. Abedi, M. Warschauer
ABSTRACT In the present article, we present a systematical review of previous empirical studies that conducted formative assessment interventions to improve student learning. Previous meta-analysis research on the overall effects of formative assessment on student learning has been conclusive, but little has been studied on important features of formative assessment interventions and their differential impacts on student learning in the United States’ K-12 education system. Analysis of the identified 126 effect sizes from the selected 33 studies representing 25 research projects that met the inclusion criteria (e.g., included a control condition) revealed an overall small-sized positive effect of formative assessment on student learning (d = .29) with benefits for mathematics (d = .34), literacy (d = .33), and arts (d = .29). Further investigation with meta-regression analyses indicated that supporting student-initiated self-assessment (d = .61) and providing formal formative assessment evidence (e.g., written feedback on quizzes; d = .40) via a medium-cycle length (within or between instructional units; d = .52) were found to enhance the effectiveness of formative assessments.
摘要在本文中,我们对以往进行形成性评估干预以改善学生学习的实证研究进行了系统的回顾。先前关于形成性评估对学生学习的总体影响的荟萃分析研究是结论性的,但很少研究形成性评估干预措施的重要特征及其对美国K-12教育系统学生学习的不同影响。从代表25个符合纳入标准(例如,包括控制条件)的研究项目的33项研究中,对确定的126个影响大小进行了分析,结果显示,形成性评估对学生学习的总体影响较小(d=.29),对数学(d=.34)、识字(d=.33),通过元回归分析进行的进一步调查表明,支持学生发起的自我评估(d=.61)和通过中等周期长度(在教学单元内或教学单元之间;d=.52)提供正式的形成性评估证据(例如,测验的书面反馈;d=.40)可以提高形成性评估的有效性。
{"title":"The Effectiveness and Features of Formative Assessment in US K-12 Education: A Systematic Review","authors":"Hansol Lee, Huy Q. Chung, Yu Zhang, J. Abedi, M. Warschauer","doi":"10.1080/08957347.2020.1732383","DOIUrl":"https://doi.org/10.1080/08957347.2020.1732383","url":null,"abstract":"ABSTRACT In the present article, we present a systematical review of previous empirical studies that conducted formative assessment interventions to improve student learning. Previous meta-analysis research on the overall effects of formative assessment on student learning has been conclusive, but little has been studied on important features of formative assessment interventions and their differential impacts on student learning in the United States’ K-12 education system. Analysis of the identified 126 effect sizes from the selected 33 studies representing 25 research projects that met the inclusion criteria (e.g., included a control condition) revealed an overall small-sized positive effect of formative assessment on student learning (d = .29) with benefits for mathematics (d = .34), literacy (d = .33), and arts (d = .29). Further investigation with meta-regression analyses indicated that supporting student-initiated self-assessment (d = .61) and providing formal formative assessment evidence (e.g., written feedback on quizzes; d = .40) via a medium-cycle length (within or between instructional units; d = .52) were found to enhance the effectiveness of formative assessments.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"124 - 140"},"PeriodicalIF":1.5,"publicationDate":"2020-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2020.1732383","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42432168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
Some Methods and Evaluation for Linking and Equating with Small Samples 小样本连接与等价的几种方法及评价
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2020-01-02 DOI: 10.1080/08957347.2019.1674304
Michael R. Peabody
ABSTRACT The purpose of the current article is to introduce the equating and evaluation methods used in this special issue. Although a comprehensive review of all existing models and methodologies would be impractical given the format, a brief introduction to some of the more popular models will be provided. A brief discussion of the conditions required for equating precedes the discussion of the equating methods themselves. The procedures in this review include the Tucker method, mean equating, nominal weights mean, simplified circle arc, identity equating, and IRT/Rasch model equating. Models shown that help to evaluate the success of the equating process are the standard error of equating, bias, and root-mean-square error. This should provide readers with a basic framework and enough background information to follow the studies found in this issue.
本文的目的是介绍在本期特刊中使用的等价和评价方法。虽然对所有现有的模型和方法进行全面的审查在格式上是不切实际的,但是将提供一些比较流行的模型的简要介绍。在讨论相等方法本身之前,先简要讨论相等所需的条件。本文综述的程序包括Tucker法、均值方程、名义权重均值、简化圆弧、恒等方程和IRT/Rasch模型方程。结果表明,有助于评价均衡过程成功与否的模型是均衡的标准误差、偏倚和均方根误差。这应该为读者提供一个基本的框架和足够的背景信息来跟随这一期的研究。
{"title":"Some Methods and Evaluation for Linking and Equating with Small Samples","authors":"Michael R. Peabody","doi":"10.1080/08957347.2019.1674304","DOIUrl":"https://doi.org/10.1080/08957347.2019.1674304","url":null,"abstract":"ABSTRACT The purpose of the current article is to introduce the equating and evaluation methods used in this special issue. Although a comprehensive review of all existing models and methodologies would be impractical given the format, a brief introduction to some of the more popular models will be provided. A brief discussion of the conditions required for equating precedes the discussion of the equating methods themselves. The procedures in this review include the Tucker method, mean equating, nominal weights mean, simplified circle arc, identity equating, and IRT/Rasch model equating. Models shown that help to evaluate the success of the equating process are the standard error of equating, bias, and root-mean-square error. This should provide readers with a basic framework and enough background information to follow the studies found in this issue.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"33 1","pages":"3 - 9"},"PeriodicalIF":1.5,"publicationDate":"2020-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/08957347.2019.1674304","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48203381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
Applied Measurement in Education
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1