首页 > 最新文献

Applied Measurement in Education最新文献

英文 中文
Not-reached Items: An Issue of Time and of test-taking Disengagement? the Case of PISA 2015 Reading Data 未达到的项目:时间和测试脱离的问题?PISA 2015阅读数据案例
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-07-03 DOI: 10.1080/08957347.2022.2103136
Elodie Pools
ABSTRACT Many low-stakes assessments, such as international large-scale surveys, are administered during time-limited testing sessions and some test-takers are not able to endorse the last items of the test, resulting in not-reached (NR) items. However, because the test has no consequence for the respondents, these NR items can also stem from quitting the test. This article, by means of mixture modeling, investigates heterogeneity in the onset of NR items in reading in PISA 2015. Test-taking behavior, assessed by the response times on the first items of the test, and the risk of NR item onset are modeled simultaneously in a 3-class model that distinguishes rapid, slow and typical respondents. Results suggest that NR items can come from a lack of time or from disengaged behaviors and that the relationship between the number of NR items and ability estimate can be affected by these non-effortful NR responses.
许多低风险的评估,如国际大规模调查,是在有限的测试时间内进行的,一些考生无法认可测试的最后一个项目,导致未达到(NR)项目。然而,由于测试对被调查者没有影响,这些NR项目也可能源于退出测试。本文采用混合模型的方法,研究了2015年国际学生评估项目阅读中NR项目开始的异质性。通过对测试第一项的反应时间来评估的测试行为和NR项目开始的风险同时在区分快速,缓慢和典型被试的3类模型中建模。结果表明,NR项目可能来自于缺乏时间或不投入行为,并且这些不费力的NR反应会影响NR项目数量与能力估计之间的关系。
{"title":"Not-reached Items: An Issue of Time and of test-taking Disengagement? the Case of PISA 2015 Reading Data","authors":"Elodie Pools","doi":"10.1080/08957347.2022.2103136","DOIUrl":"https://doi.org/10.1080/08957347.2022.2103136","url":null,"abstract":"ABSTRACT Many low-stakes assessments, such as international large-scale surveys, are administered during time-limited testing sessions and some test-takers are not able to endorse the last items of the test, resulting in not-reached (NR) items. However, because the test has no consequence for the respondents, these NR items can also stem from quitting the test. This article, by means of mixture modeling, investigates heterogeneity in the onset of NR items in reading in PISA 2015. Test-taking behavior, assessed by the response times on the first items of the test, and the risk of NR item onset are modeled simultaneously in a 3-class model that distinguishes rapid, slow and typical respondents. Results suggest that NR items can come from a lack of time or from disengaged behaviors and that the relationship between the number of NR items and ability estimate can be affected by these non-effortful NR responses.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"35 1","pages":"197 - 221"},"PeriodicalIF":1.5,"publicationDate":"2022-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45573999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Response Demands of Reading Comprehension Test Items: A Review of Item Difficulty Modeling Studies 阅读理解测试项目的反应需求——项目难度建模研究综述
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-07-03 DOI: 10.1080/08957347.2022.2103135
Steve Ferrara, J. Steedle, R. Frantz
ABSTRACT Item difficulty modeling studies involve (a) hypothesizing item features, or item response demands, that are likely to predict item difficulty with some degree of accuracy; and (b) entering the features as independent variables into a regression equation or other statistical model to predict difficulty. In this review, we report findings from 13 empirical item difficulty modeling studies of reading comprehension tests. We define reading comprehension item response demands as reading passage variables (e.g., length, complexity), passage-by-item variables (e.g., degree of correspondence between item and text, type of information requested), and item stem and response option variables. We report on response demand variables that are related to item difficulty and illustrate how they can be used to manage item difficulty in construct-relevant ways so that empirical item difficulties are within a targeted range (e.g., located within the Proficient or other proficiency level range on a test’s IRT scale, where intended).
项目难度建模研究包括:(a)假设项目特征或项目反应需求,这些特征或需求可能在一定程度上准确地预测项目难度;(b)将特征作为自变量输入回归方程或其他统计模型中以预测难度。在这篇综述中,我们报告了13项阅读理解测试的实证项目难度模型研究的结果。我们将阅读理解项目的响应需求定义为阅读段落变量(例如,长度、复杂性)、逐条变量(例如,项目与文本之间的对应程度、所要求的信息类型)以及项目词干和响应选项变量。我们报告了与项目难度相关的反应需求变量,并说明了如何使用它们以与构建相关的方式来管理项目难度,以便经验项目难度在目标范围内(例如,在测试的IRT量表上位于精通或其他熟练程度范围内)。
{"title":"Response Demands of Reading Comprehension Test Items: A Review of Item Difficulty Modeling Studies","authors":"Steve Ferrara, J. Steedle, R. Frantz","doi":"10.1080/08957347.2022.2103135","DOIUrl":"https://doi.org/10.1080/08957347.2022.2103135","url":null,"abstract":"ABSTRACT Item difficulty modeling studies involve (a) hypothesizing item features, or item response demands, that are likely to predict item difficulty with some degree of accuracy; and (b) entering the features as independent variables into a regression equation or other statistical model to predict difficulty. In this review, we report findings from 13 empirical item difficulty modeling studies of reading comprehension tests. We define reading comprehension item response demands as reading passage variables (e.g., length, complexity), passage-by-item variables (e.g., degree of correspondence between item and text, type of information requested), and item stem and response option variables. We report on response demand variables that are related to item difficulty and illustrate how they can be used to manage item difficulty in construct-relevant ways so that empirical item difficulties are within a targeted range (e.g., located within the Proficient or other proficiency level range on a test’s IRT scale, where intended).","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"35 1","pages":"237 - 253"},"PeriodicalIF":1.5,"publicationDate":"2022-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49021008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Using Bayesian Networks to Characterize Student Performance across Multiple Assessments of Individual Standards 使用贝叶斯网络表征学生的表现跨多个评估的个别标准
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-07-03 DOI: 10.1080/08957347.2022.2103134
Jiajun Xu, Nathan Dadey
ABSTRACT This paper explores how student performance across the full set of multiple modular assessments of individual standards, which we refer to as mini-assessments, from a large scale, operational program of interim assessment can be summarized using Bayesian networks. We follow a completely data-driven approach in which no constraints are imposed to best reflect the empirical relationships between these assessments, and a learning trajectory approach in which constraints are imposed to mirror the stages of a mathematic learning trajectory to provide insight into student learning. Under both approaches, we aim to draw a holistic picture of performance across all of the mini-assessments that provides additional information for students, educators, and administrators. In particular, the graphical structure of the network and the conditional probabilities of mastery provide information above and beyond an overall score on a single mini-assessment. Uses and implications of our work are discussed.
摘要本文探讨了如何利用贝叶斯网络从大规模、可操作的中期评估项目中总结出学生在个体标准的全套多重模块化评估中的表现,我们称之为迷你评估。我们遵循一种完全数据驱动的方法,在这种方法中,没有施加任何约束,以最好地反映这些评估之间的经验关系,以及一种学习轨迹方法,在这种方法中,施加约束来反映数学学习轨迹的各个阶段,以提供对学生学习的洞察。在这两种方法下,我们的目标是在所有的小型评估中绘制一个整体的表现图,为学生、教育工作者和管理人员提供额外的信息。特别是,网络的图形结构和掌握的条件概率提供的信息超过了单个小型评估的总体得分。讨论了我们工作的用途和意义。
{"title":"Using Bayesian Networks to Characterize Student Performance across Multiple Assessments of Individual Standards","authors":"Jiajun Xu, Nathan Dadey","doi":"10.1080/08957347.2022.2103134","DOIUrl":"https://doi.org/10.1080/08957347.2022.2103134","url":null,"abstract":"ABSTRACT This paper explores how student performance across the full set of multiple modular assessments of individual standards, which we refer to as mini-assessments, from a large scale, operational program of interim assessment can be summarized using Bayesian networks. We follow a completely data-driven approach in which no constraints are imposed to best reflect the empirical relationships between these assessments, and a learning trajectory approach in which constraints are imposed to mirror the stages of a mathematic learning trajectory to provide insight into student learning. Under both approaches, we aim to draw a holistic picture of performance across all of the mini-assessments that provides additional information for students, educators, and administrators. In particular, the graphical structure of the network and the conditional probabilities of mastery provide information above and beyond an overall score on a single mini-assessment. Uses and implications of our work are discussed.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"35 1","pages":"179 - 196"},"PeriodicalIF":1.5,"publicationDate":"2022-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43312965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Guiding Educators’ Evaluation of the Measurement Quality of Social and Emotional Learning (SEL) Assessments 指导教育工作者评估社会和情感学习(SEL)评估的测量质量
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-04-03 DOI: 10.1080/08957347.2022.2067541
Jessica L. Jonson
ABSTRACT This article describes a grant project that generated a technical guide for PK-12 educators who are utilizing social and emotional learning (SEL) assessments for educational improvement purposes. The guide was developed over a two-year period with funding from the Spencer Foundation. The result was the collective contribution of a widely representative group of scholars and practitioners whose background and expertise provided a multifaceted view of important considerations when evaluating the measurement quality of an SEL assessment. The intent of the guide is to enable PK-12 educators to make more informed decisions when identifying, evaluating, and using valid, reliable, and fair SEL assessments for the purposes of curricular and program improvements. The efforts can also serve as an example of how to contextualize professional standards for testing practice that support the selection and use of tests by non-measurement audiences.
摘要本文描述了一个资助项目,该项目为PK-12教育工作者提供了一份技术指南,他们正在利用社会和情感学习(SEL)评估来提高教育水平。该指南是在斯宾塞基金会的资助下历时两年制定的。其结果是一个具有广泛代表性的学者和从业者群体的集体贡献,他们的背景和专业知识为评估SEL评估的测量质量时的重要考虑因素提供了多方面的视角。该指南旨在使PK-12教育工作者能够在识别、评估和使用有效、可靠和公平的SEL评估时做出更明智的决定,以改进课程和项目。这些努力也可以作为一个例子,说明如何将支持非测量受众选择和使用测试的测试实践的专业标准背景化。
{"title":"Guiding Educators’ Evaluation of the Measurement Quality of Social and Emotional Learning (SEL) Assessments","authors":"Jessica L. Jonson","doi":"10.1080/08957347.2022.2067541","DOIUrl":"https://doi.org/10.1080/08957347.2022.2067541","url":null,"abstract":"ABSTRACT This article describes a grant project that generated a technical guide for PK-12 educators who are utilizing social and emotional learning (SEL) assessments for educational improvement purposes. The guide was developed over a two-year period with funding from the Spencer Foundation. The result was the collective contribution of a widely representative group of scholars and practitioners whose background and expertise provided a multifaceted view of important considerations when evaluating the measurement quality of an SEL assessment. The intent of the guide is to enable PK-12 educators to make more informed decisions when identifying, evaluating, and using valid, reliable, and fair SEL assessments for the purposes of curricular and program improvements. The efforts can also serve as an example of how to contextualize professional standards for testing practice that support the selection and use of tests by non-measurement audiences.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"35 1","pages":"153 - 177"},"PeriodicalIF":1.5,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49260119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparing the Robustness of Three Nonparametric DIF Procedures to Differential Rapid Guessing 三种非参数DIF程序与差分快速猜测的鲁棒性比较
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-04-03 DOI: 10.1080/08957347.2022.2067542
Mohammed A. A. Abulela, Joseph A. Rios
ABSTRACT When there are no personal consequences associated with test performance for examinees, rapid guessing (RG) is a concern and can differ between subgroups. To date, the impact of differential RG on item-level measurement invariance has received minimal attention. To that end, a simulation study was conducted to examine the robustness of the Mantel-Haenszel (MH), standardization index (STD), and logistic regression (LR) differential item functioning (DIF) procedures to type I error in the presence of differential RG. Sample size, test difficulty, group impact, and differential RG rates were manipulated. Findings revealed that the LR procedure was completely robust to type I errors, while slightly elevated false positive rates (< 1%) were observed for the MH and STD procedures. An applied analysis examining data from the Programme for International Student Assessment showed minimal differences in DIF classifications when comparing data in which RG responses were unfiltered and filtered. These results suggest that large rates of differences in RG rates between subgroups are unassociated with false positive classifications of DIF.
摘要:当考生的考试成绩没有相关的个人后果时,快速猜测(RG)是一个令人担忧的问题,并且在不同的亚组之间可能有所不同。到目前为止,微分RG对项目级测量不变性的影响很少受到关注。为此,进行了一项模拟研究,以检验Mantel Haenszel(MH)、标准化指数(STD)和逻辑回归(LR)差异项目功能(DIF)程序在存在差异RG的情况下对I型错误的稳健性。对样本量、测试难度、群体影响和RG差异率进行了处理。研究结果显示,LR程序对I型错误完全稳健,而MH和STD程序的假阳性率略有上升(<1%)。一项对国际学生评估计划数据进行的应用分析显示,在比较RG回复未经过滤和过滤的数据时,DIF分类差异最小。这些结果表明,亚组之间RG率的大差异率与DIF的假阳性分类无关。
{"title":"Comparing the Robustness of Three Nonparametric DIF Procedures to Differential Rapid Guessing","authors":"Mohammed A. A. Abulela, Joseph A. Rios","doi":"10.1080/08957347.2022.2067542","DOIUrl":"https://doi.org/10.1080/08957347.2022.2067542","url":null,"abstract":"ABSTRACT When there are no personal consequences associated with test performance for examinees, rapid guessing (RG) is a concern and can differ between subgroups. To date, the impact of differential RG on item-level measurement invariance has received minimal attention. To that end, a simulation study was conducted to examine the robustness of the Mantel-Haenszel (MH), standardization index (STD), and logistic regression (LR) differential item functioning (DIF) procedures to type I error in the presence of differential RG. Sample size, test difficulty, group impact, and differential RG rates were manipulated. Findings revealed that the LR procedure was completely robust to type I errors, while slightly elevated false positive rates (< 1%) were observed for the MH and STD procedures. An applied analysis examining data from the Programme for International Student Assessment showed minimal differences in DIF classifications when comparing data in which RG responses were unfiltered and filtered. These results suggest that large rates of differences in RG rates between subgroups are unassociated with false positive classifications of DIF.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"35 1","pages":"81 - 94"},"PeriodicalIF":1.5,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45517411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Does the Response Options Placement Provide Clues to the Correct Answers in Multiple-choice Tests? A Systematic Review 答案选项的设置是否为多项选择测试的正确答案提供了线索?系统综述
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-04-03 DOI: 10.1080/08957347.2022.2067539
Séverin Lions, Carlos Monsalve, P. Dartnell, María Paz Blanco, Gabriel Ortega, Julie Lemarié
ABSTRACT Multiple-choice tests are widely used in education, often for high-stakes assessment purposes. Consequently, these tests should be constructed following the highest standards. Many efforts have been undertaken to advance item-writing guidelines intended to improve tests. One important issue is the unwanted effects of the options’ position on test outcomes. Any possible effects should be controlled through an adequate response options placement strategy. However, literature is not straightforward about how test developers arrange options. Therefore, this research synthesis systematically reviewed studies examining adherence to options placement guidelines. Relevant item features, such as the item source (standardized or teacher-made tests) and the number of options were considered. Results show that answer keys’ distribution across tests is often biased, which might provide examinees with clues to select correct options. Findings also show that options are not always arranged in a “logical” fashion (numerically, alphabetically…) despite being suited to be so arranged. The reasons underlying non-adherence to options placement guidelines are discussed, as is the appropriateness of observed response options placement strategies. Suggestions are provided to help developers better arrange items options.
多项选择测试在教育中被广泛使用,通常用于高风险的评估目的。因此,这些测试应按照最高标准进行。已经做出了许多努力来推进旨在改进测试的项目写作指南。一个重要的问题是选项的位置对测试结果的不利影响。任何可能的影响都应通过适当的应对方案安置策略加以控制。然而,关于测试开发人员如何安排选项的文献并不简单。因此,本研究综述系统地回顾了关于遵守期权安置指南的研究。考虑了相关的项目特征,如项目来源(标准化或教师制作的测试)和选项数量。结果表明,答案键在考试中的分布往往存在偏差,这可能为考生选择正确选项提供线索。研究结果还表明,尽管选项适合这样安排,但并不总是以“逻辑”的方式(数字、字母顺序……)排列。讨论了不遵守期权配售指南的原因,以及观察到的响应期权配售策略的适当性。提供建议以帮助开发人员更好地安排项目选项。
{"title":"Does the Response Options Placement Provide Clues to the Correct Answers in Multiple-choice Tests? A Systematic Review","authors":"Séverin Lions, Carlos Monsalve, P. Dartnell, María Paz Blanco, Gabriel Ortega, Julie Lemarié","doi":"10.1080/08957347.2022.2067539","DOIUrl":"https://doi.org/10.1080/08957347.2022.2067539","url":null,"abstract":"ABSTRACT Multiple-choice tests are widely used in education, often for high-stakes assessment purposes. Consequently, these tests should be constructed following the highest standards. Many efforts have been undertaken to advance item-writing guidelines intended to improve tests. One important issue is the unwanted effects of the options’ position on test outcomes. Any possible effects should be controlled through an adequate response options placement strategy. However, literature is not straightforward about how test developers arrange options. Therefore, this research synthesis systematically reviewed studies examining adherence to options placement guidelines. Relevant item features, such as the item source (standardized or teacher-made tests) and the number of options were considered. Results show that answer keys’ distribution across tests is often biased, which might provide examinees with clues to select correct options. Findings also show that options are not always arranged in a “logical” fashion (numerically, alphabetically…) despite being suited to be so arranged. The reasons underlying non-adherence to options placement guidelines are discussed, as is the appropriateness of observed response options placement strategies. Suggestions are provided to help developers better arrange items options.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"35 1","pages":"133 - 152"},"PeriodicalIF":1.5,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48343747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Effects of Using Double Ratings as Item Scores on IRT Proficiency Estimation 使用双重评分作为项目得分对IRT能力评估的影响
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-04-03 DOI: 10.1080/08957347.2022.2067543
Yoon Ah Song, Won‐Chan Lee
ABSTRACT This article presents the performance of item response theory (IRT) models when double ratings are used as item scores over single ratings when rater effects are present. Study 1 examined the influence of the number of ratings on the accuracy of proficiency estimation in the generalized partial credit model (GPCM). Study 2 compared the accuracy of proficiency estimation of two IRT models (GPCM versus the hierarchical rater model, HRM) for double ratings. The main findings were as follows: (a) rater effects substantially reduced the accuracy of IRT proficiency estimation; (b) double ratings relieved the negative impact of rater effects on proficiency estimation and improved the accuracy relative to single ratings; (c) IRT estimators showed different patterns in the conditional accuracy; (d) as more items and a larger number of score categories were used, the accuracy of proficiency estimation improved; and (e) the HRM consistently showed better performance than the GPCM.
摘要本文介绍了在存在评分者效应的情况下,当双评分被用作单评分的项目得分时,项目反应理论(IRT)模型的性能。研究1考察了评级数量对广义部分信用模型(GPCM)中熟练程度估计准确性的影响。研究2比较了两种IRT模型(GPCM与分级评分者模型,HRM)对双重评分的熟练度估计的准确性。主要研究结果如下:(a)评分者效应显著降低了IRT能力评估的准确性;(b) 双重评分减轻了评分者效应对能力评估的负面影响,并提高了相对于单一评分的准确性;(c) IRT估计量在条件精度上表现出不同的模式;(d) 随着使用的项目越多,分数类别越多,熟练程度估计的准确性就越高;以及(e)HRM始终显示出比GPCM更好的性能。
{"title":"Effects of Using Double Ratings as Item Scores on IRT Proficiency Estimation","authors":"Yoon Ah Song, Won‐Chan Lee","doi":"10.1080/08957347.2022.2067543","DOIUrl":"https://doi.org/10.1080/08957347.2022.2067543","url":null,"abstract":"ABSTRACT This article presents the performance of item response theory (IRT) models when double ratings are used as item scores over single ratings when rater effects are present. Study 1 examined the influence of the number of ratings on the accuracy of proficiency estimation in the generalized partial credit model (GPCM). Study 2 compared the accuracy of proficiency estimation of two IRT models (GPCM versus the hierarchical rater model, HRM) for double ratings. The main findings were as follows: (a) rater effects substantially reduced the accuracy of IRT proficiency estimation; (b) double ratings relieved the negative impact of rater effects on proficiency estimation and improved the accuracy relative to single ratings; (c) IRT estimators showed different patterns in the conditional accuracy; (d) as more items and a larger number of score categories were used, the accuracy of proficiency estimation improved; and (e) the HRM consistently showed better performance than the GPCM.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"35 1","pages":"95 - 115"},"PeriodicalIF":1.5,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42878209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Performance of Infit and Outfit Confidence Intervals Calculated via Parametric Bootstrapping 通过参数自举计算输入和输出置信区间的性能
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-04-03 DOI: 10.1080/08957347.2022.2067540
John Alexander Silva Diaz, Carmen Köhler, J. Hartig
ABSTRACT Testing item fit is central in item response theory (IRT) modeling, since a good fit is necessary to draw valid inferences from estimated model parameters. Infit and outfit fit statistics, widespread indices for detecting deviations from the Rasch model, are affected by data factors, such as sample size. Consequently, the traditional use of fixed infit and outfit cutoff points is an ineffective practice. This article evaluates if confidence intervals estimated via parametric bootstrapping provide more suitable cutoff points than the conventionally applied range of 0.8–1.2, and outfit critical ranges adjusted by sample size. The performance is evaluated under different sizes of misfit, sample sizes, and number of items. Results show that the confidence intervals performed better in terms of power, but had inflated type-I error rates, which resulted from mean square values pushed below unity in the large size of misfit conditions. However, when performing a one-side test with the upper range of the confidence intervals, the forementioned inflation was fixed.
测试项目拟合是项目反应理论(IRT)建模的核心,因为良好的拟合是从估计的模型参数中得出有效推论所必需的。内嵌和整体拟合统计,广泛用于检测偏离Rasch模型的指标,受到数据因素的影响,如样本量。因此,传统上使用固定的内装分界点是一种无效的做法。本文评估了通过参数自举估计的置信区间是否比传统应用的0.8-1.2范围提供更合适的截止点,以及根据样本量调整的临界范围。在不同的错配大小、样本量和项目数量下对性能进行评估。结果表明,置信区间在功率方面表现较好,但由于在较大的失拟条件下均方值被推至1以下,导致i型错误率过高。然而,当使用置信区间的上界进行单侧检验时,上述通货膨胀是固定的。
{"title":"Performance of Infit and Outfit Confidence Intervals Calculated via Parametric Bootstrapping","authors":"John Alexander Silva Diaz, Carmen Köhler, J. Hartig","doi":"10.1080/08957347.2022.2067540","DOIUrl":"https://doi.org/10.1080/08957347.2022.2067540","url":null,"abstract":"ABSTRACT Testing item fit is central in item response theory (IRT) modeling, since a good fit is necessary to draw valid inferences from estimated model parameters. Infit and outfit fit statistics, widespread indices for detecting deviations from the Rasch model, are affected by data factors, such as sample size. Consequently, the traditional use of fixed infit and outfit cutoff points is an ineffective practice. This article evaluates if confidence intervals estimated via parametric bootstrapping provide more suitable cutoff points than the conventionally applied range of 0.8–1.2, and outfit critical ranges adjusted by sample size. The performance is evaluated under different sizes of misfit, sample sizes, and number of items. Results show that the confidence intervals performed better in terms of power, but had inflated type-I error rates, which resulted from mean square values pushed below unity in the large size of misfit conditions. However, when performing a one-side test with the upper range of the confidence intervals, the forementioned inflation was fixed.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"35 1","pages":"116 - 132"},"PeriodicalIF":1.5,"publicationDate":"2022-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48961483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Teacher Assessment Literacy: Implications for Diagnostic Assessment Systems 教师评估素养:对诊断性评估系统的启示
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-01-02 DOI: 10.1080/08957347.2022.2034823
Amy K. Clark, Brooke L. Nash, Meagan Karvonen
ABSTRACT Assessments scored with diagnostic models are increasingly popular because they provide fine-grained information about student achievement. Because of differences in how diagnostic assessments are scored and how results are used, the information teachers must know to interpret and use results may differ from concepts traditionally included in assessment literacy trainings for assessments that produce a raw or scale score. In this study, we connect assessment literacy and score reporting literature to understand teachers’ assessment literacy in a diagnostic assessment context as demonstrated by responses to focus groups and surveys. Results summarize teachers’ descriptions of fundamental diagnostic assessment concepts, understanding of the diagnostic assessment and results produced, and how diagnostic assessment results influence their instructional decision-making. Teachers understood how to use results and were comfortable using the term mastery when interpreting score report contents and planning next instruction. However, teachers were unsure how mastery was calculated and some misinterpreted mastery as representing a percent correct rather than a probability value. We share implications for others implementing large-scale diagnostic assessments or designing score reports for these systems.
摘要使用诊断模型进行评分的评估越来越受欢迎,因为它们提供了关于学生成绩的细粒度信息。由于诊断评估的评分方式和结果的使用方式存在差异,教师解释和使用结果所必须知道的信息可能与传统上包含在评估素养培训中的概念不同,这些概念用于产生原始或量表评分的评估。在这项研究中,我们将评估素养和成绩报告文献联系起来,以了解教师在诊断性评估背景下的评估素养,如对焦点小组和调查的回应所示。结果总结了教师对基本诊断性评估概念的描述,对诊断性评估和产生的结果的理解,以及诊断性评估结果如何影响他们的教学决策。教师了解如何使用结果,在解释成绩报告内容和计划下一步教学时,他们很乐意使用“掌握”一词。然而,老师们不确定掌握率是如何计算的,有些人将掌握率误解为正确率的百分比,而不是概率值。我们分享了对其他人实施大规模诊断评估或为这些系统设计评分报告的启示。
{"title":"Teacher Assessment Literacy: Implications for Diagnostic Assessment Systems","authors":"Amy K. Clark, Brooke L. Nash, Meagan Karvonen","doi":"10.1080/08957347.2022.2034823","DOIUrl":"https://doi.org/10.1080/08957347.2022.2034823","url":null,"abstract":"ABSTRACT Assessments scored with diagnostic models are increasingly popular because they provide fine-grained information about student achievement. Because of differences in how diagnostic assessments are scored and how results are used, the information teachers must know to interpret and use results may differ from concepts traditionally included in assessment literacy trainings for assessments that produce a raw or scale score. In this study, we connect assessment literacy and score reporting literature to understand teachers’ assessment literacy in a diagnostic assessment context as demonstrated by responses to focus groups and surveys. Results summarize teachers’ descriptions of fundamental diagnostic assessment concepts, understanding of the diagnostic assessment and results produced, and how diagnostic assessment results influence their instructional decision-making. Teachers understood how to use results and were comfortable using the term mastery when interpreting score report contents and planning next instruction. However, teachers were unsure how mastery was calculated and some misinterpreted mastery as representing a percent correct rather than a probability value. We share implications for others implementing large-scale diagnostic assessments or designing score reports for these systems.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"35 1","pages":"17 - 32"},"PeriodicalIF":1.5,"publicationDate":"2022-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43909414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Analyzing Student Response Processes to Evaluate Success on a Technology-Based Problem-Solving Task 分析学生的反应过程以评估基于技术的问题解决任务的成功
IF 1.5 4区 教育学 Q3 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2022-01-02 DOI: 10.1080/08957347.2022.2034821
Yuting Han, M. Wilson
ABSTRACT A technology-based problem-solving test can automatically capture all the actions of students when they complete tasks and save them as process data. Response sequences are the external manifestations of the latent intellectual activities of the students, and it contains rich information about students’ abilities and different problem-solving strategies. This study adopted the mixture Rasch measurement models (MRMs) in analyzing the success of technology-based tasks while automatically classifying the different response patterns based on the characteristics of the response process. The Olive Oil task from the Assessment and Teaching of 21st Century Skills project (ATC21S) is taken as an example to illustrate the use of MRMs and the interpretation of the process data.
摘要基于技术的问题解决测试可以自动捕捉学生完成任务时的所有动作,并将其保存为过程数据。反应序列是学生潜在智力活动的外在表现,它包含了关于学生能力和不同解决问题策略的丰富信息。本研究采用混合Rasch测量模型(MRM)来分析基于技术的任务的成功率,同时根据响应过程的特点自动分类不同的响应模式。以21世纪技能评估与教学项目(ATC21S)的橄榄油任务为例,说明了MRM的使用和过程数据的解释。
{"title":"Analyzing Student Response Processes to Evaluate Success on a Technology-Based Problem-Solving Task","authors":"Yuting Han, M. Wilson","doi":"10.1080/08957347.2022.2034821","DOIUrl":"https://doi.org/10.1080/08957347.2022.2034821","url":null,"abstract":"ABSTRACT A technology-based problem-solving test can automatically capture all the actions of students when they complete tasks and save them as process data. Response sequences are the external manifestations of the latent intellectual activities of the students, and it contains rich information about students’ abilities and different problem-solving strategies. This study adopted the mixture Rasch measurement models (MRMs) in analyzing the success of technology-based tasks while automatically classifying the different response patterns based on the characteristics of the response process. The Olive Oil task from the Assessment and Teaching of 21st Century Skills project (ATC21S) is taken as an example to illustrate the use of MRMs and the interpretation of the process data.","PeriodicalId":51609,"journal":{"name":"Applied Measurement in Education","volume":"35 1","pages":"33 - 45"},"PeriodicalIF":1.5,"publicationDate":"2022-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46282968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Applied Measurement in Education
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1