首页 > 最新文献

Journal of Educational Measurement最新文献

英文 中文
Gender Bias in Test Item Formats: Evidence from PISA 2009, 2012, and 2015 Math and Reading Tests 测试项目格式中的性别偏见:来自PISA 2009、2012和2015年数学和阅读测试的证据
IF 1.3 4区 心理学 Q1 Psychology Pub Date : 2023-06-09 DOI: 10.1111/jedm.12372
Benjamin R. Shear

Large-scale standardized tests are regularly used to measure student achievement overall and for student subgroups. These uses assume tests provide comparable measures of outcomes across student subgroups, but prior research suggests score comparisons across gender groups may be complicated by the type of test items used. This paper presents evidence that among nationally representative samples of 15-year-olds in the United States participating in the 2009, 2012, and 2015 PISA math and reading tests, there are consistent item format by gender differences. On average, male students answer multiple-choice items correctly relatively more often and female students answer constructed-response items correctly relatively more often. These patterns were consistent across 34 additional participating PISA jurisdictions, although the size of the format differences varied and were larger on average in reading than math. The average magnitude of the format differences is not large enough to be flagged in routine differential item functioning analyses intended to detect test bias but is large enough to raise questions about the validity of inferences based on comparisons of scores across gender groups. Researchers and other test users should account for test item format, particularly when comparing scores across gender groups.

大规模标准化考试通常用于衡量学生的整体成绩和学生分组。这些用途假设测试提供了跨学生亚组结果的可比测量,但先前的研究表明,跨性别群体的分数比较可能因使用的测试项目类型而变得复杂。本文提供的证据表明,在参加2009年、2012年和2015年PISA数学和阅读测试的美国15岁学生的全国代表性样本中,性别差异的项目格式是一致的。平均而言,男学生答对多项选择题的频率相对较高,女学生答对构念题的频率相对较高。这些模式在另外34个参与PISA的司法管辖区是一致的,尽管格式差异的大小各不相同,阅读的平均差异大于数学。格式差异的平均幅度不足以在旨在检测测试偏差的常规差异项目功能分析中进行标记,但足以对基于跨性别群体得分比较的推断的有效性提出质疑。研究人员和其他测试用户应该考虑到测试项目的格式,特别是在比较不同性别群体的分数时。
{"title":"Gender Bias in Test Item Formats: Evidence from PISA 2009, 2012, and 2015 Math and Reading Tests","authors":"Benjamin R. Shear","doi":"10.1111/jedm.12372","DOIUrl":"10.1111/jedm.12372","url":null,"abstract":"<p>Large-scale standardized tests are regularly used to measure student achievement overall and for student subgroups. These uses assume tests provide comparable measures of outcomes across student subgroups, but prior research suggests score comparisons across gender groups may be complicated by the type of test items used. This paper presents evidence that among nationally representative samples of 15-year-olds in the United States participating in the 2009, 2012, and 2015 PISA math and reading tests, there are consistent item format by gender differences. On average, male students answer multiple-choice items correctly relatively more often and female students answer constructed-response items correctly relatively more often. These patterns were consistent across 34 additional participating PISA jurisdictions, although the size of the format differences varied and were larger on average in reading than math. The average magnitude of the format differences is not large enough to be flagged in routine differential item functioning analyses intended to detect test bias but is large enough to raise questions about the validity of inferences based on comparisons of scores across gender groups. Researchers and other test users should account for test item format, particularly when comparing scores across gender groups.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42035945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Detecting Differential Item Functioning in CAT Using IRT Residual DIF Approach 利用IRT残差DIF方法检测CAT中不同项目的功能
IF 1.3 4区 心理学 Q1 Psychology Pub Date : 2023-04-28 DOI: 10.1111/jedm.12366
Hwanggyu Lim, Edison M. Choe

The residual differential item functioning (RDIF) detection framework was developed recently under a linear testing context. To explore the potential application of this framework to computerized adaptive testing (CAT), the present study investigated the utility of the RDIFR statistic both as an index for detecting uniform DIF of pretest items in CAT and as a direct measure of the effect size of uniform DIF. Extensive CAT simulations revealed RDIFR to have well-controlled Type I error and slightly higher power to detect uniform DIF compared with CATSIB, especially when pretest items were calibrated using fixed-item parameter calibration. Moreover, RDIFR accurately estimated the amount of uniform DIF irrespective of the presence of impact. Therefore, RDIFR demonstrates its potential as a useful tool for evaluating both the statistical and practical significance of uniform DIF in CAT.

残差项目功能(RDIF)检测框架是近年来在线性测试环境下发展起来的。为了探索这一框架在计算机化自适应测试(CAT)中的潜在应用,本研究调查了RDIFR统计量作为检测CAT预试项目均匀DIF的指标和作为均匀DIF效应大小的直接测量的效用。广泛的CAT模拟表明,与CATSIB相比,RDIFR具有良好控制的I型误差,并且检测均匀DIF的能力略高,特别是当使用固定项目参数校准预测项目时。此外,RDIFR准确地估计了均匀DIF的量,而不考虑是否存在冲击。因此,RDIFR显示了其作为评估CAT中均匀DIF的统计和实际意义的有用工具的潜力。
{"title":"Detecting Differential Item Functioning in CAT Using IRT Residual DIF Approach","authors":"Hwanggyu Lim,&nbsp;Edison M. Choe","doi":"10.1111/jedm.12366","DOIUrl":"10.1111/jedm.12366","url":null,"abstract":"<p>The residual differential item functioning (RDIF) detection framework was developed recently under a linear testing context. To explore the potential application of this framework to computerized adaptive testing (CAT), the present study investigated the utility of the RDIF<sub>R</sub> statistic both as an index for detecting uniform DIF of pretest items in CAT and as a direct measure of the effect size of uniform DIF. Extensive CAT simulations revealed RDIF<sub>R</sub> to have well-controlled Type I error and slightly higher power to detect uniform DIF compared with CATSIB, especially when pretest items were calibrated using fixed-item parameter calibration. Moreover, RDIF<sub>R</sub> accurately estimated the amount of uniform DIF irrespective of the presence of impact. Therefore, RDIF<sub>R</sub> demonstrates its potential as a useful tool for evaluating both the statistical and practical significance of uniform DIF in CAT.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45693936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Controlling the Speededness of Assembled Test Forms: A Generalization to the Three-Parameter Lognormal Response Time Model 控制组合测试表格的速度:对三参数对数正态响应时间模型的推广
IF 1.3 4区 心理学 Q1 Psychology Pub Date : 2023-04-27 DOI: 10.1111/jedm.12364
Benjamin Becker, Sebastian Weirich, Frank Goldhammer, Dries Debeer

When designing or modifying a test, an important challenge is controlling its speededness. To achieve this, van der Linden (2011a, 2011b) proposed using a lognormal response time model, more specifically the two-parameter lognormal model, and automated test assembly (ATA) via mixed integer linear programming. However, this approach has a severe limitation, in that the two-parameter lognormal model lacks a slope parameter. This means that the model assumes that all items are equally speed sensitive. From a conceptual perspective, this assumption seems very restrictive. Furthermore, various other empirical studies and new data analyses performed by us show that this assumption almost never holds in practice. To overcome this shortcoming, we bring together the already frequently used three-parameter lognormal model for response times, which contains a slope parameter, and the ATA approach for controlling speededness by van der Linden. Using multiple empirically based illustrations, the proposed extension is illustrated, including complete and documented R code. Both the original van der Linden approach and our newly proposed approach are available to practitioners in the freely available R package eatATA.

当设计或修改测试时,一个重要的挑战是控制测试的速度。为了实现这一点,van der Linden (2011a, 2011b)提出使用对数正态响应时间模型,更具体地说是双参数对数正态模型,并通过混合整数线性规划实现自动化测试装配(ATA)。然而,这种方法有一个严重的局限性,即双参数对数正态模型缺乏斜率参数。这意味着该模型假定所有项目对速度都同样敏感。从概念的角度来看,这个假设似乎非常有限。此外,我们进行的各种其他实证研究和新数据分析表明,这一假设几乎从未在实践中成立。为了克服这一缺点,我们将已经经常使用的响应时间的三参数对数正态模型(包含一个斜率参数)和由范德林登控制速度的ATA方法结合在一起。使用多个基于经验的插图,说明了建议的扩展,包括完整的和文档化的R代码。原始的van der Linden方法和我们新提出的方法都可以在免费的R包eatATA中获得。
{"title":"Controlling the Speededness of Assembled Test Forms: A Generalization to the Three-Parameter Lognormal Response Time Model","authors":"Benjamin Becker,&nbsp;Sebastian Weirich,&nbsp;Frank Goldhammer,&nbsp;Dries Debeer","doi":"10.1111/jedm.12364","DOIUrl":"10.1111/jedm.12364","url":null,"abstract":"<p>When designing or modifying a test, an important challenge is controlling its speededness. To achieve this, van der Linden (2011a, 2011b) proposed using a lognormal response time model, more specifically the two-parameter lognormal model, and automated test assembly (ATA) via mixed integer linear programming. However, this approach has a severe limitation, in that the two-parameter lognormal model lacks a slope parameter. This means that the model assumes that all items are equally speed sensitive. From a conceptual perspective, this assumption seems very restrictive. Furthermore, various other empirical studies and new data analyses performed by us show that this assumption almost never holds in practice. To overcome this shortcoming, we bring together the already frequently used three-parameter lognormal model for response times, which contains a slope parameter, and the ATA approach for controlling speededness by van der Linden. Using multiple empirically based illustrations, the proposed extension is illustrated, including complete and documented R code. Both the original van der Linden approach and our newly proposed approach are available to practitioners in the freely available R package <span>eatATA</span>.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12364","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49199830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Note on Latent Traits Estimates under IRT Models with Missingness 含缺失的IRT模型下潜在性状估计的注解
IF 1.3 4区 心理学 Q1 Psychology Pub Date : 2023-04-26 DOI: 10.1111/jedm.12365
Jinxin Guo, Xin Xu, Tao Xin

Missingness due to not-reached items and omitted items has received much attention in the recent psychometric literature. Such missingness, if not handled properly, would lead to biased parameter estimation, as well as inaccurate inference of examinees, and further erode the validity of the test. This paper reviews some commonly used IRT based models allowing missingness, followed by three popular examinee scoring methods, including maximum likelihood estimation, maximum a posteriori, and expected a posteriori. Simulation studies were conducted to compare these examinee scoring methods across these commonly used models in the presence of missingness. Results showed that all the methods could infer examinees' ability accurately when the missingness is ignorable. If the missingness is nonignorable, incorporating those missing responses would improve the precision in estimating abilities for examinees with missingness, especially when the test length is short. In terms of examinee scoring methods, expected a posteriori method performed better for evaluating latent traits under models allowing missingness. An empirical study based on the PISA 2015 Science Test was further performed.

在最近的心理测量学文献中,由于未到达项目和遗漏项目而导致的缺失受到了广泛的关注。这种缺失如果处理不当,会导致参数估计偏倚,导致考生推理不准确,进一步削弱考试的效度。本文综述了一些常用的基于IRT的遗漏模型,然后介绍了三种常用的考生评分方法,包括最大似然估计、最大后验和期望后验。模拟研究进行比较这些考生评分方法在这些常用的模型在缺失的存在。结果表明,在缺失可忽略的情况下,所有方法都能准确地推断出考生的能力。如果缺失是不可忽略的,将这些缺失的回答合并将提高对缺失考生能力的估计精度,特别是当考试长度较短时。在考生评分方法方面,在允许缺失的模型下,期望后验方法能更好地评价潜在特征。基于2015年PISA科学测试,进一步进行实证研究。
{"title":"A Note on Latent Traits Estimates under IRT Models with Missingness","authors":"Jinxin Guo,&nbsp;Xin Xu,&nbsp;Tao Xin","doi":"10.1111/jedm.12365","DOIUrl":"10.1111/jedm.12365","url":null,"abstract":"<p>Missingness due to not-reached items and omitted items has received much attention in the recent psychometric literature. Such missingness, if not handled properly, would lead to biased parameter estimation, as well as inaccurate inference of examinees, and further erode the validity of the test. This paper reviews some commonly used IRT based models allowing missingness, followed by three popular examinee scoring methods, including maximum likelihood estimation, maximum a posteriori, and expected a posteriori. Simulation studies were conducted to compare these examinee scoring methods across these commonly used models in the presence of missingness. Results showed that all the methods could infer examinees' ability accurately when the missingness is ignorable. If the missingness is nonignorable, incorporating those missing responses would improve the precision in estimating abilities for examinees with missingness, especially when the test length is short. In terms of examinee scoring methods, expected a posteriori method performed better for evaluating latent traits under models allowing missingness. An empirical study based on the PISA 2015 Science Test was further performed.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44924100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Online Monitoring of Test-Taking Behavior Based on Item Responses and Response Times 基于项目反应和反应时间的测试行为在线监控
IF 1.3 4区 心理学 Q1 Psychology Pub Date : 2023-04-17 DOI: 10.1111/jedm.12367
Suhwa Han, Hyeon-Ah Kang

The study presents multivariate sequential monitoring procedures for examining test-taking behaviors online. The procedures monitor examinee's responses and response times and signal aberrancy as soon as significant change is identifieddetected in the test-taking behavior. The study in particular proposes three schemes to track different indicators of a test-taking mode—the observable manifest variables, latent trait variables, and measurement likelihood. For each procedure, sequential sampling strategies are presented to implement online monitoring. Numerical experimentation based on simulated data suggests that the proposed procedures demonstrate adequate performance. The procedures identified examinees with aberrant behaviors with high detection power and timeliness, while maintaining error rates reasonably small. Experimental application to real data also suggested that the procedures have practical relevance to real assessments. Based on the observations from the experiential analysis, the study discusses implications and guidelines for practical use.

本研究提出了在线考试行为检查的多变量顺序监测程序。该程序监测考生的反应和反应时间,并在发现考试行为发生重大变化时立即发出异常信号。该研究特别提出了三种方案来跟踪测试模式的不同指标-可观察的明显变量,潜在特征变量和测量似然。对于每个过程,提出了顺序采样策略来实现在线监测。基于模拟数据的数值实验表明,该方法具有良好的性能。该程序对考生异常行为的识别具有较高的检测能力和及时性,同时使错误率保持在合理的低水平。对实际数据的实验应用也表明,该程序与实际评估具有实际相关性。在实证分析的基础上,探讨了实证研究的启示和实践指导。
{"title":"Online Monitoring of Test-Taking Behavior Based on Item Responses and Response Times","authors":"Suhwa Han,&nbsp;Hyeon-Ah Kang","doi":"10.1111/jedm.12367","DOIUrl":"10.1111/jedm.12367","url":null,"abstract":"<p>The study presents multivariate sequential monitoring procedures for examining test-taking behaviors online. The procedures monitor examinee's responses and response times and signal aberrancy as soon as significant change is identifieddetected in the test-taking behavior. The study in particular proposes three schemes to track different indicators of a test-taking mode—the observable manifest variables, latent trait variables, and measurement likelihood. For each procedure, sequential sampling strategies are presented to implement online monitoring. Numerical experimentation based on simulated data suggests that the proposed procedures demonstrate adequate performance. The procedures identified examinees with aberrant behaviors with high detection power and timeliness, while maintaining error rates reasonably small. Experimental application to real data also suggested that the procedures have practical relevance to real assessments. Based on the observations from the experiential analysis, the study discusses implications and guidelines for practical use.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46552013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Detecting Group Collaboration Using Multiple Correspondence Analysis 利用多重对应分析检测群体协作
IF 1.3 4区 心理学 Q1 Psychology Pub Date : 2023-03-23 DOI: 10.1111/jedm.12363
Joseph H. Grochowalski, Amy Hendrickson

Test takers wishing to gain an unfair advantage often share answers with other test takers, either sharing all answers (a full key) or some (a partial key). Detecting key sharing during a tight testing window requires an efficient, easily interpretable, and rich form of analysis that is descriptive and inferential. We introduce a detection method based on multiple correspondence analysis (MCA) that identifies test takers with unusual response similarities. The method simultaneously detects multiple shared keys (partial or full), plots results, and is computationally efficient as it requires only matrix operations. We describe the method, evaluate its detection accuracy under various simulation conditions, and demonstrate the procedure on a real data set with known test-taking misbehavior. The simulation results showed that the MCA method had reasonably high power under realistic conditions and maintained the nominal false-positive level, except when the group size was very large or partial shared keys had more than 50% of the items. The real data analysis illustrated visual detection procedures and inference about the item responses possibly shared in the key, which was likely shared among 91 test takers, many of whom were confirmed by nonstatistical investigation to have engaged in test-taking misconduct.

希望获得不公平优势的考生经常与其他考生共享答案,要么共享所有答案(完整答案),要么共享部分答案(部分答案)。在严格的测试窗口中检测密钥共享需要一种高效、易于解释和丰富的分析形式,这种形式是描述性和推断性的。我们介绍了一种基于多重对应分析(MCA)的检测方法,该方法可以识别具有异常反应相似性的考生。该方法同时检测多个共享密钥(部分或全部),绘制结果,并且计算效率高,因为它只需要矩阵操作。我们描述了该方法,评估了其在各种模拟条件下的检测精度,并在具有已知测试错误行为的真实数据集上演示了该方法。仿真结果表明,除了组大小非常大或部分共享密钥超过50%的项目外,MCA方法在实际条件下具有相当高的功率,并保持名义上的假阳性水平。真实的数据分析说明了视觉检测程序和对可能在关键中共享的项目反应的推断,该关键可能在91名考生中共享,其中许多人被非统计调查证实参与了考试不当行为。
{"title":"Detecting Group Collaboration Using Multiple Correspondence Analysis","authors":"Joseph H. Grochowalski,&nbsp;Amy Hendrickson","doi":"10.1111/jedm.12363","DOIUrl":"10.1111/jedm.12363","url":null,"abstract":"<p>Test takers wishing to gain an unfair advantage often share answers with other test takers, either sharing all answers (a full key) or some (a partial key). Detecting key sharing during a tight testing window requires an efficient, easily interpretable, and rich form of analysis that is descriptive and inferential. We introduce a detection method based on multiple correspondence analysis (MCA) that identifies test takers with unusual response similarities. The method simultaneously detects multiple shared keys (partial or full), plots results, and is computationally efficient as it requires only matrix operations. We describe the method, evaluate its detection accuracy under various simulation conditions, and demonstrate the procedure on a real data set with known test-taking misbehavior. The simulation results showed that the MCA method had reasonably high power under realistic conditions and maintained the nominal false-positive level, except when the group size was very large or partial shared keys had more than 50% of the items. The real data analysis illustrated visual detection procedures and inference about the item responses possibly shared in the key, which was likely shared among 91 test takers, many of whom were confirmed by nonstatistical investigation to have engaged in test-taking misconduct.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44728675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pretest Item Calibration in Computerized Multistage Adaptive Testing 计算机多阶段自适应测试中的测试前项目校准
IF 1.3 4区 心理学 Q1 Psychology Pub Date : 2023-03-10 DOI: 10.1111/jedm.12361
Rabia Karatoprak Ersen, Won-Chan Lee

The purpose of this study was to compare calibration and linking methods for placing pretest item parameter estimates on the item pool scale in a 1-3 computerized multistage adaptive testing design in terms of item parameter recovery. Two models were used: embedded-section, in which pretest items were administered within a separate module, and embedded-items, in which pretest items were distributed across operational modules. The calibration methods were separate calibration with linking (SC) and fixed calibration (FC) with three parallel approaches under each (FC-1 and SC-1; FC-2 and SC-2; and FC-3 and SC-3). The FC-1 and SC-1 used only operational items in the routing module to link pretest items. The FC-2 and SC-2 also used only operational items in the routing module for linking, but in addition, the operational items in second stage modules were freely estimated. The FC-3 and SC-3 used operational items in all modules to link pretest items. The third calibration approach (i.e., FC-3 and SC-3) yielded the best results. For all three approaches, SC outperformed FC in all study conditions which were module length, sample size and examinee distributions.

本研究的目的是比较在1-3个计算机化多阶段自适应测试设计中,将预测项目参数估计放置在项目池量表上的校准和连接方法在项目参数恢复方面的差异。使用了两种模型:嵌入部分模型,其中预试项目在一个单独的模块中进行管理;嵌入项目模型,其中预试项目分布在操作模块中。标定方法分为连接分离标定(SC)和固定标定(FC)两种,分别采用FC-1和SC-1两种平行标定方法;FC-2和SC-2;FC-3和SC-3)。FC-1和SC-1只使用路由模块中的操作项来连接预测项。FC-2和SC-2同样只使用路由模块的操作项进行连接,但二级模块的操作项是自由估计的。FC-3和SC-3使用所有模块中的操作项目来连接预测项目。第三种校准方法(即FC-3和SC-3)产生的结果最好。对于所有三种方法,SC在模块长度,样本量和考生分布的所有研究条件下都优于FC。
{"title":"Pretest Item Calibration in Computerized Multistage Adaptive Testing","authors":"Rabia Karatoprak Ersen,&nbsp;Won-Chan Lee","doi":"10.1111/jedm.12361","DOIUrl":"10.1111/jedm.12361","url":null,"abstract":"<p>The purpose of this study was to compare calibration and linking methods for placing pretest item parameter estimates on the item pool scale in a 1-3 computerized multistage adaptive testing design in terms of item parameter recovery. Two models were used: embedded-section, in which pretest items were administered within a separate module, and embedded-items, in which pretest items were distributed across operational modules. The calibration methods were separate calibration with linking (SC) and fixed calibration (FC) with three parallel approaches under each (FC-1 and SC-1; FC-2 and SC-2; and FC-3 and SC-3). The FC-1 and SC-1 used only operational items in the routing module to link pretest items. The FC-2 and SC-2 also used only operational items in the routing module for linking, but in addition, the operational items in second stage modules were freely estimated. The FC-3 and SC-3 used operational items in all modules to link pretest items. The third calibration approach (i.e., FC-3 and SC-3) yielded the best results. For all three approaches, SC outperformed FC in all study conditions which were module length, sample size and examinee distributions.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12361","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48133014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Classical Item Analysis from a Signal Detection Perspective 信号检测视角下的经典项目分析
IF 1.3 4区 心理学 Q1 Psychology Pub Date : 2023-02-27 DOI: 10.1111/jedm.12358
Lawrence T. DeCarlo

A conceptualization of multiple-choice exams in terms of signal detection theory (SDT) leads to simple measures of item difficulty and item discrimination that are closely related to, but also distinct from, those used in classical item analysis (CIA). The theory defines a “true split,” depending on whether or not examinees know an item, and so it provides a basis for using total scores to split item tables, as done in CIA, while also clarifying benefits and limitations of the approach. The SDT item difficulty and discrimination measures differ from those used in CIA in that they explicitly consider the role of distractors and avoid limitations due to range restrictions. A new screening measure is also introduced. The measures are theoretically well-grounded and are simple to compute by hand calculations or with standard software for choice models; simulations show that they offer advantages over traditional measures.

从信号检测理论(SDT)的角度将多项选择题考试概念化,导致了项目难度和项目辨别力的简单测量,这些测量与经典项目分析(CIA)中使用的方法密切相关,但又截然不同。该理论定义了“真正的分割”,取决于考生是否知道一个项目,因此它为使用总分来分割项目表提供了基础,就像CIA所做的那样,同时也阐明了这种方法的优点和局限性。SDT的项目难度和区分度测量与CIA的不同之处在于,它们明确考虑了干扰因素的作用,避免了由于范围限制而产生的限制。并介绍了一种新的筛选措施。这些措施在理论上是有充分根据的,而且通过手工计算或使用标准的选择模型软件计算起来很简单;模拟表明,它们比传统的测量方法更有优势。
{"title":"Classical Item Analysis from a Signal Detection Perspective","authors":"Lawrence T. DeCarlo","doi":"10.1111/jedm.12358","DOIUrl":"10.1111/jedm.12358","url":null,"abstract":"<p>A conceptualization of multiple-choice exams in terms of signal detection theory (SDT) leads to simple measures of item difficulty and item discrimination that are closely related to, but also distinct from, those used in classical item analysis (CIA). The theory defines a “true split,” depending on whether or not examinees know an item, and so it provides a basis for using total scores to split item tables, as done in CIA, while also clarifying benefits and limitations of the approach. The SDT item difficulty and discrimination measures differ from those used in CIA in that they explicitly consider the role of distractors and avoid limitations due to range restrictions. A new screening measure is also introduced. The measures are theoretically well-grounded and are simple to compute by hand calculations or with standard software for choice models; simulations show that they offer advantages over traditional measures.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42654295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Corrigendum: A Residual-Based Differential Item Functioning Detection Framework in Item Response Theory 更正:项目反应理论中基于残差的差异项目功能检测框架
IF 1.3 4区 心理学 Q1 Psychology Pub Date : 2023-02-26 DOI: 10.1111/jedm.12362
Hwanggyu Lim, Edison M. Choe, Kyung T. Han

In the original article, it was written that “Then the MLE scoring and DIF analysis with RDIF statistics were performed using the est_score and rdif functions, respectively, in the R (R Core Team, 2019) package irtplay (p.90).” However, the irtplay package has been removed from the CRAN repository due to intellectual property (IP) violation issues. Instead, a new R package called irtQ (Lim & Wells, 2023) has been released as a successor to irtplay. All IP issues have been resolved in irtQ, ensuring that the package is compliant with industry standards. https://doi.org/10.1111/jedm.12313

We would like to inform that the same functions of est_score and rdif used in the original study are also included in irtQ. Thus, it can be used as a replacement for irtplay. We apologize for any confusion caused by the previous version of the article.

在最初的文章中,它是这样写的:“然后在R (R Core Team, 2019)包irtplay中分别使用est_score和RDIF函数执行MLE评分和使用RDIF统计的DIF分析(第90页)。”但是,由于知识产权(IP)侵犯问题,该irtplay包已从CRAN存储库中删除。取而代之的是一个名为irtQ (Lim &《威尔斯,2023年》(Wells, 2023)作为irtplay的继承者发布。irtQ解决了所有IP问题,确保封装符合行业标准。https://doi.org/10.1111/jedm.12313We想告知,irtQ中也包含了原始研究中使用的est_score和rdif的相同函数。因此,它可以作为一种替代节目。对于之前的文章版本所造成的混乱,我们深表歉意。
{"title":"Corrigendum: A Residual-Based Differential Item Functioning Detection Framework in Item Response Theory","authors":"Hwanggyu Lim,&nbsp;Edison M. Choe,&nbsp;Kyung T. Han","doi":"10.1111/jedm.12362","DOIUrl":"10.1111/jedm.12362","url":null,"abstract":"<p>In the original article, it was written that “Then the MLE scoring and DIF analysis with RDIF statistics were performed using the <i>est_score</i> and <i>rdif</i> functions, respectively, in the R (R Core Team, 2019) package irtplay (p.90).” However, the irtplay package has been removed from the CRAN repository due to intellectual property (IP) violation issues. Instead, a new R package called irtQ (Lim &amp; Wells, <span>2023</span>) has been released as a successor to irtplay. All IP issues have been resolved in irtQ, ensuring that the package is compliant with industry standards. https://doi.org/10.1111/jedm.12313</p><p>We would like to inform that the same functions of <i>est_score</i> and <i>rdif</i> used in the original study are also included in irtQ. Thus, it can be used as a replacement for irtplay. We apologize for any confusion caused by the previous version of the article.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12362","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44304771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Linkage Sets to Improve Connectedness in Rater Response Model Estimation 利用链接集提高评价响应模型估计中的连通性
IF 1.3 4区 心理学 Q1 Psychology Pub Date : 2023-02-19 DOI: 10.1111/jedm.12360
Jodi M. Casabianca, John R. Donoghue, Hyo Jeong Shin, Szu-Fu Chao, Ikkyu Choi

Using item-response theory to model rater effects provides an alternative solution for rater monitoring and diagnosis, compared to using standard performance metrics. In order to fit such models, the ratings data must be sufficiently connected in order to estimate rater effects. Due to popular rating designs used in large-scale testing scenarios, there tends to be a large proportion of missing data, yielding sparse matrices and estimation issues. In this article, we explore the impact of different types of connectedness, or linkage, brought about by using a linkage set—a collection of responses scored by most or all raters. We also explore the impact of the properties and composition of the linkage set, the different connectedness yielded from different rating designs, and the role of scores from automated scoring engines. In designing monitoring systems using the rater response version of the generalized partial credit model, the study results suggest use of a linkage set, especially a large one that is comprised of responses representing the full score scale. Results also show that a double-human-scoring design provides more connectedness than a design with one human and an automated scoring engine. Furthermore, scores from automated scoring engines do not provide adequate connectedness. We discuss considerations for operational implementation and further study.

与使用标准绩效指标相比,使用项目反应理论来模拟评分者效应为评分者监测和诊断提供了另一种解决方案。为了拟合这样的模型,评级数据必须充分连接,以便估计评级效应。由于在大规模测试场景中使用的流行评级设计,往往存在很大比例的缺失数据,从而产生稀疏矩阵和估计问题。在本文中,我们探讨了不同类型的连通性或联系的影响,通过使用联系集(由大多数或所有评分者评分的回答集合)带来的影响。我们还探讨了链接集的属性和组成的影响,不同评级设计产生的不同连通性,以及自动评分引擎得分的作用。在使用广义部分信用模型的评分反应版本设计监测系统时,研究结果建议使用链接集,特别是由代表满分量表的反应组成的大型链接集。结果还表明,与一个人和自动评分引擎的设计相比,双人评分设计提供了更多的连通性。此外,来自自动评分引擎的分数不能提供足够的连接性。我们讨论了操作实施和进一步研究的考虑。
{"title":"Using Linkage Sets to Improve Connectedness in Rater Response Model Estimation","authors":"Jodi M. Casabianca,&nbsp;John R. Donoghue,&nbsp;Hyo Jeong Shin,&nbsp;Szu-Fu Chao,&nbsp;Ikkyu Choi","doi":"10.1111/jedm.12360","DOIUrl":"10.1111/jedm.12360","url":null,"abstract":"<p>Using item-response theory to model rater effects provides an alternative solution for rater monitoring and diagnosis, compared to using standard performance metrics. In order to fit such models, the ratings data must be sufficiently connected in order to estimate rater effects. Due to popular rating designs used in large-scale testing scenarios, there tends to be a large proportion of missing data, yielding sparse matrices and estimation issues. In this article, we explore the impact of different types of connectedness, or linkage, brought about by using a linkage set—a collection of responses scored by most or all raters. We also explore the impact of the properties and composition of the linkage set, the different connectedness yielded from different rating designs, and the role of scores from automated scoring engines. In designing monitoring systems using the rater response version of the generalized partial credit model, the study results suggest use of a linkage set, especially a large one that is comprised of responses representing the full score scale. Results also show that a double-human-scoring design provides more connectedness than a design with one human and an automated scoring engine. Furthermore, scores from automated scoring engines do not provide adequate connectedness. We discuss considerations for operational implementation and further study.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2023-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47979944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Journal of Educational Measurement
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1