Journal of Educational Measurement最新文献_第4页

Argument-Based Approach to Validity: Developing a Living Document and Incorporating Preregistration 基于论证的有效性方法：编制活文件并纳入预注册内容

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2024-02-14 DOI: 10.1111/jedm.12385

Daria Gerasimova

I propose two practical advances to the argument-based approach to validity: developing a living document and incorporating preregistration. First, I present a potential structure for the living document that includes an up-to-date summary of the validity argument. As the validation process may span across multiple studies, the living document allows future users of the instrument to access the entire validity argument in one place. Second, I describe how preregistration can be incorporated in the argument-based approach. Specifically, I distinguish between two types of preregistration: preregistration of the argument and preregistration of validation studies. Preregistration of the argument is a single preregistration that is specified for the entire validation process. Here, the developer specifies interpretations, uses, and claims before collecting validity evidence. Preregistration of a validation study refers to preregistering a single validation study that aims to evaluate a set of claims. Here, the developer describes study components (e.g., research design, data collection, data analysis, etc.), before collecting data. Both preregistration types have the potential to reduce the risk of bias (e.g., hindsight and confirmation biases), as well as to allow others to evaluate the risk of bias and, hence, calibrate confidence, in the developer's evaluation of the validity argument.

对于基于论证的有效性方法，我提出了两个切实可行的进展：制定一份活文件和纳入预先登记。首先，我提出了活文件的潜在结构，其中包括有效性论证的最新摘要。由于验证过程可能会跨越多项研究，因此活文档可以让仪器的未来用户在一个地方获取整个有效性论证。其次，我介绍了如何将预注册纳入基于论证的方法。具体来说，我将预注册分为两种类型：论点预注册和验证研究预注册。论点预注册是为整个验证过程指定的单一预注册。在这里，开发者在收集有效性证据之前，要明确解释、用途和主张。验证研究的预注册是指对旨在评估一系列主张的单一验证研究进行预注册。在这里，开发者在收集数据之前，先描述研究的组成部分（如研究设计、数据收集、数据分析等）。这两种预注册类型都有可能降低偏差风险（如事后认识偏差和确认偏差），并允许其他人评估偏差风险，从而校准对开发者有效性论证评估的信心。

{"title":"Argument-Based Approach to Validity: Developing a Living Document and Incorporating Preregistration","authors":"Daria Gerasimova","doi":"10.1111/jedm.12385","DOIUrl":"10.1111/jedm.12385","url":null,"abstract":"<p>I propose two practical advances to the argument-based approach to validity: developing a living document and incorporating preregistration. First, I present a potential structure for the living document that includes an up-to-date summary of the validity argument. As the validation process may span across multiple studies, the living document allows future users of the instrument to access the entire validity argument in one place. Second, I describe how preregistration can be incorporated in the argument-based approach. Specifically, I distinguish between two types of preregistration: preregistration of the argument and preregistration of validation studies. Preregistration of the argument is a single preregistration that is specified for the entire validation process. Here, the developer specifies interpretations, uses, and claims before collecting validity evidence. Preregistration of a validation study refers to preregistering a single validation study that aims to evaluate a set of claims. Here, the developer describes study components (e.g., research design, data collection, data analysis, etc.), before collecting data. Both preregistration types have the potential to reduce the risk of bias (e.g., hindsight and confirmation biases), as well as to allow others to evaluate the risk of bias and, hence, calibrate confidence, in the developer's evaluation of the validity argument.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 2","pages":"252-273"},"PeriodicalIF":1.3,"publicationDate":"2024-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139837198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Dual-Purpose Model for Binary Data: Estimating Ability and Misconceptions 二元数据的两用模型：估计能力和误解

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2024-01-04 DOI: 10.1111/jedm.12383

Wenchao Ma, Miguel A. Sorrel, Xiaoming Zhai, Yuan Ge

Most existing diagnostic models are developed to detect whether students have mastered a set of skills of interest, but few have focused on identifying what scientific misconceptions students possess. This article developed a general dual-purpose model for simultaneously estimating students' overall ability and the presence and absence of misconceptions. The expectation-maximization algorithm was developed to estimate the model parameters. A simulation study was conducted to evaluate to what extent the parameters can be accurately recovered under varied conditions. A set of real data in science education was also analyzed to examine the viability of the proposed model in practice.

大多数现有的诊断模型都是为了检测学生是否掌握了一套感兴趣的技能而开发的，但很少有诊断模型侧重于识别学生存在哪些科学误解。本文开发了一种通用的两用模型，可同时估计学生的整体能力以及是否存在误解。本文开发了期望最大化算法来估计模型参数。本文进行了一项模拟研究，以评估在不同条件下参数的准确恢复程度。此外，还分析了科学教育中的一组真实数据，以考察所提模型在实践中的可行性。

引用次数: 0

A Highly Adaptive Testing Design for PISA 一种高度适应性的PISA测试设计

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2023-12-03 DOI: 10.1111/jedm.12382

Andreas Frey, Christoph König, Aron Fink

The highly adaptive testing (HAT) design is introduced as an alternative test design for the Programme for International Student Assessment (PISA). The principle of HAT is to be as adaptive as possible when selecting items while accounting for PISA's nonstatistical constraints and addressing issues concerning PISA such as item position effects. HAT combines established methods from the field of computerized adaptive testing. It is implemented in R and code is provided. HAT was compared to the PISA 2018 multistage design (MST) in a simulation study based on a factorial design with the independent variables response probability (RP; .50, .62), item pool optimality (PISA 2018, optimal), and ability level (low, medium, high). PISA-specific conditions regarding sample size, missing responses, and nonstatistical constraints were implemented. HAT clearly outperformed MST regarding test information, RMSE, and constraint management across ability groups but it showed slightly weaker item exposure. Raising RP to .62 did not decrease test information much and is therefore a viable option to foster students’ test-taking experience with HAT. Test information for HAT was up to three times higher than for MST when using a hypothetical optimal item pool. Summarizing, HAT proved to be a promising and applicable test design for PISA.

高适应性测试(HAT)设计是作为国际学生评估项目(PISA)的替代测试设计引入的。HAT的原则是在选择项目时尽可能适应，同时考虑到PISA的非统计约束和解决与PISA有关的问题，如项目位置效应。HAT结合了计算机自适应测试领域的既定方法。它是用R实现的，并提供了代码。在一项基于因子设计的模拟研究中，将HAT与PISA 2018多阶段设计(MST)进行了比较，其中自变量为反应概率(RP;.50， .62)、项目池最优性(PISA 2018，最优)和能力水平(低、中、高)。实施了关于样本量、缺失回复和非统计约束的pisa特定条件。HAT在测试信息、RMSE和跨能力组的约束管理方面明显优于MST，但是它显示出稍弱的项目暴露。将RP提高到0.62并没有减少太多的考试信息，因此是一个可行的选择，以促进学生的应试经验与HAT。当使用假设的最优项目池时，HAT的测试信息比MST高三倍。综上所述，HAT被证明是一个很有前途和适用于PISA的测试设计。

{"title":"A Highly Adaptive Testing Design for PISA","authors":"Andreas Frey, Christoph König, Aron Fink","doi":"10.1111/jedm.12382","DOIUrl":"https://doi.org/10.1111/jedm.12382","url":null,"abstract":"The highly adaptive testing (HAT) design is introduced as an alternative test design for the Programme for International Student Assessment (PISA). The principle of HAT is to be as adaptive as possible when selecting items while accounting for PISA's nonstatistical constraints and addressing issues concerning PISA such as item position effects. HAT combines established methods from the field of computerized adaptive testing. It is implemented in R and code is provided. HAT was compared to the PISA 2018 multistage design (MST) in a simulation study based on a factorial design with the independent variables response probability (RP; .50, .62), item pool optimality (PISA 2018, optimal), and ability level (low, medium, high). PISA-specific conditions regarding sample size, missing responses, and nonstatistical constraints were implemented. HAT clearly outperformed MST regarding test information, RMSE, and constraint management across ability groups but it showed slightly weaker item exposure. Raising RP to .62 did not decrease test information much and is therefore a viable option to foster students’ test-taking experience with HAT. Test information for HAT was up to three times higher than for MST when using a hypothetical optimal item pool. Summarizing, HAT proved to be a promising and applicable test design for PISA.","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"13 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138539704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Highly Adaptive Testing Design for PISA 一种高度适应性的PISA测试设计

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2023-12-03 DOI: 10.1111/jedm.12382

Andreas Frey, Christoph König, Aron Fink

The highly adaptive testing (HAT) design is introduced as an alternative test design for the Programme for International Student Assessment (PISA). The principle of HAT is to be as adaptive as possible when selecting items while accounting for PISA's nonstatistical constraints and addressing issues concerning PISA such as item position effects. HAT combines established methods from the field of computerized adaptive testing. It is implemented in R and code is provided. HAT was compared to the PISA 2018 multistage design (MST) in a simulation study based on a factorial design with the independent variables response probability (RP; .50, .62), item pool optimality (PISA 2018, optimal), and ability level (low, medium, high). PISA-specific conditions regarding sample size, missing responses, and nonstatistical constraints were implemented. HAT clearly outperformed MST regarding test information, RMSE, and constraint management across ability groups but it showed slightly weaker item exposure. Raising RP to .62 did not decrease test information much and is therefore a viable option to foster students’ test-taking experience with HAT. Test information for HAT was up to three times higher than for MST when using a hypothetical optimal item pool. Summarizing, HAT proved to be a promising and applicable test design for PISA.

高适应性测试(HAT)设计是作为国际学生评估项目(PISA)的替代测试设计引入的。HAT的原则是在选择项目时尽可能适应，同时考虑到PISA的非统计约束和解决与PISA有关的问题，如项目位置效应。HAT结合了计算机自适应测试领域的既定方法。它是用R实现的，并提供了代码。在一项基于因子设计的模拟研究中，将HAT与PISA 2018多阶段设计(MST)进行了比较，其中自变量为反应概率(RP;.50， .62)、项目池最优性(PISA 2018，最优)和能力水平(低、中、高)。实施了关于样本量、缺失回复和非统计约束的pisa特定条件。HAT在测试信息、RMSE和跨能力组的约束管理方面明显优于MST，但是它显示出稍弱的项目暴露。将RP提高到0.62并没有减少太多的考试信息，因此是一个可行的选择，以促进学生的应试经验与HAT。当使用假设的最优项目池时，HAT的测试信息比MST高三倍。综上所述，HAT被证明是一个很有前途和适用于PISA的测试设计。

{"title":"A Highly Adaptive Testing Design for PISA","authors":"Andreas Frey, Christoph König, Aron Fink","doi":"10.1111/jedm.12382","DOIUrl":"https://doi.org/10.1111/jedm.12382","url":null,"abstract":"The highly adaptive testing (HAT) design is introduced as an alternative test design for the Programme for International Student Assessment (PISA). The principle of HAT is to be as adaptive as possible when selecting items while accounting for PISA's nonstatistical constraints and addressing issues concerning PISA such as item position effects. HAT combines established methods from the field of computerized adaptive testing. It is implemented in R and code is provided. HAT was compared to the PISA 2018 multistage design (MST) in a simulation study based on a factorial design with the independent variables response probability (RP; .50, .62), item pool optimality (PISA 2018, optimal), and ability level (low, medium, high). PISA-specific conditions regarding sample size, missing responses, and nonstatistical constraints were implemented. HAT clearly outperformed MST regarding test information, RMSE, and constraint management across ability groups but it showed slightly weaker item exposure. Raising RP to .62 did not decrease test information much and is therefore a viable option to foster students’ test-taking experience with HAT. Test information for HAT was up to three times higher than for MST when using a hypothetical optimal item pool. Summarizing, HAT proved to be a promising and applicable test design for PISA.","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"13 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138539665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Computation and Accuracy Evaluation of Comparable Scores on Culturally Responsive Assessments 文化反应性评估可比分数的计算与准确性评价

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2023-11-16 DOI: 10.1111/jedm.12381

Sandip Sinharay, Matthew S. Johnson

Culturally responsive assessments have been proposed as potential tools to ensure equity and fairness for examinees from all backgrounds including those from traditionally underserved or minoritized groups. However, these assessments are relatively new and, with few exceptions, are yet to be implemented in large scale. Consequently, there is a lack of guidance on how one can compute comparable scores on various versions of these assessments. In this paper, the multigroup multidimensional Rasch model is repurposed for modeling data originating from various versions of a culturally responsive assessment and for analyzing such data to compute comparable scores. Two simulation studies are performed to evaluate the performance of the model for data simulated from hypothetical culturally responsive assessments and to find the conditions under which the computed scores are accurate. Recommendations are made for measurement practitioners interested in culturally responsive assessments.

文化响应性评估被提议作为潜在的工具，以确保来自所有背景的考生，包括那些来自传统上服务不足或少数群体的考生的公平和公平。然而，这些评估是比较新的，除了少数例外，尚未大规模执行。因此，缺乏关于如何在这些评估的不同版本中计算可比较分数的数据的指导。在本文中，多组多维Rasch模型被重新用于建模源自不同版本的文化响应性评估的数据，并用于分析这些数据以计算可比分数。进行了两项模拟研究，以评估从假设的文化反应性评估中模拟的数据模型的性能，并找到计算分数准确的条件。为对文化反应性评估感兴趣的测量从业人员提出了建议。

引用次数: 0

Computation and Accuracy Evaluation of Comparable Scores on Culturally Responsive Assessments 文化反应性评估可比分数的计算与准确性评价

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2023-11-16 DOI: 10.1111/jedm.12381

Sandip Sinharay, Matthew S. Johnson

Culturally responsive assessments have been proposed as potential tools to ensure equity and fairness for examinees from all backgrounds including those from traditionally underserved or minoritized groups. However, these assessments are relatively new and, with few exceptions, are yet to be implemented in large scale. Consequently, there is a lack of guidance on how data on how one can compute comparable scores on various versions of these assessments. In this paper, the multigroup multidimensional Rasch model is repurposed for modeling data originating from various versions of a culturally responsive assessment and for analyzing such data to compute comparable scores. Two simulation studies are performed to evaluate the performance of the model for data simulated from hypothetical culturally responsive assessments and to find the conditions under which the computed scores are accurate. Recommendations are made for measurement practitioners interested in culturally responsive assessments.

文化响应性评估被提议作为潜在的工具，以确保来自所有背景的考生，包括那些来自传统上服务不足或少数群体的考生的公平和公平。然而，这些评估是比较新的，除了少数例外，尚未大规模执行。因此，缺乏关于如何在这些评估的不同版本中计算可比较分数的数据的指导。在本文中，多组多维Rasch模型被重新用于建模源自不同版本的文化响应性评估的数据，并用于分析这些数据以计算可比分数。进行了两项模拟研究，以评估从假设的文化反应性评估中模拟的数据模型的性能，并找到计算分数准确的条件。为对文化反应性评估感兴趣的测量从业人员提出了建议。

引用次数: 0

Incorporating Test‐Taking Engagement into Multistage Adaptive Testing Design for Large‐Scale Assessments 将测试参与纳入大规模评估的多阶段自适应测试设计

4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2023-11-10 DOI: 10.1111/jedm.12380

Okan Bulut, Guher Gorgun, Hacer Karamese

Abstract The use of multistage adaptive testing (MST) has gradually increased in large‐scale testing programs as MST achieves a balanced compromise between linear test design and item‐level adaptive testing. MST works on the premise that each examinee gives their best effort when attempting the items, and their responses truly reflect what they know or can do. However, research shows that large‐scale assessments may suffer from a lack of test‐taking engagement, especially if they are low stakes. Examinees with low test‐taking engagement are likely to show noneffortful responding (e.g., answering the items very rapidly without reading the item stem or response options). To alleviate the impact of noneffortful responses on the measurement accuracy of MST, test‐taking engagement can be operationalized as a latent trait based on response times and incorporated into the on‐the‐fly module assembly procedure. To demonstrate the proposed approach, a Monte‐Carlo simulation study was conducted based on item parameters from an international large‐scale assessment. The results indicated that the on‐the‐fly module assembly considering both ability and test‐taking engagement could minimize the impact of noneffortful responses, yielding more accurate ability estimates and classifications. Implications for practice and directions for future research were discussed.

多阶段自适应测试(MST)的使用在大规模测试项目中逐渐增加，因为MST实现了线性测试设计和项目水平自适应测试之间的平衡妥协。MST的工作前提是每位考生在尝试试题时都尽了最大的努力，他们的回答真实地反映了他们所知道或能做的事情。然而，研究表明，大规模的评估可能会受到缺乏参与考试的影响，特别是如果它们是低赌注的。参与度低的考生可能表现出不费力的反应(例如，在不阅读题干或回答选项的情况下非常快速地回答问题)。为了减轻不费力的反应对MST测量精度的影响，应试参与可以作为基于反应时间的潜在特征进行操作，并纳入实时模块组装程序。为了证明所提出的方法，基于国际大规模评估的项目参数进行了蒙特卡罗模拟研究。结果表明，考虑能力和测试参与的在线模块组装可以最大限度地减少不费力响应的影响，从而产生更准确的能力估计和分类。讨论了实践意义和未来研究方向。

{"title":"Incorporating Test‐Taking Engagement into Multistage Adaptive Testing Design for Large‐Scale Assessments","authors":"Okan Bulut, Guher Gorgun, Hacer Karamese","doi":"10.1111/jedm.12380","DOIUrl":"https://doi.org/10.1111/jedm.12380","url":null,"abstract":"Abstract The use of multistage adaptive testing (MST) has gradually increased in large‐scale testing programs as MST achieves a balanced compromise between linear test design and item‐level adaptive testing. MST works on the premise that each examinee gives their best effort when attempting the items, and their responses truly reflect what they know or can do. However, research shows that large‐scale assessments may suffer from a lack of test‐taking engagement, especially if they are low stakes. Examinees with low test‐taking engagement are likely to show noneffortful responding (e.g., answering the items very rapidly without reading the item stem or response options). To alleviate the impact of noneffortful responses on the measurement accuracy of MST, test‐taking engagement can be operationalized as a latent trait based on response times and incorporated into the on‐the‐fly module assembly procedure. To demonstrate the proposed approach, a Monte‐Carlo simulation study was conducted based on item parameters from an international large‐scale assessment. The results indicated that the on‐the‐fly module assembly considering both ability and test‐taking engagement could minimize the impact of noneffortful responses, yielding more accurate ability estimates and classifications. Implications for practice and directions for future research were discussed.","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"119 52","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135137584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Information Functions of Rank-2PL Models for Forced-Choice Questionnaires 强迫选择问卷的等级-2PL 模型的信息函数

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2023-10-29 DOI: 10.1111/jedm.12379

Jianbin Fu, Xuan Tan, Patrick C. Kyllonen

This paper presents the item and test information functions of the Rank two-parameter logistic models (Rank-2PLM) for items with two (pair) and three (triplet) statements in forced-choice questionnaires. The Rank-2PLM model for pairs is the MUPP-2PLM (Multi-Unidimensional Pairwise Preference) and, for triplets, is the Triplet-2PLM. Fisher's information and directional information are described, and the test information for Maximum Likelihood (ML), Maximum A Posterior (MAP), and Expected A Posterior (EAP) trait score estimates is distinguished. Expected item/test information indexes at various levels are proposed and plotted to provide diagnostic information on items and tests. The expected test information indexes for EAP scores may be difficult to compute due to a typical test's vast number of item response patterns. The relationships of item/test information with discrimination parameters of statements, standard error, and reliability estimates of trait score estimates are discussed and demonstrated using real data. Practical suggestions for checking the various expected item/test information indexes and plots are provided.

本文介绍了强迫选择问卷中包含两个（成对）和三个（三连音）陈述的项目的 Rank 双参数逻辑模型（Rank-2PLM）的项目信息函数和测试信息函数。针对成对陈述的 Rank-2PLM 模型为 MUPP-2PLM（多维成对偏好），针对三重陈述的 Rank-2PLM 模型为 Triplet-2PLM。描述了费雪信息和方向信息，并区分了最大似然（ML）、最大 A 后验（MAP）和期望 A 后验（EAP）性状分数估计的测试信息。提出并绘制了不同水平的预期项目/测验信息指数，以提供项目和测验的诊断信息。由于典型测验的项目反应模式数量庞大，EAP 分数的预期测验信息指数可能难以计算。本文讨论了项目/测验信息与语句辨别参数、标准误差和特质分值估计的可靠性估计之间的关系，并使用真实数据进行了演示。此外，还提供了检查各种预期项目/测验信息指数和绘图的实用建议。

引用次数: 0

Detecting Multidimensional DIF in Polytomous Items with IRT Methods and Estimation Approaches 用 IRT 方法和估计方法检测多同调项目中的多维 DIF

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2023-10-15 DOI: 10.1111/jedm.12377

Güler Yavuz Temel

The purpose of this study was to investigate multidimensional DIF with a simple and nonsimple structure in the context of multidimensional Graded Response Model (MGRM). This study examined and compared the performance of the IRT-LR and Wald test using MML-EM and MHRM estimation approaches with different test factors and test structures in simulation studies and applying real data sets. When the test structure included two dimensions, the IRT-LR (MML-EM) generally performed better than the Wald test and provided higher power rates. If the test included three dimensions, the methods provided similar performance in DIF detection. In contrast to these results, when the number of dimensions in the test was four, MML-EM estimation completely lost precision in estimating the nonuniform DIF, even with large sample sizes. The Wald with MHRM estimation approaches outperformed the Wald test (MML-EM) and IRT-LR (MML-EM). The Wald test had higher power rate and acceptable type I error rates for nonuniform DIF with the MHRM estimation approach.The small and/or unbalanced sample sizes, small DIF magnitudes, unequal ability distributions between groups, number of dimensions, estimation methods and test structure were evaluated as important test factors for detecting multidimensional DIF.

本研究的目的是在多维分级反应模型（MGRM）的背景下研究具有简单和非简单结构的多维 DIF。本研究在模拟研究和应用真实数据集时，使用 MML-EM 和 MHRM 估算方法，检验并比较了不同测试因子和测试结构下 IRT-LR 和 Wald 检验的性能。当测试结构包括两个维度时，IRT-LR（MML-EM）的性能通常优于 Wald 检验，并提供更高的功率率。如果测试包括三个维度，这两种方法的 DIF 检测性能相似。与这些结果相反，当测试的维度数为四个时，MML-EM 估计在估计非均匀 DIF 方面完全失去了精确性，即使样本量很大也是如此。采用 MML-EM 估计方法的 Wald 检验结果优于 Wald 检验（MML-EM）和 IRT-LR （MML-EM）。小样本量和/或不平衡样本量、小 DIF 量级、组间能力分布不均、维度数量、估计方法和测试结构被评估为检测多维 DIF 的重要测试因素。

{"title":"Detecting Multidimensional DIF in Polytomous Items with IRT Methods and Estimation Approaches","authors":"Güler Yavuz Temel","doi":"10.1111/jedm.12377","DOIUrl":"10.1111/jedm.12377","url":null,"abstract":"<p>The purpose of this study was to investigate multidimensional DIF with a simple and nonsimple structure in the context of multidimensional Graded Response Model (MGRM). This study examined and compared the performance of the IRT-LR and Wald test using MML-EM and MHRM estimation approaches with different test factors and test structures in simulation studies and applying real data sets. When the test structure included two dimensions, the IRT-LR (MML-EM) generally performed better than the Wald test and provided higher power rates. If the test included three dimensions, the methods provided similar performance in DIF detection. In contrast to these results, when the number of dimensions in the test was four, MML-EM estimation completely lost precision in estimating the nonuniform DIF, even with large sample sizes. The Wald with MHRM estimation approaches outperformed the Wald test (MML-EM) and IRT-LR (MML-EM). The Wald test had higher power rate and acceptable type I error rates for nonuniform DIF with the MHRM estimation approach.The small and/or unbalanced sample sizes, small DIF magnitudes, unequal ability distributions between groups, number of dimensions, estimation methods and test structure were evaluated as important test factors for detecting multidimensional DIF.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 1","pages":"69-98"},"PeriodicalIF":1.3,"publicationDate":"2023-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136185515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MSAEM Estimation for Confirmatory Multidimensional Four-Parameter Normal Ogive Models 确认性多维四参数正态椭圆模型的 MSAEM 估计

IF 1.3 4区心理学 Q3 PSYCHOLOGY, APPLIED

Journal of Educational Measurement

Pub Date : 2023-10-09 DOI: 10.1111/jedm.12378

Jia Liu, Xiangbin Meng, Gongjun Xu, Wei Gao, Ningzhong Shi

In this paper, we develop a mixed stochastic approximation expectation-maximization (MSAEM) algorithm coupled with a Gibbs sampler to compute the marginalized maximum a posteriori estimate (MMAPE) of a confirmatory multidimensional four-parameter normal ogive (M4PNO) model. The proposed MSAEM algorithm not only has the computational advantages of the stochastic approximation expectation-maximization (SAEM) algorithm for multidimensional data, but it also alleviates the potential instability caused by label-switching, and then improved the estimation accuracy. Simulation studies are conducted to illustrate the good performance of the proposed MSAEM method, where MSAEM consistently performs better than SAEM and some other existing methods in multidimensional item response theory. Moreover, the proposed method is applied to a real data set from the 2018 Programme for International Student Assessment (PISA) to demonstrate the usefulness of the 4PNO model as well as MSAEM in practice.

本文开发了一种与吉布斯采样器相结合的混合随机逼近期望最大化（MSAEM）算法，用于计算确证多维四参数正态椭圆（M4PNO）模型的边际最大后验估计值（MMAPE）。所提出的 MSAEM 算法不仅具有多维数据随机逼近期望最大化（SAEM）算法的计算优势，而且缓解了标签切换可能导致的不稳定性，进而提高了估计精度。仿真研究说明了所提出的 MSAEM 方法的良好性能，MSAEM 的性能一直优于 SAEM 和其他一些现有的多维项目反应理论方法。此外，还将提出的方法应用于 2018 年国际学生评估项目（PISA）的真实数据集，以证明 4PNO 模型以及 MSAEM 在实践中的实用性。

引用次数: 0