Educational and Psychological Measurement最新文献

英文中文

Design Effect in Multilevel Settings: A Commentary on a Latent Variable Modeling Procedure for Its Evaluation. 多层次环境中的设计效果：对潜在变量建模过程的评价。

IF 2.7 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2022-10-01 Epub Date: 2021-06-04 DOI: 10.1177/00131644211019447

Tenko Raykov, Christine DiStefano

A latent variable modeling-based procedure is discussed that permits to readily point and interval estimate the design effect index in multilevel settings using widely circulated software. The method provides useful information about the relationship of important parameter standard errors when accounting for clustering effects relative to conducting single-level analyses. The approach can also be employed as an addendum to point and interval estimation of the intraclass correlation coefficient in empirical research. The discussed procedure makes it easily possible to evaluate the design effect in two-level studies by utilizing the popular latent variable modeling methodology and is illustrated with an example.

讨论了一种基于潜变量建模的程序，该程序允许使用广泛流传的软件在多级环境中容易地对设计效果指数进行点和区间估计。该方法在考虑相对于进行单级分析的聚类效应时，提供了关于重要参数标准误差关系的有用信息。该方法也可以作为经验研究中类内相关系数的点和区间估计的补充。所讨论的过程使利用流行的潜变量建模方法在两级研究中评估设计效果变得容易，并通过实例进行了说明。

引用次数: 1

Examining the Robustness of the Graded Response and 2-Parameter Logistic Models to Violations of Construct Normality. 检验分级响应和二参数Logistic模型对违反构造正态性的鲁棒性。

IF 2.7 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2022-10-01 Epub Date: 2022-01-07 DOI: 10.1177/00131644211063453

Patrick D Manapat, Michael C Edwards

When fitting unidimensional item response theory (IRT) models, the population distribution of the latent trait (θ) is often assumed to be normally distributed. However, some psychological theories would suggest a nonnormal θ. For example, some clinical traits (e.g., alcoholism, depression) are believed to follow a positively skewed distribution where the construct is low for most people, medium for some, and high for few. Failure to account for nonnormality may compromise the validity of inferences and conclusions. Although corrections have been developed to account for nonnormality, these methods can be computationally intensive and have not yet been widely adopted. Previous research has recommended implementing nonnormality corrections when θ is not "approximately normal." This research focused on examining how far θ can deviate from normal before the normality assumption becomes untenable. Specifically, our goal was to identify the type(s) and degree(s) of nonnormality that result in unacceptable parameter recovery for the graded response model (GRM) and 2-parameter logistic model (2PLM).

在拟合一维项目反应理论（IRT）模型时，通常假设潜在特征（θ）的群体分布为正态分布。然而，一些心理学理论认为θ是非正常的。例如，一些临床特征（如酗酒、抑郁）被认为遵循正偏态分布，其中大多数人的结构较低，一些人的结构中等，少数人的结构较高。未能解释非正态性可能会损害推论和结论的有效性。尽管已经开发了校正来解释非正态性，但这些方法可能是计算密集型的，尚未被广泛采用。先前的研究建议，当θ不是“近似正态”时，实施非正态性校正。这项研究的重点是在正态性假设变得不成立之前，检查θ可以偏离正态多远。具体而言，我们的目标是确定导致分级响应模型（GRM）和双参数逻辑模型（2PLM）不可接受的参数恢复的非正态性的类型和程度。

{"title":"Examining the Robustness of the Graded Response and 2-Parameter Logistic Models to Violations of Construct Normality.","authors":"Patrick D Manapat, Michael C Edwards","doi":"10.1177/00131644211063453","DOIUrl":"10.1177/00131644211063453","url":null,"abstract":"When fitting unidimensional item response theory (IRT) models, the population distribution of the latent trait (θ) is often assumed to be normally distributed. However, some psychological theories would suggest a nonnormal θ. For example, some clinical traits (e.g., alcoholism, depression) are believed to follow a positively skewed distribution where the construct is low for most people, medium for some, and high for few. Failure to account for nonnormality may compromise the validity of inferences and conclusions. Although corrections have been developed to account for nonnormality, these methods can be computationally intensive and have not yet been widely adopted. Previous research has recommended implementing nonnormality corrections when θ is not \"approximately normal.\" This research focused on examining how far θ can deviate from normal before the normality assumption becomes untenable. Specifically, our goal was to identify the type(s) and degree(s) of nonnormality that result in unacceptable parameter recovery for the graded response model (GRM) and 2-parameter logistic model (2PLM).","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"82 5","pages":"967-988"},"PeriodicalIF":2.7,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9386882/pdf/10.1177_00131644211063453.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40626322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Identifying Problematic Item Characteristics With Small Samples Using Mokken Scale Analysis. 用莫肯量表分析小样本识别有问题的项目特征。

IF 2.7 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2022-08-01 DOI: 10.1177/00131644211045347

Stefanie A Wind

Researchers frequently use Mokken scale analysis (MSA), which is a nonparametric approach to item response theory, when they have relatively small samples of examinees. Researchers have provided some guidance regarding the minimum sample size for applications of MSA under various conditions. However, these studies have not focused on item-level measurement problems, such as violations of monotonicity or invariant item ordering (IIO). Moreover, these studies have focused on problems that occur for a complete sample of examinees. The current study uses a simulation study to consider the sensitivity of MSA item analysis procedures to problematic item characteristics that occur within limited ranges of the latent variable. Results generally support the use of MSA with small samples (N around 100 examinees) as long as multiple indicators of item quality are considered.

当被试的样本相对较少时，研究者经常使用莫肯量表分析(MSA)，这是一种项目反应理论的非参数方法。研究人员对不同条件下MSA应用的最小样本量提供了一些指导。然而，这些研究并没有关注项目层面的测量问题，如违反单调性或不变项目顺序(IIO)。此外，这些研究都集中在一个完整的考生样本中出现的问题。本研究采用模拟研究来考虑MSA项目分析程序对潜在变量有限范围内出现的问题项目特征的敏感性。只要考虑到项目质量的多个指标，结果通常支持在小样本(约100名考生)下使用MSA。

引用次数: 3

Detecting Differential Rater Functioning in Severity and Centrality: The Dual DRF Facets Model. 在严重程度和中心性中检测差异评级功能:双重DRF方面模型。

IF 2.7 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2022-08-01 DOI: 10.1177/00131644211043207

Kuan-Yu Jin, Thomas Eckes

Performance assessments heavily rely on human ratings. These ratings are typically subject to various forms of error and bias, threatening the assessment outcomes' validity and fairness. Differential rater functioning (DRF) is a special kind of threat to fairness manifesting itself in unwanted interactions between raters and performance- or construct-irrelevant factors (e.g., examinee gender, rater experience, or time of rating). Most DRF studies have focused on whether raters show differential severity toward known groups of examinees. This study expands the DRF framework and investigates the more complex case of dual DRF effects, where DRF is simultaneously present in rater severity and centrality. Adopting a facets modeling approach, we propose the dual DRF model (DDRFM) for detecting and measuring these effects. In two simulation studies, we found that dual DRF effects (a) negatively affected measurement quality and (b) can reliably be detected and compensated under the DDRFM. Using sample data from a large-scale writing assessment (N = 1,323), we demonstrate the practical measurement consequences of the dual DRF effects. Findings have implications for researchers and practitioners assessing the psychometric quality of ratings.

绩效评估严重依赖于人类的评级。这些评级通常会受到各种形式的错误和偏见的影响，威胁到评估结果的有效性和公平性。差异评分功能(DRF)是对公平的一种特殊威胁，表现为评分者与表现或结构无关因素(如考生性别、评分者经验或评分时间)之间的不必要互动。大多数DRF研究都集中在评分者是否对已知的考生群体表现出不同的严重程度。本研究扩展了DRF框架，并调查了双重DRF效应的更复杂情况，其中DRF同时以更严重和中心性存在。采用面建模方法，我们提出了双DRF模型(DDRFM)来检测和测量这些影响。在两个模拟研究中，我们发现双重DRF效应(a)对测量质量产生负面影响，(b)在DDRFM下可以可靠地检测和补偿。使用大规模写作评估(N = 1,323)的样本数据，我们证明了双重DRF效应的实际测量结果。研究结果对评估评分的心理测量质量的研究人员和从业人员具有启示意义。

{"title":"Detecting Differential Rater Functioning in Severity and Centrality: The Dual DRF Facets Model.","authors":"Kuan-Yu Jin, Thomas Eckes","doi":"10.1177/00131644211043207","DOIUrl":"https://doi.org/10.1177/00131644211043207","url":null,"abstract":"Performance assessments heavily rely on human ratings. These ratings are typically subject to various forms of error and bias, threatening the assessment outcomes' validity and fairness. Differential rater functioning (DRF) is a special kind of threat to fairness manifesting itself in unwanted interactions between raters and performance- or construct-irrelevant factors (e.g., examinee gender, rater experience, or time of rating). Most DRF studies have focused on whether raters show differential severity toward known groups of examinees. This study expands the DRF framework and investigates the more complex case of dual DRF effects, where DRF is simultaneously present in rater severity and centrality. Adopting a facets modeling approach, we propose the dual DRF model (DDRFM) for detecting and measuring these effects. In two simulation studies, we found that dual DRF effects (a) negatively affected measurement quality and (b) can reliably be detected and compensated under the DDRFM. Using sample data from a large-scale writing assessment (N = 1,323), we demonstrate the practical measurement consequences of the dual DRF effects. Findings have implications for researchers and practitioners assessing the psychometric quality of ratings.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"82 4","pages":"757-781"},"PeriodicalIF":2.7,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9228693/pdf/10.1177_00131644211043207.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10271624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

The Response Vector for Mastery Method of Standard Setting. 制定标准的 "掌握反应向量法"。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2022-08-01 Epub Date: 2021-07-21 DOI: 10.1177/00131644211032388

Dimiter M Dimitrov

Proposed is a new method of standard setting referred to as response vector for mastery (RVM) method. Under the RVM method, the task of panelists that participate in the standard setting process does not involve conceptualization of a borderline examinee and probability judgments as it is the case with the Angoff and bookmark methods. Also, the RVM-based computation of a cut-score is not based on a single item (e.g., marked in an ordered item booklet) but, instead, on a response vector (1/0 scores) on items and their parameters calibrated in item response theory or under the recently developed D-scoring method. Illustrations with hypothetical and real-data scenarios of standard setting are provided and methodological aspects of the RVM method are discussed.

我们提出了一种新的标准制定方法，即 "掌握反应向量法"（RVM）。在 RVM 方法中，参与标准制定过程的专家小组成员的任务并不像 Angoff 和书签法那样涉及对边缘考生的概念化和概率判断。此外，基于 RVM 的切分分值计算不是基于单个项目（如在有序的项目手册中标注的项目），而是基于项目的反应向量（1/0 分）以及项目反应理论或最近开发的 D 评分法校准的项目参数。本文提供了假设和真实数据情况下的标准设定示例，并讨论了 RVM 方法的方法论问题。

引用次数: 0

DIF Detection With Zero-Inflation Under the Factor Mixture Modeling Framework. 因子混合建模框架下的零膨胀 DIF 检测。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2022-08-01 Epub Date: 2021-07-26 DOI: 10.1177/00131644211028995

Sooyong Lee, Suhwa Han, Seung W Choi

Response data containing an excessive number of zeros are referred to as zero-inflated data. When differential item functioning (DIF) detection is of interest, zero-inflation can attenuate DIF effects in the total sample and lead to underdetection of DIF items. The current study presents a DIF detection procedure for response data with excess zeros due to the existence of unobserved heterogeneous subgroups. The suggested procedure utilizes the factor mixture modeling (FMM) with MIMIC (multiple-indicator multiple-cause) to address the compromised DIF detection power via the estimation of latent classes. A Monte Carlo simulation was conducted to evaluate the suggested procedure in comparison to the well-known likelihood ratio (LR) DIF test. Our simulation study results indicated the superiority of FMM over the LR DIF test in terms of detection power and illustrated the importance of accounting for latent heterogeneity in zero-inflated data. The empirical data analysis results further supported the use of FMM by flagging additional DIF items over and above the LR test.

包含过多零的响应数据被称为零膨胀数据。当需要进行差异项目功能（DIF）检测时，零膨胀会削弱总体样本中的 DIF 效应，导致 DIF 项目检测不足。本研究提出了一种 DIF 检测程序，适用于因存在未观察到的异质子群而出现过多零的响应数据。建议的程序利用带有 MIMIC（多指标多原因）的因子混合建模（FMM），通过估计潜类来解决 DIF 检测能力不足的问题。我们进行了蒙特卡罗模拟，以评估建议程序与著名的似然比 (LR) DIF 检验的比较。模拟研究结果表明，就检测能力而言，FMM 优于 LR DIF 检验，并说明了在零膨胀数据中考虑潜在异质性的重要性。实证数据分析结果进一步支持了 FMM 的使用，在 LR 检验基础上标记出了额外的 DIF 项目。

{"title":"DIF Detection With Zero-Inflation Under the Factor Mixture Modeling Framework.","authors":"Sooyong Lee, Suhwa Han, Seung W Choi","doi":"10.1177/00131644211028995","DOIUrl":"10.1177/00131644211028995","url":null,"abstract":"Response data containing an excessive number of zeros are referred to as zero-inflated data. When differential item functioning (DIF) detection is of interest, zero-inflation can attenuate DIF effects in the total sample and lead to underdetection of DIF items. The current study presents a DIF detection procedure for response data with excess zeros due to the existence of unobserved heterogeneous subgroups. The suggested procedure utilizes the factor mixture modeling (FMM) with MIMIC (multiple-indicator multiple-cause) to address the compromised DIF detection power via the estimation of latent classes. A Monte Carlo simulation was conducted to evaluate the suggested procedure in comparison to the well-known likelihood ratio (LR) DIF test. Our simulation study results indicated the superiority of FMM over the LR DIF test in terms of detection power and illustrated the importance of accounting for latent heterogeneity in zero-inflated data. The empirical data analysis results further supported the use of FMM by flagging additional DIF items over and above the LR test.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"82 4","pages":"678-704"},"PeriodicalIF":2.1,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9228697/pdf/10.1177_00131644211028995.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10290044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Extended Multivariate Generalizability Theory With Complex Design Structures. 具有复杂设计结构的扩展多元推广理论。

IF 2.7 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2022-08-01 DOI: 10.1177/00131644211049746

Robert L Brennan, Stella Y Kim, Won-Chan Lee

This article extends multivariate generalizability theory (MGT) to tests with different random-effects designs for each level of a fixed facet. There are numerous situations in which the design of a test and the resulting data structure are not definable by a single design. One example is mixed-format tests that are composed of multiple-choice and free-response items, with the latter involving variability attributable to both items and raters. In this case, two distinct designs are needed to fully characterize the design and capture potential sources of error associated with each item format. Another example involves tests containing both testlets and one or more stand-alone sets of items. Testlet effects need to be taken into account for the testlet-based items, but not the stand-alone sets of items. This article presents an extension of MGT that faithfully models such complex test designs, along with two real-data examples. Among other things, these examples illustrate that estimates of error variance, error-tolerance ratios, and reliability-like coefficients can be biased if there is a mismatch between the user-specified universe of generalization and the complex nature of the test.

本文将多元概化理论(MGT)推广到具有不同随机效应设计的测试中。在许多情况下，测试的设计和结果数据结构不是由单个设计定义的。一个例子是由多项选择和自由回答项目组成的混合格式测试，后者涉及可归因于项目和评分者的差异。在这种情况下，需要两个不同的设计来充分表征设计并捕获与每个项目格式相关的潜在错误来源。另一个示例涉及既包含testlet又包含一个或多个独立项集的测试。需要考虑基于测试集的项目的测试集效果，而不是独立的项目集。本文提供了MGT的一个扩展，忠实地为这种复杂的测试设计建模，并提供了两个实际数据示例。除此之外，这些示例说明，如果用户指定的泛化范围与测试的复杂性之间存在不匹配，则误差方差、容错比率和类可靠性系数的估计可能存在偏差。

{"title":"Extended Multivariate Generalizability Theory With Complex Design Structures.","authors":"Robert L Brennan, Stella Y Kim, Won-Chan Lee","doi":"10.1177/00131644211049746","DOIUrl":"https://doi.org/10.1177/00131644211049746","url":null,"abstract":"This article extends multivariate generalizability theory (MGT) to tests with different random-effects designs for each level of a fixed facet. There are numerous situations in which the design of a test and the resulting data structure are not definable by a single design. One example is mixed-format tests that are composed of multiple-choice and free-response items, with the latter involving variability attributable to both items and raters. In this case, two distinct designs are needed to fully characterize the design and capture potential sources of error associated with each item format. Another example involves tests containing both testlets and one or more stand-alone sets of items. Testlet effects need to be taken into account for the testlet-based items, but not the stand-alone sets of items. This article presents an extension of MGT that faithfully models such complex test designs, along with two real-data examples. Among other things, these examples illustrate that estimates of error variance, error-tolerance ratios, and reliability-like coefficients can be biased if there is a mismatch between the user-specified universe of generalization and the complex nature of the test.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"82 4","pages":"617-642"},"PeriodicalIF":2.7,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9228696/pdf/10.1177_00131644211049746.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10290043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hybrid Threshold-Based Sequential Procedures for Detecting Compromised Items in a Computerized Adaptive Testing Licensure Exam. 在计算机化自适应测试许可考试中检测折衷项目的基于混合阈值的顺序程序。

IF 2.7 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2022-08-01 DOI: 10.1177/00131644211023868

Chansoon Lee, Hong Qian

Using classical test theory and item response theory, this study applied sequential procedures to a real operational item pool in a variable-length computerized adaptive testing (CAT) to detect items whose security may be compromised. Moreover, this study proposed a hybrid threshold approach to improve the detection power of the sequential procedure while controlling the Type I error rate. The hybrid threshold approach uses a local threshold for each item in an early stage of the CAT administration, and then it uses the global threshold in the decision-making stage. Applying various simulation factors, a series of simulation studies examined which factors contribute significantly to the power rate and lag time of the procedure. In addition to the simulation study, a case study investigated whether the procedures are applicable to the real item pool administered in CAT and can identify potentially compromised items in the pool. This research found that the increment of probability of a correct answer (p-increment) was the simulation factor most important to the sequential procedures' ability to detect compromised items. This study also found that the local threshold approach improved power rates and shortened lag times when the p-increment was small. The findings of this study could help practitioners implement the sequential procedures using the hybrid threshold approach in real-time CAT administration.

本研究运用经典测试理论和项目反应理论，在变长计算机自适应测试(CAT)的实际操作项目池中，采用顺序程序来检测可能危及安全性的项目。此外，本研究提出了一种混合阈值方法来提高序列过程的检测能力，同时控制I型错误率。混合阈值方法在CAT管理的早期阶段对每个项目使用一个局部阈值，然后在决策阶段使用全局阈值。应用各种仿真因素，进行了一系列仿真研究，考察了哪些因素对程序的功率率和滞后时间有重要影响。除了模拟研究之外，一个案例研究调查了这些过程是否适用于在CAT中管理的实际项目池，并可以识别池中潜在的受损项目。本研究发现，正确答案的概率增量(p-增量)是序列程序检测受损物品能力最重要的模拟因素。本研究还发现，当p增量较小时，局部阈值方法提高了功率率并缩短了滞后时间。本研究的发现可以帮助从业者在实时CAT管理中使用混合阈值方法实施顺序程序。

{"title":"Hybrid Threshold-Based Sequential Procedures for Detecting Compromised Items in a Computerized Adaptive Testing Licensure Exam.","authors":"Chansoon Lee, Hong Qian","doi":"10.1177/00131644211023868","DOIUrl":"https://doi.org/10.1177/00131644211023868","url":null,"abstract":"Using classical test theory and item response theory, this study applied sequential procedures to a real operational item pool in a variable-length computerized adaptive testing (CAT) to detect items whose security may be compromised. Moreover, this study proposed a hybrid threshold approach to improve the detection power of the sequential procedure while controlling the Type I error rate. The hybrid threshold approach uses a local threshold for each item in an early stage of the CAT administration, and then it uses the global threshold in the decision-making stage. Applying various simulation factors, a series of simulation studies examined which factors contribute significantly to the power rate and lag time of the procedure. In addition to the simulation study, a case study investigated whether the procedures are applicable to the real item pool administered in CAT and can identify potentially compromised items in the pool. This research found that the increment of probability of a correct answer (p-increment) was the simulation factor most important to the sequential procedures' ability to detect compromised items. This study also found that the local threshold approach improved power rates and shortened lag times when the p-increment was small. The findings of this study could help practitioners implement the sequential procedures using the hybrid threshold approach in real-time CAT administration.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"82 4","pages":"782-810"},"PeriodicalIF":2.7,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/00131644211023868","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10272075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Monte Carlo Study of Confidence Interval Methods for Generalizability Coefficient. 广义系数置信区间方法的蒙特卡罗研究。

IF 2.7 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2022-08-01 DOI: 10.1177/00131644211033899

Zhehan Jiang, Mark Raymond, Christine DiStefano, Dexin Shi, Ren Liu, Junhua Sun

Computing confidence intervals around generalizability coefficients has long been a challenging task in generalizability theory. This is a serious practical problem because generalizability coefficients are often computed from designs where some facets have small sample sizes, and researchers have little guide regarding the trustworthiness of the coefficients. As generalizability theory can be framed to a linear mixed-effect model (LMM), bootstrap and simulation techniques from LMM paradigm can be used to construct the confidence intervals. The purpose of this research is to examine four different LMM-based methods for computing the confidence intervals that have been proposed and to determine their accuracy under six simulated conditions based on the type of test scores (normal, dichotomous, and polytomous data) and data measurement design (p×i×r and p× [i:r]). A bootstrap technique called "parametric methods with spherical random effects" consistently produced more accurate confidence intervals than the three other LMM-based methods. Furthermore, the selected technique was compared with model-based approach to investigate the performance at the levels of variance components via the second simulation study, where the numbers of examines, raters, and items were varied. We conclude with the recommendation generalizability coefficients, the confidence interval should accompany the point estimate.

计算可泛化系数周围的置信区间一直是可泛化理论中一个具有挑战性的课题。这是一个严重的实际问题，因为泛化系数通常是从设计中计算出来的，其中一些方面的样本量很小，研究人员对系数的可信度几乎没有指导。由于泛化理论可以被框架化为线性混合效应模型(LMM)，因此可以使用LMM范式中的自举和仿真技术来构建置信区间。本研究的目的是检验四种不同的基于lmm的方法，用于计算已提出的置信区间，并根据测试分数类型(正常，二分类和多分类数据)和数据测量设计(p×i×r和px [i:r])确定其在六种模拟条件下的准确性。一种称为“具有球形随机效应的参数方法”的自举技术始终比其他三种基于lmm的方法产生更准确的置信区间。此外，通过第二次模拟研究，将所选技术与基于模型的方法进行比较，以调查方差成分水平上的表现，其中检查，评分者和项目的数量是不同的。我们得出结论，推荐泛化系数，置信区间应伴随点估计。

{"title":"A Monte Carlo Study of Confidence Interval Methods for Generalizability Coefficient.","authors":"Zhehan Jiang, Mark Raymond, Christine DiStefano, Dexin Shi, Ren Liu, Junhua Sun","doi":"10.1177/00131644211033899","DOIUrl":"https://doi.org/10.1177/00131644211033899","url":null,"abstract":"Computing confidence intervals around generalizability coefficients has long been a challenging task in generalizability theory. This is a serious practical problem because generalizability coefficients are often computed from designs where some facets have small sample sizes, and researchers have little guide regarding the trustworthiness of the coefficients. As generalizability theory can be framed to a linear mixed-effect model (LMM), bootstrap and simulation techniques from LMM paradigm can be used to construct the confidence intervals. The purpose of this research is to examine four different LMM-based methods for computing the confidence intervals that have been proposed and to determine their accuracy under six simulated conditions based on the type of test scores (normal, dichotomous, and polytomous data) and data measurement design (p×i×r and p× [i:r]). A bootstrap technique called \"parametric methods with spherical random effects\" consistently produced more accurate confidence intervals than the three other LMM-based methods. Furthermore, the selected technique was compared with model-based approach to investigate the performance at the levels of variance components via the second simulation study, where the numbers of examines, raters, and items were varied. We conclude with the recommendation generalizability coefficients, the confidence interval should accompany the point estimate.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"82 4","pages":"705-718"},"PeriodicalIF":2.7,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9228698/pdf/10.1177_00131644211033899.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10290038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Robustness of Adaptive Measurement of Change to Item Parameter Estimation Error. 变化自适应测量对项目参数估计误差的鲁棒性。

IF 2.7 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2022-08-01 DOI: 10.1177/00131644211033902

Allison W Cooperman, David J Weiss, Chun Wang

Adaptive measurement of change (AMC) is a psychometric method for measuring intra-individual change on one or more latent traits across testing occasions. Three hypothesis tests-a Z test, likelihood ratio test, and score ratio index-have demonstrated desirable statistical properties in this context, including low false positive rates and high true positive rates. However, the extant AMC research has assumed that the item parameter values in the simulated item banks were devoid of estimation error. This assumption is unrealistic for applied testing settings, where item parameters are estimated from a calibration sample before test administration. Using Monte Carlo simulation, this study evaluated the robustness of the common AMC hypothesis tests to the presence of item parameter estimation error when measuring omnibus change across four testing occasions. Results indicated that item parameter estimation error had at most a small effect on false positive rates and latent trait change recovery, and these effects were largely explained by the computerized adaptive testing item bank information functions. Differences in AMC performance as a function of item parameter estimation error and choice of hypothesis test were generally limited to simulees with particularly low or high latent trait values, where the item bank provided relatively lower information. These simulations highlight how AMC can accurately measure intra-individual change in the presence of item parameter estimation error when paired with an informative item bank. Limitations and future directions for AMC research are discussed.

自适应变化测量(AMC)是一种心理测量方法，用于测量个体内部在测试场合中一个或多个潜在特征的变化。三个假设检验——Z检验、似然比检验和得分比指数——在这种情况下显示了理想的统计特性，包括低假阳性率和高真阳性率。然而，现有的AMC研究都假设模拟物题库中的物项参数值不存在估计误差。这个假设对于应用的测试设置是不现实的，因为项目参数是在测试管理之前从校准样本估计的。利用蒙特卡罗模拟，本研究评估了在测量综合变化时，常见的AMC假设检验对项目参数估计误差存在的稳健性。结果表明，项目参数估计误差对误阳性率和潜在性状变化恢复率的影响很小，这种影响在很大程度上可以通过计算机化自适应测试题库信息功能来解释。项目参数估计误差和假设检验选择在AMC表现上的差异通常局限于具有特别低或特别高的潜在特征值的模拟，其中项目库提供的信息相对较少。这些模拟强调了当与信息库配对时，在存在项目参数估计误差的情况下，AMC如何准确地测量个体内部变化。讨论了AMC研究的局限性和未来发展方向。

{"title":"Robustness of Adaptive Measurement of Change to Item Parameter Estimation Error.","authors":"Allison W Cooperman, David J Weiss, Chun Wang","doi":"10.1177/00131644211033902","DOIUrl":"https://doi.org/10.1177/00131644211033902","url":null,"abstract":"Adaptive measurement of change (AMC) is a psychometric method for measuring intra-individual change on one or more latent traits across testing occasions. Three hypothesis tests-a Z test, likelihood ratio test, and score ratio index-have demonstrated desirable statistical properties in this context, including low false positive rates and high true positive rates. However, the extant AMC research has assumed that the item parameter values in the simulated item banks were devoid of estimation error. This assumption is unrealistic for applied testing settings, where item parameters are estimated from a calibration sample before test administration. Using Monte Carlo simulation, this study evaluated the robustness of the common AMC hypothesis tests to the presence of item parameter estimation error when measuring omnibus change across four testing occasions. Results indicated that item parameter estimation error had at most a small effect on false positive rates and latent trait change recovery, and these effects were largely explained by the computerized adaptive testing item bank information functions. Differences in AMC performance as a function of item parameter estimation error and choice of hypothesis test were generally limited to simulees with particularly low or high latent trait values, where the item bank provided relatively lower information. These simulations highlight how AMC can accurately measure intra-individual change in the presence of item parameter estimation error when paired with an informative item bank. Limitations and future directions for AMC research are discussed.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"82 4","pages":"643-677"},"PeriodicalIF":2.7,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9228691/pdf/10.1177_00131644211033902.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10290045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Educational and Psychological Measurement

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀