首页 > 最新文献

Educational and Psychological Measurement最新文献

英文 中文
From Linear Geometry to Nonlinear and Information-Geometric Settings in Test Theory: Bregman Projections as a Unifying Framework. 从线性几何到测试理论中的非线性和信息几何设置:作为统一框架的Bregman投影。
IF 2.3 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-12-12 DOI: 10.1177/00131644251393483
Bruno D Zumbo

This article develops a unified geometric framework linking expectation, regression, test theory, reliability, and item response theory through the concept of Bregman projection. Building on operator-theoretic and convex-analytic foundations, the framework extends the linear geometry of classical test theory (CTT) into nonlinear and information-geometric settings. Reliability and regression emerge as measures of projection efficiency-linear in Hilbert space and nonlinear under convex potentials. The exposition demonstrates that classical conditional expectation, least-squares regression, and information projections in exponential-family models share a common mathematical structure defined by Bregman divergence. By situating CTT within this broader geometric context, the article clarifies relationships between measurement, expectation, and statistical inference, providing a coherent foundation for nonlinear measurement and estimation in psychometrics.

本文通过Bregman投影的概念,建立了一个连接期望、回归、测试理论、信度和项目反应理论的统一几何框架。该框架建立在算子理论和凸解析的基础上,将经典测试理论(CTT)的线性几何扩展到非线性和信息几何环境。可靠性和回归作为投影效率的度量——在希尔伯特空间是线性的,在凸势下是非线性的。本文论证了指数族模型中的经典条件期望、最小二乘回归和信息投影具有由Bregman散度定义的共同数学结构。通过将CTT置于这一更广泛的几何背景下,本文澄清了测量、期望和统计推断之间的关系,为心理测量学中的非线性测量和估计提供了连贯的基础。
{"title":"From Linear Geometry to Nonlinear and Information-Geometric Settings in Test Theory: Bregman Projections as a Unifying Framework.","authors":"Bruno D Zumbo","doi":"10.1177/00131644251393483","DOIUrl":"10.1177/00131644251393483","url":null,"abstract":"<p><p>This article develops a unified geometric framework linking expectation, regression, test theory, reliability, and item response theory through the concept of Bregman projection. Building on operator-theoretic and convex-analytic foundations, the framework extends the linear geometry of classical test theory (CTT) into nonlinear and information-geometric settings. Reliability and regression emerge as measures of projection efficiency-linear in Hilbert space and nonlinear under convex potentials. The exposition demonstrates that classical conditional expectation, least-squares regression, and information projections in exponential-family models share a common mathematical structure defined by Bregman divergence. By situating CTT within this broader geometric context, the article clarifies relationships between measurement, expectation, and statistical inference, providing a coherent foundation for nonlinear measurement and estimation in psychometrics.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251393483"},"PeriodicalIF":2.3,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701833/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145762466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reliability as Projection in Operator-Theoretic Test Theory: Conditional Expectation, Hilbert Space Geometry, and Implications for Psychometric Practice. 信度在算子理论测试理论中的投射:条件期望、希尔伯特空间几何及其对心理测量实践的影响。
IF 2.3 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-12 DOI: 10.1177/00131644251389891
Bruno D Zumbo

This article reconceptualizes reliability as a theorem derived from the projection geometry of Hilbert space rather than an assumption of classical test theory. Within this framework, the true score is defined as the conditional expectation E ( X G ) , representing the orthogonal projection of the observed score onto the σ-algebra of the latent variable. Reliability, expressed as Rel ( X ) = Var [ E ( X G ) ] / Var ( X ) , quantifies the efficiency of this projection-the squared cosine between X and its true-score projection. This formulation unifies reliability with regression R 2 , factor-analytic communality, and predictive accuracy in stochastic models. The operator-theoretic perspective clarifies that measurement error corresponds to the orthogonal complement of the projection, and reliability reflects the alignment between observed and latent scores. Numerical examples and measure-theoretic proofs illustrate the framework's generality. The approach provides a rigorous mathematical foundation for reliability, connecting psychometric theory with modern statistical and geometric analysis.

本文将可靠性重新定义为一个从希尔伯特空间的投影几何推导出来的定理,而不是一个经典检验理论的假设。在这个框架中,真实分数被定义为条件期望E (X∣G),表示观察到的分数在潜在变量的σ-代数上的正交投影。可靠性,表示为Rel (X) = Var [E (X∣G)] / Var (X),量化了该投影的效率- X与其真值投影之间的平方余弦。该公式将可靠性与回归r2、因子分析共同性和随机模型中的预测准确性统一起来。从算子理论的角度来看,测量误差对应于投影的正交补,而信度反映了观测分数和潜在分数之间的一致性。数值实例和测量理论证明说明了该框架的通用性。该方法为可靠性提供了严格的数学基础,将心理测量理论与现代统计和几何分析联系起来。
{"title":"Reliability as Projection in Operator-Theoretic Test Theory: Conditional Expectation, Hilbert Space Geometry, and Implications for Psychometric Practice.","authors":"Bruno D Zumbo","doi":"10.1177/00131644251389891","DOIUrl":"10.1177/00131644251389891","url":null,"abstract":"<p><p>This article reconceptualizes reliability as a theorem derived from the projection geometry of Hilbert space rather than an assumption of classical test theory. Within this framework, the true score is defined as the conditional expectation <math><mrow><mi>E</mi> <mo>(</mo> <mi>X</mi> <mo>∣</mo> <mi>G</mi> <mo>)</mo></mrow> </math> , representing the orthogonal projection of the observed score onto the σ-algebra of the latent variable. Reliability, expressed as <math><mrow><mi>Rel</mi> <mo>(</mo> <mi>X</mi> <mo>)</mo> <mo>=</mo> <mi>Var</mi> <mo>[</mo> <mi>E</mi> <mo>(</mo> <mi>X</mi> <mo>∣</mo> <mi>G</mi> <mo>)</mo> <mo>]</mo> <mo>/</mo> <mi>Var</mi> <mo>(</mo> <mi>X</mi> <mo>)</mo></mrow> </math> , quantifies the efficiency of this projection-the squared cosine between <math><mrow><mi>X</mi> <mspace></mspace></mrow> </math> and its true-score projection. This formulation unifies reliability with regression <math> <mrow> <msup><mrow><mi>R</mi></mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> , factor-analytic communality, and predictive accuracy in stochastic models. The operator-theoretic perspective clarifies that measurement error corresponds to the orthogonal complement of the projection, and reliability reflects the alignment between observed and latent scores. Numerical examples and measure-theoretic proofs illustrate the framework's generality. The approach provides a rigorous mathematical foundation for reliability, connecting psychometric theory with modern statistical and geometric analysis.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251389891"},"PeriodicalIF":2.3,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12615236/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145539189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Agreement Lambda for Weighted Disagreement With Ordinal Scales: Correction for Category Prevalence. 与序数尺度加权不一致的协议Lambda:类别流行率的校正。
IF 2.3 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-08 DOI: 10.1177/00131644251376553
Rashid Saif Almehrizi

Weighted inter-rater agreement allows for differentiation between levels of disagreement among rating categories and is especially useful when there is an ordinal relationship between categories. Many existing weighted inter-rater agreement coefficients are either extensions of weighted Kappa or are formulated as Cohen's Kappa-like coefficients. These measures suffer from the same issues as Cohen's Kappa, including sensitivity to the marginal distributions of raters and the effects of category prevalence. They primarily account for the possibility of chance agreement or disagreement. This article introduces a new coefficient, weighted Lambda, which allows for the inclusion of varying weights assigned to disagreements. Unlike traditional methods, this coefficient does not assume random assignment and does not adjust for chance agreement or disagreement. Instead, it modifies the observed percentage of agreement while taking into account the anticipated impact of prevalence-agreement effects. The study also outlines techniques for estimating sampling standard errors, conducting hypothesis tests, and constructing confidence intervals for weighted Lambda. Illustrative numerical examples and Monte Carlo simulations are presented to investigate and compare the performance of the new weighted Lambda with commonly used weighted inter-rater agreement coefficients across various true agreement levels and agreement matrices. Results demonstrate several advantages of the new coefficient in measuring weighted inter-rater agreement.

加权评级者之间的协议允许区分评级类别之间的分歧程度,当类别之间存在顺序关系时特别有用。许多现有的加权评级间协议系数要么是加权Kappa的扩展,要么被表示为科恩的类Kappa系数。这些方法和科恩的Kappa方法一样存在同样的问题,包括对评分者边际分布的敏感性和类别流行率的影响。它们主要考虑偶然同意或不同意的可能性。本文引入了一个新的加权系数Lambda,它允许包含分配给分歧的不同权重。与传统方法不同,该系数不假设随机分配,也不调整偶然的一致或不一致。相反,它修改了观察到的一致百分比,同时考虑到普遍一致效应的预期影响。该研究还概述了估计抽样标准误差、进行假设检验和构建加权Lambda置信区间的技术。给出了说明性的数值示例和蒙特卡罗模拟,以研究和比较新的加权Lambda与常用的跨各种真实协议级别和协议矩阵的加权间协议系数的性能。结果表明,新系数在衡量加权评分间一致性方面具有若干优势。
{"title":"Agreement Lambda for Weighted Disagreement With Ordinal Scales: Correction for Category Prevalence.","authors":"Rashid Saif Almehrizi","doi":"10.1177/00131644251376553","DOIUrl":"10.1177/00131644251376553","url":null,"abstract":"<p><p>Weighted inter-rater agreement allows for differentiation between levels of disagreement among rating categories and is especially useful when there is an ordinal relationship between categories. Many existing weighted inter-rater agreement coefficients are either extensions of weighted Kappa or are formulated as Cohen's Kappa-like coefficients. These measures suffer from the same issues as Cohen's Kappa, including sensitivity to the marginal distributions of raters and the effects of category prevalence. They primarily account for the possibility of chance agreement or disagreement. This article introduces a new coefficient, weighted Lambda, which allows for the inclusion of varying weights assigned to disagreements. Unlike traditional methods, this coefficient does not assume random assignment and does not adjust for chance agreement or disagreement. Instead, it modifies the observed percentage of agreement while taking into account the anticipated impact of prevalence-agreement effects. The study also outlines techniques for estimating sampling standard errors, conducting hypothesis tests, and constructing confidence intervals for weighted Lambda. Illustrative numerical examples and Monte Carlo simulations are presented to investigate and compare the performance of the new weighted Lambda with commonly used weighted inter-rater agreement coefficients across various true agreement levels and agreement matrices. Results demonstrate several advantages of the new coefficient in measuring weighted inter-rater agreement.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251376553"},"PeriodicalIF":2.3,"publicationDate":"2025-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12602299/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145502674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On the Complex Sources of Differential Item Functioning: A Comparison of Three Methods. 论差异项目功能的复杂来源:三种方法的比较。
IF 2.3 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-07 DOI: 10.1177/00131644251379802
Haeju Lee, Sijia Huang, Dubravka Svetina Valdivia, Ben Schwartzman

Differential item functioning (DIF) has been a long-standing problem in educational and psychological measurement. In practice, the source from which DIF originates can be complex in the sense that an item can show DIF on multiple background variables of different types simultaneously. Although a variety of non-item response theory-(IRT)-based and IRT-based DIF detection methods have been introduced, they do not sufficiently address the issue of DIF evaluation when its source is complex. The recently proposed least absolute shrinkage and selection operator (LASSO) regularization method has shown promising results of detecting DIF on multiple background variables. To provide more insight, in this study, we compared three DIF detection methods, including the non-IRT-based logistic regression (LR), the IRT-based likelihood ratio test (LRT), and LASSO regularization, through a comprehensive simulation and an empirical data analysis. We found that when multiple background variables were considered, the Type I error and Power rates of the three methods for identifying DIF items on one of the variables depended on not only the sample size and its DIF magnitude but also on the DIF magnitude of the other background variable and the correlation between them. We presented other findings and discussed the limitations and future research directions in this paper.

差异项目功能(DIF)是教育和心理测量中一个长期存在的问题。在实践中,DIF的来源可能是复杂的,因为一个项目可以同时在多个不同类型的背景变量上显示DIF。虽然已经引入了各种基于非项目反应理论(IRT)和基于IRT的DIF检测方法,但它们不能充分解决DIF来源复杂时的评估问题。最近提出的最小绝对收缩和选择算子(LASSO)正则化方法在检测多背景变量上的DIF方面显示出良好的效果。为了提供更多的见解,本研究通过综合模拟和实证数据分析,比较了非基于红外光谱的逻辑回归(LR)、基于红外光谱的似然比检验(LRT)和LASSO正则化三种DIF检测方法。我们发现,当考虑多个背景变量时,三种识别某一变量DIF项目的方法的I型误差和功率率不仅取决于样本量及其DIF大小,还取决于另一背景变量的DIF大小及其之间的相关性。本文还介绍了其他研究结果,并讨论了研究的局限性和未来的研究方向。
{"title":"On the Complex Sources of Differential Item Functioning: A Comparison of Three Methods.","authors":"Haeju Lee, Sijia Huang, Dubravka Svetina Valdivia, Ben Schwartzman","doi":"10.1177/00131644251379802","DOIUrl":"10.1177/00131644251379802","url":null,"abstract":"<p><p>Differential item functioning (DIF) has been a long-standing problem in educational and psychological measurement. In practice, the source from which DIF originates can be complex in the sense that an item can show DIF on multiple background variables of different types simultaneously. Although a variety of non-item response theory-(IRT)-based and IRT-based DIF detection methods have been introduced, they do not sufficiently address the issue of DIF evaluation when its source is complex. The recently proposed <i>l</i>east <i>a</i>bsolute <i>s</i>hrinkage and <i>s</i>election <i>o</i>perator (LASSO) regularization method has shown promising results of detecting DIF on multiple background variables. To provide more insight, in this study, we compared three DIF detection methods, including the non-IRT-based logistic regression (LR), the IRT-based likelihood ratio test (LRT), and LASSO regularization, through a comprehensive simulation and an empirical data analysis. We found that when multiple background variables were considered, the Type I error and Power rates of the three methods for identifying DIF items on one of the variables depended on not only the sample size and its DIF magnitude but also on the DIF magnitude of the other background variable and the correlation between them. We presented other findings and discussed the limitations and future research directions in this paper.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251379802"},"PeriodicalIF":2.3,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12602301/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145502628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Evaluation of the Replicable Factor Analytic Solutions Algorithm for Variable Selection: A Simulation Study. 变量选择的可复制因子解析解算法的评价:仿真研究。
IF 2.3 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-03 DOI: 10.1177/00131644251377381
Daniel A Sass, Michael A Sanchez

Observed variable and factor selection are critical components of factor analysis, particularly when the optimal subset of observed variables and the number of factors are unknown and results cannot be replicated across studies. The Replicable Factor Analytic Solutions (RFAS) algorithm was developed to assess the replicability of factor structures-both in terms of the number of factors and the variables retained-while identifying the "best" or most replicable solutions according to predefined criteria. This study evaluated RFAS performance across 54 experimental conditions that varied in model complexity (six-factor models), interfactor correlations (ρ = 0, .30, and .60), and sample sizes (n = 300, 500, and 1000). Under default settings, RFAS generally performed well and demonstrated its utility in producing replicable factor structures. However, performance declined with highly correlated factors, smaller sample sizes, and more complex models. RFAS was also compared to four alternative variable selection methods: Ant Colony Optimization (ACO), Weighted Group Least Absolute Shrinkage and Selection Operator (LASSO), and stepwise procedures based on target Tucker-Lewis Index (TLI) and ΔTLI criteria. Stepwise and LASSO methods were largely ineffective at eliminating problematic variables under the studied conditions. In contrast, both RFAS and ACO successfully removed variables as intended, although the resulting factor structures often differed substantially between the two approaches. As with other variable selection methods, refining algorithmic criteria may be necessary to further enhance model performance.

观察变量和因素选择是因素分析的关键组成部分,特别是当观察变量的最佳子集和因素数量未知且结果无法在研究中复制时。可复制因子分析解决方案(RFAS)算法的开发是为了评估因子结构的可复制性——包括因子的数量和保留的变量——同时根据预定义的标准确定“最佳”或最可复制的解决方案。本研究评估了RFAS在54种不同实验条件下的性能,这些条件在模型复杂性(六因素模型)、因素间相关性(ρ = 0、30,和。60)和样本量(n = 300、500和1000)。在默认设置下,RFAS通常表现良好,并证明了其在产生可复制因子结构方面的实用性。然而,性能下降与高度相关的因素,较小的样本量,和更复杂的模型。RFAS还与四种替代变量选择方法进行了比较:蚁群优化(ACO),加权组最小绝对收缩和选择算子(LASSO),以及基于目标塔克-刘易斯指数(TLI)和ΔTLI标准的逐步程序。在研究条件下,逐步法和LASSO法在消除问题变量方面基本上是无效的。相比之下,RFAS和ACO都成功地按预期去除了变量,尽管两种方法之间产生的因子结构通常存在很大差异。与其他变量选择方法一样,可能需要改进算法标准以进一步提高模型性能。
{"title":"An Evaluation of the Replicable Factor Analytic Solutions Algorithm for Variable Selection: A Simulation Study.","authors":"Daniel A Sass, Michael A Sanchez","doi":"10.1177/00131644251377381","DOIUrl":"10.1177/00131644251377381","url":null,"abstract":"<p><p>Observed variable and factor selection are critical components of factor analysis, particularly when the optimal subset of observed variables and the number of factors are unknown and results cannot be replicated across studies. The Replicable Factor Analytic Solutions (RFAS) algorithm was developed to assess the replicability of factor structures-both in terms of the number of factors and the variables retained-while identifying the \"best\" or most replicable solutions according to predefined criteria. This study evaluated RFAS performance across 54 experimental conditions that varied in model complexity (six-factor models), interfactor correlations (ρ = 0, .30, and .60), and sample sizes (<i>n</i> = 300, 500, and 1000). Under default settings, RFAS generally performed well and demonstrated its utility in producing replicable factor structures. However, performance declined with highly correlated factors, smaller sample sizes, and more complex models. RFAS was also compared to four alternative variable selection methods: Ant Colony Optimization (ACO), Weighted Group Least Absolute Shrinkage and Selection Operator (LASSO), and stepwise procedures based on target Tucker-Lewis Index (TLI) and ΔTLI criteria. Stepwise and LASSO methods were largely ineffective at eliminating problematic variables under the studied conditions. In contrast, both RFAS and ACO successfully removed variables as intended, although the resulting factor structures often differed substantially between the two approaches. As with other variable selection methods, refining algorithmic criteria may be necessary to further enhance model performance.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251377381"},"PeriodicalIF":2.3,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12583011/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145451151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Coefficient Lambda for Interrater Agreement Among Multiple Raters: Correction for Category Prevalence. 多重评价者间一致性的系数Lambda:类别流行率的修正。
IF 2.3 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-11-03 DOI: 10.1177/00131644251380540
Rashid Saif Almehrizi

Fleiss's Kappa is an extension of Cohen's Kappa, developed to assess the degree of interrater agreement among multiple raters or methods classifying subjects using categorical scales. Like Cohen's Kappa, it adjusts the observed proportion of agreement to account for agreement expected by chance. However, over time, several paradoxes and interpretative challenges have been identified, largely stemming from the assumption of random chance agreement and the sensitivity of the coefficient to the number of raters. Interpreting Fleiss's Kappa can be particularly difficult due to its dependence on the distribution of categories and prevalence patterns. This paper argues that a portion of the observed agreement may be better explained by the interaction between category prevalence and inherent category characteristics, such as ambiguity, appeal, or social desirability, rather than by chance alone. By shifting away from the assumption of random rater assignment, the paper introduces a novel agreement coefficient that adjusts for the expected agreement by accounting for category prevalence, providing a more accurate measure of interrater reliability in the presence of imbalanced category distributions. It also examines the theoretical justification for this new measure, its interpretability, its standard error, and the robustness of its estimates in simulation and practical applications.

Fleiss的Kappa是Cohen的Kappa的延伸,用于评估多个评分者或使用分类量表对受试者进行分类的方法之间的相互一致程度。与科恩的Kappa一样,它调整了观察到的一致比例,以解释偶然预期的一致。然而,随着时间的推移,已经确定了几个悖论和解释上的挑战,主要源于随机机会一致的假设和系数对评分者数量的敏感性。解释Fleiss的Kappa可能特别困难,因为它依赖于类别和流行模式的分布。本文认为,观察到的部分一致性可能更好地解释为类别流行与固有类别特征(如模糊性、吸引力或社会可取性)之间的相互作用,而不是偶然的。通过改变随机评分者分配的假设,本文引入了一种新的一致性系数,该系数通过考虑类别流行率来调整预期的一致性,从而在类别分布不平衡的情况下提供更准确的评分者可靠性度量。本文还考察了这种新方法的理论依据、可解释性、标准误差以及在模拟和实际应用中估计的稳健性。
{"title":"Coefficient Lambda for Interrater Agreement Among Multiple Raters: Correction for Category Prevalence.","authors":"Rashid Saif Almehrizi","doi":"10.1177/00131644251380540","DOIUrl":"10.1177/00131644251380540","url":null,"abstract":"<p><p>Fleiss's Kappa is an extension of Cohen's Kappa, developed to assess the degree of interrater agreement among multiple raters or methods classifying subjects using categorical scales. Like Cohen's Kappa, it adjusts the observed proportion of agreement to account for agreement expected by chance. However, over time, several paradoxes and interpretative challenges have been identified, largely stemming from the assumption of random chance agreement and the sensitivity of the coefficient to the number of raters. Interpreting Fleiss's Kappa can be particularly difficult due to its dependence on the distribution of categories and prevalence patterns. This paper argues that a portion of the observed agreement may be better explained by the interaction between category prevalence and inherent category characteristics, such as ambiguity, appeal, or social desirability, rather than by chance alone. By shifting away from the assumption of random rater assignment, the paper introduces a novel agreement coefficient that adjusts for the expected agreement by accounting for category prevalence, providing a more accurate measure of interrater reliability in the presence of imbalanced category distributions. It also examines the theoretical justification for this new measure, its interpretability, its standard error, and the robustness of its estimates in simulation and practical applications.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251380540"},"PeriodicalIF":2.3,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12583010/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145451160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Common Persons Design in Score Equating: A Monte Carlo Investigation. 分数相等中的普通人物设计:蒙特卡洛调查。
IF 2.3 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-10-29 DOI: 10.1177/00131644251380585
Jiayi Liu, Zhehan Jiang, Tianpeng Zheng, Yuting Han, Shicong Feng

The Common Persons (CP) equating design offers critical advantages for high-security testing contexts-eliminating anchor item exposure risks while accommodating non-equivalent groups-yet few studies have systematically examined how CP characteristics influence equating accuracy, and the field still lacks clear implementation guidelines. Addressing this gap, this comprehensive Monte Carlo simulation (N = 5,000 examinees per form; 500 replications) evaluates CP equating by manipulating 8 factors: test length, difficulty shift, ability dispersion, correlation between test forms and CP characteristics. Four equating methods (identity, IRT true-score, linear, equipercentile) were compared using normalized RMSE and %Bias. Key findings reveal: (a) when the CP sample size reaches at least 30, CP sample properties exert negligible influence on accuracy, challenging assumptions about distributional representativeness; (b) Test factors dominate outcomes-difficulty shifts ( Δ δ XY = 1) degrade IRT precision severely (|%Bias| >22% vs. linear/equipercentile's |%Bias| <1.5%), while longer tests reduce NRMSE and wider ability dispersion ( σ θ = 1) enhances precision through improved person-item targeting; (c) Equipercentile and linear methods demonstrate superior robustness under form differences. We establish minimum operational thresholds: ≥30 CPs covering the score range suffice for precise equating. These results provide an evidence-based framework for CP implementation by systematically examining multiple manipulated factors, resolving security-vs-accuracy tradeoffs in high-stakes equating (e.g., credentialing exams) and enabling novel solutions like synthetic respondents.

一般人(CP)等同设计为高安全性测试环境提供了关键优势——消除锚项目暴露风险,同时容纳非等效组——然而很少有研究系统地检查CP特征如何影响等同准确性,该领域仍然缺乏明确的实施指南。为了解决这一问题,这个综合蒙特卡罗模拟(N = 5000名考生,500个重复)通过操纵8个因素来评估CP等价性:考试长度、难度转移、能力分散、考试形式与CP特征之间的相关性。使用归一化RMSE和%Bias比较四种等价方法(恒等、IRT真值、线性、等百分位)。主要发现表明:(a)当CP样本量达到至少30时,CP样本性质对准确性的影响可以忽略不计,挑战了关于分布代表性的假设;(b)测试因素主导结果-难度变化(Δ Δ XY = 1)严重降低IRT精度(|%Bias| >22% vs.线性/等百分位的|%Bias| σ θ = 1)通过改进人-项目定位提高精度;(c)等百分位法和线性法在形式差异下表现出较好的稳健性。我们建立了最低操作阈值:≥30 CPs覆盖的分数范围足以精确相等。这些结果为CP的实施提供了一个基于证据的框架,通过系统地检查多个操纵因素,解决高风险等同(例如,证书考试)中的安全性与准确性权衡,并启用像合成应答者这样的新解决方案。
{"title":"Common Persons Design in Score Equating: A Monte Carlo Investigation.","authors":"Jiayi Liu, Zhehan Jiang, Tianpeng Zheng, Yuting Han, Shicong Feng","doi":"10.1177/00131644251380585","DOIUrl":"10.1177/00131644251380585","url":null,"abstract":"<p><p>The Common Persons (CP) equating design offers critical advantages for high-security testing contexts-eliminating anchor item exposure risks while accommodating non-equivalent groups-yet few studies have systematically examined how CP characteristics influence equating accuracy, and the field still lacks clear implementation guidelines. Addressing this gap, this comprehensive Monte Carlo simulation (<i>N</i> = 5,000 examinees per form; 500 replications) evaluates CP equating by manipulating 8 factors: test length, difficulty shift, ability dispersion, correlation between test forms and CP characteristics. Four equating methods (identity, IRT true-score, linear, equipercentile) were compared using normalized RMSE and %Bias. Key findings reveal: (a) when the CP sample size reaches at least 30, CP sample properties exert negligible influence on accuracy, challenging assumptions about distributional representativeness; (b) Test factors dominate outcomes-difficulty shifts ( <math><mrow><mi>Δ</mi> <msub><mrow><mi>δ</mi></mrow> <mrow><mi>XY</mi></mrow> </msub> </mrow> </math> = 1) degrade IRT precision severely (|%Bias| >22% vs. linear/equipercentile's |%Bias| <1.5%), while longer tests reduce NRMSE and wider ability dispersion ( <math> <mrow> <msub><mrow><mi>σ</mi></mrow> <mrow><mi>θ</mi></mrow> </msub> </mrow> </math> = 1) enhances precision through improved person-item targeting; (c) Equipercentile and linear methods demonstrate superior robustness under form differences. We establish minimum operational thresholds: ≥30 CPs covering the score range suffice for precise equating. These results provide an evidence-based framework for CP implementation by systematically examining multiple manipulated factors, resolving security-vs-accuracy tradeoffs in high-stakes equating (e.g., credentialing exams) and enabling novel solutions like synthetic respondents.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251380585"},"PeriodicalIF":2.3,"publicationDate":"2025-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12571793/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145430563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Path Analysis With Mixed-Scale Variables: Categorical ML, Least Squares, and Bayesian Estimations. 路径分析与混合尺度变量:分类机器学习,最小二乘法,和贝叶斯估计。
IF 2.3 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-10-27 DOI: 10.1177/00131644251379773
Xinya Liang, Paula Castro, Chunhua Cao, Wen-Juo Lo

In applied research across education, the social and behavioral sciences, and medicine, path models frequently incorporate both continuous and ordinal manifest variables to predict binary outcomes. This study employs Monte Carlo simulations to evaluate six estimators: robust maximum likelihood with probit and logit links (MLR-probit, MLR-logit), mean- and variance-adjusted weighted and unweighted least squares (WLSMV, ULSMV), and Bayesian methods with noninformative and weakly informative priors (Bayes-NI, Bayes-WI). Across various sample sizes, variable scales, and effect sizes, results show that WLSMV and Bayes-WI consistently achieve low bias and RMSE, particularly in small samples or when mediators have few categories. By contrast, categorical MLR approaches tended to yield unstable estimates for modest effects. These findings offer practical guidance for selecting estimators in mixed-scale path analyses and underscore their implications for robust inference.

在教育、社会和行为科学以及医学领域的应用研究中,路径模型经常结合连续和有序的显性变量来预测二元结果。本研究使用蒙特卡罗模拟来评估六种估计方法:具有probit和logit链接的鲁棒极大似然(MLR-probit, MLR-logit),经均值和方差调整的加权和未加权最小二乘(WLSMV, ULSMV),以及具有非信息和弱信息先验的贝叶斯方法(Bayesian - ni, Bayesian - wi)。在不同的样本量、可变尺度和效应大小中,结果表明,WLSMV和贝叶斯- wi一致地实现了低偏差和RMSE,特别是在小样本或介质类别较少的情况下。相比之下,分类MLR方法往往对适度效果产生不稳定的估计。这些发现为在混合尺度路径分析中选择估计量提供了实用的指导,并强调了它们对鲁棒推断的意义。
{"title":"Path Analysis With Mixed-Scale Variables: Categorical ML, Least Squares, and Bayesian Estimations.","authors":"Xinya Liang, Paula Castro, Chunhua Cao, Wen-Juo Lo","doi":"10.1177/00131644251379773","DOIUrl":"10.1177/00131644251379773","url":null,"abstract":"<p><p>In applied research across education, the social and behavioral sciences, and medicine, path models frequently incorporate both continuous and ordinal manifest variables to predict binary outcomes. This study employs Monte Carlo simulations to evaluate six estimators: robust maximum likelihood with probit and logit links (MLR-probit, MLR-logit), mean- and variance-adjusted weighted and unweighted least squares (WLSMV, ULSMV), and Bayesian methods with noninformative and weakly informative priors (Bayes-NI, Bayes-WI). Across various sample sizes, variable scales, and effect sizes, results show that WLSMV and Bayes-WI consistently achieve low bias and RMSE, particularly in small samples or when mediators have few categories. By contrast, categorical MLR approaches tended to yield unstable estimates for modest effects. These findings offer practical guidance for selecting estimators in mixed-scale path analyses and underscore their implications for robust inference.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251379773"},"PeriodicalIF":2.3,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12568548/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145408337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correcting the Variance of Effect Sizes Based on Binary Outcomes for Clustering. 基于二元结果的聚类效应大小方差校正。
IF 2.3 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-10-23 DOI: 10.1177/00131644251380777
Larry V Hedges

Researchers conducting systematic reviews and meta-analyses often encounter studies in which the research design is a well conducted cluster randomized trial, but the statistical analysis does not take clustering into account. For example, the study might assign treatments by clusters but the analysis may not take into account the clustered treatment assignment. Alternatively, the analysis of the primary outcome of the study might take clustering into account, but the reviewer might be interested in another outcome for which only summary data are available in a form that does not take clustering into account. This article provides expressions for the approximate variance of risk differences, log risk ratios, and log odds ratios computed from clustered binary data, using the intraclass correlations. An example illustrates the calculations. References to empirical estimates of intraclass correlations are provided.

进行系统评价和荟萃分析的研究人员经常遇到这样的研究:研究设计是一个进行得很好的聚类随机试验,但统计分析没有考虑聚类。例如,研究可能会按集群分配治疗,但分析可能不会考虑到集群治疗分配。或者,对研究的主要结果的分析可能会考虑聚类,但审稿人可能对另一个结果感兴趣,该结果只有摘要数据,其形式没有考虑聚类。本文提供了使用类内相关性从聚类二进制数据计算的风险差异、对数风险比和对数比值比的近似方差表达式。一个例子说明了计算。提供了对类内相关性的经验估计的参考。
{"title":"Correcting the Variance of Effect Sizes Based on Binary Outcomes for Clustering.","authors":"Larry V Hedges","doi":"10.1177/00131644251380777","DOIUrl":"10.1177/00131644251380777","url":null,"abstract":"<p><p>Researchers conducting systematic reviews and meta-analyses often encounter studies in which the research design is a well conducted cluster randomized trial, but the statistical analysis does not take clustering into account. For example, the study might assign treatments by clusters but the analysis may not take into account the clustered treatment assignment. Alternatively, the analysis of the primary outcome of the study might take clustering into account, but the reviewer might be interested in another outcome for which only summary data are available in a form that does not take clustering into account. This article provides expressions for the approximate variance of risk differences, log risk ratios, and log odds ratios computed from clustered binary data, using the intraclass correlations. An example illustrates the calculations. References to empirical estimates of intraclass correlations are provided.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251380777"},"PeriodicalIF":2.3,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12549596/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145372242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Network Approaches to Binary Assessment Data: Network Psychometrics Versus Latent Space Item Response Models. 二元评估数据的网络方法:网络心理测量与潜在空间项目反应模型。
IF 2.3 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-10-23 DOI: 10.1177/00131644251371187
Ludovica De Carolis, Minjeong Jeon

This study compares two network-based approaches for analyzing binary psychological assessment data: network psychometrics and latent space item response modeling (LSIRM). Network psychometrics, a well-established method, infers relationships among items or symptoms based on pairwise conditional dependencies. In contrast, LSIRM is a more recent framework that represents item responses as a bipartite network of respondents and items embedded in a latent metric space, where the likelihood of a response decreases with increasing distance between the respondent and item. We evaluate the performance of both methods through simulation studies under varying data-generating conditions. In addition, we demonstrate their applications to real assessment data, showcasing the distinct insights each method offers to researchers and practitioners.

本研究比较了两种基于网络的二元心理评估数据分析方法:网络心理测量法和潜在空间项目反应模型(LSIRM)。网络心理测量是一种行之有效的方法,它基于成对条件依赖来推断项目或症状之间的关系。相比之下,LSIRM是一个较新的框架,它将项目反应表示为嵌入在潜在度量空间中的被调查者和项目的二分网络,其中响应的可能性随着被调查者和项目之间距离的增加而降低。我们通过在不同数据生成条件下的模拟研究来评估这两种方法的性能。此外,我们还展示了它们在真实评估数据中的应用,展示了每种方法为研究人员和从业者提供的独特见解。
{"title":"Network Approaches to Binary Assessment Data: Network Psychometrics Versus Latent Space Item Response Models.","authors":"Ludovica De Carolis, Minjeong Jeon","doi":"10.1177/00131644251371187","DOIUrl":"10.1177/00131644251371187","url":null,"abstract":"<p><p>This study compares two network-based approaches for analyzing binary psychological assessment data: network psychometrics and latent space item response modeling (LSIRM). Network psychometrics, a well-established method, infers relationships among items or symptoms based on pairwise conditional dependencies. In contrast, LSIRM is a more recent framework that represents item responses as a bipartite network of respondents and items embedded in a latent metric space, where the likelihood of a response decreases with increasing distance between the respondent and item. We evaluate the performance of both methods through simulation studies under varying data-generating conditions. In addition, we demonstrate their applications to real assessment data, showcasing the distinct insights each method offers to researchers and practitioners.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251371187"},"PeriodicalIF":2.3,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12549609/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145376583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Educational and Psychological Measurement
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1