Pub Date : 2025-12-12DOI: 10.1177/00131644251393483
Bruno D Zumbo
This article develops a unified geometric framework linking expectation, regression, test theory, reliability, and item response theory through the concept of Bregman projection. Building on operator-theoretic and convex-analytic foundations, the framework extends the linear geometry of classical test theory (CTT) into nonlinear and information-geometric settings. Reliability and regression emerge as measures of projection efficiency-linear in Hilbert space and nonlinear under convex potentials. The exposition demonstrates that classical conditional expectation, least-squares regression, and information projections in exponential-family models share a common mathematical structure defined by Bregman divergence. By situating CTT within this broader geometric context, the article clarifies relationships between measurement, expectation, and statistical inference, providing a coherent foundation for nonlinear measurement and estimation in psychometrics.
{"title":"From Linear Geometry to Nonlinear and Information-Geometric Settings in Test Theory: Bregman Projections as a Unifying Framework.","authors":"Bruno D Zumbo","doi":"10.1177/00131644251393483","DOIUrl":"10.1177/00131644251393483","url":null,"abstract":"<p><p>This article develops a unified geometric framework linking expectation, regression, test theory, reliability, and item response theory through the concept of Bregman projection. Building on operator-theoretic and convex-analytic foundations, the framework extends the linear geometry of classical test theory (CTT) into nonlinear and information-geometric settings. Reliability and regression emerge as measures of projection efficiency-linear in Hilbert space and nonlinear under convex potentials. The exposition demonstrates that classical conditional expectation, least-squares regression, and information projections in exponential-family models share a common mathematical structure defined by Bregman divergence. By situating CTT within this broader geometric context, the article clarifies relationships between measurement, expectation, and statistical inference, providing a coherent foundation for nonlinear measurement and estimation in psychometrics.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251393483"},"PeriodicalIF":2.3,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701833/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145762466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-12DOI: 10.1177/00131644251389891
Bruno D Zumbo
This article reconceptualizes reliability as a theorem derived from the projection geometry of Hilbert space rather than an assumption of classical test theory. Within this framework, the true score is defined as the conditional expectation , representing the orthogonal projection of the observed score onto the σ-algebra of the latent variable. Reliability, expressed as , quantifies the efficiency of this projection-the squared cosine between and its true-score projection. This formulation unifies reliability with regression , factor-analytic communality, and predictive accuracy in stochastic models. The operator-theoretic perspective clarifies that measurement error corresponds to the orthogonal complement of the projection, and reliability reflects the alignment between observed and latent scores. Numerical examples and measure-theoretic proofs illustrate the framework's generality. The approach provides a rigorous mathematical foundation for reliability, connecting psychometric theory with modern statistical and geometric analysis.
本文将可靠性重新定义为一个从希尔伯特空间的投影几何推导出来的定理,而不是一个经典检验理论的假设。在这个框架中,真实分数被定义为条件期望E (X∣G),表示观察到的分数在潜在变量的σ-代数上的正交投影。可靠性,表示为Rel (X) = Var [E (X∣G)] / Var (X),量化了该投影的效率- X与其真值投影之间的平方余弦。该公式将可靠性与回归r2、因子分析共同性和随机模型中的预测准确性统一起来。从算子理论的角度来看,测量误差对应于投影的正交补,而信度反映了观测分数和潜在分数之间的一致性。数值实例和测量理论证明说明了该框架的通用性。该方法为可靠性提供了严格的数学基础,将心理测量理论与现代统计和几何分析联系起来。
{"title":"Reliability as Projection in Operator-Theoretic Test Theory: Conditional Expectation, Hilbert Space Geometry, and Implications for Psychometric Practice.","authors":"Bruno D Zumbo","doi":"10.1177/00131644251389891","DOIUrl":"10.1177/00131644251389891","url":null,"abstract":"<p><p>This article reconceptualizes reliability as a theorem derived from the projection geometry of Hilbert space rather than an assumption of classical test theory. Within this framework, the true score is defined as the conditional expectation <math><mrow><mi>E</mi> <mo>(</mo> <mi>X</mi> <mo>∣</mo> <mi>G</mi> <mo>)</mo></mrow> </math> , representing the orthogonal projection of the observed score onto the σ-algebra of the latent variable. Reliability, expressed as <math><mrow><mi>Rel</mi> <mo>(</mo> <mi>X</mi> <mo>)</mo> <mo>=</mo> <mi>Var</mi> <mo>[</mo> <mi>E</mi> <mo>(</mo> <mi>X</mi> <mo>∣</mo> <mi>G</mi> <mo>)</mo> <mo>]</mo> <mo>/</mo> <mi>Var</mi> <mo>(</mo> <mi>X</mi> <mo>)</mo></mrow> </math> , quantifies the efficiency of this projection-the squared cosine between <math><mrow><mi>X</mi> <mspace></mspace></mrow> </math> and its true-score projection. This formulation unifies reliability with regression <math> <mrow> <msup><mrow><mi>R</mi></mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> , factor-analytic communality, and predictive accuracy in stochastic models. The operator-theoretic perspective clarifies that measurement error corresponds to the orthogonal complement of the projection, and reliability reflects the alignment between observed and latent scores. Numerical examples and measure-theoretic proofs illustrate the framework's generality. The approach provides a rigorous mathematical foundation for reliability, connecting psychometric theory with modern statistical and geometric analysis.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251389891"},"PeriodicalIF":2.3,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12615236/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145539189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-08DOI: 10.1177/00131644251376553
Rashid Saif Almehrizi
Weighted inter-rater agreement allows for differentiation between levels of disagreement among rating categories and is especially useful when there is an ordinal relationship between categories. Many existing weighted inter-rater agreement coefficients are either extensions of weighted Kappa or are formulated as Cohen's Kappa-like coefficients. These measures suffer from the same issues as Cohen's Kappa, including sensitivity to the marginal distributions of raters and the effects of category prevalence. They primarily account for the possibility of chance agreement or disagreement. This article introduces a new coefficient, weighted Lambda, which allows for the inclusion of varying weights assigned to disagreements. Unlike traditional methods, this coefficient does not assume random assignment and does not adjust for chance agreement or disagreement. Instead, it modifies the observed percentage of agreement while taking into account the anticipated impact of prevalence-agreement effects. The study also outlines techniques for estimating sampling standard errors, conducting hypothesis tests, and constructing confidence intervals for weighted Lambda. Illustrative numerical examples and Monte Carlo simulations are presented to investigate and compare the performance of the new weighted Lambda with commonly used weighted inter-rater agreement coefficients across various true agreement levels and agreement matrices. Results demonstrate several advantages of the new coefficient in measuring weighted inter-rater agreement.
{"title":"Agreement Lambda for Weighted Disagreement With Ordinal Scales: Correction for Category Prevalence.","authors":"Rashid Saif Almehrizi","doi":"10.1177/00131644251376553","DOIUrl":"10.1177/00131644251376553","url":null,"abstract":"<p><p>Weighted inter-rater agreement allows for differentiation between levels of disagreement among rating categories and is especially useful when there is an ordinal relationship between categories. Many existing weighted inter-rater agreement coefficients are either extensions of weighted Kappa or are formulated as Cohen's Kappa-like coefficients. These measures suffer from the same issues as Cohen's Kappa, including sensitivity to the marginal distributions of raters and the effects of category prevalence. They primarily account for the possibility of chance agreement or disagreement. This article introduces a new coefficient, weighted Lambda, which allows for the inclusion of varying weights assigned to disagreements. Unlike traditional methods, this coefficient does not assume random assignment and does not adjust for chance agreement or disagreement. Instead, it modifies the observed percentage of agreement while taking into account the anticipated impact of prevalence-agreement effects. The study also outlines techniques for estimating sampling standard errors, conducting hypothesis tests, and constructing confidence intervals for weighted Lambda. Illustrative numerical examples and Monte Carlo simulations are presented to investigate and compare the performance of the new weighted Lambda with commonly used weighted inter-rater agreement coefficients across various true agreement levels and agreement matrices. Results demonstrate several advantages of the new coefficient in measuring weighted inter-rater agreement.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251376553"},"PeriodicalIF":2.3,"publicationDate":"2025-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12602299/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145502674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-07DOI: 10.1177/00131644251379802
Haeju Lee, Sijia Huang, Dubravka Svetina Valdivia, Ben Schwartzman
Differential item functioning (DIF) has been a long-standing problem in educational and psychological measurement. In practice, the source from which DIF originates can be complex in the sense that an item can show DIF on multiple background variables of different types simultaneously. Although a variety of non-item response theory-(IRT)-based and IRT-based DIF detection methods have been introduced, they do not sufficiently address the issue of DIF evaluation when its source is complex. The recently proposed least absolute shrinkage and selection operator (LASSO) regularization method has shown promising results of detecting DIF on multiple background variables. To provide more insight, in this study, we compared three DIF detection methods, including the non-IRT-based logistic regression (LR), the IRT-based likelihood ratio test (LRT), and LASSO regularization, through a comprehensive simulation and an empirical data analysis. We found that when multiple background variables were considered, the Type I error and Power rates of the three methods for identifying DIF items on one of the variables depended on not only the sample size and its DIF magnitude but also on the DIF magnitude of the other background variable and the correlation between them. We presented other findings and discussed the limitations and future research directions in this paper.
{"title":"On the Complex Sources of Differential Item Functioning: A Comparison of Three Methods.","authors":"Haeju Lee, Sijia Huang, Dubravka Svetina Valdivia, Ben Schwartzman","doi":"10.1177/00131644251379802","DOIUrl":"10.1177/00131644251379802","url":null,"abstract":"<p><p>Differential item functioning (DIF) has been a long-standing problem in educational and psychological measurement. In practice, the source from which DIF originates can be complex in the sense that an item can show DIF on multiple background variables of different types simultaneously. Although a variety of non-item response theory-(IRT)-based and IRT-based DIF detection methods have been introduced, they do not sufficiently address the issue of DIF evaluation when its source is complex. The recently proposed <i>l</i>east <i>a</i>bsolute <i>s</i>hrinkage and <i>s</i>election <i>o</i>perator (LASSO) regularization method has shown promising results of detecting DIF on multiple background variables. To provide more insight, in this study, we compared three DIF detection methods, including the non-IRT-based logistic regression (LR), the IRT-based likelihood ratio test (LRT), and LASSO regularization, through a comprehensive simulation and an empirical data analysis. We found that when multiple background variables were considered, the Type I error and Power rates of the three methods for identifying DIF items on one of the variables depended on not only the sample size and its DIF magnitude but also on the DIF magnitude of the other background variable and the correlation between them. We presented other findings and discussed the limitations and future research directions in this paper.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251379802"},"PeriodicalIF":2.3,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12602301/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145502628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-03DOI: 10.1177/00131644251377381
Daniel A Sass, Michael A Sanchez
Observed variable and factor selection are critical components of factor analysis, particularly when the optimal subset of observed variables and the number of factors are unknown and results cannot be replicated across studies. The Replicable Factor Analytic Solutions (RFAS) algorithm was developed to assess the replicability of factor structures-both in terms of the number of factors and the variables retained-while identifying the "best" or most replicable solutions according to predefined criteria. This study evaluated RFAS performance across 54 experimental conditions that varied in model complexity (six-factor models), interfactor correlations (ρ = 0, .30, and .60), and sample sizes (n = 300, 500, and 1000). Under default settings, RFAS generally performed well and demonstrated its utility in producing replicable factor structures. However, performance declined with highly correlated factors, smaller sample sizes, and more complex models. RFAS was also compared to four alternative variable selection methods: Ant Colony Optimization (ACO), Weighted Group Least Absolute Shrinkage and Selection Operator (LASSO), and stepwise procedures based on target Tucker-Lewis Index (TLI) and ΔTLI criteria. Stepwise and LASSO methods were largely ineffective at eliminating problematic variables under the studied conditions. In contrast, both RFAS and ACO successfully removed variables as intended, although the resulting factor structures often differed substantially between the two approaches. As with other variable selection methods, refining algorithmic criteria may be necessary to further enhance model performance.
{"title":"An Evaluation of the Replicable Factor Analytic Solutions Algorithm for Variable Selection: A Simulation Study.","authors":"Daniel A Sass, Michael A Sanchez","doi":"10.1177/00131644251377381","DOIUrl":"10.1177/00131644251377381","url":null,"abstract":"<p><p>Observed variable and factor selection are critical components of factor analysis, particularly when the optimal subset of observed variables and the number of factors are unknown and results cannot be replicated across studies. The Replicable Factor Analytic Solutions (RFAS) algorithm was developed to assess the replicability of factor structures-both in terms of the number of factors and the variables retained-while identifying the \"best\" or most replicable solutions according to predefined criteria. This study evaluated RFAS performance across 54 experimental conditions that varied in model complexity (six-factor models), interfactor correlations (ρ = 0, .30, and .60), and sample sizes (<i>n</i> = 300, 500, and 1000). Under default settings, RFAS generally performed well and demonstrated its utility in producing replicable factor structures. However, performance declined with highly correlated factors, smaller sample sizes, and more complex models. RFAS was also compared to four alternative variable selection methods: Ant Colony Optimization (ACO), Weighted Group Least Absolute Shrinkage and Selection Operator (LASSO), and stepwise procedures based on target Tucker-Lewis Index (TLI) and ΔTLI criteria. Stepwise and LASSO methods were largely ineffective at eliminating problematic variables under the studied conditions. In contrast, both RFAS and ACO successfully removed variables as intended, although the resulting factor structures often differed substantially between the two approaches. As with other variable selection methods, refining algorithmic criteria may be necessary to further enhance model performance.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251377381"},"PeriodicalIF":2.3,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12583011/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145451151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-03DOI: 10.1177/00131644251380540
Rashid Saif Almehrizi
Fleiss's Kappa is an extension of Cohen's Kappa, developed to assess the degree of interrater agreement among multiple raters or methods classifying subjects using categorical scales. Like Cohen's Kappa, it adjusts the observed proportion of agreement to account for agreement expected by chance. However, over time, several paradoxes and interpretative challenges have been identified, largely stemming from the assumption of random chance agreement and the sensitivity of the coefficient to the number of raters. Interpreting Fleiss's Kappa can be particularly difficult due to its dependence on the distribution of categories and prevalence patterns. This paper argues that a portion of the observed agreement may be better explained by the interaction between category prevalence and inherent category characteristics, such as ambiguity, appeal, or social desirability, rather than by chance alone. By shifting away from the assumption of random rater assignment, the paper introduces a novel agreement coefficient that adjusts for the expected agreement by accounting for category prevalence, providing a more accurate measure of interrater reliability in the presence of imbalanced category distributions. It also examines the theoretical justification for this new measure, its interpretability, its standard error, and the robustness of its estimates in simulation and practical applications.
{"title":"Coefficient Lambda for Interrater Agreement Among Multiple Raters: Correction for Category Prevalence.","authors":"Rashid Saif Almehrizi","doi":"10.1177/00131644251380540","DOIUrl":"10.1177/00131644251380540","url":null,"abstract":"<p><p>Fleiss's Kappa is an extension of Cohen's Kappa, developed to assess the degree of interrater agreement among multiple raters or methods classifying subjects using categorical scales. Like Cohen's Kappa, it adjusts the observed proportion of agreement to account for agreement expected by chance. However, over time, several paradoxes and interpretative challenges have been identified, largely stemming from the assumption of random chance agreement and the sensitivity of the coefficient to the number of raters. Interpreting Fleiss's Kappa can be particularly difficult due to its dependence on the distribution of categories and prevalence patterns. This paper argues that a portion of the observed agreement may be better explained by the interaction between category prevalence and inherent category characteristics, such as ambiguity, appeal, or social desirability, rather than by chance alone. By shifting away from the assumption of random rater assignment, the paper introduces a novel agreement coefficient that adjusts for the expected agreement by accounting for category prevalence, providing a more accurate measure of interrater reliability in the presence of imbalanced category distributions. It also examines the theoretical justification for this new measure, its interpretability, its standard error, and the robustness of its estimates in simulation and practical applications.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251380540"},"PeriodicalIF":2.3,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12583010/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145451160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Common Persons (CP) equating design offers critical advantages for high-security testing contexts-eliminating anchor item exposure risks while accommodating non-equivalent groups-yet few studies have systematically examined how CP characteristics influence equating accuracy, and the field still lacks clear implementation guidelines. Addressing this gap, this comprehensive Monte Carlo simulation (N = 5,000 examinees per form; 500 replications) evaluates CP equating by manipulating 8 factors: test length, difficulty shift, ability dispersion, correlation between test forms and CP characteristics. Four equating methods (identity, IRT true-score, linear, equipercentile) were compared using normalized RMSE and %Bias. Key findings reveal: (a) when the CP sample size reaches at least 30, CP sample properties exert negligible influence on accuracy, challenging assumptions about distributional representativeness; (b) Test factors dominate outcomes-difficulty shifts ( = 1) degrade IRT precision severely (|%Bias| >22% vs. linear/equipercentile's |%Bias| <1.5%), while longer tests reduce NRMSE and wider ability dispersion ( = 1) enhances precision through improved person-item targeting; (c) Equipercentile and linear methods demonstrate superior robustness under form differences. We establish minimum operational thresholds: ≥30 CPs covering the score range suffice for precise equating. These results provide an evidence-based framework for CP implementation by systematically examining multiple manipulated factors, resolving security-vs-accuracy tradeoffs in high-stakes equating (e.g., credentialing exams) and enabling novel solutions like synthetic respondents.
{"title":"Common Persons Design in Score Equating: A Monte Carlo Investigation.","authors":"Jiayi Liu, Zhehan Jiang, Tianpeng Zheng, Yuting Han, Shicong Feng","doi":"10.1177/00131644251380585","DOIUrl":"10.1177/00131644251380585","url":null,"abstract":"<p><p>The Common Persons (CP) equating design offers critical advantages for high-security testing contexts-eliminating anchor item exposure risks while accommodating non-equivalent groups-yet few studies have systematically examined how CP characteristics influence equating accuracy, and the field still lacks clear implementation guidelines. Addressing this gap, this comprehensive Monte Carlo simulation (<i>N</i> = 5,000 examinees per form; 500 replications) evaluates CP equating by manipulating 8 factors: test length, difficulty shift, ability dispersion, correlation between test forms and CP characteristics. Four equating methods (identity, IRT true-score, linear, equipercentile) were compared using normalized RMSE and %Bias. Key findings reveal: (a) when the CP sample size reaches at least 30, CP sample properties exert negligible influence on accuracy, challenging assumptions about distributional representativeness; (b) Test factors dominate outcomes-difficulty shifts ( <math><mrow><mi>Δ</mi> <msub><mrow><mi>δ</mi></mrow> <mrow><mi>XY</mi></mrow> </msub> </mrow> </math> = 1) degrade IRT precision severely (|%Bias| >22% vs. linear/equipercentile's |%Bias| <1.5%), while longer tests reduce NRMSE and wider ability dispersion ( <math> <mrow> <msub><mrow><mi>σ</mi></mrow> <mrow><mi>θ</mi></mrow> </msub> </mrow> </math> = 1) enhances precision through improved person-item targeting; (c) Equipercentile and linear methods demonstrate superior robustness under form differences. We establish minimum operational thresholds: ≥30 CPs covering the score range suffice for precise equating. These results provide an evidence-based framework for CP implementation by systematically examining multiple manipulated factors, resolving security-vs-accuracy tradeoffs in high-stakes equating (e.g., credentialing exams) and enabling novel solutions like synthetic respondents.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251380585"},"PeriodicalIF":2.3,"publicationDate":"2025-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12571793/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145430563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-27DOI: 10.1177/00131644251379773
Xinya Liang, Paula Castro, Chunhua Cao, Wen-Juo Lo
In applied research across education, the social and behavioral sciences, and medicine, path models frequently incorporate both continuous and ordinal manifest variables to predict binary outcomes. This study employs Monte Carlo simulations to evaluate six estimators: robust maximum likelihood with probit and logit links (MLR-probit, MLR-logit), mean- and variance-adjusted weighted and unweighted least squares (WLSMV, ULSMV), and Bayesian methods with noninformative and weakly informative priors (Bayes-NI, Bayes-WI). Across various sample sizes, variable scales, and effect sizes, results show that WLSMV and Bayes-WI consistently achieve low bias and RMSE, particularly in small samples or when mediators have few categories. By contrast, categorical MLR approaches tended to yield unstable estimates for modest effects. These findings offer practical guidance for selecting estimators in mixed-scale path analyses and underscore their implications for robust inference.
{"title":"Path Analysis With Mixed-Scale Variables: Categorical ML, Least Squares, and Bayesian Estimations.","authors":"Xinya Liang, Paula Castro, Chunhua Cao, Wen-Juo Lo","doi":"10.1177/00131644251379773","DOIUrl":"10.1177/00131644251379773","url":null,"abstract":"<p><p>In applied research across education, the social and behavioral sciences, and medicine, path models frequently incorporate both continuous and ordinal manifest variables to predict binary outcomes. This study employs Monte Carlo simulations to evaluate six estimators: robust maximum likelihood with probit and logit links (MLR-probit, MLR-logit), mean- and variance-adjusted weighted and unweighted least squares (WLSMV, ULSMV), and Bayesian methods with noninformative and weakly informative priors (Bayes-NI, Bayes-WI). Across various sample sizes, variable scales, and effect sizes, results show that WLSMV and Bayes-WI consistently achieve low bias and RMSE, particularly in small samples or when mediators have few categories. By contrast, categorical MLR approaches tended to yield unstable estimates for modest effects. These findings offer practical guidance for selecting estimators in mixed-scale path analyses and underscore their implications for robust inference.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251379773"},"PeriodicalIF":2.3,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12568548/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145408337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-23DOI: 10.1177/00131644251380777
Larry V Hedges
Researchers conducting systematic reviews and meta-analyses often encounter studies in which the research design is a well conducted cluster randomized trial, but the statistical analysis does not take clustering into account. For example, the study might assign treatments by clusters but the analysis may not take into account the clustered treatment assignment. Alternatively, the analysis of the primary outcome of the study might take clustering into account, but the reviewer might be interested in another outcome for which only summary data are available in a form that does not take clustering into account. This article provides expressions for the approximate variance of risk differences, log risk ratios, and log odds ratios computed from clustered binary data, using the intraclass correlations. An example illustrates the calculations. References to empirical estimates of intraclass correlations are provided.
{"title":"Correcting the Variance of Effect Sizes Based on Binary Outcomes for Clustering.","authors":"Larry V Hedges","doi":"10.1177/00131644251380777","DOIUrl":"10.1177/00131644251380777","url":null,"abstract":"<p><p>Researchers conducting systematic reviews and meta-analyses often encounter studies in which the research design is a well conducted cluster randomized trial, but the statistical analysis does not take clustering into account. For example, the study might assign treatments by clusters but the analysis may not take into account the clustered treatment assignment. Alternatively, the analysis of the primary outcome of the study might take clustering into account, but the reviewer might be interested in another outcome for which only summary data are available in a form that does not take clustering into account. This article provides expressions for the approximate variance of risk differences, log risk ratios, and log odds ratios computed from clustered binary data, using the intraclass correlations. An example illustrates the calculations. References to empirical estimates of intraclass correlations are provided.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251380777"},"PeriodicalIF":2.3,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12549596/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145372242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-23DOI: 10.1177/00131644251371187
Ludovica De Carolis, Minjeong Jeon
This study compares two network-based approaches for analyzing binary psychological assessment data: network psychometrics and latent space item response modeling (LSIRM). Network psychometrics, a well-established method, infers relationships among items or symptoms based on pairwise conditional dependencies. In contrast, LSIRM is a more recent framework that represents item responses as a bipartite network of respondents and items embedded in a latent metric space, where the likelihood of a response decreases with increasing distance between the respondent and item. We evaluate the performance of both methods through simulation studies under varying data-generating conditions. In addition, we demonstrate their applications to real assessment data, showcasing the distinct insights each method offers to researchers and practitioners.
{"title":"Network Approaches to Binary Assessment Data: Network Psychometrics Versus Latent Space Item Response Models.","authors":"Ludovica De Carolis, Minjeong Jeon","doi":"10.1177/00131644251371187","DOIUrl":"10.1177/00131644251371187","url":null,"abstract":"<p><p>This study compares two network-based approaches for analyzing binary psychological assessment data: network psychometrics and latent space item response modeling (LSIRM). Network psychometrics, a well-established method, infers relationships among items or symptoms based on pairwise conditional dependencies. In contrast, LSIRM is a more recent framework that represents item responses as a bipartite network of respondents and items embedded in a latent metric space, where the likelihood of a response decreases with increasing distance between the respondent and item. We evaluate the performance of both methods through simulation studies under varying data-generating conditions. In addition, we demonstrate their applications to real assessment data, showcasing the distinct insights each method offers to researchers and practitioners.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251371187"},"PeriodicalIF":2.3,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12549609/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145376583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}