Pub Date : 2025-11-12DOI: 10.1177/00131644251389891
Bruno D Zumbo
This article reconceptualizes reliability as a theorem derived from the projection geometry of Hilbert space rather than an assumption of classical test theory. Within this framework, the true score is defined as the conditional expectation , representing the orthogonal projection of the observed score onto the σ-algebra of the latent variable. Reliability, expressed as , quantifies the efficiency of this projection-the squared cosine between and its true-score projection. This formulation unifies reliability with regression , factor-analytic communality, and predictive accuracy in stochastic models. The operator-theoretic perspective clarifies that measurement error corresponds to the orthogonal complement of the projection, and reliability reflects the alignment between observed and latent scores. Numerical examples and measure-theoretic proofs illustrate the framework's generality. The approach provides a rigorous mathematical foundation for reliability, connecting psychometric theory with modern statistical and geometric analysis.
本文将可靠性重新定义为一个从希尔伯特空间的投影几何推导出来的定理,而不是一个经典检验理论的假设。在这个框架中,真实分数被定义为条件期望E (X∣G),表示观察到的分数在潜在变量的σ-代数上的正交投影。可靠性,表示为Rel (X) = Var [E (X∣G)] / Var (X),量化了该投影的效率- X与其真值投影之间的平方余弦。该公式将可靠性与回归r2、因子分析共同性和随机模型中的预测准确性统一起来。从算子理论的角度来看,测量误差对应于投影的正交补,而信度反映了观测分数和潜在分数之间的一致性。数值实例和测量理论证明说明了该框架的通用性。该方法为可靠性提供了严格的数学基础,将心理测量理论与现代统计和几何分析联系起来。
{"title":"Reliability as Projection in Operator-Theoretic Test Theory: Conditional Expectation, Hilbert Space Geometry, and Implications for Psychometric Practice.","authors":"Bruno D Zumbo","doi":"10.1177/00131644251389891","DOIUrl":"10.1177/00131644251389891","url":null,"abstract":"<p><p>This article reconceptualizes reliability as a theorem derived from the projection geometry of Hilbert space rather than an assumption of classical test theory. Within this framework, the true score is defined as the conditional expectation <math><mrow><mi>E</mi> <mo>(</mo> <mi>X</mi> <mo>∣</mo> <mi>G</mi> <mo>)</mo></mrow> </math> , representing the orthogonal projection of the observed score onto the σ-algebra of the latent variable. Reliability, expressed as <math><mrow><mi>Rel</mi> <mo>(</mo> <mi>X</mi> <mo>)</mo> <mo>=</mo> <mi>Var</mi> <mo>[</mo> <mi>E</mi> <mo>(</mo> <mi>X</mi> <mo>∣</mo> <mi>G</mi> <mo>)</mo> <mo>]</mo> <mo>/</mo> <mi>Var</mi> <mo>(</mo> <mi>X</mi> <mo>)</mo></mrow> </math> , quantifies the efficiency of this projection-the squared cosine between <math><mrow><mi>X</mi> <mspace></mspace></mrow> </math> and its true-score projection. This formulation unifies reliability with regression <math> <mrow> <msup><mrow><mi>R</mi></mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> , factor-analytic communality, and predictive accuracy in stochastic models. The operator-theoretic perspective clarifies that measurement error corresponds to the orthogonal complement of the projection, and reliability reflects the alignment between observed and latent scores. Numerical examples and measure-theoretic proofs illustrate the framework's generality. The approach provides a rigorous mathematical foundation for reliability, connecting psychometric theory with modern statistical and geometric analysis.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251389891"},"PeriodicalIF":2.3,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12615236/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145539189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-08DOI: 10.1177/00131644251376553
Rashid Saif Almehrizi
Weighted inter-rater agreement allows for differentiation between levels of disagreement among rating categories and is especially useful when there is an ordinal relationship between categories. Many existing weighted inter-rater agreement coefficients are either extensions of weighted Kappa or are formulated as Cohen's Kappa-like coefficients. These measures suffer from the same issues as Cohen's Kappa, including sensitivity to the marginal distributions of raters and the effects of category prevalence. They primarily account for the possibility of chance agreement or disagreement. This article introduces a new coefficient, weighted Lambda, which allows for the inclusion of varying weights assigned to disagreements. Unlike traditional methods, this coefficient does not assume random assignment and does not adjust for chance agreement or disagreement. Instead, it modifies the observed percentage of agreement while taking into account the anticipated impact of prevalence-agreement effects. The study also outlines techniques for estimating sampling standard errors, conducting hypothesis tests, and constructing confidence intervals for weighted Lambda. Illustrative numerical examples and Monte Carlo simulations are presented to investigate and compare the performance of the new weighted Lambda with commonly used weighted inter-rater agreement coefficients across various true agreement levels and agreement matrices. Results demonstrate several advantages of the new coefficient in measuring weighted inter-rater agreement.
{"title":"Agreement Lambda for Weighted Disagreement With Ordinal Scales: Correction for Category Prevalence.","authors":"Rashid Saif Almehrizi","doi":"10.1177/00131644251376553","DOIUrl":"10.1177/00131644251376553","url":null,"abstract":"<p><p>Weighted inter-rater agreement allows for differentiation between levels of disagreement among rating categories and is especially useful when there is an ordinal relationship between categories. Many existing weighted inter-rater agreement coefficients are either extensions of weighted Kappa or are formulated as Cohen's Kappa-like coefficients. These measures suffer from the same issues as Cohen's Kappa, including sensitivity to the marginal distributions of raters and the effects of category prevalence. They primarily account for the possibility of chance agreement or disagreement. This article introduces a new coefficient, weighted Lambda, which allows for the inclusion of varying weights assigned to disagreements. Unlike traditional methods, this coefficient does not assume random assignment and does not adjust for chance agreement or disagreement. Instead, it modifies the observed percentage of agreement while taking into account the anticipated impact of prevalence-agreement effects. The study also outlines techniques for estimating sampling standard errors, conducting hypothesis tests, and constructing confidence intervals for weighted Lambda. Illustrative numerical examples and Monte Carlo simulations are presented to investigate and compare the performance of the new weighted Lambda with commonly used weighted inter-rater agreement coefficients across various true agreement levels and agreement matrices. Results demonstrate several advantages of the new coefficient in measuring weighted inter-rater agreement.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251376553"},"PeriodicalIF":2.3,"publicationDate":"2025-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12602299/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145502674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-07DOI: 10.1177/00131644251379802
Haeju Lee, Sijia Huang, Dubravka Svetina Valdivia, Ben Schwartzman
Differential item functioning (DIF) has been a long-standing problem in educational and psychological measurement. In practice, the source from which DIF originates can be complex in the sense that an item can show DIF on multiple background variables of different types simultaneously. Although a variety of non-item response theory-(IRT)-based and IRT-based DIF detection methods have been introduced, they do not sufficiently address the issue of DIF evaluation when its source is complex. The recently proposed least absolute shrinkage and selection operator (LASSO) regularization method has shown promising results of detecting DIF on multiple background variables. To provide more insight, in this study, we compared three DIF detection methods, including the non-IRT-based logistic regression (LR), the IRT-based likelihood ratio test (LRT), and LASSO regularization, through a comprehensive simulation and an empirical data analysis. We found that when multiple background variables were considered, the Type I error and Power rates of the three methods for identifying DIF items on one of the variables depended on not only the sample size and its DIF magnitude but also on the DIF magnitude of the other background variable and the correlation between them. We presented other findings and discussed the limitations and future research directions in this paper.
{"title":"On the Complex Sources of Differential Item Functioning: A Comparison of Three Methods.","authors":"Haeju Lee, Sijia Huang, Dubravka Svetina Valdivia, Ben Schwartzman","doi":"10.1177/00131644251379802","DOIUrl":"10.1177/00131644251379802","url":null,"abstract":"<p><p>Differential item functioning (DIF) has been a long-standing problem in educational and psychological measurement. In practice, the source from which DIF originates can be complex in the sense that an item can show DIF on multiple background variables of different types simultaneously. Although a variety of non-item response theory-(IRT)-based and IRT-based DIF detection methods have been introduced, they do not sufficiently address the issue of DIF evaluation when its source is complex. The recently proposed <i>l</i>east <i>a</i>bsolute <i>s</i>hrinkage and <i>s</i>election <i>o</i>perator (LASSO) regularization method has shown promising results of detecting DIF on multiple background variables. To provide more insight, in this study, we compared three DIF detection methods, including the non-IRT-based logistic regression (LR), the IRT-based likelihood ratio test (LRT), and LASSO regularization, through a comprehensive simulation and an empirical data analysis. We found that when multiple background variables were considered, the Type I error and Power rates of the three methods for identifying DIF items on one of the variables depended on not only the sample size and its DIF magnitude but also on the DIF magnitude of the other background variable and the correlation between them. We presented other findings and discussed the limitations and future research directions in this paper.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251379802"},"PeriodicalIF":2.3,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12602301/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145502628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-03DOI: 10.1177/00131644251377381
Daniel A Sass, Michael A Sanchez
Observed variable and factor selection are critical components of factor analysis, particularly when the optimal subset of observed variables and the number of factors are unknown and results cannot be replicated across studies. The Replicable Factor Analytic Solutions (RFAS) algorithm was developed to assess the replicability of factor structures-both in terms of the number of factors and the variables retained-while identifying the "best" or most replicable solutions according to predefined criteria. This study evaluated RFAS performance across 54 experimental conditions that varied in model complexity (six-factor models), interfactor correlations (ρ = 0, .30, and .60), and sample sizes (n = 300, 500, and 1000). Under default settings, RFAS generally performed well and demonstrated its utility in producing replicable factor structures. However, performance declined with highly correlated factors, smaller sample sizes, and more complex models. RFAS was also compared to four alternative variable selection methods: Ant Colony Optimization (ACO), Weighted Group Least Absolute Shrinkage and Selection Operator (LASSO), and stepwise procedures based on target Tucker-Lewis Index (TLI) and ΔTLI criteria. Stepwise and LASSO methods were largely ineffective at eliminating problematic variables under the studied conditions. In contrast, both RFAS and ACO successfully removed variables as intended, although the resulting factor structures often differed substantially between the two approaches. As with other variable selection methods, refining algorithmic criteria may be necessary to further enhance model performance.
{"title":"An Evaluation of the Replicable Factor Analytic Solutions Algorithm for Variable Selection: A Simulation Study.","authors":"Daniel A Sass, Michael A Sanchez","doi":"10.1177/00131644251377381","DOIUrl":"10.1177/00131644251377381","url":null,"abstract":"<p><p>Observed variable and factor selection are critical components of factor analysis, particularly when the optimal subset of observed variables and the number of factors are unknown and results cannot be replicated across studies. The Replicable Factor Analytic Solutions (RFAS) algorithm was developed to assess the replicability of factor structures-both in terms of the number of factors and the variables retained-while identifying the \"best\" or most replicable solutions according to predefined criteria. This study evaluated RFAS performance across 54 experimental conditions that varied in model complexity (six-factor models), interfactor correlations (ρ = 0, .30, and .60), and sample sizes (<i>n</i> = 300, 500, and 1000). Under default settings, RFAS generally performed well and demonstrated its utility in producing replicable factor structures. However, performance declined with highly correlated factors, smaller sample sizes, and more complex models. RFAS was also compared to four alternative variable selection methods: Ant Colony Optimization (ACO), Weighted Group Least Absolute Shrinkage and Selection Operator (LASSO), and stepwise procedures based on target Tucker-Lewis Index (TLI) and ΔTLI criteria. Stepwise and LASSO methods were largely ineffective at eliminating problematic variables under the studied conditions. In contrast, both RFAS and ACO successfully removed variables as intended, although the resulting factor structures often differed substantially between the two approaches. As with other variable selection methods, refining algorithmic criteria may be necessary to further enhance model performance.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251377381"},"PeriodicalIF":2.3,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12583011/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145451151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-03DOI: 10.1177/00131644251380540
Rashid Saif Almehrizi
Fleiss's Kappa is an extension of Cohen's Kappa, developed to assess the degree of interrater agreement among multiple raters or methods classifying subjects using categorical scales. Like Cohen's Kappa, it adjusts the observed proportion of agreement to account for agreement expected by chance. However, over time, several paradoxes and interpretative challenges have been identified, largely stemming from the assumption of random chance agreement and the sensitivity of the coefficient to the number of raters. Interpreting Fleiss's Kappa can be particularly difficult due to its dependence on the distribution of categories and prevalence patterns. This paper argues that a portion of the observed agreement may be better explained by the interaction between category prevalence and inherent category characteristics, such as ambiguity, appeal, or social desirability, rather than by chance alone. By shifting away from the assumption of random rater assignment, the paper introduces a novel agreement coefficient that adjusts for the expected agreement by accounting for category prevalence, providing a more accurate measure of interrater reliability in the presence of imbalanced category distributions. It also examines the theoretical justification for this new measure, its interpretability, its standard error, and the robustness of its estimates in simulation and practical applications.
{"title":"Coefficient Lambda for Interrater Agreement Among Multiple Raters: Correction for Category Prevalence.","authors":"Rashid Saif Almehrizi","doi":"10.1177/00131644251380540","DOIUrl":"10.1177/00131644251380540","url":null,"abstract":"<p><p>Fleiss's Kappa is an extension of Cohen's Kappa, developed to assess the degree of interrater agreement among multiple raters or methods classifying subjects using categorical scales. Like Cohen's Kappa, it adjusts the observed proportion of agreement to account for agreement expected by chance. However, over time, several paradoxes and interpretative challenges have been identified, largely stemming from the assumption of random chance agreement and the sensitivity of the coefficient to the number of raters. Interpreting Fleiss's Kappa can be particularly difficult due to its dependence on the distribution of categories and prevalence patterns. This paper argues that a portion of the observed agreement may be better explained by the interaction between category prevalence and inherent category characteristics, such as ambiguity, appeal, or social desirability, rather than by chance alone. By shifting away from the assumption of random rater assignment, the paper introduces a novel agreement coefficient that adjusts for the expected agreement by accounting for category prevalence, providing a more accurate measure of interrater reliability in the presence of imbalanced category distributions. It also examines the theoretical justification for this new measure, its interpretability, its standard error, and the robustness of its estimates in simulation and practical applications.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251380540"},"PeriodicalIF":2.3,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12583010/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145451160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Common Persons (CP) equating design offers critical advantages for high-security testing contexts-eliminating anchor item exposure risks while accommodating non-equivalent groups-yet few studies have systematically examined how CP characteristics influence equating accuracy, and the field still lacks clear implementation guidelines. Addressing this gap, this comprehensive Monte Carlo simulation (N = 5,000 examinees per form; 500 replications) evaluates CP equating by manipulating 8 factors: test length, difficulty shift, ability dispersion, correlation between test forms and CP characteristics. Four equating methods (identity, IRT true-score, linear, equipercentile) were compared using normalized RMSE and %Bias. Key findings reveal: (a) when the CP sample size reaches at least 30, CP sample properties exert negligible influence on accuracy, challenging assumptions about distributional representativeness; (b) Test factors dominate outcomes-difficulty shifts ( = 1) degrade IRT precision severely (|%Bias| >22% vs. linear/equipercentile's |%Bias| <1.5%), while longer tests reduce NRMSE and wider ability dispersion ( = 1) enhances precision through improved person-item targeting; (c) Equipercentile and linear methods demonstrate superior robustness under form differences. We establish minimum operational thresholds: ≥30 CPs covering the score range suffice for precise equating. These results provide an evidence-based framework for CP implementation by systematically examining multiple manipulated factors, resolving security-vs-accuracy tradeoffs in high-stakes equating (e.g., credentialing exams) and enabling novel solutions like synthetic respondents.
{"title":"Common Persons Design in Score Equating: A Monte Carlo Investigation.","authors":"Jiayi Liu, Zhehan Jiang, Tianpeng Zheng, Yuting Han, Shicong Feng","doi":"10.1177/00131644251380585","DOIUrl":"10.1177/00131644251380585","url":null,"abstract":"<p><p>The Common Persons (CP) equating design offers critical advantages for high-security testing contexts-eliminating anchor item exposure risks while accommodating non-equivalent groups-yet few studies have systematically examined how CP characteristics influence equating accuracy, and the field still lacks clear implementation guidelines. Addressing this gap, this comprehensive Monte Carlo simulation (<i>N</i> = 5,000 examinees per form; 500 replications) evaluates CP equating by manipulating 8 factors: test length, difficulty shift, ability dispersion, correlation between test forms and CP characteristics. Four equating methods (identity, IRT true-score, linear, equipercentile) were compared using normalized RMSE and %Bias. Key findings reveal: (a) when the CP sample size reaches at least 30, CP sample properties exert negligible influence on accuracy, challenging assumptions about distributional representativeness; (b) Test factors dominate outcomes-difficulty shifts ( <math><mrow><mi>Δ</mi> <msub><mrow><mi>δ</mi></mrow> <mrow><mi>XY</mi></mrow> </msub> </mrow> </math> = 1) degrade IRT precision severely (|%Bias| >22% vs. linear/equipercentile's |%Bias| <1.5%), while longer tests reduce NRMSE and wider ability dispersion ( <math> <mrow> <msub><mrow><mi>σ</mi></mrow> <mrow><mi>θ</mi></mrow> </msub> </mrow> </math> = 1) enhances precision through improved person-item targeting; (c) Equipercentile and linear methods demonstrate superior robustness under form differences. We establish minimum operational thresholds: ≥30 CPs covering the score range suffice for precise equating. These results provide an evidence-based framework for CP implementation by systematically examining multiple manipulated factors, resolving security-vs-accuracy tradeoffs in high-stakes equating (e.g., credentialing exams) and enabling novel solutions like synthetic respondents.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251380585"},"PeriodicalIF":2.3,"publicationDate":"2025-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12571793/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145430563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-27DOI: 10.1177/00131644251379773
Xinya Liang, Paula Castro, Chunhua Cao, Wen-Juo Lo
In applied research across education, the social and behavioral sciences, and medicine, path models frequently incorporate both continuous and ordinal manifest variables to predict binary outcomes. This study employs Monte Carlo simulations to evaluate six estimators: robust maximum likelihood with probit and logit links (MLR-probit, MLR-logit), mean- and variance-adjusted weighted and unweighted least squares (WLSMV, ULSMV), and Bayesian methods with noninformative and weakly informative priors (Bayes-NI, Bayes-WI). Across various sample sizes, variable scales, and effect sizes, results show that WLSMV and Bayes-WI consistently achieve low bias and RMSE, particularly in small samples or when mediators have few categories. By contrast, categorical MLR approaches tended to yield unstable estimates for modest effects. These findings offer practical guidance for selecting estimators in mixed-scale path analyses and underscore their implications for robust inference.
{"title":"Path Analysis With Mixed-Scale Variables: Categorical ML, Least Squares, and Bayesian Estimations.","authors":"Xinya Liang, Paula Castro, Chunhua Cao, Wen-Juo Lo","doi":"10.1177/00131644251379773","DOIUrl":"10.1177/00131644251379773","url":null,"abstract":"<p><p>In applied research across education, the social and behavioral sciences, and medicine, path models frequently incorporate both continuous and ordinal manifest variables to predict binary outcomes. This study employs Monte Carlo simulations to evaluate six estimators: robust maximum likelihood with probit and logit links (MLR-probit, MLR-logit), mean- and variance-adjusted weighted and unweighted least squares (WLSMV, ULSMV), and Bayesian methods with noninformative and weakly informative priors (Bayes-NI, Bayes-WI). Across various sample sizes, variable scales, and effect sizes, results show that WLSMV and Bayes-WI consistently achieve low bias and RMSE, particularly in small samples or when mediators have few categories. By contrast, categorical MLR approaches tended to yield unstable estimates for modest effects. These findings offer practical guidance for selecting estimators in mixed-scale path analyses and underscore their implications for robust inference.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251379773"},"PeriodicalIF":2.3,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12568548/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145408337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-23DOI: 10.1177/00131644251380777
Larry V Hedges
Researchers conducting systematic reviews and meta-analyses often encounter studies in which the research design is a well conducted cluster randomized trial, but the statistical analysis does not take clustering into account. For example, the study might assign treatments by clusters but the analysis may not take into account the clustered treatment assignment. Alternatively, the analysis of the primary outcome of the study might take clustering into account, but the reviewer might be interested in another outcome for which only summary data are available in a form that does not take clustering into account. This article provides expressions for the approximate variance of risk differences, log risk ratios, and log odds ratios computed from clustered binary data, using the intraclass correlations. An example illustrates the calculations. References to empirical estimates of intraclass correlations are provided.
{"title":"Correcting the Variance of Effect Sizes Based on Binary Outcomes for Clustering.","authors":"Larry V Hedges","doi":"10.1177/00131644251380777","DOIUrl":"10.1177/00131644251380777","url":null,"abstract":"<p><p>Researchers conducting systematic reviews and meta-analyses often encounter studies in which the research design is a well conducted cluster randomized trial, but the statistical analysis does not take clustering into account. For example, the study might assign treatments by clusters but the analysis may not take into account the clustered treatment assignment. Alternatively, the analysis of the primary outcome of the study might take clustering into account, but the reviewer might be interested in another outcome for which only summary data are available in a form that does not take clustering into account. This article provides expressions for the approximate variance of risk differences, log risk ratios, and log odds ratios computed from clustered binary data, using the intraclass correlations. An example illustrates the calculations. References to empirical estimates of intraclass correlations are provided.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251380777"},"PeriodicalIF":2.3,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12549596/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145372242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-23DOI: 10.1177/00131644251371187
Ludovica De Carolis, Minjeong Jeon
This study compares two network-based approaches for analyzing binary psychological assessment data: network psychometrics and latent space item response modeling (LSIRM). Network psychometrics, a well-established method, infers relationships among items or symptoms based on pairwise conditional dependencies. In contrast, LSIRM is a more recent framework that represents item responses as a bipartite network of respondents and items embedded in a latent metric space, where the likelihood of a response decreases with increasing distance between the respondent and item. We evaluate the performance of both methods through simulation studies under varying data-generating conditions. In addition, we demonstrate their applications to real assessment data, showcasing the distinct insights each method offers to researchers and practitioners.
{"title":"Network Approaches to Binary Assessment Data: Network Psychometrics Versus Latent Space Item Response Models.","authors":"Ludovica De Carolis, Minjeong Jeon","doi":"10.1177/00131644251371187","DOIUrl":"10.1177/00131644251371187","url":null,"abstract":"<p><p>This study compares two network-based approaches for analyzing binary psychological assessment data: network psychometrics and latent space item response modeling (LSIRM). Network psychometrics, a well-established method, infers relationships among items or symptoms based on pairwise conditional dependencies. In contrast, LSIRM is a more recent framework that represents item responses as a bipartite network of respondents and items embedded in a latent metric space, where the likelihood of a response decreases with increasing distance between the respondent and item. We evaluate the performance of both methods through simulation studies under varying data-generating conditions. In addition, we demonstrate their applications to real assessment data, showcasing the distinct insights each method offers to researchers and practitioners.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251371187"},"PeriodicalIF":2.3,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12549609/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145376583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-07DOI: 10.1177/00131644251374302
Georgios D Sideridis, Mohammed Alghamdi
The three-parameter logistic (3PL) model in item-response theory (IRT) has long been used to account for guessing in multiple-choice assessments through a fixed item-level parameter. However, this approach treats guessing as a property of the test item rather than the individual, potentially misrepresenting the cognitive processes underlying the examinee's behavior. This study evaluates a novel alternative, the Two-Parameter Logistic Extension (2PLE) model, which re-conceptualizes guessing as a function of a person's ability rather than as an item-specific constant. Using Monte Carlo simulation and empirical data from the PIRLS 2021 reading comprehension assessment, we compared the 3PL and 2PLE models on the recovery of latent ability, predictive fit (Leave-One-Out Information Criterion [LOOIC]), and theoretical alignment with test-taking behavior. The simulation results demonstrated that although both models performed similarly in terms of root-mean-squared error (RMSE) for ability estimates, the 2PLE model consistently achieved superior LOOIC values across conditions, particularly with longer tests and larger sample sizes. In an empirical analysis involving the reading achievement of 131 fourth-grade students from Saudi Arabia, model comparison again favored 2PLE, with a statistically significant LOOIC difference (ΔLOOIC = 0.482, z = 2.54). Importantly, person-level guessing estimates derived from the 2PLE model were significantly associated with established person-fit statistics (C*, U3), supporting their criterion validity. These findings suggest that the 2PLE model provides a more cognitively plausible and statistically robust representation of examinee behavior by embedding an ability-dependent guessing function.
项目反应理论(IRT)中的三参数逻辑模型(3PL)长期以来被用于解释多项选择评估中通过固定的项目水平参数进行猜测。然而,这种方法将猜测视为测试项目的属性而不是个人的属性,可能会歪曲考生行为背后的认知过程。本研究评估了一种新的替代方案,即双参数逻辑扩展(2PLE)模型,该模型将猜测重新定义为一个人的能力的函数,而不是特定项目的常数。利用蒙特卡罗模拟和PIRLS 2021阅读理解评估的经验数据,我们比较了3PL和2PLE模型在潜在能力恢复、预测拟合(留一信息标准[LOOIC])以及理论与考试行为的一致性方面的差异。模拟结果表明,尽管两种模型在能力估计的均方根误差(RMSE)方面表现相似,但2PLE模型在各种条件下始终获得更好的LOOIC值,特别是在更长的测试时间和更大的样本量下。在对131名沙特阿拉伯四年级学生阅读成绩的实证分析中,模型比较再次倾向于2PLE,其LOOIC差异具有统计学意义(ΔLOOIC = 0.482, z = 2.54)。重要的是,从2PLE模型中得出的个人水平猜测估计值与已建立的个人拟合统计量显著相关(C*, U3),支持其标准有效性。这些发现表明,通过嵌入能力依赖的猜测函数,2PLE模型提供了一个更可信的认知和统计稳健的考生行为表征。
{"title":"Guessing During Testing is a Person Attribute Not an Instrument Parameter.","authors":"Georgios D Sideridis, Mohammed Alghamdi","doi":"10.1177/00131644251374302","DOIUrl":"10.1177/00131644251374302","url":null,"abstract":"<p><p>The three-parameter logistic (3PL) model in item-response theory (IRT) has long been used to account for guessing in multiple-choice assessments through a fixed item-level parameter. However, this approach treats guessing as a property of the test item rather than the individual, potentially misrepresenting the cognitive processes underlying the examinee's behavior. This study evaluates a novel alternative, the Two-Parameter Logistic Extension (2PLE) model, which re-conceptualizes guessing as a function of a person's ability rather than as an item-specific constant. Using Monte Carlo simulation and empirical data from the PIRLS 2021 reading comprehension assessment, we compared the 3PL and 2PLE models on the recovery of latent ability, predictive fit (Leave-One-Out Information Criterion [LOOIC]), and theoretical alignment with test-taking behavior. The simulation results demonstrated that although both models performed similarly in terms of root-mean-squared error (RMSE) for ability estimates, the 2PLE model consistently achieved superior LOOIC values across conditions, particularly with longer tests and larger sample sizes. In an empirical analysis involving the reading achievement of 131 fourth-grade students from Saudi Arabia, model comparison again favored 2PLE, with a statistically significant LOOIC difference (ΔLOOIC = 0.482, <i>z</i> = 2.54). Importantly, person-level guessing estimates derived from the 2PLE model were significantly associated with established person-fit statistics (C*, U3), supporting their criterion validity. These findings suggest that the 2PLE model provides a more cognitively plausible and statistically robust representation of examinee behavior by embedding an ability-dependent guessing function.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251374302"},"PeriodicalIF":2.3,"publicationDate":"2025-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12504208/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145257555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}