Pub Date : 2026-01-04DOI: 10.1177/00131644251399588
Jasper Bogaert, Wen Wei Loh, Yves Rosseel
Researchers in the behavioral, educational, and social sciences often aim to analyze relationships among latent variables. Structural equation modeling (SEM) is widely regarded as the gold standard for this purpose. A straightforward alternative for estimating the structural model parameters is uncorrected factor score regression (UFSR), where factor scores are first computed and then employed in regression or path analysis. Unfortunately, the most commonly used factor scores (i.e., Regression and Bartlett factor scores) may yield biased estimates and invalid inferences when using this approach. In recent years, factor score regression (FSR) has enjoyed several methodological advancements to address this inconsistency. Despite these advancements, the use of FSR with correlation-preserving factor scores, here termed consistent factor score regression (cFSR), has received limited attention. In this paper, we revisit cFSR and compare its advantages and disadvantages relative to other recent FSR and SEM methods. We conducted an extensive simulation study comparing cFSR with other estimation approaches, assessing their performance in terms of convergence rate, bias, efficiency, and type I error rate. The findings indicate that cFSR outperforms UFSR while maintaining the conceptual simplicity of UFSR. We encourage behavioral, educational, and social science researchers to avoid UFSR and adopt cFSR as an alternative to SEM.
{"title":"Consistent Factor Score Regression: A Better Alternative for Uncorrected Factor Score Regression?","authors":"Jasper Bogaert, Wen Wei Loh, Yves Rosseel","doi":"10.1177/00131644251399588","DOIUrl":"10.1177/00131644251399588","url":null,"abstract":"<p><p>Researchers in the behavioral, educational, and social sciences often aim to analyze relationships among latent variables. Structural equation modeling (SEM) is widely regarded as the gold standard for this purpose. A straightforward alternative for estimating the structural model parameters is uncorrected factor score regression (UFSR), where factor scores are first computed and then employed in regression or path analysis. Unfortunately, the most commonly used factor scores (i.e., Regression and Bartlett factor scores) may yield biased estimates and invalid inferences when using this approach. In recent years, factor score regression (FSR) has enjoyed several methodological advancements to address this inconsistency. Despite these advancements, the use of FSR with correlation-preserving factor scores, here termed consistent factor score regression (cFSR), has received limited attention. In this paper, we revisit cFSR and compare its advantages and disadvantages relative to other recent FSR and SEM methods. We conducted an extensive simulation study comparing cFSR with other estimation approaches, assessing their performance in terms of convergence rate, bias, efficiency, and type I error rate. The findings indicate that cFSR outperforms UFSR while maintaining the conceptual simplicity of UFSR. We encourage behavioral, educational, and social science researchers to avoid UFSR and adopt cFSR as an alternative to SEM.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251399588"},"PeriodicalIF":2.3,"publicationDate":"2026-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12774818/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145932674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1177/00131644251405406
Tianpeng Zheng, Zhehan Jiang, Zhichen Guo, Yuanfang Liu
A critical methodological challenge in standard setting arises in small-sample, high-dimensional contexts where the number of items substantially exceeds the number of examinees. Under such conditions, conventional data-driven methods that rely on parametric models (e.g., item response theory) often become unstable or fail due to unreliable parameter estimation. This study investigates two families of data-driven methods: information-theoretic and unsupervised clustering, offering a potential solution to this challenge. Using a Monte Carlo simulation, we systematically evaluate 15 such methods to establish an evidence-based framework for practice. The simulation manipulated five factors, including sample size, the item-to-examinee ratio, mixture proportions, item quality, and ability separation. Method performance was evaluated using multiple criteria, including Relative Error, Classification Accuracy, Sensitivity, Specificity, and Youden's Index. Results indicated that no single method is universally superior; the optimal choice depends on the examinee mixture proportion. Specifically, the information-theoretic method QIR (quantile information ratio) excelled in scenarios with a dominant non-competent group, where high specificity was critical. Conversely, in highly selective contexts with balanced proficiency groups, the clustering methods CHI (Calinski-Harabasz index) and sum of squared error (SSE) demonstrated the highest classification effectiveness. Bayesian kernel density estimation (BKDE), however, consistently performed as a robust, balanced method across conditions. These findings provide practitioners with a clear decision framework for selecting a defensible, data-driven standard-setting method when traditional approaches are infeasible.
{"title":"Empowering Expert Judgment: A Data-Driven Decision Framework for Standard Setting in High-Dimensional and Data-Scarce Assessments.","authors":"Tianpeng Zheng, Zhehan Jiang, Zhichen Guo, Yuanfang Liu","doi":"10.1177/00131644251405406","DOIUrl":"10.1177/00131644251405406","url":null,"abstract":"<p><p>A critical methodological challenge in standard setting arises in small-sample, high-dimensional contexts where the number of items substantially exceeds the number of examinees. Under such conditions, conventional data-driven methods that rely on parametric models (e.g., item response theory) often become unstable or fail due to unreliable parameter estimation. This study investigates two families of data-driven methods: information-theoretic and unsupervised clustering, offering a potential solution to this challenge. Using a Monte Carlo simulation, we systematically evaluate 15 such methods to establish an evidence-based framework for practice. The simulation manipulated five factors, including sample size, the item-to-examinee ratio, mixture proportions, item quality, and ability separation. Method performance was evaluated using multiple criteria, including Relative Error, Classification Accuracy, Sensitivity, Specificity, and Youden's Index. Results indicated that no single method is universally superior; the optimal choice depends on the examinee mixture proportion. Specifically, the information-theoretic method QIR (quantile information ratio) excelled in scenarios with a dominant non-competent group, where high specificity was critical. Conversely, in highly selective contexts with balanced proficiency groups, the clustering methods CHI (Calinski-Harabasz index) and sum of squared error (SSE) demonstrated the highest classification effectiveness. Bayesian kernel density estimation (BKDE), however, consistently performed as a robust, balanced method across conditions. These findings provide practitioners with a clear decision framework for selecting a defensible, data-driven standard-setting method when traditional approaches are infeasible.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251405406"},"PeriodicalIF":2.3,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12764426/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145905891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-24DOI: 10.1177/00131644251393444
Minho Lee, Juyoung Jung
Residual-based fit statistics, which compare observed item statistics (e.g., proportions) with model-implied probabilities, are widely used to evaluate model fit, item fit, and local dependence in item response theory (IRT) models. Despite the prevalence of item non-responses in empirical studies, their impact on these statistics has not been systematically examined. Existing software (package) often applies heuristic treatments (e.g., listwise or pairwise deletion), which can distort fit statistics because missing data further inflate discrepancies between observed and expected proportions. This study evaluates the appropriateness of such treatments through extensive simulation. Results show that deletion methods degrade the accuracy of fit testing: fit indices are inflated under both null and power conditions, with the bias worsening as missingness increases. In addition, the impact of missing data exceeds that of model misspecification. Practical recommendations and alternative methods are discussed to guide applied researchers.
{"title":"Evaluation of Residual-Based Fit Statistics for Item Response Theory Models in the Presence of Non-Responses.","authors":"Minho Lee, Juyoung Jung","doi":"10.1177/00131644251393444","DOIUrl":"10.1177/00131644251393444","url":null,"abstract":"<p><p>Residual-based fit statistics, which compare observed item statistics (e.g., proportions) with model-implied probabilities, are widely used to evaluate model fit, item fit, and local dependence in item response theory (IRT) models. Despite the prevalence of item non-responses in empirical studies, their impact on these statistics has not been systematically examined. Existing software (package) often applies heuristic treatments (e.g., listwise or pairwise deletion), which can distort fit statistics because missing data further inflate discrepancies between observed and expected proportions. This study evaluates the appropriateness of such treatments through extensive simulation. Results show that deletion methods degrade the accuracy of fit testing: fit indices are inflated under both null and power conditions, with the bias worsening as missingness increases. In addition, the impact of missing data exceeds that of model misspecification. Practical recommendations and alternative methods are discussed to guide applied researchers.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251393444"},"PeriodicalIF":2.3,"publicationDate":"2025-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12738280/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145849277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-20DOI: 10.1177/00131644251396543
Dimiter M Dimitrov, Dimitar V Atanasov
Based on previous research on conditional reliability for number-correct test scores, conditioned on levels of the logit scale in item response theory, this article deals with conditional reliability of classical-type weighted scores conditioned on latent levels of a bounded scale. This is done in the framework of the D-scoring method of measurement (D-scale, bounded from 0 to 1). Along with the conditional reliability of weighted D-scores, conditioned on latent levels of the D-scale, presented are some additional measures of precision-conditional standard error, conditional signal-to-noise ratio, and marginal reliability. A syntax code (in R) for all computations is also provided.
{"title":"Conditional Reliability of Weighted Test Scores on a Bounded <i>D</i>-Scale.","authors":"Dimiter M Dimitrov, Dimitar V Atanasov","doi":"10.1177/00131644251396543","DOIUrl":"10.1177/00131644251396543","url":null,"abstract":"<p><p>Based on previous research on conditional reliability for number-correct test scores, conditioned on levels of the logit scale in item response theory, this article deals with conditional reliability of classical-type weighted scores conditioned on latent levels of a bounded scale. This is done in the framework of the <i>D</i>-scoring method of measurement (<i>D</i>-scale, bounded from 0 to 1). Along with the conditional reliability of weighted <i>D</i>-scores, conditioned on latent levels of the <i>D</i>-scale, presented are some additional measures of precision-conditional standard error, conditional signal-to-noise ratio, and marginal reliability. A syntax code (in R) for all computations is also provided.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251396543"},"PeriodicalIF":2.3,"publicationDate":"2025-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12718170/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145809788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-19DOI: 10.1177/00131644251401097
Jin Liu, Yu Bao, Christine DiStefano, Wei Jiang
Applied researchers often encounter situations where certain item response categories receive very few endorsements, resulting in sparse data. Collapsing categories may mitigate sparsity by increasing cell counts, yet the methodological consequences of this practice remain insufficiently explored. The current study examined the effects of response collapsing in Likert-type scale data through a simulation study under the confirmatory factor analysis model. Sparse response categories were collapsed to determine the impact on fit indices (i.e., chi-square, comparative fit index [CFI], Tucker-Lewis index [TLI], root mean square error of approximation [RMSEA], and standardized root mean square residual [SRMR]). Findings indicate that category collapsing has a significant impact when sparsity is severe, leading to reduced model rejections in both correctly specified and misspecified models. In addition, different fit indices exhibited varying sensitivities to data collapsing. Specifically, RMSEA was recommended for the correctly identified model, and TLI with a cut-off value of .95 was recommended for the misspecified models. The empirical analysis was aligned with the simulation results. These results provide valuable insights for researchers confronted with sparse data in applied measurement contexts.
{"title":"Collapsing Sparse Responses in Likert-Type Scale Data: Advantages and Disadvantages for Model Fit in CFA.","authors":"Jin Liu, Yu Bao, Christine DiStefano, Wei Jiang","doi":"10.1177/00131644251401097","DOIUrl":"10.1177/00131644251401097","url":null,"abstract":"<p><p>Applied researchers often encounter situations where certain item response categories receive very few endorsements, resulting in sparse data. Collapsing categories may mitigate sparsity by increasing cell counts, yet the methodological consequences of this practice remain insufficiently explored. The current study examined the effects of response collapsing in Likert-type scale data through a simulation study under the confirmatory factor analysis model. Sparse response categories were collapsed to determine the impact on fit indices (i.e., chi-square, comparative fit index [CFI], Tucker-Lewis index [TLI], root mean square error of approximation [RMSEA], and standardized root mean square residual [SRMR]). Findings indicate that category collapsing has a significant impact when sparsity is severe, leading to reduced model rejections in both correctly specified and misspecified models. In addition, different fit indices exhibited varying sensitivities to data collapsing. Specifically, RMSEA was recommended for the correctly identified model, and TLI with a cut-off value of .95 was recommended for the misspecified models. The empirical analysis was aligned with the simulation results. These results provide valuable insights for researchers confronted with sparse data in applied measurement contexts.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251401097"},"PeriodicalIF":2.3,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12716976/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145803112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-19DOI: 10.1177/00131644251395323
Michalis P Michaelides, Evi Konstantinidou
Respondent behavior in questionnaires may vary in terms of attention, effort, and consistency depending on the survey administration context and motivational conditions. This pre-registered experimental study examined whether motivational context influences response inconsistency, response times, and the role of conscientiousness in survey responding. A sample of 66 university students in Cyprus completed five psychological scales under both low-stakes and high-stakes instructions in a counterbalanced within-subjects design. To identify inconsistent respondents, two index-based methods were used: the mean absolute difference (MAD) index and Mahalanobis distance. Results showed that inconsistent responding was somewhat more frequent under low-stakes conditions, although differences were generally small and significant only for selected scales when using a lenient MAD threshold. By contrast, internal consistency reliability was slightly higher, and response times were significantly longer under high-stakes instructions, indicating greater deliberation. Conscientiousness predicted lower inconsistency only in the low-stakes condition. Overall, high-stakes instructions did not substantially reduce inconsistent responding but fostered longer response times and modest gains in reliability, suggesting enhanced behavioral engagement. Implications for survey design and data quality in psychological and educational research are discussed.
{"title":"An Experimental Study on the Impact of Survey Stakes on Response Inconsistency in Mixed-Worded Scales.","authors":"Michalis P Michaelides, Evi Konstantinidou","doi":"10.1177/00131644251395323","DOIUrl":"10.1177/00131644251395323","url":null,"abstract":"<p><p>Respondent behavior in questionnaires may vary in terms of attention, effort, and consistency depending on the survey administration context and motivational conditions. This pre-registered experimental study examined whether motivational context influences response inconsistency, response times, and the role of conscientiousness in survey responding. A sample of 66 university students in Cyprus completed five psychological scales under both low-stakes and high-stakes instructions in a counterbalanced within-subjects design. To identify inconsistent respondents, two index-based methods were used: the mean absolute difference (MAD) index and Mahalanobis distance. Results showed that inconsistent responding was somewhat more frequent under low-stakes conditions, although differences were generally small and significant only for selected scales when using a lenient MAD threshold. By contrast, internal consistency reliability was slightly higher, and response times were significantly longer under high-stakes instructions, indicating greater deliberation. Conscientiousness predicted lower inconsistency only in the low-stakes condition. Overall, high-stakes instructions did not substantially reduce inconsistent responding but fostered longer response times and modest gains in reliability, suggesting enhanced behavioral engagement. Implications for survey design and data quality in psychological and educational research are discussed.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251395323"},"PeriodicalIF":2.3,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12716977/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145803101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-19DOI: 10.1177/00131644251397428
Ji Yoon Jung, Ummugul Bezirhan, Matthias von Davier
Conventional cross-country scoring reliability in international large-scale assessments often depends on double scoring, which typically involves relatively small samples of multilingual responses. To extend the reach of reliability estimation, this study introduces the Linguistic-integrated Reliability Audit (LiRA), a novel method that measures scoring reliability using an entire dataset in a large-scale, multilingual context. LiRA automatically generates a second score for each response by analyzing its semantic alignment within a neighborhood of similar responses, then applies a weighted majority voting to determine a consensus score. Results demonstrate that LiRA provides a more comprehensive and systematic estimation of scoring reliability at the item, country, and language levels, while preserving the fundamental concepts of traditional reliability.
{"title":"Reconceptualizing Scoring Reliability Through Linguistic Similarity.","authors":"Ji Yoon Jung, Ummugul Bezirhan, Matthias von Davier","doi":"10.1177/00131644251397428","DOIUrl":"10.1177/00131644251397428","url":null,"abstract":"<p><p>Conventional cross-country scoring reliability in international large-scale assessments often depends on double scoring, which typically involves relatively small samples of multilingual responses. To extend the reach of reliability estimation, this study introduces the Linguistic-integrated Reliability Audit (LiRA), a novel method that measures scoring reliability using an entire dataset in a large-scale, multilingual context. LiRA automatically generates a second score for each response by analyzing its semantic alignment within a neighborhood of similar responses, then applies a weighted majority voting to determine a consensus score. Results demonstrate that LiRA provides a more comprehensive and systematic estimation of scoring reliability at the item, country, and language levels, while preserving the fundamental concepts of traditional reliability.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251397428"},"PeriodicalIF":2.3,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12716981/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145803134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-19DOI: 10.1177/00131644251393203
Francis Huang
Although cluster-robust standard errors (CRSEs) are commonly used to account for violations of observations independence found in nested data, an underappreciated issue is that there are several instances when CRSEs can fail to properly maintain the nominally accepted Type I error rate. These situations (e.g., analyzing data with imbalanced cluster sizes) can readily be found in various types of education-related datasets and are important to consider when computing statistical inference tests when using cluster-level predictors. Using a Monte Carlo simulation, we investigated these conditions and tested alternative estimators and degrees of freedom (df) adjustments to assess how well they could ameliorate the issues related to the use of the traditional CRSE (CR1) estimator using both continuous and dichotomous predictors. Findings showed that the bias-reduced linearization estimator (CR2) and the jackknife estimator (CR3) together with df adjustments were generally effective at maintaining Type I error rates for most of the conditions tested. Results also indicated that the CR1 when paired with df based on the effective cluster size was also acceptable. We emphasize the importance of clearly describing the nested data structure as the characteristics of the dataset can influence Type I error rates when using CRSEs.
{"title":"When Cluster-Robust Inferences Fail.","authors":"Francis Huang","doi":"10.1177/00131644251393203","DOIUrl":"10.1177/00131644251393203","url":null,"abstract":"<p><p>Although cluster-robust standard errors (CRSEs) are commonly used to account for violations of observations independence found in nested data, an underappreciated issue is that there are several instances when CRSEs can fail to properly maintain the nominally accepted Type I error rate. These situations (e.g., analyzing data with imbalanced cluster sizes) can readily be found in various types of education-related datasets and are important to consider when computing statistical inference tests when using cluster-level predictors. Using a Monte Carlo simulation, we investigated these conditions and tested alternative estimators and degrees of freedom (df) adjustments to assess how well they could ameliorate the issues related to the use of the traditional CRSE (CR1) estimator using both continuous and dichotomous predictors. Findings showed that the bias-reduced linearization estimator (CR2) and the jackknife estimator (CR3) together with df adjustments were generally effective at maintaining Type I error rates for most of the conditions tested. Results also indicated that the CR1 when paired with df based on the effective cluster size was also acceptable. We emphasize the importance of clearly describing the nested data structure as the characteristics of the dataset can influence Type I error rates when using CRSEs.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251393203"},"PeriodicalIF":2.3,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12716980/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145803212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-12DOI: 10.1177/00131644251393483
Bruno D Zumbo
This article develops a unified geometric framework linking expectation, regression, test theory, reliability, and item response theory through the concept of Bregman projection. Building on operator-theoretic and convex-analytic foundations, the framework extends the linear geometry of classical test theory (CTT) into nonlinear and information-geometric settings. Reliability and regression emerge as measures of projection efficiency-linear in Hilbert space and nonlinear under convex potentials. The exposition demonstrates that classical conditional expectation, least-squares regression, and information projections in exponential-family models share a common mathematical structure defined by Bregman divergence. By situating CTT within this broader geometric context, the article clarifies relationships between measurement, expectation, and statistical inference, providing a coherent foundation for nonlinear measurement and estimation in psychometrics.
{"title":"From Linear Geometry to Nonlinear and Information-Geometric Settings in Test Theory: Bregman Projections as a Unifying Framework.","authors":"Bruno D Zumbo","doi":"10.1177/00131644251393483","DOIUrl":"10.1177/00131644251393483","url":null,"abstract":"<p><p>This article develops a unified geometric framework linking expectation, regression, test theory, reliability, and item response theory through the concept of Bregman projection. Building on operator-theoretic and convex-analytic foundations, the framework extends the linear geometry of classical test theory (CTT) into nonlinear and information-geometric settings. Reliability and regression emerge as measures of projection efficiency-linear in Hilbert space and nonlinear under convex potentials. The exposition demonstrates that classical conditional expectation, least-squares regression, and information projections in exponential-family models share a common mathematical structure defined by Bregman divergence. By situating CTT within this broader geometric context, the article clarifies relationships between measurement, expectation, and statistical inference, providing a coherent foundation for nonlinear measurement and estimation in psychometrics.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251393483"},"PeriodicalIF":2.3,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12701833/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145762466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-12DOI: 10.1177/00131644251389891
Bruno D Zumbo
This article reconceptualizes reliability as a theorem derived from the projection geometry of Hilbert space rather than an assumption of classical test theory. Within this framework, the true score is defined as the conditional expectation , representing the orthogonal projection of the observed score onto the σ-algebra of the latent variable. Reliability, expressed as , quantifies the efficiency of this projection-the squared cosine between and its true-score projection. This formulation unifies reliability with regression , factor-analytic communality, and predictive accuracy in stochastic models. The operator-theoretic perspective clarifies that measurement error corresponds to the orthogonal complement of the projection, and reliability reflects the alignment between observed and latent scores. Numerical examples and measure-theoretic proofs illustrate the framework's generality. The approach provides a rigorous mathematical foundation for reliability, connecting psychometric theory with modern statistical and geometric analysis.
本文将可靠性重新定义为一个从希尔伯特空间的投影几何推导出来的定理,而不是一个经典检验理论的假设。在这个框架中,真实分数被定义为条件期望E (X∣G),表示观察到的分数在潜在变量的σ-代数上的正交投影。可靠性,表示为Rel (X) = Var [E (X∣G)] / Var (X),量化了该投影的效率- X与其真值投影之间的平方余弦。该公式将可靠性与回归r2、因子分析共同性和随机模型中的预测准确性统一起来。从算子理论的角度来看,测量误差对应于投影的正交补,而信度反映了观测分数和潜在分数之间的一致性。数值实例和测量理论证明说明了该框架的通用性。该方法为可靠性提供了严格的数学基础,将心理测量理论与现代统计和几何分析联系起来。
{"title":"Reliability as Projection in Operator-Theoretic Test Theory: Conditional Expectation, Hilbert Space Geometry, and Implications for Psychometric Practice.","authors":"Bruno D Zumbo","doi":"10.1177/00131644251389891","DOIUrl":"10.1177/00131644251389891","url":null,"abstract":"<p><p>This article reconceptualizes reliability as a theorem derived from the projection geometry of Hilbert space rather than an assumption of classical test theory. Within this framework, the true score is defined as the conditional expectation <math><mrow><mi>E</mi> <mo>(</mo> <mi>X</mi> <mo>∣</mo> <mi>G</mi> <mo>)</mo></mrow> </math> , representing the orthogonal projection of the observed score onto the σ-algebra of the latent variable. Reliability, expressed as <math><mrow><mi>Rel</mi> <mo>(</mo> <mi>X</mi> <mo>)</mo> <mo>=</mo> <mi>Var</mi> <mo>[</mo> <mi>E</mi> <mo>(</mo> <mi>X</mi> <mo>∣</mo> <mi>G</mi> <mo>)</mo> <mo>]</mo> <mo>/</mo> <mi>Var</mi> <mo>(</mo> <mi>X</mi> <mo>)</mo></mrow> </math> , quantifies the efficiency of this projection-the squared cosine between <math><mrow><mi>X</mi> <mspace></mspace></mrow> </math> and its true-score projection. This formulation unifies reliability with regression <math> <mrow> <msup><mrow><mi>R</mi></mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> , factor-analytic communality, and predictive accuracy in stochastic models. The operator-theoretic perspective clarifies that measurement error corresponds to the orthogonal complement of the projection, and reliability reflects the alignment between observed and latent scores. Numerical examples and measure-theoretic proofs illustrate the framework's generality. The approach provides a rigorous mathematical foundation for reliability, connecting psychometric theory with modern statistical and geometric analysis.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251389891"},"PeriodicalIF":2.3,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12615236/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145539189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}