Pub Date : 2026-03-18DOI: 10.1177/00131644261422169
Timo Seitz, Esther Ulitzsch
When personality assessments are employed in high-stakes contexts, there is the risk that test-takers provide overly positive descriptions of themselves. This response bias is known as faking and has often been addressed in latent variable models through an additional dimension capturing each test-taker's faking degree. Such models typically assume a homogeneous response strategy for all test-takers, with substantive traits and faking jointly influencing responses to all items. In this article, we present a latent response mixture item response theory (IRT) model of faking that accounts for changes in test-takers' response strategies over the course of the assessment. The model translates theoretical considerations about test-taker behavior into different model components for item responses and corresponding item-level response times (RT), thereby allowing to account for, identify, and investigate different faking-related response strategies on the person-by-item level. In a parameter recovery study, we found that the model parameters can be estimated well under realistic conditions. Also, we applied the model to an empirical dataset (N = 1,824) from a job application context, showcasing its utility in real high-stakes assessment data. We conclude the article by discussing the role of the model for psychological measurement as well as substantive research.
{"title":"Faking in High-Stakes Personality Assessments: A Response-Time-Based Latent Response Mixture Modeling Approach.","authors":"Timo Seitz, Esther Ulitzsch","doi":"10.1177/00131644261422169","DOIUrl":"10.1177/00131644261422169","url":null,"abstract":"<p><p>When personality assessments are employed in high-stakes contexts, there is the risk that test-takers provide overly positive descriptions of themselves. This response bias is known as faking and has often been addressed in latent variable models through an additional dimension capturing each test-taker's faking degree. Such models typically assume a homogeneous response strategy for all test-takers, with substantive traits and faking jointly influencing responses to all items. In this article, we present a latent response mixture item response theory (IRT) model of faking that accounts for changes in test-takers' response strategies over the course of the assessment. The model translates theoretical considerations about test-taker behavior into different model components for item responses and corresponding item-level response times (RT), thereby allowing to account for, identify, and investigate different faking-related response strategies on the person-by-item level. In a parameter recovery study, we found that the model parameters can be estimated well under realistic conditions. Also, we applied the model to an empirical dataset (<i>N</i> = 1,824) from a job application context, showcasing its utility in real high-stakes assessment data. We conclude the article by discussing the role of the model for psychological measurement as well as substantive research.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644261422169"},"PeriodicalIF":2.3,"publicationDate":"2026-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12999537/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147497889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-17DOI: 10.1177/00131644261426972
Joshua B Gilbert, William S Young, Zachary Himmelsbach, Esther Ulitzsch, Benjamin W Domingue
The use of process data, such as response time (RT) in psychometrics, has generally focused on the relationship between speed and accuracy. The potential relationships between RT and item discrimination remain less explored. In this study, we propose a model for simultaneously estimating the relationships between RT and item discrimination at the person, item, and person-by-item (residual) levels and illustrate our approach through an item-level meta-analysis of 40 empirical data sets comprising 1.84 million item responses. We find no evidence of average differences in item discrimination between items of different time intensity or persons of different average RT, while residual RT strongly and negatively predicts item discrimination (pooled coef. = -.27% per 1% difference in RT, SE = .04, = .17). While heterogeneity is high, we find little evidence of moderation by overall data set characteristics. Flexible generalized additive models show that the relationship between residual RT and item discrimination is generally curvilinear, with discrimination maximized just below average RT and minimized at the extremes. Our results suggest that RT data can provide insights into the measurement properties of educational and psychological assessments, but that the relationships between RT and item discrimination are highly variable.
{"title":"Conditional Dependencies Between Response Time and Item Discrimination: An Item-Level Meta-Analysis.","authors":"Joshua B Gilbert, William S Young, Zachary Himmelsbach, Esther Ulitzsch, Benjamin W Domingue","doi":"10.1177/00131644261426972","DOIUrl":"https://doi.org/10.1177/00131644261426972","url":null,"abstract":"<p><p>The use of process data, such as response time (RT) in psychometrics, has generally focused on the relationship between speed and accuracy. The potential relationships between RT and item discrimination remain less explored. In this study, we propose a model for simultaneously estimating the relationships between RT and item discrimination at the person, item, and person-by-item (residual) levels and illustrate our approach through an item-level meta-analysis of 40 empirical data sets comprising 1.84 million item responses. We find no evidence of average differences in item discrimination between items of different time intensity or persons of different average RT, while residual RT strongly and negatively predicts item discrimination (pooled coef. = -.27% per 1% difference in RT, <i>SE</i> = .04, <math><mrow><mi>τ</mi></mrow> </math> = .17). While heterogeneity is high, we find little evidence of moderation by overall data set characteristics. Flexible generalized additive models show that the relationship between residual RT and item discrimination is generally curvilinear, with discrimination maximized just below average RT and minimized at the extremes. Our results suggest that RT data can provide insights into the measurement properties of educational and psychological assessments, but that the relationships between RT and item discrimination are highly variable.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644261426972"},"PeriodicalIF":2.3,"publicationDate":"2026-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12995739/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147485016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-13DOI: 10.1177/00131644251408818
Oskar Engels, Oliver Lüdtke, Alexander Robitzsch
In longitudinal assessments, tests are frequently used to estimate trends over time. However, when item parameters lack invariance, time-point comparisons can be distorted, necessitating appropriate statistical methods to achieve accurate estimation. This study compares trend estimates using the two-parameter logistic (2PL) model under item parameter drift (IPD) across five trend-estimation approaches for two time points: First, concurrent calibration, which jointly estimates item parameters across multiple time points. Second, fixed calibration, which estimates item parameters at a single time point and fixes them at the other time point. Third, robust linking with Haberman and Haebara as linking methods with or losses. Fourth, non-invariant items are detected using likelihood-ratio tests or the root mean square deviation statistic with fixed or data-driven cutoffs, and trend estimates are then recomputed using only the detected invariant items under partial invariance. Fifth, regularized estimation under a smooth Bayesian information criterion (SBIC) is applied, shrinking small or null IPD effects toward zero while estimating all others as nonzero. Bias and relative root mean square error (RMSE) were evaluated for the mean and SD at T2. An empirical example using synthetic longitudinal reading data, applying the trend-estimation approaches, is provided. The results indicate that the regularized estimation with SBIC performed best across conditions, maintaining low bias and RMSE, followed by robust linking methods. Specifically, Haberman linking with the loss function showed superior performance under unbalanced IPD, outperforming the partial invariance approaches. Concurrent and fixed calibration showed the poorest trend recovery under unbalanced IPD conditions.
{"title":"Estimating Trends With Differential Item Functioning: A Comparison of Five IRT-Based Approaches.","authors":"Oskar Engels, Oliver Lüdtke, Alexander Robitzsch","doi":"10.1177/00131644251408818","DOIUrl":"https://doi.org/10.1177/00131644251408818","url":null,"abstract":"<p><p>In longitudinal assessments, tests are frequently used to estimate trends over time. However, when item parameters lack invariance, time-point comparisons can be distorted, necessitating appropriate statistical methods to achieve accurate estimation. This study compares trend estimates using the two-parameter logistic (2PL) model under item parameter drift (IPD) across five trend-estimation approaches for two time points: First, concurrent calibration, which jointly estimates item parameters across multiple time points. Second, fixed calibration, which estimates item parameters at a single time point and fixes them at the other time point. Third, robust linking with Haberman and Haebara as linking methods with <math> <mrow> <msub><mrow><mi>L</mi></mrow> <mrow><mi>p</mi></mrow> </msub> </mrow> </math> or <math> <mrow> <msub><mrow><mi>L</mi></mrow> <mrow><mn>0</mn></mrow> </msub> </mrow> </math> losses. Fourth, non-invariant items are detected using likelihood-ratio tests or the root mean square deviation statistic with fixed or data-driven cutoffs, and trend estimates are then recomputed using only the detected invariant items under partial invariance. Fifth, regularized estimation under a smooth Bayesian information criterion (SBIC) is applied, shrinking small or null IPD effects toward zero while estimating all others as nonzero. Bias and relative root mean square error (RMSE) were evaluated for the mean and <i>SD</i> at T2. An empirical example using synthetic longitudinal reading data, applying the trend-estimation approaches, is provided. The results indicate that the regularized estimation with SBIC performed best across conditions, maintaining low bias and RMSE, followed by robust linking methods. Specifically, Haberman linking with the <math> <mrow> <msub><mrow><mi>L</mi></mrow> <mrow><mn>0</mn></mrow> </msub> </mrow> </math> loss function showed superior performance under unbalanced IPD, outperforming the partial invariance approaches. Concurrent and fixed calibration showed the poorest trend recovery under unbalanced IPD conditions.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251408818"},"PeriodicalIF":2.3,"publicationDate":"2026-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12987755/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147462744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-11DOI: 10.1177/00131644261419028
Karl Schweizer, Xuezhu Ren, Tengfei Wang
The capability of confirmatory factor analysis to discriminate common systematic variation of attribute, item-position, and wording effects was investigated using the congeneric and tau-equivalent models. The simulated data generated according to four approaches included gradually increased amounts of item-position or wording effect variation while the amount of attribute variation was kept constant. The congeneric model always signified good model fit independently of the type and amount of additional common systematic variation, that is, there was no discrimination. In applications of the tau-equivalent model, the increase of the item-position or wording effect variation led to the change from indicating good fit to bad model fit, that is, there was negative discrimination. In contrast, the additionally considered two-factor tau model discriminated positively. As a consequence of these results, we recommend the pre-screening of data for method effects.
{"title":"Discriminating Between Attribute, Item-Position, and Wording Effects by the Congeneric and Tau-Equivalent Confirmatory Factor Analysis Models.","authors":"Karl Schweizer, Xuezhu Ren, Tengfei Wang","doi":"10.1177/00131644261419028","DOIUrl":"https://doi.org/10.1177/00131644261419028","url":null,"abstract":"<p><p>The capability of confirmatory factor analysis to discriminate common systematic variation of attribute, item-position, and wording effects was investigated using the congeneric and tau-equivalent models. The simulated data generated according to four approaches included gradually increased amounts of item-position or wording effect variation while the amount of attribute variation was kept constant. The congeneric model always signified good model fit independently of the type and amount of additional common systematic variation, that is, there was no discrimination. In applications of the tau-equivalent model, the increase of the item-position or wording effect variation led to the change from indicating good fit to bad model fit, that is, there was negative discrimination. In contrast, the additionally considered two-factor tau model discriminated positively. As a consequence of these results, we recommend the pre-screening of data for method effects.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644261419028"},"PeriodicalIF":2.3,"publicationDate":"2026-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12979218/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147462748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-25DOI: 10.1177/00131644261420391
Yuanyuan J Stirn, Won-Chan Lee
This paper proposes an information-based analytic method for calculating the conditional standard error of measurement (CSEM) in multistage testing (MST) using maximum likelihood estimation. The accuracy of the proposed method was evaluated by comparing CSEMs computed using the analytic method with those obtained from simulation across the same four MST designs. The results show that analytic and simulation-based CSEMs converge as test length increases, indicating that the proposed method provides a reliable approximation for longer tests. However, shorter tests and more complex MST designs require additional items to achieve comparable accuracy. The study also compared the proposed method with Park et al.'s analytic approach. Practical implications of the proposed method are discussed.
{"title":"Estimation of Conditional Standard Errors of Measurement for MLE Scores in MST.","authors":"Yuanyuan J Stirn, Won-Chan Lee","doi":"10.1177/00131644261420391","DOIUrl":"https://doi.org/10.1177/00131644261420391","url":null,"abstract":"<p><p>This paper proposes an information-based analytic method for calculating the conditional standard error of measurement (CSEM) in multistage testing (MST) using maximum likelihood estimation. The accuracy of the proposed method was evaluated by comparing CSEMs computed using the analytic method with those obtained from simulation across the same four MST designs. The results show that analytic and simulation-based CSEMs converge as test length increases, indicating that the proposed method provides a reliable approximation for longer tests. However, shorter tests and more complex MST designs require additional items to achieve comparable accuracy. The study also compared the proposed method with Park et al.'s analytic approach. Practical implications of the proposed method are discussed.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644261420391"},"PeriodicalIF":2.3,"publicationDate":"2026-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12945742/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147324610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-23DOI: 10.1177/00131644261419426
Santeri Holopainen, Jari Metsämuuronen, Mikko-Jussi Laakso, Janne Kujala
Response Time Threshold Methods (RTTMs) are widely used to identify rapid-guessing behavior (RG) in low-stakes assessments, yet face two key challenges: (a) inevitable misclassifications due to overlapping response time distributions of engaged and disengaged responses, and (b) lack of agreement on which method to use under varying conditions. This simulation study evaluated five RTTMs. Item responses and response times were generated from either a one-component model without RG or a two-component mixture model with RG in the population. Distribution, item, and person parameters were varied. Results showed that when the population contained RG, the mixture lognormal distribution-based method (MLN) was the most robust approach and estimated precise thresholds closest to the time points at which the misclassification rates were minimized, even when bimodality was more difficult to detect. The cumulative proportion method (CUMP) was less robust but also accurate when successful, though less precise. In addition, when the population did not include RG, CUMP was the only method to set thresholds for a notable proportion of cases. The methods were generally more conservative than liberal, though the mixture response time quantile method (MRTQ) was neither. The results are discussed in the light of prior RG research and the methods' characteristics, and future directions are suggested. Ultimately, for practical settings, we recommend a six-step process for RG identification that utilizes both a mixture modeling approach (MLN or MRTQ) and the CUMP method.
{"title":"Misclassification Produced by Rapid-Guessing Identification Methods and Their Suitability Under Various Conditions.","authors":"Santeri Holopainen, Jari Metsämuuronen, Mikko-Jussi Laakso, Janne Kujala","doi":"10.1177/00131644261419426","DOIUrl":"https://doi.org/10.1177/00131644261419426","url":null,"abstract":"<p><p>Response Time Threshold Methods (RTTMs) are widely used to identify rapid-guessing behavior (RG) in low-stakes assessments, yet face two key challenges: (a) inevitable misclassifications due to overlapping response time distributions of engaged and disengaged responses, and (b) lack of agreement on which method to use under varying conditions. This simulation study evaluated five RTTMs. Item responses and response times were generated from either a one-component model without RG or a two-component mixture model with RG in the population. Distribution, item, and person parameters were varied. Results showed that when the population contained RG, the mixture lognormal distribution-based method (MLN) was the most robust approach and estimated precise thresholds closest to the time points at which the misclassification rates were minimized, even when bimodality was more difficult to detect. The cumulative proportion method (CUMP) was less robust but also accurate when successful, though less precise. In addition, when the population did not include RG, CUMP was the only method to set thresholds for a notable proportion of cases. The methods were generally more conservative than liberal, though the mixture response time quantile method (MRTQ) was neither. The results are discussed in the light of prior RG research and the methods' characteristics, and future directions are suggested. Ultimately, for practical settings, we recommend a six-step process for RG identification that utilizes both a mixture modeling approach (MLN or MRTQ) and the CUMP method.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644261419426"},"PeriodicalIF":2.3,"publicationDate":"2026-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12929091/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147303530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-16DOI: 10.1177/00131644261417643
Irene Gianeselli
Inter-rater reliability is commonly assessed using chance-corrected agreement coefficients such as Cohen's κ, which summarize concordance among categorical judgments without modeling the inferential processes that generate them. As a result, κ is sensitive to prevalence imbalance, task difficulty, and heterogeneity in decision criteria and is often misinterpreted as a proxy for diagnostic accuracy or rater competence. This paper reframes inter-rater reliability within a signal detection-theoretic (SDT) framework in which categorical judgments arise from comparisons between latent continuous evidence and rater-specific decision thresholds. Within this generative model, κ can be interpreted as a bounded transformation of discrete strategic variance (i.e., the observable consequence of dispersion in latent decision criteria) rather than as a direct measure of epistemic alignment. To make this structure explicit, we introduce the Strategic Convergence Index (SCI), a normalized functional summarizing convergence in rater decision thresholds under an SDT generative process. SCI is not proposed as a standalone agreement coefficient but as a model-implied quantity whose interpretation depends on explicit assumptions about evidence distributions and decision rules. Monte Carlo simulations show that κ varies systematically with prevalence and perceptual discriminability even when decision-policy alignment is held constant, whereas SCI selectively tracks epistemic alignment and remains invariant to these factors. Supplementary model-based analyses further illustrate that SCI can be recovered as a stable system-level property even under latent-truth uncertainty, whereas individual thresholds may be weakly identified. Together, these results clarify the epistemic meaning of κ and motivate a decomposition of inter-rater reliability into outcome-level agreement and process-level alignment. By linking classical agreement statistics to an explicit generative model of judgment, the Strategic Convergence framework advances reliability assessment from description toward explanation.
{"title":"From Agreement to Epistemic Alignment: A Signal Detection-Theoretic Model of Inter-Rater Reliability.","authors":"Irene Gianeselli","doi":"10.1177/00131644261417643","DOIUrl":"https://doi.org/10.1177/00131644261417643","url":null,"abstract":"<p><p>Inter-rater reliability is commonly assessed using chance-corrected agreement coefficients such as Cohen's κ, which summarize concordance among categorical judgments without modeling the inferential processes that generate them. As a result, κ is sensitive to prevalence imbalance, task difficulty, and heterogeneity in decision criteria and is often misinterpreted as a proxy for diagnostic accuracy or rater competence. This paper reframes inter-rater reliability within a signal detection-theoretic (SDT) framework in which categorical judgments arise from comparisons between latent continuous evidence and rater-specific decision thresholds. Within this generative model, κ can be interpreted as a bounded transformation of discrete strategic variance (i.e., the observable consequence of dispersion in latent decision criteria) rather than as a direct measure of epistemic alignment. To make this structure explicit, we introduce the Strategic Convergence Index (SCI), a normalized functional summarizing convergence in rater decision thresholds under an SDT generative process. SCI is not proposed as a standalone agreement coefficient but as a model-implied quantity whose interpretation depends on explicit assumptions about evidence distributions and decision rules. Monte Carlo simulations show that κ varies systematically with prevalence and perceptual discriminability even when decision-policy alignment is held constant, whereas SCI selectively tracks epistemic alignment and remains invariant to these factors. Supplementary model-based analyses further illustrate that SCI can be recovered as a stable system-level property even under latent-truth uncertainty, whereas individual thresholds may be weakly identified. Together, these results clarify the epistemic meaning of κ and motivate a decomposition of inter-rater reliability into outcome-level agreement and process-level alignment. By linking classical agreement statistics to an explicit generative model of judgment, the Strategic Convergence framework advances reliability assessment from description toward explanation.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644261417643"},"PeriodicalIF":2.3,"publicationDate":"2026-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12909152/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146219104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-16DOI: 10.1177/00131644261418138
Mingfeng Xue, Xingyao Xiao, Yunting Liu, Mark Wilson
Large language models (LLMs) have shown great potential in automatic scoring. However, due to model characteristics and variation in training materials and pipelines, scoring inconsistency can exist within an LLM and across LLMs when rating the same response multiple times. This study investigates the intra-LLM and inter-LLM consistency in scoring with five LLMs (i.e., Claude, DeepSeek, Gemini, GPT, and Qwen), variability under different temperatures, and their relationship with scoring accuracy. Moreover, a voting strategy that assembles information from different LLMs was proposed to address inconsistent scoring. Using constructed-response items from a science education assessment and open-source data from the Automated Student Assessment Prize (ASAP), we find that: (a) LLMs generally exhibited almost perfect intra-LLM consistency regardless of temperature; (b) inter-LLM consistency was moderate, with higher agreement observed for items that were easier to score; (c) intra-LLM consistency consistently exceeded inter-LLM consistency, supporting the expectation that within-model consistency represents an upper bound for cross-model agreement; (d) intra-LLM consistency was not associated with scoring accuracy, whereas inter-LLM consistency showed a strong positive relationship with accuracy; and (e) majority voting across LLMs improved scoring accuracy by leveraging complementary strengths of different models.
{"title":"On the Consistency of Automatic Scoring with Large Language Models.","authors":"Mingfeng Xue, Xingyao Xiao, Yunting Liu, Mark Wilson","doi":"10.1177/00131644261418138","DOIUrl":"https://doi.org/10.1177/00131644261418138","url":null,"abstract":"<p><p>Large language models (LLMs) have shown great potential in automatic scoring. However, due to model characteristics and variation in training materials and pipelines, scoring inconsistency can exist within an LLM and across LLMs when rating the same response multiple times. This study investigates the intra-LLM and inter-LLM consistency in scoring with five LLMs (i.e., Claude, DeepSeek, Gemini, GPT, and Qwen), variability under different temperatures, and their relationship with scoring accuracy. Moreover, a voting strategy that assembles information from different LLMs was proposed to address inconsistent scoring. Using constructed-response items from a science education assessment and open-source data from the Automated Student Assessment Prize (ASAP), we find that: (a) LLMs generally exhibited almost perfect intra-LLM consistency regardless of temperature; (b) inter-LLM consistency was moderate, with higher agreement observed for items that were easier to score; (c) intra-LLM consistency consistently exceeded inter-LLM consistency, supporting the expectation that within-model consistency represents an upper bound for cross-model agreement; (d) intra-LLM consistency was not associated with scoring accuracy, whereas inter-LLM consistency showed a strong positive relationship with accuracy; and (e) majority voting across LLMs improved scoring accuracy by leveraging complementary strengths of different models.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644261418138"},"PeriodicalIF":2.3,"publicationDate":"2026-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12909151/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146219113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-13DOI: 10.1177/00131644251395590
Jana Welling, Eva Zink, Timo Gnambs
Educational large-scale assessments provide information on ability differences between groups, informing policies and shaping educational decisions. However, some of these differences might partly reflect variations in test-taking motivation rather than in actual abilities. Existing approaches for mitigating the distorting effects of rapid guessing focus mainly on point estimates of abilities, although research questions often refer to latent variables. The present study seeks to (a) determine the bias introduced by rapid guessing in group comparisons based on plausible value estimates and (b) introduce and evaluate different approaches of handling rapid guessing in the estimation of plausible values. In a simulation study, four models were compared: (1) a baseline model did not account for rapid guessing, (2) a person-level model incorporated rapid guessing as a respondent characteristic in the background model, (3) a response-level model filtered responses with item response times lower than a predetermined threshold, and (4) a combined model merged the person- and response-level approaches. Results show that the response-level and combined model performed best while accounting for rapid guessing on the person level did not suffice. An empirical example using data from a German large-scale assessment (N = 478) demonstrates the applicability of all approaches in practice. Recommendations for future research are given to improve ability estimation.
{"title":"Comparing Different Approaches of (Not) Accounting for Rapid Guessing in Plausible Values Estimation.","authors":"Jana Welling, Eva Zink, Timo Gnambs","doi":"10.1177/00131644251395590","DOIUrl":"10.1177/00131644251395590","url":null,"abstract":"<p><p>Educational large-scale assessments provide information on ability differences between groups, informing policies and shaping educational decisions. However, some of these differences might partly reflect variations in test-taking motivation rather than in actual abilities. Existing approaches for mitigating the distorting effects of rapid guessing focus mainly on point estimates of abilities, although research questions often refer to latent variables. The present study seeks to (a) determine the bias introduced by rapid guessing in group comparisons based on plausible value estimates and (b) introduce and evaluate different approaches of handling rapid guessing in the estimation of plausible values. In a simulation study, four models were compared: (1) a baseline model did not account for rapid guessing, (2) a person-level model incorporated rapid guessing as a respondent characteristic in the background model, (3) a response-level model filtered responses with item response times lower than a predetermined threshold, and (4) a combined model merged the person- and response-level approaches. Results show that the response-level and combined model performed best while accounting for rapid guessing on the person level did not suffice. An empirical example using data from a German large-scale assessment (<i>N</i> = 478) demonstrates the applicability of all approaches in practice. Recommendations for future research are given to improve ability estimation.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251395590"},"PeriodicalIF":2.3,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12804065/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145997547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-04DOI: 10.1177/00131644251399588
Jasper Bogaert, Wen Wei Loh, Yves Rosseel
Researchers in the behavioral, educational, and social sciences often aim to analyze relationships among latent variables. Structural equation modeling (SEM) is widely regarded as the gold standard for this purpose. A straightforward alternative for estimating the structural model parameters is uncorrected factor score regression (UFSR), where factor scores are first computed and then employed in regression or path analysis. Unfortunately, the most commonly used factor scores (i.e., Regression and Bartlett factor scores) may yield biased estimates and invalid inferences when using this approach. In recent years, factor score regression (FSR) has enjoyed several methodological advancements to address this inconsistency. Despite these advancements, the use of FSR with correlation-preserving factor scores, here termed consistent factor score regression (cFSR), has received limited attention. In this paper, we revisit cFSR and compare its advantages and disadvantages relative to other recent FSR and SEM methods. We conducted an extensive simulation study comparing cFSR with other estimation approaches, assessing their performance in terms of convergence rate, bias, efficiency, and type I error rate. The findings indicate that cFSR outperforms UFSR while maintaining the conceptual simplicity of UFSR. We encourage behavioral, educational, and social science researchers to avoid UFSR and adopt cFSR as an alternative to SEM.
{"title":"Consistent Factor Score Regression: A Better Alternative for Uncorrected Factor Score Regression?","authors":"Jasper Bogaert, Wen Wei Loh, Yves Rosseel","doi":"10.1177/00131644251399588","DOIUrl":"10.1177/00131644251399588","url":null,"abstract":"<p><p>Researchers in the behavioral, educational, and social sciences often aim to analyze relationships among latent variables. Structural equation modeling (SEM) is widely regarded as the gold standard for this purpose. A straightforward alternative for estimating the structural model parameters is uncorrected factor score regression (UFSR), where factor scores are first computed and then employed in regression or path analysis. Unfortunately, the most commonly used factor scores (i.e., Regression and Bartlett factor scores) may yield biased estimates and invalid inferences when using this approach. In recent years, factor score regression (FSR) has enjoyed several methodological advancements to address this inconsistency. Despite these advancements, the use of FSR with correlation-preserving factor scores, here termed consistent factor score regression (cFSR), has received limited attention. In this paper, we revisit cFSR and compare its advantages and disadvantages relative to other recent FSR and SEM methods. We conducted an extensive simulation study comparing cFSR with other estimation approaches, assessing their performance in terms of convergence rate, bias, efficiency, and type I error rate. The findings indicate that cFSR outperforms UFSR while maintaining the conceptual simplicity of UFSR. We encourage behavioral, educational, and social science researchers to avoid UFSR and adopt cFSR as an alternative to SEM.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251399588"},"PeriodicalIF":2.3,"publicationDate":"2026-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12774818/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145932674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}