Pub Date : 2025-09-14DOI: 10.1177/00131644251368335
Chansoon Lee, Kylie Gorney, Jianshen Chen
Sequential procedures have been shown to be effective methods for real-time detection of compromised items in computerized adaptive testing. In this study, we propose three item response theory-based sequential procedures that involve the use of item scores and response times (RTs). The first procedure requires that either the score-based statistic or the RT-based statistic be extreme, the second procedure requires that both the score-based statistic and the RT-based statistic be extreme, and the third procedure requires that a combined score and RT-based statistic be extreme. Results suggest that the third procedure is the most promising, providing a reasonable balance between the false-positive rate and the true-positive rate while also producing relatively short lag times across a wide range of simulation conditions.
{"title":"Using Item Scores and Response Times to Detect Item Compromise in Computerized Adaptive Testing.","authors":"Chansoon Lee, Kylie Gorney, Jianshen Chen","doi":"10.1177/00131644251368335","DOIUrl":"10.1177/00131644251368335","url":null,"abstract":"<p><p>Sequential procedures have been shown to be effective methods for real-time detection of compromised items in computerized adaptive testing. In this study, we propose three item response theory-based sequential procedures that involve the use of item scores and response times (RTs). The first procedure requires that either the score-based statistic or the RT-based statistic be extreme, the second procedure requires that both the score-based statistic and the RT-based statistic be extreme, and the third procedure requires that a combined score and RT-based statistic be extreme. Results suggest that the third procedure is the most promising, providing a reasonable balance between the false-positive rate and the true-positive rate while also producing relatively short lag times across a wide range of simulation conditions.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251368335"},"PeriodicalIF":2.3,"publicationDate":"2025-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12433998/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145074512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-08DOI: 10.1177/00131644251358226
Diego F Graña, Rodrigo S Kreitchmann, Miguel A Sorrel, Luis Eduardo Garrido, Francisco J Abad
Forced-choice (FC) questionnaires have gained increasing attention as a strategy to reduce social desirability in self-reports, supported by advancements in confirmatory models that address the ipsativity of FC test scores. However, these models assume a known dimensionality and structure, which can be overly restrictive or fail to fit the data adequately. Consequently, exploratory models can be required, with accurate dimensionality assessment as a critical first step. FC questionnaires also pose unique challenges for dimensionality assessment, due to their inherently complex multidimensional structures. Despite this, no prior studies have systematically evaluated dimensionality assessment methods for FC data. To fill this gap, the present study examines five commonly used methods: the Kaiser Criterion, Empirical Kaiser Criterion, Parallel Analysis (PA), Hull Method, and Exploratory Graph Analysis. A Monte Carlo simulation study was conducted, manipulating key design features of FC questionnaires, such as the number of dimensions, items per dimension, response formats (e.g., binary vs. graded), and block composition (e.g., inclusion of heteropolar and unidimensional blocks), as well as factor loadings, inter-factor correlations, and sample size. Results showed that the Maximal Kaiser Criterion and PA methods outperformed the others, achieving higher accuracy and lower bias. Performance improved particularly when heteropolar or unidimensional blocks were included or when the questionnaire length increased. These findings emphasize the importance of thoughtful FC test design and provide practical recommendations for improving dimensionality assessment in this format.
{"title":"Dimensionality Assessment in Forced-Choice Questionnaires: First Steps Toward an Exploratory Framework.","authors":"Diego F Graña, Rodrigo S Kreitchmann, Miguel A Sorrel, Luis Eduardo Garrido, Francisco J Abad","doi":"10.1177/00131644251358226","DOIUrl":"10.1177/00131644251358226","url":null,"abstract":"<p><p>Forced-choice (FC) questionnaires have gained increasing attention as a strategy to reduce social desirability in self-reports, supported by advancements in confirmatory models that address the ipsativity of FC test scores. However, these models assume a known dimensionality and structure, which can be overly restrictive or fail to fit the data adequately. Consequently, exploratory models can be required, with accurate dimensionality assessment as a critical first step. FC questionnaires also pose unique challenges for dimensionality assessment, due to their inherently complex multidimensional structures. Despite this, no prior studies have systematically evaluated dimensionality assessment methods for FC data. To fill this gap, the present study examines five commonly used methods: the Kaiser Criterion, Empirical Kaiser Criterion, Parallel Analysis (PA), Hull Method, and Exploratory Graph Analysis. A Monte Carlo simulation study was conducted, manipulating key design features of FC questionnaires, such as the number of dimensions, items per dimension, response formats (e.g., binary vs. graded), and block composition (e.g., inclusion of heteropolar and unidimensional blocks), as well as factor loadings, inter-factor correlations, and sample size. Results showed that the Maximal Kaiser Criterion and PA methods outperformed the others, achieving higher accuracy and lower bias. Performance improved particularly when heteropolar or unidimensional blocks were included or when the questionnaire length increased. These findings emphasize the importance of thoughtful FC test design and provide practical recommendations for improving dimensionality assessment in this format.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251358226"},"PeriodicalIF":2.3,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12420653/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145039408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-06DOI: 10.1177/00131644251364252
Johan Braeken, Saskia van Laar
Measurement appropriateness concerns the question of whether the test or survey scale under consideration can provide a valid measure for a specific individual. An aberrant item response pattern would provide internal counterevidence against using the test/scale for this person, whereas a more typical item response pattern would imply a fit of the measure to the person. Traditional approaches, including the popular Lz person fit statistic, are hampered by their two-stage estimation procedure and the fact that the fit for the person is determined based on the model calibrated on data that include the misfitting persons. This calibration bias creates suboptimal conditions for person fit assessment. Solutions have been sought through the derivation of approximating bias-correction formulas and/or iterative purification procedures. Yet, here we discuss an alternative one-stage solution that involves calibrating a model expansion of the measurement model that includes a mixture component for target aberrant response patterns. A simulation study evaluates the approach under the most unfavorable and least-studied conditions for person fit indices, short polytomous survey scales, similar to those found in large-scale educational assessments such as the Program for International Student Assessment or Trends in Mathematics and Science Study.
{"title":"Reducing Calibration Bias for Person Fit Assessment by Mixture Model Expansion.","authors":"Johan Braeken, Saskia van Laar","doi":"10.1177/00131644251364252","DOIUrl":"10.1177/00131644251364252","url":null,"abstract":"<p><p>Measurement appropriateness concerns the question of whether the test or survey scale under consideration can provide a valid measure for a specific individual. An aberrant item response pattern would provide internal counterevidence against using the test/scale for this person, whereas a more typical item response pattern would imply a fit of the measure to the person. Traditional approaches, including the popular Lz person fit statistic, are hampered by their two-stage estimation procedure and the fact that the fit for the person is determined based on the model calibrated on data that include the misfitting persons. This calibration bias creates suboptimal conditions for person fit assessment. Solutions have been sought through the derivation of approximating bias-correction formulas and/or iterative purification procedures. Yet, here we discuss an alternative one-stage solution that involves calibrating a model expansion of the measurement model that includes a mixture component for target aberrant response patterns. A simulation study evaluates the approach under the most unfavorable and least-studied conditions for person fit indices, short polytomous survey scales, similar to those found in large-scale educational assessments such as the Program for International Student Assessment or Trends in Mathematics and Science Study.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251364252"},"PeriodicalIF":2.3,"publicationDate":"2025-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12413990/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145023055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-23DOI: 10.1177/00131644251350536
Tenko Raykov, Christine DiStefano, Yusuf Ransome
A procedure for evaluation of the proportion explained component variance by the underlying trait in behavioral scales with second-order structure is outlined. The resulting index of accounted for variance over all scale components is a useful and informative complement to the conventional omega-hierarchical coefficient as well as the proportion of explained component correlation. A point and interval estimation method is described for the discussed index, which utilizes a confirmatory factor analysis approach within the latent variable modeling methodology. The procedure can be used with widely available software and is illustrated on data.
{"title":"Proportion Explained Component Variance in Second-Order Scales: A Note on a Latent Variable Modeling Approach.","authors":"Tenko Raykov, Christine DiStefano, Yusuf Ransome","doi":"10.1177/00131644251350536","DOIUrl":"https://doi.org/10.1177/00131644251350536","url":null,"abstract":"<p><p>A procedure for evaluation of the proportion explained component variance by the underlying trait in behavioral scales with second-order structure is outlined. The resulting index of accounted for variance over all scale components is a useful and informative complement to the conventional omega-hierarchical coefficient as well as the proportion of explained component correlation. A point and interval estimation method is described for the discussed index, which utilizes a confirmatory factor analysis approach within the latent variable modeling methodology. The procedure can be used with widely available software and is illustrated on data.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251350536"},"PeriodicalIF":2.3,"publicationDate":"2025-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12374956/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144946890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-15DOI: 10.1177/00131644251347530
André Beauducel, Norbert Hilger, Anneke C Weide
Previous research has shown that ignoring individual differences of factor loadings in conventional factor models may reduce the determinacy of factor score predictors. Therefore, the aim of the present study is to propose a heterogeneous regression factor score predictor (HRFS) with larger determinacy than the conventional regression factor score predictor (RFS) when individuals have different factor loadings. First, a method for the estimation of individual loadings is proposed. The individual loading estimates are used to compute the HRFS. Then, a binomial test for loading heterogeneity of a factor is proposed to compute the HRFS only when the test is significant. Otherwise, the conventional RFS should be used. A simulation study reveals that the HRFS has larger determinacy than the conventional RFS in populations with substantial loading heterogeneity. An empirical example based on subsamples drawn randomly from a large sample of Big Five Markers indicates that the determinacy can be improved for the factor emotional stability when the HRFS is computed.
{"title":"How to Improve the Regression Factor Score Predictor When Individuals Have Different Factor Loadings.","authors":"André Beauducel, Norbert Hilger, Anneke C Weide","doi":"10.1177/00131644251347530","DOIUrl":"10.1177/00131644251347530","url":null,"abstract":"<p><p>Previous research has shown that ignoring individual differences of factor loadings in conventional factor models may reduce the determinacy of factor score predictors. Therefore, the aim of the present study is to propose a heterogeneous regression factor score predictor (HRFS) with larger determinacy than the conventional regression factor score predictor (RFS) when individuals have different factor loadings. First, a method for the estimation of individual loadings is proposed. The individual loading estimates are used to compute the HRFS. Then, a binomial test for loading heterogeneity of a factor is proposed to compute the HRFS only when the test is significant. Otherwise, the conventional RFS should be used. A simulation study reveals that the HRFS has larger determinacy than the conventional RFS in populations with substantial loading heterogeneity. An empirical example based on subsamples drawn randomly from a large sample of Big Five Markers indicates that the determinacy can be improved for the factor emotional stability when the HRFS is computed.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251347530"},"PeriodicalIF":2.3,"publicationDate":"2025-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12356820/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144872005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-14DOI: 10.1177/00131644251358530
Na Yeon Lee, Sojin Yoon, Sehee Hong
In longitudinal mixture models like latent transition analysis (LTA), identical items are often repeatedly measured across multiple time points to define latent classes and individuals' similar response patterns across multiple time points, which attributes to residual correlations. Therefore, this study hypothesized that an LTA model assuming residual correlations among indicator variables measured repeatedly across multiple time points would provide more accurate estimates of transition probabilities than a traditional LTA model. To test this hypothesis, a Monte Carlo simulation was conducted to generate data both with and without specified residual correlations among the repeatedly measured indicator variables, and the two LTA models-one that accounted for residual correlations and one that did not-were compared. This study included transition probabilities, numbers of indicator variables, sample sizes, and levels of residual correlations as the simulation conditions. The estimation performances were compared based on parameter estimate bias, mean squared error, and coverage. The results demonstrate that LTA with residual correlations outperforms traditional LTA in estimating transition probabilities, and the differences between the two models become prominent when the residual correlation is .3 or higher. This research integrates the characteristics of longitudinal data in an LTA simulation study and suggests an improved version of LTA estimation.
{"title":"A Comparison of LTA Models with and Without Residual Correlation in Estimating Transition Probabilities.","authors":"Na Yeon Lee, Sojin Yoon, Sehee Hong","doi":"10.1177/00131644251358530","DOIUrl":"10.1177/00131644251358530","url":null,"abstract":"<p><p>In longitudinal mixture models like latent transition analysis (LTA), identical items are often repeatedly measured across multiple time points to define latent classes and individuals' similar response patterns across multiple time points, which attributes to residual correlations. Therefore, this study hypothesized that an LTA model assuming residual correlations among indicator variables measured repeatedly across multiple time points would provide more accurate estimates of transition probabilities than a traditional LTA model. To test this hypothesis, a Monte Carlo simulation was conducted to generate data both with and without specified residual correlations among the repeatedly measured indicator variables, and the two LTA models-one that accounted for residual correlations and one that did not-were compared. This study included transition probabilities, numbers of indicator variables, sample sizes, and levels of residual correlations as the simulation conditions. The estimation performances were compared based on parameter estimate bias, mean squared error, and coverage. The results demonstrate that LTA with residual correlations outperforms traditional LTA in estimating transition probabilities, and the differences between the two models become prominent when the residual correlation is .3 or higher. This research integrates the characteristics of longitudinal data in an LTA simulation study and suggests an improved version of LTA estimation.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251358530"},"PeriodicalIF":2.3,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12356818/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144872004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-14DOI: 10.1177/00131644251360386
Dimiter M Dimitrov
Proposed is a new method of scoring multidimensional forced-choice (MFC) questionnaires referred to as the dominant trait profile (DTP) method. The DTP method identifies a dominant response vector (DRV) for each trait-a vector of binary scores for preferences in item pairs within MFC blocks from the perspective of a respondent for whom the trait under consideration dominates over the other traits being measured. The respondents' observed response vectors are matched to the DRV for each trait to produce (1/0) matching scores that are then analyzed via latent trait modeling, with scaling options (a) bounded D-scale (from 0 to 1), or (b) item response theory logit scale. The DTP method allows for the comparison of individuals on a trait of interest, as well as their standing in relation to a dominant trait "standard" (criterion). The study results indicate that DTP-based trait estimates are highly correlated with those produced by the popular Thurstonian item response theory model and the Zinnes and Griggs pairwise preference item response theory model, while avoiding the complexity of their designs and some computations issues.
{"title":"The Dominant Trait Profile Method of Scoring Multidimensional Forced-Choice Questionnaires.","authors":"Dimiter M Dimitrov","doi":"10.1177/00131644251360386","DOIUrl":"10.1177/00131644251360386","url":null,"abstract":"<p><p>Proposed is a new method of scoring multidimensional forced-choice (MFC) questionnaires referred to as the dominant trait profile (DTP) method. The DTP method identifies a dominant response vector (DRV) for each trait-a vector of binary scores for preferences in item pairs within MFC blocks from the perspective of a respondent for whom the trait under consideration dominates over the other traits being measured. The respondents' observed response vectors are matched to the DRV for each trait to produce (1/0) matching scores that are then analyzed via latent trait modeling, with scaling options (a) bounded D-scale (from 0 to 1), or (b) item response theory logit scale. The DTP method allows for the comparison of individuals on a trait of interest, as well as their standing in relation to a dominant trait \"standard\" (criterion). The study results indicate that DTP-based trait estimates are highly correlated with those produced by the popular Thurstonian item response theory model and the Zinnes and Griggs pairwise preference item response theory model, while avoiding the complexity of their designs and some computations issues.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251360386"},"PeriodicalIF":2.3,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12356822/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144872007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-14DOI: 10.1177/00131644251355485
Nicola Milano, Michela Ponticorvo, Davide Marocco
In this article, we explore the application of Large Language Models (LLMs) in assessing the content validity of psychometric instruments, focusing on the Big Five Questionnaire (BFQ) and Big Five Inventory (BFI). Content validity, a cornerstone of test construction, ensures that psychological measures adequately cover their intended constructs. Using both human expert evaluations and advanced LLMs, we compared the accuracy of semantic item-construct alignment. Graduate psychology students employed the Content Validity Ratio to rate test items, forming the human baseline. In parallel, state-of-the-art LLMs, including multilingual and fine-tuned models, analyzed item embeddings to predict construct mappings. The results reveal distinct strengths and limitations of human and AI approaches. Human validators excelled in aligning the behaviorally rich BFQ items, while LLMs performed better with the linguistically concise BFI items. Training strategies significantly influenced LLM performance, with models tailored for lexical relationships outperforming general-purpose LLMs. Here we highlight the complementary potential of hybrid validation systems that integrate human expertise and AI precision. The findings underscore the transformative role of LLMs in psychological assessment, paving the way for scalable, objective, and robust test development methodologies.
{"title":"Human Expertise and Large Language Model Embeddings in the Content Validity Assessment of Personality Tests.","authors":"Nicola Milano, Michela Ponticorvo, Davide Marocco","doi":"10.1177/00131644251355485","DOIUrl":"10.1177/00131644251355485","url":null,"abstract":"<p><p>In this article, we explore the application of Large Language Models (LLMs) in assessing the content validity of psychometric instruments, focusing on the Big Five Questionnaire (BFQ) and Big Five Inventory (BFI). Content validity, a cornerstone of test construction, ensures that psychological measures adequately cover their intended constructs. Using both human expert evaluations and advanced LLMs, we compared the accuracy of semantic item-construct alignment. Graduate psychology students employed the Content Validity Ratio to rate test items, forming the human baseline. In parallel, state-of-the-art LLMs, including multilingual and fine-tuned models, analyzed item embeddings to predict construct mappings. The results reveal distinct strengths and limitations of human and AI approaches. Human validators excelled in aligning the behaviorally rich BFQ items, while LLMs performed better with the linguistically concise BFI items. Training strategies significantly influenced LLM performance, with models tailored for lexical relationships outperforming general-purpose LLMs. Here we highlight the complementary potential of hybrid validation systems that integrate human expertise and AI precision. The findings underscore the transformative role of LLMs in psychological assessment, paving the way for scalable, objective, and robust test development methodologies.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251355485"},"PeriodicalIF":2.3,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12356817/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144872006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-06DOI: 10.1177/00131644251345120
Tenko Raykov, Bingsheng Zhang
This note is concerned with the chance of the one-parameter logistic (1PL-) model or the Rasch model being true for a unidimensional multi-item measuring instrument. It is pointed out that if a single dimension underlies a scale consisting of dichotomous items, then the probability of either model being correct for that scale can be zero. The question is then addressed, what the consequences could be of removing items not following these models. Using a large number of simulated data sets, a pair of empirically relevant settings is presented where such item elimination can be problematic. Specifically, dropping items from a unidimensional instrument due to them not satisfying the 1PL-model, or the Rasch model, can yield potentially seriously misleading ability estimates with increased standard errors and prediction error with respect to the latent trait. Implications for educational and behavioral research are discussed.
{"title":"The One-Parameter Logistic Model Can Be True With Zero Probability for a Unidimensional Measuring Instrument: How One Could Go Wrong Removing Items Not Satisfying the Model.","authors":"Tenko Raykov, Bingsheng Zhang","doi":"10.1177/00131644251345120","DOIUrl":"10.1177/00131644251345120","url":null,"abstract":"<p><p>This note is concerned with the chance of the one-parameter logistic (1PL-) model or the Rasch model being true for a unidimensional multi-item measuring instrument. It is pointed out that if a single dimension underlies a scale consisting of dichotomous items, then the probability of either model being correct for that scale can be zero. The question is then addressed, what the consequences could be of removing items not following these models. Using a large number of simulated data sets, a pair of empirically relevant settings is presented where such item elimination can be problematic. Specifically, dropping items from a unidimensional instrument due to them not satisfying the 1PL-model, or the Rasch model, can yield potentially seriously misleading ability estimates with increased standard errors and prediction error with respect to the latent trait. Implications for educational and behavioral research are discussed.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251345120"},"PeriodicalIF":2.3,"publicationDate":"2025-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12328337/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144816062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-03DOI: 10.1177/00131644251339444
Jared M Block, Steven P Reise, Keith F Widaman, Amanda K Montoya, David W Loring, Laura Glass Umfleet, Russell M Bauer, Joseph M Gullett, Brittany Wolff, Daniel L Drane, Kristen Enriquez, Robert M Bilder
An important task in clinical neuropsychology is to evaluate whether scores obtained on a test battery, such as the Wechsler Adult Intelligence Scale Fourth Edition (WAIS-IV), can be considered "credible" or "valid" for a particular patient. Such evaluations are typically made based on responses to performance validity tests (PVTs). As a complement to PVTs, we propose that WAIS-IV profiles also be evaluated using a residual-based M-distance ( ) person fit statistic. Large values flag profiles that are inconsistent with the factor analytic model underlying the interpretation of test scores. We first established a well-fitting model with four correlated factors for 10 core WAIS-IV subtests derived from the standardization sample. Based on this model, we then performed a Monte Carlo simulation to evaluate whether a hypothesized sampling distribution for was accurate and whether was computable, under different degrees of missing subtest scores. We found that when the number of subtests administered was less than 8, could not be computed around 25% of the time. When computable, conformed to a distribution with degrees of freedom equal to the number of tests minus the number of factors. Demonstration of the index in a large sample of clinical cases was also provided. Findings highlight the potential utility of the index as an adjunct to PVTs, offering clinicians an additional method to evaluate WAIS-IV test profiles and improve the accuracy of neuropsychological evaluations.
临床神经心理学的一项重要任务是评估在一系列测试中获得的分数,如韦氏成人智力量表第四版(WAIS-IV),对于特定患者来说是否可以被认为是“可信的”或“有效的”。这种评估通常是基于对性能有效性测试(pvt)的响应进行的。作为pvt的补充,我们建议使用基于残差的m -距离(d ri 2)人拟合统计量来评估WAIS-IV剖面。大的dri 2值标志着与解释考试成绩的因素分析模型不一致的概况。首先,我们对标准化样本衍生的10个核心WAIS-IV子测试建立了具有4个相关因子的良好拟合模型。在此模型的基础上,我们进行了蒙特卡罗模拟,以评估在不同程度的缺失子测试分数下,d ri 2的假设抽样分布是否准确以及d ri 2是否可计算。我们发现,当进行的子测试数量少于8个时,大约25%的时间无法计算dri 2。当可计算时,dri 2符合χ 2分布,其自由度等于试验数减去因子数。还提供了在大量临床病例样本中对dri 2指数的演示。研究结果强调了d2指数作为pvt辅助指标的潜在效用,为临床医生提供了一种评估WAIS-IV测试资料的额外方法,并提高了神经心理学评估的准确性。
{"title":"Model-Based Person Fit Statistics Applied to the Wechsler Adult Intelligence Scale IV.","authors":"Jared M Block, Steven P Reise, Keith F Widaman, Amanda K Montoya, David W Loring, Laura Glass Umfleet, Russell M Bauer, Joseph M Gullett, Brittany Wolff, Daniel L Drane, Kristen Enriquez, Robert M Bilder","doi":"10.1177/00131644251339444","DOIUrl":"10.1177/00131644251339444","url":null,"abstract":"<p><p>An important task in clinical neuropsychology is to evaluate whether scores obtained on a test battery, such as the Wechsler Adult Intelligence Scale Fourth Edition (WAIS-IV), can be considered \"credible\" or \"valid\" for a particular patient. Such evaluations are typically made based on responses to performance validity tests (PVTs). As a complement to PVTs, we propose that WAIS-IV profiles also be evaluated using a residual-based M-distance ( <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> ) person fit statistic. Large <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> values flag profiles that are inconsistent with the factor analytic model underlying the interpretation of test scores. We first established a well-fitting model with four correlated factors for 10 core WAIS-IV subtests derived from the standardization sample. Based on this model, we then performed a Monte Carlo simulation to evaluate whether a hypothesized sampling distribution for <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> was accurate and whether <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> was computable, under different degrees of missing subtest scores. We found that when the number of subtests administered was less than 8, <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> could not be computed around 25% of the time. When computable, <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> conformed to a <math> <mrow> <msup><mrow><mi>χ</mi></mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> distribution with degrees of freedom equal to the number of tests minus the number of factors. Demonstration of the <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> index in a large sample of clinical cases was also provided. Findings highlight the potential utility of the <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> index as an adjunct to PVTs, offering clinicians an additional method to evaluate WAIS-IV test profiles and improve the accuracy of neuropsychological evaluations.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251339444"},"PeriodicalIF":2.3,"publicationDate":"2025-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12321812/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144793789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}