Daniel F. McCaffrey, Jodi M. Casabianca, Matthew S. Johnson
Use of artificial intelligence (AI) to score responses is growing in popularity and likely to increase. Evidence of the validity of scores relies on quadratic weighted kappa (QWK) to demonstrate agreement between AI scores and human ratings. QWK is a measure of agreement that accounts for chance agreement and the ordinality of the data by giving greater weight to larger disagreements. It has known shortcomings including sensitivity to the human rating reliability. The proportional reduction in mean squared error (PRMSE) measures agreement between predictions and their target that accounts for measurement error in the target. For example, the accuracy of the automated scoring model, with respect to prediction of the human true scores rather than the observed ratings. Extensive simulation study results show PRMSE is robust to many factors to which QWK is sensitive such as the human rater reliability, skew in the data and the number of score points. Analysis of operational test data demonstrates QWK and PRMSE can lead to different conclusions about AI scores. We investigate sample size requirements for accurate estimation of PRMSE in the context of AI scoring, although the results could apply more generally to measures with similar distributions as those tested in our study.
{"title":"Measuring the Accuracy of True Score Predictions for AI Scoring Evaluation","authors":"Daniel F. McCaffrey, Jodi M. Casabianca, Matthew S. Johnson","doi":"10.1111/jedm.70011","DOIUrl":"https://doi.org/10.1111/jedm.70011","url":null,"abstract":"<p>Use of artificial intelligence (AI) to score responses is growing in popularity and likely to increase. Evidence of the validity of scores relies on quadratic weighted kappa (QWK) to demonstrate agreement between AI scores and human ratings. QWK is a measure of agreement that accounts for chance agreement and the ordinality of the data by giving greater weight to larger disagreements. It has known shortcomings including sensitivity to the human rating reliability. The proportional reduction in mean squared error (PRMSE) measures agreement between predictions and their target that accounts for measurement error in the target. For example, the accuracy of the automated scoring model, with respect to prediction of the human true scores rather than the observed ratings. Extensive simulation study results show PRMSE is robust to many factors to which QWK is sensitive such as the human rater reliability, skew in the data and the number of score points. Analysis of operational test data demonstrates QWK and PRMSE can lead to different conclusions about AI scores. We investigate sample size requirements for accurate estimation of PRMSE in the context of AI scoring, although the results could apply more generally to measures with similar distributions as those tested in our study.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"763-786"},"PeriodicalIF":1.6,"publicationDate":"2025-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing Huang, Yuxiao Zhang, Jason W. Morphew, Jayson M. Nissen, Ben Van Dusen, Hua Hua Chang
Online calibration estimates new item parameters alongside previously calibrated items, supporting efficient item replenishment. However, most existing online calibration procedures for Cognitive Diagnostic Computerized Adaptive Testing (CD-CAT) lack mechanisms to ensure content balance during live testing. This limitation can lead to uneven content coverage, potentially undermining the alignment with instructional goals. This research extends the current calibration framework by integrating a two-phase test design with a content-balancing item selection method into the online calibration procedure. Simulation studies evaluated item parameter recovery and attribute profile estimation accuracy under the proposed procedure. Results indicated that the developed procedure yielded more accurate new item parameter estimates. The procedure also maintained content representativeness under both balanced and unbalanced constraints. Attribute profile estimation was sensitive to item parameter values. Accuracy declined when items had larger parameter values. Calibration improved with larger sample sizes and smaller parameter values. Longer test lengths contributed more to profile estimation than to new item calibration. These findings highlight design trade-offs in adaptive item replenishment and suggest new directions for hybrid calibration methods.
{"title":"Two-Phase Content-Balancing CD-CAT Online Item Calibration","authors":"Jing Huang, Yuxiao Zhang, Jason W. Morphew, Jayson M. Nissen, Ben Van Dusen, Hua Hua Chang","doi":"10.1111/jedm.70012","DOIUrl":"https://doi.org/10.1111/jedm.70012","url":null,"abstract":"<p>Online calibration estimates new item parameters alongside previously calibrated items, supporting efficient item replenishment. However, most existing online calibration procedures for Cognitive Diagnostic Computerized Adaptive Testing (CD-CAT) lack mechanisms to ensure content balance during live testing. This limitation can lead to uneven content coverage, potentially undermining the alignment with instructional goals. This research extends the current calibration framework by integrating a two-phase test design with a content-balancing item selection method into the online calibration procedure. Simulation studies evaluated item parameter recovery and attribute profile estimation accuracy under the proposed procedure. Results indicated that the developed procedure yielded more accurate new item parameter estimates. The procedure also maintained content representativeness under both balanced and unbalanced constraints. Attribute profile estimation was sensitive to item parameter values. Accuracy declined when items had larger parameter values. Calibration improved with larger sample sizes and smaller parameter values. Longer test lengths contributed more to profile estimation than to new item calibration. These findings highlight design trade-offs in adaptive item replenishment and suggest new directions for hybrid calibration methods.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"787-808"},"PeriodicalIF":1.6,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70012","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study considers the estimation of marginal reliability and conditional accuracy measures using a generalized recursion procedure with several IRT-based ability and score estimators. The estimators include MLE, TCC, and EAP abilities, and corresponding test scores obtained with different weightings of the item scores. We consider reliability estimates for 1-, 2-, and 3-parameter logistic IRT models (1PL, 2PL, and 3PL) for tests of dichotomously scored items, using IRT calibrations from two datasets. The generalized recursion procedure is shown to produce conditional probability distributions for the considered IRT estimators that can be used in the estimation of marginal reliabilities and conditional accuracies (biases and CSEMs). These reliabilities and conditional accuracies are shown to have less extreme and more plausible values compared to theoretical approaches based on test information. The proposed recursion procedure for the estimation of reliability and other accuracy measures are demonstrated for testing situations involving different test lengths, IRT models, and different types of IRT parameter inaccuracies.
{"title":"IRT Scoring and Recursion for Estimating Reliability and Other Accuracy Indices","authors":"Tim Moses, YoungKoung Kim","doi":"10.1111/jedm.70008","DOIUrl":"https://doi.org/10.1111/jedm.70008","url":null,"abstract":"<p>This study considers the estimation of marginal reliability and conditional accuracy measures using a generalized recursion procedure with several IRT-based ability and score estimators. The estimators include MLE, TCC, and EAP abilities, and corresponding test scores obtained with different weightings of the item scores. We consider reliability estimates for 1-, 2-, and 3-parameter logistic IRT models (1PL, 2PL, and 3PL) for tests of dichotomously scored items, using IRT calibrations from two datasets. The generalized recursion procedure is shown to produce conditional probability distributions for the considered IRT estimators that can be used in the estimation of marginal reliabilities and conditional accuracies (biases and CSEMs). These reliabilities and conditional accuracies are shown to have less extreme and more plausible values compared to theoretical approaches based on test information. The proposed recursion procedure for the estimation of reliability and other accuracy measures are demonstrated for testing situations involving different test lengths, IRT models, and different types of IRT parameter inaccuracies.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"718-739"},"PeriodicalIF":1.6,"publicationDate":"2025-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Inadequate test-taking effort poses a significant challenge, particularly when low-stakes test results inform high-stakes policy and psychometric decisions. We examined how rapid guessing (RG), a common form of low test-taking effort, biases item parameter estimates, particularly the discrimination and difficulty parameters. Previous research reported conflicting findings on the direction of bias and what contributes to it. Using simulated data that replicate real-world, low-stakes testing conditions, this study reconciles the inconsistencies by identifying the conditions under which item parameters are over- or underestimated. Bias is influenced by item-related factors (true parameter values and the number of RG responses the items receive) and examinee-related factors (proficiency differences between rapid guessers and non-rapid guessers, the variability in RG behavior among rapid guessers, and the pattern of RG responses throughout the test). The findings highlight that ignoring RG not only distorts proficiency estimates but may also impact broader test operations, including adaptive testing, equating, and standard setting. By demonstrating the potential far-reaching effects of RG, we underline the need for testing professionals to implement methods that mitigate RG's impact (such as motivation filtering) to protect the integrity of their psychometric work.
{"title":"From Item Estimates to Test Operations: The Cascading Effect of Rapid Guessing","authors":"Sarah Alahmadi, Christine E. DeMars","doi":"10.1111/jedm.70010","DOIUrl":"https://doi.org/10.1111/jedm.70010","url":null,"abstract":"<p>Inadequate test-taking effort poses a significant challenge, particularly when low-stakes test results inform high-stakes policy and psychometric decisions. We examined how rapid guessing (RG), a common form of low test-taking effort, biases item parameter estimates, particularly the discrimination and difficulty parameters. Previous research reported conflicting findings on the direction of bias and what contributes to it. Using simulated data that replicate real-world, low-stakes testing conditions, this study reconciles the inconsistencies by identifying the conditions under which item parameters are over- or underestimated. Bias is influenced by item-related factors (true parameter values and the number of RG responses the items receive) and examinee-related factors (proficiency differences between rapid guessers and non-rapid guessers, the variability in RG behavior among rapid guessers, and the pattern of RG responses throughout the test). The findings highlight that ignoring RG not only distorts proficiency estimates but may also impact broader test operations, including adaptive testing, equating, and standard setting. By demonstrating the potential far-reaching effects of RG, we underline the need for testing professionals to implement methods that mitigate RG's impact (such as motivation filtering) to protect the integrity of their psychometric work.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"740-762"},"PeriodicalIF":1.6,"publicationDate":"2025-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70010","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145754592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Special Issue: Adaptive Testing in Large-Scale Assessments","authors":"Peter van Rijn, Francesco Avvisati","doi":"10.1111/jedm.70009","DOIUrl":"https://doi.org/10.1111/jedm.70009","url":null,"abstract":"","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 3","pages":"385-391"},"PeriodicalIF":1.6,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145341777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joseph H. Grochowalski, Lei Wan, Lauren Molin, Amy H. Hendrickson
The Beuk standard setting method derives cut scores through expert judgment that balances content and normative perspectives. This study developed a method to estimate confidence intervals for Beuk settings and assessed their accuracy via simulations. Simulations varied SME panel size, expert agreement, cut score locations, score distributions, and decision alignment. Panels with 20+ participants provided precise and accurate cut score estimates if strongly agreed upon. Larger panels did not improve precision significantly. Cut score location influenced confidence interval widths, highlighting its importance in planning. Real data showed SME disagreement increased bias and variance of Beuk estimates. Use Beuk cut scores cautiously with small panels, flat score distributions, or significant expert disagreement.
{"title":"The Precision and Bias of Cut Score Estimates from the Beuk Standard Setting Method","authors":"Joseph H. Grochowalski, Lei Wan, Lauren Molin, Amy H. Hendrickson","doi":"10.1111/jedm.70007","DOIUrl":"https://doi.org/10.1111/jedm.70007","url":null,"abstract":"<p>The Beuk standard setting method derives cut scores through expert judgment that balances content and normative perspectives. This study developed a method to estimate confidence intervals for Beuk settings and assessed their accuracy via simulations. Simulations varied SME panel size, expert agreement, cut score locations, score distributions, and decision alignment. Panels with 20+ participants provided precise and accurate cut score estimates if strongly agreed upon. Larger panels did not improve precision significantly. Cut score location influenced confidence interval widths, highlighting its importance in planning. Real data showed SME disagreement increased bias and variance of Beuk estimates. Use Beuk cut scores cautiously with small panels, flat score distributions, or significant expert disagreement.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"687-717"},"PeriodicalIF":1.6,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Traditional methods for detecting cheating on assessments tend to focus on either identifying cheaters or compromised items in isolation, overlooking their interconnection. In this study, we present a novel biclustering approach that simultaneously detects both cheaters and compromised items by identifying coherent subgroups of examinees and items exhibiting suspicious response patterns. To identify these patterns, our method leverages response accuracy, response time, and distractor choice data. We evaluated the approach on real datasets and compared its performance with existing detection approaches. Additionally, a comprehensive simulation study was conducted, modeling a variety of realistic cheating scenarios such as answer copying, pre-knowledge of test items, and distinct forms of rapid guessing. Our findings revealed that the biclustering method outperformed previous methods in simultaneously distinguishing cheating and non-cheating behaviors within the empirical study. The simulation analyses further revealed the conditions under which the biclustering approach was most effective in both regards. Overall, the findings underscore the flexibility of biclustering and its adaptability in enhancing test security within diverse testing environments.
{"title":"Simultaneous Detection of Cheaters and Compromised Items Using a Biclustering Approach","authors":"Hyeryung Lee, Walter P. Vispoel","doi":"10.1111/jedm.70004","DOIUrl":"https://doi.org/10.1111/jedm.70004","url":null,"abstract":"<p>Traditional methods for detecting cheating on assessments tend to focus on either identifying cheaters or compromised items in isolation, overlooking their interconnection. In this study, we present a novel biclustering approach that simultaneously detects both cheaters and compromised items by identifying coherent subgroups of examinees and items exhibiting suspicious response patterns. To identify these patterns, our method leverages response accuracy, response time, and distractor choice data. We evaluated the approach on real datasets and compared its performance with existing detection approaches. Additionally, a comprehensive simulation study was conducted, modeling a variety of realistic cheating scenarios such as answer copying, pre-knowledge of test items, and distinct forms of rapid guessing. Our findings revealed that the biclustering method outperformed previous methods in simultaneously distinguishing cheating and non-cheating behaviors within the empirical study. The simulation analyses further revealed the conditions under which the biclustering approach was most effective in both regards. Overall, the findings underscore the flexibility of biclustering and its adaptability in enhancing test security within diverse testing environments.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"608-638"},"PeriodicalIF":1.6,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study investigates the estimation of classification consistency and accuracy indices for composite summed and theta scores within the SS-MIRT framework, using five popular approaches, including the Lee, Rudner, Guo, Bayesian EAP, and Bayesian MCMC approaches. The procedures are illustrated through analysis of two real datasets and further evaluated via a simulation study under various conditions. Overall, results indicated that all five approaches performed well, producing classification indices estimates that were highly consistent in both magnitude and pattern. However, the results also indicated that factors such as the ability estimator, score metric, and cut score location can significantly influence estimation outcomes. Consequently, these considerations should guide practitioners in selecting the most appropriate estimation approach for their specific assessment context.
{"title":"Classification Consistency and Accuracy Indices for Simple Structure MIRT Model","authors":"Huan Liu, Won-Chan Lee","doi":"10.1111/jedm.70006","DOIUrl":"https://doi.org/10.1111/jedm.70006","url":null,"abstract":"<p>This study investigates the estimation of classification consistency and accuracy indices for composite summed and theta scores within the SS-MIRT framework, using five popular approaches, including the Lee, Rudner, Guo, Bayesian EAP, and Bayesian MCMC approaches. The procedures are illustrated through analysis of two real datasets and further evaluated via a simulation study under various conditions. Overall, results indicated that all five approaches performed well, producing classification indices estimates that were highly consistent in both magnitude and pattern. However, the results also indicated that factors such as the ability estimator, score metric, and cut score location can significantly influence estimation outcomes. Consequently, these considerations should guide practitioners in selecting the most appropriate estimation approach for their specific assessment context.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"663-686"},"PeriodicalIF":1.6,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70006","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yue Zhao, Yuerong Wu, Yanlou Liu, Tao Xin, Yiming Wang
Cognitive diagnosis models (CDMs) are widely used to assess individuals’ latent characteristics, offering detailed diagnostic insights for tailored instructional development. Maximum likelihood estimation using the expectation-maximization algorithm (MLE-EM) or its variants, such as the EM algorithm with monotonic constraints and Bayes modal estimation, typically uses a single set of initial values (SIV). The MLE-EM method is sensitive to initial values, especially when dealing with non-convex likelihood functions. This sensitivity implies that different initial values may converge to different local maximum likelihood solutions, but SIV does not guarantee a satisfactory local optimum. Thus, we introduced the multiple sets of initial values (MIV) method to reduce sensitivity to the choice of initial values. We compared MIV and SIV in terms of convergence, log-likelihood values of the converged solutions, parameter recovery, and time consumption under varying conditions of item quality, sample size, attribute correlation, number of initial sets, and convergence settings. The results showed that MIV outperformed SIV in terms of convergence. Applying the MIV method increased the probability of obtaining solutions with higher log-likelihood values. We provide a detailed discussion of this outcome under small sample conditions in which MIV performed worse than SIV.
{"title":"Multiple Sets of Initial Values Method for MLE-EM and Its Variants in Cognitive Diagnosis Models","authors":"Yue Zhao, Yuerong Wu, Yanlou Liu, Tao Xin, Yiming Wang","doi":"10.1111/jedm.70005","DOIUrl":"https://doi.org/10.1111/jedm.70005","url":null,"abstract":"<p>Cognitive diagnosis models (CDMs) are widely used to assess individuals’ latent characteristics, offering detailed diagnostic insights for tailored instructional development. Maximum likelihood estimation using the expectation-maximization algorithm (MLE-EM) or its variants, such as the EM algorithm with monotonic constraints and Bayes modal estimation, typically uses a single set of initial values (SIV). The MLE-EM method is sensitive to initial values, especially when dealing with non-convex likelihood functions. This sensitivity implies that different initial values may converge to different local maximum likelihood solutions, but SIV does not guarantee a satisfactory local optimum. Thus, we introduced the multiple sets of initial values (MIV) method to reduce sensitivity to the choice of initial values. We compared MIV and SIV in terms of convergence, log-likelihood values of the converged solutions, parameter recovery, and time consumption under varying conditions of item quality, sample size, attribute correlation, number of initial sets, and convergence settings. The results showed that MIV outperformed SIV in terms of convergence. Applying the MIV method increased the probability of obtaining solutions with higher log-likelihood values. We provide a detailed discussion of this outcome under small sample conditions in which MIV performed worse than SIV.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"639-662"},"PeriodicalIF":1.6,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Test items with problematic options often require revision to improve their psychometric properties. When an option is identified as ambiguous or nonfunctioning, the traditional approach involves removing the option and conducting another field test to gather new response data—a process that, while effective, is resource-intensive. This study compares two methods for handling option removal: the Retesting method (administering modified items to new examinees) versus the Recalculating method (computationally removing options from existing response data). Through a controlled experiment with multiple-response and matrix-format items, we examined whether these methods produce equivalent item characteristics. Results show striking similarities between methods across multiple psychometric item properties. These findings suggest that the Recalculating method may offer an efficient alternative for items with sufficient option choices. We discuss implementation considerations and present our experimental design and analytical approach as a framework that other testing programs can adapt to evaluate whether the Recalculating method is appropriate for their specific contexts.
{"title":"Comparing Data-Driven Methods for Removing Options in Assessment Items","authors":"William Muntean, Joe Betts, Zhuoran Wang, Hao Jia","doi":"10.1111/jedm.70003","DOIUrl":"https://doi.org/10.1111/jedm.70003","url":null,"abstract":"<p>Test items with problematic options often require revision to improve their psychometric properties. When an option is identified as ambiguous or nonfunctioning, the traditional approach involves removing the option and conducting another field test to gather new response data—a process that, while effective, is resource-intensive. This study compares two methods for handling option removal: the Retesting method (administering modified items to new examinees) versus the Recalculating method (computationally removing options from existing response data). Through a controlled experiment with multiple-response and matrix-format items, we examined whether these methods produce equivalent item characteristics. Results show striking similarities between methods across multiple psychometric item properties. These findings suggest that the Recalculating method may offer an efficient alternative for items with sufficient option choices. We discuss implementation considerations and present our experimental design and analytical approach as a framework that other testing programs can adapt to evaluate whether the Recalculating method is appropriate for their specific contexts.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"588-607"},"PeriodicalIF":1.6,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70003","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}