Cognitive Diagnostic Models (CDMs) provide fine-grained diagnostic feedback, but their central component—the Q-matrix—remains costly and labor-intensive to construct. This study explores the automated generation of Q-matrices using general-purpose AI, including ChatGPT-4o, Gemini-2.5-pro, and Claude-sonnet-4. We evaluated two prompting strategies (all-at-once and one-by-one) across TIMSS 2007, TIMSS 2011, and PISA 2012 mathematics assessments. Results show that AI-generated Q-matrices approximate human baselines with competitive model fitting performance (AIC, BIC, log-likelihood, and SRMSR) and acceptable classification discrepancies. While AI predictions for larger and more complicated assessments (TIMSS 07 and 11) were generally sparser than human-generated Q-matrices, they still achieved equal or better fit statistics under most CDMs. In contrast, for the smaller and less complicated PISA 2012 assessment, AI-generated Q-matrices matched human density and fitting quality. Importantly, chatbot-human matching accuracy remained high across models, with Gemini benefiting from all-at-once prompting, ChatGPT-4o maintaining stable performance under both strategies, and Claude showing sensitivity to prompt structure. These findings highlight both the promise and current limitations of automated Q-matrix generation, underscoring opportunities for integrating LLMs into scalable diagnostic assessment practices.
{"title":"Evaluating General-Purpose Multimodal AI for Q-Matrix Generation from Math Items: A Cognitive Diagnostic Modeling Exploration","authors":"Kang Xue, James J. Appleton","doi":"10.1111/jedm.70028","DOIUrl":"https://doi.org/10.1111/jedm.70028","url":null,"abstract":"<p>Cognitive Diagnostic Models (CDMs) provide fine-grained diagnostic feedback, but their central component—the Q-matrix—remains costly and labor-intensive to construct. This study explores the automated generation of Q-matrices using general-purpose AI, including ChatGPT-4o, Gemini-2.5-pro, and Claude-sonnet-4. We evaluated two prompting strategies (all-at-once and one-by-one) across TIMSS 2007, TIMSS 2011, and PISA 2012 mathematics assessments. Results show that AI-generated Q-matrices approximate human baselines with competitive model fitting performance (AIC, BIC, log-likelihood, and SRMSR) and acceptable classification discrepancies. While AI predictions for larger and more complicated assessments (TIMSS 07 and 11) were generally sparser than human-generated Q-matrices, they still achieved equal or better fit statistics under most CDMs. In contrast, for the smaller and less complicated PISA 2012 assessment, AI-generated Q-matrices matched human density and fitting quality. Importantly, chatbot-human matching accuracy remained high across models, with Gemini benefiting from all-at-once prompting, ChatGPT-4o maintaining stable performance under both strategies, and Claude showing sensitivity to prompt structure. These findings highlight both the promise and current limitations of automated Q-matrix generation, underscoring opportunities for integrating LLMs into scalable diagnostic assessment practices.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146130123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fang, L., Lee, G., & Zhai, X. (2025), Using GPT-4 to augment imbalanced data for automatic scoring. Journal of Educational Measurement, 62, 959–995. https://doi.org/10.1111/jedm.70020
Gyeonggeon Lee carried out his work for this article while at the University of Georgia and continued it after moving to Nanyang Technological University, Singapore, where he is now affiliated. The original article erroneously indicated that the work was conducted solely at the University of Georgia.
We apologize for this error.
方,L。李,G。,,翟,x(2025),使用GPT-4增加不平衡数据进行自动评分。教育测量学报,6(2):959-995。https://doi.org/10.1111/jedm.70020Gyeonggeon Lee在佐治亚大学(University of Georgia)完成了本文的工作,并在搬到新加坡南洋理工大学(Nanyang Technological University)后继续工作。最初的文章错误地指出,这项工作完全是在佐治亚大学进行的。我们为这个错误道歉。
{"title":"Correction to “Using GPT-4 to Augment Imbalanced Data for Automatic Scoring”","authors":"","doi":"10.1111/jedm.70033","DOIUrl":"10.1111/jedm.70033","url":null,"abstract":"<p>Fang, L., Lee, G., & Zhai, X. (2025), Using GPT-4 to augment imbalanced data for automatic scoring. <i>Journal of Educational Measurement</i>, 62, 959–995. https://doi.org/10.1111/jedm.70020</p><p>Gyeonggeon Lee carried out his work for this article while at the University of Georgia and continued it after moving to Nanyang Technological University, Singapore, where he is now affiliated. The original article erroneously indicated that the work was conducted solely at the University of Georgia.</p><p>We apologize for this error.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70033","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146136871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unbiasedness for proficiency estimates is important for autoscoring engines since the outcome might be used for future learning or placement. Imbalanced training data may lead to certain biases and lower the prediction accuracy for classification algorithms. In this article, we investigated several data augmentation methods to lower the negative effect of imbalanced data in measurement settings. Four approaches were examined: (1) Resampling methods, either oversampling or undersampling; (2) Active resampling methods, where the resampling weight is based on representativeness in the training set; (3) Data expansion methods using synonym Replacement, slightly changing the meaning or semantics of the original answers; and (4) Content recreation method using Generative AI (e.g., ChatGPT) to create responses for less populated scores. We compared the performance (e.g., Accuracy, QWK, F1) as well as the distance metric for different combinations of the methods. Two datasets with different imbalanced distributions were used. Results show that all four methods can help to mitigate the bias issue and the efficacy was influenced by the imbalance level, representativeness of the original data and the level of increment in the variety of the response (i.e., lexical diversity). In general, resampling and GenAI with active resampling showed the best overall performance.
{"title":"AI and Measurement Concerns: Dealing with Imbalanced Data in Autoscoring","authors":"Yunting Liu, Yijun Xiang, Xutao Feng, Mark Wilson","doi":"10.1111/jedm.70031","DOIUrl":"https://doi.org/10.1111/jedm.70031","url":null,"abstract":"<p>Unbiasedness for proficiency estimates is important for autoscoring engines since the outcome might be used for future learning or placement. Imbalanced training data may lead to certain biases and lower the prediction accuracy for classification algorithms. In this article, we investigated several data augmentation methods to lower the negative effect of imbalanced data in measurement settings. Four approaches were examined: (1) Resampling methods, either oversampling or undersampling; (2) Active resampling methods, where the resampling weight is based on representativeness in the training set; (3) Data expansion methods using synonym Replacement, slightly changing the meaning or semantics of the original answers; and (4) Content recreation method using Generative AI (e.g., ChatGPT) to create responses for less populated scores. We compared the performance (e.g., Accuracy, QWK, <i>F</i>1) as well as the distance metric for different combinations of the methods. Two datasets with different imbalanced distributions were used. Results show that all four methods can help to mitigate the bias issue and the efficacy was influenced by the imbalance level, representativeness of the original data and the level of increment in the variety of the response (i.e., lexical diversity). In general, resampling and GenAI with active resampling showed the best overall performance.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70031","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146130218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Advancements in artificial intelligence (AI) have brought significant changes to testing practices, including the emergence of randomly parallel testing (RPT), in which examinees receive different but psychometrically similar sets of items generated from templates or AI-based systems. This paper presents a generalizability theory (GT) framework for estimating conditional standard errors of measurement (CSEMs) and related reliability indices, with a particular focus on design structures commonly encountered in RPT within domain-referenced testing contexts. The proposed framework supports the evaluation of score precision across a variety of operational designs, including crossed, nested, and multivariate configurations. Several illustrative examples are provided to demonstrate the methodology in practical settings. The paper also addresses key psychometric and interpretive challenges associated with RPT and outlines promising directions for future research.
{"title":"Generalizability Theory for Randomly Parallel Testing","authors":"Won-Chan Lee, Stella Y. Kim, Seungwon Shin","doi":"10.1111/jedm.70029","DOIUrl":"10.1111/jedm.70029","url":null,"abstract":"<p>Advancements in artificial intelligence (AI) have brought significant changes to testing practices, including the emergence of randomly parallel testing (RPT), in which examinees receive different but psychometrically similar sets of items generated from templates or AI-based systems. This paper presents a generalizability theory (GT) framework for estimating conditional standard errors of measurement (CSEMs) and related reliability indices, with a particular focus on design structures commonly encountered in RPT within domain-referenced testing contexts. The proposed framework supports the evaluation of score precision across a variety of operational designs, including crossed, nested, and multivariate configurations. Several illustrative examples are provided to demonstrate the methodology in practical settings. The paper also addresses key psychometric and interpretive challenges associated with RPT and outlines promising directions for future research.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70029","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146091461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The rapid transition from traditional paper-and-pencil tests to computer-based testing systems has significantly altered the educational landscape, particularly during the COVID-19 pandemic. While online assessments offer numerous advantages, they also present unique challenges, with test security being paramount. This article addresses the critical issue of test fraud in digital assessments, specifically focusing on item preknowledge, where examinees have prior access to test items. Using response-time data, we propose a statistical framework for simultaneously identifying compromised items and examinees with item preknowledge in a single-step analysis. Unlike existing methods, our model does not require prior knowledge about the compromised status of items. Using a large-scale online certification exam dataset, we demonstrate the model's application in detecting significant signals in response times, identifying potentially compromised items, and examinees with potential item preknowledge.
{"title":"Simultaneous Detection of Compromised Items and Examinees with Item Preknowledge in Online Assessments Using Response Time Data","authors":"Cengiz Zopluoglu","doi":"10.1111/jedm.70030","DOIUrl":"10.1111/jedm.70030","url":null,"abstract":"<p>The rapid transition from traditional paper-and-pencil tests to computer-based testing systems has significantly altered the educational landscape, particularly during the COVID-19 pandemic. While online assessments offer numerous advantages, they also present unique challenges, with test security being paramount. This article addresses the critical issue of test fraud in digital assessments, specifically focusing on item preknowledge, where examinees have prior access to test items. Using response-time data, we propose a statistical framework for simultaneously identifying compromised items and examinees with item preknowledge in a single-step analysis. Unlike existing methods, our model does not require prior knowledge about the compromised status of items. Using a large-scale online certification exam dataset, we demonstrate the model's application in detecting significant signals in response times, identifying potentially compromised items, and examinees with potential item preknowledge.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70030","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146057815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The emergence of automated item generation (AIG) techniques has intensified discussions around their application in assessment development. Some testing companies have already begun developing software to construct exams using AIG. However, the current literature offers limited insights into the characteristics of items generated through AIG, particularly in the realm of multistage testing (MST). This study proposes a novel approach for adjusting template item parameters to enhance ability estimation accuracy under the MST context. A simulation study was conducted using two MST designs with varying numbers of stages and modules. Results demonstrated that the proposed method significantly improved the accuracy of person parameter estimates compared to a more practical, yet less precise, approach that assumes all item clones share identical parameters.
{"title":"Improving Ability Estimation Accuracy for Automated Item Generated Forms under Multistage Testing","authors":"Stella Y. Kim, Won-Chan Lee","doi":"10.1111/jedm.70027","DOIUrl":"https://doi.org/10.1111/jedm.70027","url":null,"abstract":"<p>The emergence of automated item generation (AIG) techniques has intensified discussions around their application in assessment development. Some testing companies have already begun developing software to construct exams using AIG. However, the current literature offers limited insights into the characteristics of items generated through AIG, particularly in the realm of multistage testing (MST). This study proposes a novel approach for adjusting template item parameters to enhance ability estimation accuracy under the MST context. A simulation study was conducted using two MST designs with varying numbers of stages and modules. Results demonstrated that the proposed method significantly improved the accuracy of person parameter estimates compared to a more practical, yet less precise, approach that assumes all item clones share identical parameters.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70027","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146099374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Comparative judgment is an assessment method where item ratings are estimated based on rankings of subsets of the items. These rankings are typically pairwise, with ratings taken to be the estimated parameters from fitting a Bradley-Terry model. Likelihood penalization is often employed to ensure finiteness of estimates. Adaptive scheduling of the comparisons can increase the efficiency of the assessment. We show that the most commonly used penalty in Comparative Judgment is not the best-performing penalty under adaptive scheduling and can lead to substantial bias in parameter estimation. We demonstrate this using simulated and real data and provide a theoretical explanation for the relative performance of the penalties considered, including identifying a preferred alternative. Further, we propose a novel approach based on a parametric bootstrap. It is found to produce better parameter estimates for adaptive schedules and to be robust to variations in underlying strength distributions. The work allows for more efficient implementations of comparative judgment.
{"title":"Parameter Estimation in Comparative Judgment Under Random and Adaptive Scheduling Schemes","authors":"Ian Hamilton, Nick Tawn","doi":"10.1111/jedm.70022","DOIUrl":"https://doi.org/10.1111/jedm.70022","url":null,"abstract":"<p>Comparative judgment is an assessment method where item ratings are estimated based on rankings of subsets of the items. These rankings are typically pairwise, with ratings taken to be the estimated parameters from fitting a Bradley-Terry model. Likelihood penalization is often employed to ensure finiteness of estimates. Adaptive scheduling of the comparisons can increase the efficiency of the assessment. We show that the most commonly used penalty in Comparative Judgment is not the best-performing penalty under adaptive scheduling and can lead to substantial bias in parameter estimation. We demonstrate this using simulated and real data and provide a theoretical explanation for the relative performance of the penalties considered, including identifying a preferred alternative. Further, we propose a novel approach based on a parametric bootstrap. It is found to produce better parameter estimates for adaptive schedules and to be robust to variations in underlying strength distributions. The work allows for more efficient implementations of comparative judgment.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70022","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146099375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The emergence of automated item generation (AIG) techniques has intensified discussions around their application in assessment development. Some testing companies have already begun developing software to construct exams using AIG. However, the current literature offers limited insights into the characteristics of items generated through AIG, particularly in the realm of multistage testing (MST). This study proposes a novel approach for adjusting template item parameters to enhance ability estimation accuracy under the MST context. A simulation study was conducted using two MST designs with varying numbers of stages and modules. Results demonstrated that the proposed method significantly improved the accuracy of person parameter estimates compared to a more practical, yet less precise, approach that assumes all item clones share identical parameters.
{"title":"Improving Ability Estimation Accuracy for Automated Item Generated Forms under Multistage Testing","authors":"Stella Y. Kim, Won-Chan Lee","doi":"10.1111/jedm.70027","DOIUrl":"https://doi.org/10.1111/jedm.70027","url":null,"abstract":"<p>The emergence of automated item generation (AIG) techniques has intensified discussions around their application in assessment development. Some testing companies have already begun developing software to construct exams using AIG. However, the current literature offers limited insights into the characteristics of items generated through AIG, particularly in the realm of multistage testing (MST). This study proposes a novel approach for adjusting template item parameters to enhance ability estimation accuracy under the MST context. A simulation study was conducted using two MST designs with varying numbers of stages and modules. Results demonstrated that the proposed method significantly improved the accuracy of person parameter estimates compared to a more practical, yet less precise, approach that assumes all item clones share identical parameters.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70027","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146099376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Comparative judgment is an assessment method where item ratings are estimated based on rankings of subsets of the items. These rankings are typically pairwise, with ratings taken to be the estimated parameters from fitting a Bradley-Terry model. Likelihood penalization is often employed to ensure finiteness of estimates. Adaptive scheduling of the comparisons can increase the efficiency of the assessment. We show that the most commonly used penalty in Comparative Judgment is not the best-performing penalty under adaptive scheduling and can lead to substantial bias in parameter estimation. We demonstrate this using simulated and real data and provide a theoretical explanation for the relative performance of the penalties considered, including identifying a preferred alternative. Further, we propose a novel approach based on a parametric bootstrap. It is found to produce better parameter estimates for adaptive schedules and to be robust to variations in underlying strength distributions. The work allows for more efficient implementations of comparative judgment.
{"title":"Parameter Estimation in Comparative Judgment Under Random and Adaptive Scheduling Schemes","authors":"Ian Hamilton, Nick Tawn","doi":"10.1111/jedm.70022","DOIUrl":"https://doi.org/10.1111/jedm.70022","url":null,"abstract":"<p>Comparative judgment is an assessment method where item ratings are estimated based on rankings of subsets of the items. These rankings are typically pairwise, with ratings taken to be the estimated parameters from fitting a Bradley-Terry model. Likelihood penalization is often employed to ensure finiteness of estimates. Adaptive scheduling of the comparisons can increase the efficiency of the assessment. We show that the most commonly used penalty in Comparative Judgment is not the best-performing penalty under adaptive scheduling and can lead to substantial bias in parameter estimation. We demonstrate this using simulated and real data and provide a theoretical explanation for the relative performance of the penalties considered, including identifying a preferred alternative. Further, we propose a novel approach based on a parametric bootstrap. It is found to produce better parameter estimates for adaptive schedules and to be robust to variations in underlying strength distributions. The work allows for more efficient implementations of comparative judgment.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70022","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146099377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aberrant behaviors among test-takers in large-scale assessments are often more prevalent within specific groups or testing sites. While various techniques have been developed to detect individual-level test-takers' aberrant behaviors, research in detecting those behaviors at the group level is rare. We propose a group fit statistic