Machine learning-based automatic scoring faces challenges with imbalanced student responses across scoring categories. To address this, we introduce a novel text data augmentation framework that leverages GPT-4, a generative large language model specifically tailored for imbalanced datasets in automatic scoring. Our experimental dataset consisted of student-written responses to four science items. We crafted prompts for GPT-4 to generate responses, especially for minority scoring classes, enhancing the dataset. We then fine-tuned DistilBERT for automatic scoring based on the augmented and original datasets. Model performance was assessed using accuracy, precision, recall, and F1 metrics. Our findings revealed that incorporating GPT-4-augmented data significantly improved model performance, particularly in terms of precision and F1 scores. Interestingly, the extent of improvement varied depending on the specific dataset and the proportion of augmented data used. Notably, we found that a varying amount of augmented data (20%-40%) was required to achieve stable improvement in automatic scoring. Comparisons with models trained on additional student-written responses suggest that GPT-4 augmented models align with those trained on student data. This research highlights the potential and effectiveness of data augmentation techniques, utilizing generative large language models like GPT-4, in addressing imbalanced datasets within automatic assessment.
{"title":"Using GPT-4 to Augment Imbalanced Data for Automatic Scoring","authors":"Luyang Fang, Gyeonggeon Lee, Xiaoming Zhai","doi":"10.1111/jedm.70020","DOIUrl":"https://doi.org/10.1111/jedm.70020","url":null,"abstract":"<p>Machine learning-based automatic scoring faces challenges with imbalanced student responses across scoring categories. To address this, we introduce a novel text data augmentation framework that leverages GPT-4, a generative large language model specifically tailored for imbalanced datasets in automatic scoring. Our experimental dataset consisted of student-written responses to four science items. We crafted prompts for GPT-4 to generate responses, especially for minority scoring classes, enhancing the dataset. We then fine-tuned DistilBERT for automatic scoring based on the augmented and original datasets. Model performance was assessed using accuracy, precision, recall, and <i>F</i>1 metrics. Our findings revealed that incorporating GPT-4-augmented data significantly improved model performance, particularly in terms of precision and <i>F</i>1 scores. Interestingly, the extent of improvement varied depending on the specific dataset and the proportion of augmented data used. Notably, we found that a varying amount of augmented data (20%-40%) was required to achieve stable improvement in automatic scoring. Comparisons with models trained on additional student-written responses suggest that GPT-4 augmented models align with those trained on student data. This research highlights the potential and effectiveness of data augmentation techniques, utilizing generative large language models like GPT-4, in addressing imbalanced datasets within automatic assessment.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"959-995"},"PeriodicalIF":1.6,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70020","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vertical scales are intended to establish a common metric for scores on test forms targeting different levels of development in a specified domain. They are often constructed using common item, nonequivalent group designs that implicitly rely on the linking items being effectively free from differential item functioning (DIF) or the DIF being symmetric to produce unbiased linking constants. Moderated Nonlinear Factor Analysis (MNLFA) is a measurement model that can be used to understand both the presence of DIF among vertical scale common items and the extent to which the presence of DIF may affect grade-to-grade score distributions. Monte Carlo simulation and synthetic data applications show how models that do and do not account for DIF in vertical scale common items can produce meaningfully different answers to the fundamental question of how much students grow from one grade to the next, but that when DIF is not present, MNLFA provides effectively identical growth estimates to traditional concurrent and characteristic curve approaches to vertical linking.
{"title":"Vertical Scaling with Moderated Nonlinear Factor Analysis","authors":"Sanford R. Student","doi":"10.1111/jedm.70019","DOIUrl":"https://doi.org/10.1111/jedm.70019","url":null,"abstract":"<p>Vertical scales are intended to establish a common metric for scores on test forms targeting different levels of development in a specified domain. They are often constructed using common item, nonequivalent group designs that implicitly rely on the linking items being effectively free from differential item functioning (DIF) or the DIF being symmetric to produce unbiased linking constants. Moderated Nonlinear Factor Analysis (MNLFA) is a measurement model that can be used to understand both the presence of DIF among vertical scale common items and the extent to which the presence of DIF may affect grade-to-grade score distributions. Monte Carlo simulation and synthetic data applications show how models that do and do not account for DIF in vertical scale common items can produce meaningfully different answers to the fundamental question of how much students grow from one grade to the next, but that when DIF is not present, MNLFA provides effectively identical growth estimates to traditional concurrent and characteristic curve approaches to vertical linking.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"929-958"},"PeriodicalIF":1.6,"publicationDate":"2025-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In U.S. colleges, admissions officers tend to use ACT-SAT concordant scores, also known as linked scores, as predictions of individual scores for tests not taken. The major problem in this situation is the use of linked scores without thoroughly examining their predictive utility (i.e., the degree to which they serve as predicted scores at the individual level). To address this problem, we developed a method, referred to as the “predictive utility analysis,” for quantitatively evaluating the prediction accuracy and error properties of linked scores. A Monte Carlo simulation provided several findings on the behavior of the indices formulated in this paper regarding the number of common examinees, the number of items, and the correlation between tests. Furthermore, we illustrated the predictive utility analysis in concordance and equating with the results of an actual large-scale test, the Japan Law School Admission Test. In both examples, we found that the linked scores obtained by using the equipercentile or linear equating method could be used as predictions of individual scores. Our findings suggest that the predictive utility analysis offers practical guidance for enhancing the use of linked scores as well as supporting institutional accountability.
{"title":"A Quantitative Method for Evaluating the Predictive Utility of Linked Scores","authors":"Yoshikazu Sato, Tadashi Shibayama","doi":"10.1111/jedm.70018","DOIUrl":"https://doi.org/10.1111/jedm.70018","url":null,"abstract":"<p>In U.S. colleges, admissions officers tend to use ACT-SAT concordant scores, also known as linked scores, as predictions of individual scores for tests not taken. The major problem in this situation is the use of linked scores without thoroughly examining their predictive utility (i.e., the degree to which they serve as predicted scores at the individual level). To address this problem, we developed a method, referred to as the “predictive utility analysis,” for quantitatively evaluating the prediction accuracy and error properties of linked scores. A Monte Carlo simulation provided several findings on the behavior of the indices formulated in this paper regarding the number of common examinees, the number of items, and the correlation between tests. Furthermore, we illustrated the predictive utility analysis in concordance and equating with the results of an actual large-scale test, the Japan Law School Admission Test. In both examples, we found that the linked scores obtained by using the equipercentile or linear equating method could be used as predictions of individual scores. Our findings suggest that the predictive utility analysis offers practical guidance for enhancing the use of linked scores as well as supporting institutional accountability.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"907-928"},"PeriodicalIF":1.6,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70018","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automated scoring systems provide multiple benefits but also pose challenges, notably potential bias. Various methods exist to evaluate these algorithms and their outputs for bias. Upon detecting bias, the next logical step is to investigate its cause, often by examining feature distributions. Recently, Johnson and McCaffrey proposed an exploratory approach to identify features responsible for differential prediction bias. However, their approach applies only to linear additive prediction models, excluding many machine learning algorithms. In this paper, we propose the bias contribution measure, a statistic that expands Johnson and McCaffrey's approach to any prediction algorithms that have partial derivatives and that can be implemented in any framework that supports automatic differentiation and matrix inversion. We demonstrated its application and effectiveness on synthetic and real-word data using multiple nonlinear prediction algorithms, including a single-layer feed-forward network (FFN), a support vector regressor, and a deep FFN with multiple hidden layers. In the synthetic data examples, the bias contribution measure successfully identified the feature responsible for the bias. When applied to a real-world data set, the bias contribution measure consistently identified the same set of features across all considered prediction algorithms.
{"title":"Identifying Features Contributing to Differential Prediction Bias of Automated Scoring Systems","authors":"Ikkyu Choi, Matthew S. Johnson","doi":"10.1111/jedm.70015","DOIUrl":"https://doi.org/10.1111/jedm.70015","url":null,"abstract":"<p>Automated scoring systems provide multiple benefits but also pose challenges, notably potential bias. Various methods exist to evaluate these algorithms and their outputs for bias. Upon detecting bias, the next logical step is to investigate its cause, often by examining feature distributions. Recently, Johnson and McCaffrey proposed an exploratory approach to identify features responsible for differential prediction bias. However, their approach applies only to linear additive prediction models, excluding many machine learning algorithms. In this paper, we propose the bias contribution measure, a statistic that expands Johnson and McCaffrey's approach to any prediction algorithms that have partial derivatives and that can be implemented in any framework that supports automatic differentiation and matrix inversion. We demonstrated its application and effectiveness on synthetic and real-word data using multiple nonlinear prediction algorithms, including a single-layer feed-forward network (FFN), a support vector regressor, and a deep FFN with multiple hidden layers. In the synthetic data examples, the bias contribution measure successfully identified the feature responsible for the bias. When applied to a real-world data set, the bias contribution measure consistently identified the same set of features across all considered prediction algorithms.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"838-861"},"PeriodicalIF":1.6,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We fine-tuned and compared several encoder-based Transformer large language models (LLM) to predict differential item functioning (DIF) from the item text. We then applied explainable artificial intelligence (XAI) methods to identify specific words associated with the DIF prediction. The data included 42,180 items designed for English language arts and mathematics summative state assessments among students in grades 3 to 11. Prediction