Adaptive learning systems (ALS) aim to tailor the educational material to match the student's needs, ultimately improving the learning outcomes. An ALS dynamically adjust the level of practice based on the student's ability; therefore, obtaining accurate ability estimates is crucial. Since the amount of responses in a timeframe is limited, high measurement precision is unattainable using only accuracies, which calls for the inclusion of other data sources into the measurement. Here, we propose algorithms that can estimate the abilities on-the-fly based on both accuracy and response times (RT). These are extensions of a rating system called the Urnings algorithm. Since the Urnings algorithm uses discrete updates, building on the difference between a single observed and simulated response, we combined accuracy and RT into a continuous score using a discretized version of the Signed Residual Time (SRT) scoring rule. Through simulation studies, we showed that by augmenting the algorithm with RT, a reliable ability measure and better ability tracking can be obtained by administering fewer items. By reanalyzing data from an existing ALS, we showed that the algorithms can be utilized even if the SRT scoring rule is not explicitly used during measurement, providing better ability estimates and smaller prediction errors.
{"title":"Toward Psychometric Learning Analytics: Augmenting the Urnings Algorithm with Response Times","authors":"Bence Gergely, Maria Bolsinova","doi":"10.1111/jedm.70026","DOIUrl":"10.1111/jedm.70026","url":null,"abstract":"<p>Adaptive learning systems (ALS) aim to tailor the educational material to match the student's needs, ultimately improving the learning outcomes. An ALS dynamically adjust the level of practice based on the student's ability; therefore, obtaining accurate ability estimates is crucial. Since the amount of responses in a timeframe is limited, high measurement precision is unattainable using only accuracies, which calls for the inclusion of other data sources into the measurement. Here, we propose algorithms that can estimate the abilities on-the-fly based on both accuracy and response times (RT). These are extensions of a rating system called the Urnings algorithm. Since the Urnings algorithm uses discrete updates, building on the difference between a single observed and simulated response, we combined accuracy and RT into a continuous score using a discretized version of the Signed Residual Time (SRT) scoring rule. Through simulation studies, we showed that by augmenting the algorithm with RT, a reliable ability measure and better ability tracking can be obtained by administering fewer items. By reanalyzing data from an existing ALS, we showed that the algorithms can be utilized even if the SRT scoring rule is not explicitly used during measurement, providing better ability estimates and smaller prediction errors.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2025-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70026","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146083211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Real-time anomaly detection offers the opportunity to adopt mitigation strategies or interventions to yield credible test scores. Using the signed likelihood ratio test index in real-time to detect examinees with preknowledge (EWP), we propose a novel mitigation strategy that routes flagged examinees to secure content to reduce the effects of preknowledge. In a simulation study, we evaluated the inflation reduction in proficiency estimates by routing flagged examinees to highly secure items. We demonstrated the effects of preknowledge on proficiency estimates, illustrating potential scale drift. We showed a greater chance of flagging EWP of low proficiency early in the test, providing a better opportunity to correct score inflation. Using routing as a mitigation strategy significantly reduced the overall inflation in proficiency estimates for EWP, especially when routing began early in the test. As a result, routing proved to be an effective strategy to reduce overall scale drift due to preknowledge. Routing intervention significantly increased classification accuracy based on pass rates, positive predictive values, and percent agreement for EWP in a high-stakes licensure setting. We also noted that routing did not introduce any inflation into the proficiency estimates of null examinees (false positives).
{"title":"Reducing Score Inflation through Real-Time Routing of Examinees with Preknowledge","authors":"Merve Sarac, James A. Wollack","doi":"10.1111/jedm.70023","DOIUrl":"10.1111/jedm.70023","url":null,"abstract":"<p>Real-time anomaly detection offers the opportunity to adopt mitigation strategies or interventions to yield credible test scores. Using the signed likelihood ratio test index in real-time to detect examinees with preknowledge (EWP), we propose a novel mitigation strategy that routes flagged examinees to secure content to reduce the effects of preknowledge. In a simulation study, we evaluated the inflation reduction in proficiency estimates by routing flagged examinees to highly secure items. We demonstrated the effects of preknowledge on proficiency estimates, illustrating potential scale drift. We showed a greater chance of flagging EWP of low proficiency early in the test, providing a better opportunity to correct score inflation. Using routing as a mitigation strategy significantly reduced the overall inflation in proficiency estimates for EWP, especially when routing began early in the test. As a result, routing proved to be an effective strategy to reduce overall scale drift due to preknowledge. Routing intervention significantly increased classification accuracy based on pass rates, positive predictive values, and percent agreement for EWP in a high-stakes licensure setting. We also noted that routing did not introduce any inflation into the proficiency estimates of null examinees (false positives).</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146091257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Computer-based assessments offer readily available process data for analysis to gain a deeper understanding of the response process. A common response strategy is item revisiting, which can reduce examinees' anxiety and improve their chances of answering questions correctly, and data on item revisiting are recorded automatically in system logs. The approach reported here is to combine two useful and easily accessible types of process data—item response times and item-revisiting data—with a cognitive diagnostic model to enhance accuracy, identify examinees' level of mastery in specific skills within a particular knowledge domain, and provide personalized diagnostic feedback. The modeling involves two monotonicity hypotheses: (1) examinees who engaged in more revisiting in previous items are more likely to revisit the current item; (2) a longer accumulated response time on previous items results in less remaining time, reducing the likelihood of revisiting the current item. Unlike previous studies in which response time was modeled separately, the focus here is on examinees' revisiting behavior, thus the response time is included in the revisiting modeling as a covariate. This allows an in-depth investigation of how the accumulated response time influences revisiting behavior, as well as an exploration of the relationship between response strategy (i.e., item revisiting) and time allocation. The Markov-chain Monte Carlo approach is used for parameter estimation, and its effectiveness is evaluated using two Bayesian evaluation criteria based on posterior samples. Simulation results show that this method is effective for recovering parameters, and an example analysis verifies the the proposed model.
{"title":"Exploring the Influence of Response Time Allocation on Item Revisiting: Implications for Test-Taking Strategies in Cognitive Diagnostic Assessments","authors":"Ziyuan Zhao, Jiwei Zhang, Jing Lu","doi":"10.1111/jedm.70021","DOIUrl":"10.1111/jedm.70021","url":null,"abstract":"<p>Computer-based assessments offer readily available process data for analysis to gain a deeper understanding of the response process. A common response strategy is item revisiting, which can reduce examinees' anxiety and improve their chances of answering questions correctly, and data on item revisiting are recorded automatically in system logs. The approach reported here is to combine two useful and easily accessible types of process data—item response times and item-revisiting data—with a cognitive diagnostic model to enhance accuracy, identify examinees' level of mastery in specific skills within a particular knowledge domain, and provide personalized diagnostic feedback. The modeling involves two monotonicity hypotheses: (1) examinees who engaged in more revisiting in previous items are more likely to revisit the current item; (2) a longer accumulated response time on previous items results in less remaining time, reducing the likelihood of revisiting the current item. Unlike previous studies in which response time was modeled separately, the focus here is on examinees' revisiting behavior, thus the response time is included in the revisiting modeling as a covariate. This allows an in-depth investigation of how the accumulated response time influences revisiting behavior, as well as an exploration of the relationship between response strategy (i.e., item revisiting) and time allocation. The Markov-chain Monte Carlo approach is used for parameter estimation, and its effectiveness is evaluated using two Bayesian evaluation criteria based on posterior samples. Simulation results show that this method is effective for recovering parameters, and an example analysis verifies the the proposed model.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146096630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Machine learning-based automatic scoring faces challenges with imbalanced student responses across scoring categories. To address this, we introduce a novel text data augmentation framework that leverages GPT-4, a generative large language model specifically tailored for imbalanced datasets in automatic scoring. Our experimental dataset consisted of student-written responses to four science items. We crafted prompts for GPT-4 to generate responses, especially for minority scoring classes, enhancing the dataset. We then fine-tuned DistilBERT for automatic scoring based on the augmented and original datasets. Model performance was assessed using accuracy, precision, recall, and F1 metrics. Our findings revealed that incorporating GPT-4-augmented data significantly improved model performance, particularly in terms of precision and F1 scores. Interestingly, the extent of improvement varied depending on the specific dataset and the proportion of augmented data used. Notably, we found that a varying amount of augmented data (20%-40%) was required to achieve stable improvement in automatic scoring. Comparisons with models trained on additional student-written responses suggest that GPT-4 augmented models align with those trained on student data. This research highlights the potential and effectiveness of data augmentation techniques, utilizing generative large language models like GPT-4, in addressing imbalanced datasets within automatic assessment.
{"title":"Using GPT-4 to Augment Imbalanced Data for Automatic Scoring","authors":"Luyang Fang, Gyeonggeon Lee, Xiaoming Zhai","doi":"10.1111/jedm.70020","DOIUrl":"https://doi.org/10.1111/jedm.70020","url":null,"abstract":"<p>Machine learning-based automatic scoring faces challenges with imbalanced student responses across scoring categories. To address this, we introduce a novel text data augmentation framework that leverages GPT-4, a generative large language model specifically tailored for imbalanced datasets in automatic scoring. Our experimental dataset consisted of student-written responses to four science items. We crafted prompts for GPT-4 to generate responses, especially for minority scoring classes, enhancing the dataset. We then fine-tuned DistilBERT for automatic scoring based on the augmented and original datasets. Model performance was assessed using accuracy, precision, recall, and <i>F</i>1 metrics. Our findings revealed that incorporating GPT-4-augmented data significantly improved model performance, particularly in terms of precision and <i>F</i>1 scores. Interestingly, the extent of improvement varied depending on the specific dataset and the proportion of augmented data used. Notably, we found that a varying amount of augmented data (20%-40%) was required to achieve stable improvement in automatic scoring. Comparisons with models trained on additional student-written responses suggest that GPT-4 augmented models align with those trained on student data. This research highlights the potential and effectiveness of data augmentation techniques, utilizing generative large language models like GPT-4, in addressing imbalanced datasets within automatic assessment.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"959-995"},"PeriodicalIF":1.6,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70020","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vertical scales are intended to establish a common metric for scores on test forms targeting different levels of development in a specified domain. They are often constructed using common item, nonequivalent group designs that implicitly rely on the linking items being effectively free from differential item functioning (DIF) or the DIF being symmetric to produce unbiased linking constants. Moderated Nonlinear Factor Analysis (MNLFA) is a measurement model that can be used to understand both the presence of DIF among vertical scale common items and the extent to which the presence of DIF may affect grade-to-grade score distributions. Monte Carlo simulation and synthetic data applications show how models that do and do not account for DIF in vertical scale common items can produce meaningfully different answers to the fundamental question of how much students grow from one grade to the next, but that when DIF is not present, MNLFA provides effectively identical growth estimates to traditional concurrent and characteristic curve approaches to vertical linking.
{"title":"Vertical Scaling with Moderated Nonlinear Factor Analysis","authors":"Sanford R. Student","doi":"10.1111/jedm.70019","DOIUrl":"https://doi.org/10.1111/jedm.70019","url":null,"abstract":"<p>Vertical scales are intended to establish a common metric for scores on test forms targeting different levels of development in a specified domain. They are often constructed using common item, nonequivalent group designs that implicitly rely on the linking items being effectively free from differential item functioning (DIF) or the DIF being symmetric to produce unbiased linking constants. Moderated Nonlinear Factor Analysis (MNLFA) is a measurement model that can be used to understand both the presence of DIF among vertical scale common items and the extent to which the presence of DIF may affect grade-to-grade score distributions. Monte Carlo simulation and synthetic data applications show how models that do and do not account for DIF in vertical scale common items can produce meaningfully different answers to the fundamental question of how much students grow from one grade to the next, but that when DIF is not present, MNLFA provides effectively identical growth estimates to traditional concurrent and characteristic curve approaches to vertical linking.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"929-958"},"PeriodicalIF":1.6,"publicationDate":"2025-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In U.S. colleges, admissions officers tend to use ACT-SAT concordant scores, also known as linked scores, as predictions of individual scores for tests not taken. The major problem in this situation is the use of linked scores without thoroughly examining their predictive utility (i.e., the degree to which they serve as predicted scores at the individual level). To address this problem, we developed a method, referred to as the “predictive utility analysis,” for quantitatively evaluating the prediction accuracy and error properties of linked scores. A Monte Carlo simulation provided several findings on the behavior of the indices formulated in this paper regarding the number of common examinees, the number of items, and the correlation between tests. Furthermore, we illustrated the predictive utility analysis in concordance and equating with the results of an actual large-scale test, the Japan Law School Admission Test. In both examples, we found that the linked scores obtained by using the equipercentile or linear equating method could be used as predictions of individual scores. Our findings suggest that the predictive utility analysis offers practical guidance for enhancing the use of linked scores as well as supporting institutional accountability.
{"title":"A Quantitative Method for Evaluating the Predictive Utility of Linked Scores","authors":"Yoshikazu Sato, Tadashi Shibayama","doi":"10.1111/jedm.70018","DOIUrl":"https://doi.org/10.1111/jedm.70018","url":null,"abstract":"<p>In U.S. colleges, admissions officers tend to use ACT-SAT concordant scores, also known as linked scores, as predictions of individual scores for tests not taken. The major problem in this situation is the use of linked scores without thoroughly examining their predictive utility (i.e., the degree to which they serve as predicted scores at the individual level). To address this problem, we developed a method, referred to as the “predictive utility analysis,” for quantitatively evaluating the prediction accuracy and error properties of linked scores. A Monte Carlo simulation provided several findings on the behavior of the indices formulated in this paper regarding the number of common examinees, the number of items, and the correlation between tests. Furthermore, we illustrated the predictive utility analysis in concordance and equating with the results of an actual large-scale test, the Japan Law School Admission Test. In both examples, we found that the linked scores obtained by using the equipercentile or linear equating method could be used as predictions of individual scores. Our findings suggest that the predictive utility analysis offers practical guidance for enhancing the use of linked scores as well as supporting institutional accountability.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"907-928"},"PeriodicalIF":1.6,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70018","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Automated scoring systems provide multiple benefits but also pose challenges, notably potential bias. Various methods exist to evaluate these algorithms and their outputs for bias. Upon detecting bias, the next logical step is to investigate its cause, often by examining feature distributions. Recently, Johnson and McCaffrey proposed an exploratory approach to identify features responsible for differential prediction bias. However, their approach applies only to linear additive prediction models, excluding many machine learning algorithms. In this paper, we propose the bias contribution measure, a statistic that expands Johnson and McCaffrey's approach to any prediction algorithms that have partial derivatives and that can be implemented in any framework that supports automatic differentiation and matrix inversion. We demonstrated its application and effectiveness on synthetic and real-word data using multiple nonlinear prediction algorithms, including a single-layer feed-forward network (FFN), a support vector regressor, and a deep FFN with multiple hidden layers. In the synthetic data examples, the bias contribution measure successfully identified the feature responsible for the bias. When applied to a real-world data set, the bias contribution measure consistently identified the same set of features across all considered prediction algorithms.
{"title":"Identifying Features Contributing to Differential Prediction Bias of Automated Scoring Systems","authors":"Ikkyu Choi, Matthew S. Johnson","doi":"10.1111/jedm.70015","DOIUrl":"https://doi.org/10.1111/jedm.70015","url":null,"abstract":"<p>Automated scoring systems provide multiple benefits but also pose challenges, notably potential bias. Various methods exist to evaluate these algorithms and their outputs for bias. Upon detecting bias, the next logical step is to investigate its cause, often by examining feature distributions. Recently, Johnson and McCaffrey proposed an exploratory approach to identify features responsible for differential prediction bias. However, their approach applies only to linear additive prediction models, excluding many machine learning algorithms. In this paper, we propose the bias contribution measure, a statistic that expands Johnson and McCaffrey's approach to any prediction algorithms that have partial derivatives and that can be implemented in any framework that supports automatic differentiation and matrix inversion. We demonstrated its application and effectiveness on synthetic and real-word data using multiple nonlinear prediction algorithms, including a single-layer feed-forward network (FFN), a support vector regressor, and a deep FFN with multiple hidden layers. In the synthetic data examples, the bias contribution measure successfully identified the feature responsible for the bias. When applied to a real-world data set, the bias contribution measure consistently identified the same set of features across all considered prediction algorithms.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"838-861"},"PeriodicalIF":1.6,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We fine-tuned and compared several encoder-based Transformer large language models (LLM) to predict differential item functioning (DIF) from the item text. We then applied explainable artificial intelligence (XAI) methods to identify specific words associated with the DIF prediction. The data included 42,180 items designed for English language arts and mathematics summative state assessments among students in grades 3 to 11. Prediction