Mingfeng Xue, Yunting Liu, Xingyao Xiao, Mark Wilson
Prompts play a crucial role in eliciting accurate outputs from large language models (LLMs). This study examines the effectiveness of an automatic prompt engineering (APE) framework for automatic scoring in educational measurement. We collected constructed-response data from 930 students across 11 items and used human scores as the true labels. A baseline was established by providing LLMs with the original human-scoring instructions and materials. APE was then applied to optimize prompts for each item. We found that on average, APE increased scoring accuracy by 9%; few-shot learning (i.e., giving multiple labeled examples related to the goal) increased APE performance by 2%; a high temperature (i.e., a parameter for output randomness) was needed in at least part of the APE to improve the scoring accuracy; Quadratic Weighted Kappa (QWK) showed a similar pattern. These findings support the use of APE in automatic scoring. Moreover, compared with the manual scoring instructions, APE tended to restate and reformat the scoring prompts, which could give rise to concerns about validity. Thus, the creative variability introduced by LLMs raises considerations about the balance between innovation and adherence to scoring rubrics.
{"title":"Automatic Prompt Engineering for Automatic Scoring","authors":"Mingfeng Xue, Yunting Liu, Xingyao Xiao, Mark Wilson","doi":"10.1111/jedm.70002","DOIUrl":"https://doi.org/10.1111/jedm.70002","url":null,"abstract":"<p>Prompts play a crucial role in eliciting accurate outputs from large language models (LLMs). This study examines the effectiveness of an automatic prompt engineering (APE) framework for automatic scoring in educational measurement. We collected constructed-response data from 930 students across 11 items and used human scores as the true labels. A baseline was established by providing LLMs with the original human-scoring instructions and materials. APE was then applied to optimize prompts for each item. We found that on average, APE increased scoring accuracy by 9%; few-shot learning (i.e., giving multiple labeled examples related to the goal) increased APE performance by 2%; a high temperature (i.e., a parameter for output randomness) was needed in at least part of the APE to improve the scoring accuracy; Quadratic Weighted Kappa (QWK) showed a similar pattern. These findings support the use of APE in automatic scoring. Moreover, compared with the manual scoring instructions, APE tended to restate and reformat the scoring prompts, which could give rise to concerns about validity. Thus, the creative variability introduced by LLMs raises considerations about the balance between innovation and adherence to scoring rubrics.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"559-587"},"PeriodicalIF":1.6,"publicationDate":"2025-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiawei Xiong, Huan (Hailey) Kuang, Cheng Tang, Qidi Liu, Bowen Wang, George Engelhard Jr., Allan S. Cohen, Xinhui (Maggie) Xiong, Rufei Sheng
Constructed responses (CRs) within testlets are widely used to assess complex skills but can pose calibration challenges due to local item dependence. A few current testlet models incorporate testlet-specific effects to address local dependence but struggle with interpreting these effects and may not fully capture the complexities of CR items because they rely only on response or score patterns. A Topic Testlet Model (TTM) integrates topic modeling within a psychometric framework was proposed. It uses latent topics from student written responses to adjust for local dependence, enable simultaneous calibration, and provide insights into evaluating student reasoning and writing in testlet CR items. Using empirical data from both English Language and Arts as well as Science assessments for grades 3-12, we compare the TTM with existing models in terms of ability estimates, item parameter estimates, and overall model fit. Simulation studies further demonstrate parameter recovery under various testing scenarios. Results show that the TTM effectively accounts for local dependence, improves testlet effect interpretability, and demonstrates a better fit than the existing models. TTM advances CR testlet calibration, leveraging additional information from student written responses to improve the precision of the assessment systems and validity of the use of test scores.
{"title":"A Topic Testlet Model for Calibrating Testlet Constructed Responses","authors":"Jiawei Xiong, Huan (Hailey) Kuang, Cheng Tang, Qidi Liu, Bowen Wang, George Engelhard Jr., Allan S. Cohen, Xinhui (Maggie) Xiong, Rufei Sheng","doi":"10.1111/jedm.70001","DOIUrl":"10.1111/jedm.70001","url":null,"abstract":"<p>Constructed responses (CRs) within testlets are widely used to assess complex skills but can pose calibration challenges due to local item dependence. A few current testlet models incorporate testlet-specific effects to address local dependence but struggle with interpreting these effects and may not fully capture the complexities of CR items because they rely only on response or score patterns. A Topic Testlet Model (TTM) integrates topic modeling within a psychometric framework was proposed. It uses latent topics from student written responses to adjust for local dependence, enable simultaneous calibration, and provide insights into evaluating student reasoning and writing in testlet CR items. Using empirical data from both English Language and Arts as well as Science assessments for grades 3-12, we compare the TTM with existing models in terms of ability estimates, item parameter estimates, and overall model fit. Simulation studies further demonstrate parameter recovery under various testing scenarios. Results show that the TTM effectively accounts for local dependence, improves testlet effect interpretability, and demonstrates a better fit than the existing models. TTM advances CR testlet calibration, leveraging additional information from student written responses to improve the precision of the assessment systems and validity of the use of test scores.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"63 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70001","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146096362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paul A. Jewsbury, Daniel F. McCaffrey, Yue Jia, Eugenio J. Gonzalez
Large-scale survey assessments (LSAs) such as NAEP, TIMSS, PIRLS, IELS, and NAPLAN produce plausible values of student proficiency for estimating population statistics. Plausible values are imputed values for latent proficiency variables. While prominently used for LSAs, they are applicable to a wide range of latent variable modelling contexts such as surveys about psychological dispositions or beliefs. Following the practice of multiple imputation, LSAs produce multiple sets of plausible values for each survey. The criteria used to determine the number of plausible values remains unresolved and is inconsistent in practice. We show analytically and via simulation that the number of plausible values used determines the amount of Monte Carlo error on point estimates and standard errors as a function of the fraction of missing information. We derive expressions to determine the number of plausible values required to reach a given level of precision. We analyze real data from a LSA to provide guidelines supported by theory, simulation, and real data on the number of plausible values. Finally, we illustrate the impact with a power analysis. Our results show there is meaningful benefit to the use of greater numbers of plausible values than currently generated by LSAs.
{"title":"How Many Plausible Values?","authors":"Paul A. Jewsbury, Daniel F. McCaffrey, Yue Jia, Eugenio J. Gonzalez","doi":"10.1111/jedm.70000","DOIUrl":"https://doi.org/10.1111/jedm.70000","url":null,"abstract":"<p>Large-scale survey assessments (LSAs) such as NAEP, TIMSS, PIRLS, IELS, and NAPLAN produce plausible values of student proficiency for estimating population statistics. Plausible values are imputed values for latent proficiency variables. While prominently used for LSAs, they are applicable to a wide range of latent variable modelling contexts such as surveys about psychological dispositions or beliefs. Following the practice of multiple imputation, LSAs produce multiple sets of plausible values for each survey. The criteria used to determine the number of plausible values remains unresolved and is inconsistent in practice. We show analytically and via simulation that the number of plausible values used determines the amount of Monte Carlo error on point estimates and standard errors as a function of the fraction of missing information. We derive expressions to determine the number of plausible values required to reach a given level of precision. We analyze real data from a LSA to provide guidelines supported by theory, simulation, and real data on the number of plausible values. Finally, we illustrate the impact with a power analysis. Our results show there is meaningful benefit to the use of greater numbers of plausible values than currently generated by LSAs.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"531-558"},"PeriodicalIF":1.6,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While testlets have proven useful for assessing complex skills, the stem shared by multiple items often induces correlations between responses, leading to violations of local independence (LI), which can result in biased parameter and ability estimates. Diagnostic procedures for detecting testlet effects typically involve model comparisons testing for the inclusion of extra testlet parameters or, at the item level, testing for pairwise LI. Rosenbaum's adaptation of the Mantel-Haenszel (MH)