首页 > 最新文献

Journal of Educational Measurement最新文献

英文 中文
Using GPT-4 to Augment Imbalanced Data for Automatic Scoring 使用GPT-4来增加自动评分的不平衡数据
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-11-19 DOI: 10.1111/jedm.70020
Luyang Fang, Gyeonggeon Lee, Xiaoming Zhai

Machine learning-based automatic scoring faces challenges with imbalanced student responses across scoring categories. To address this, we introduce a novel text data augmentation framework that leverages GPT-4, a generative large language model specifically tailored for imbalanced datasets in automatic scoring. Our experimental dataset consisted of student-written responses to four science items. We crafted prompts for GPT-4 to generate responses, especially for minority scoring classes, enhancing the dataset. We then fine-tuned DistilBERT for automatic scoring based on the augmented and original datasets. Model performance was assessed using accuracy, precision, recall, and F1 metrics. Our findings revealed that incorporating GPT-4-augmented data significantly improved model performance, particularly in terms of precision and F1 scores. Interestingly, the extent of improvement varied depending on the specific dataset and the proportion of augmented data used. Notably, we found that a varying amount of augmented data (20%-40%) was required to achieve stable improvement in automatic scoring. Comparisons with models trained on additional student-written responses suggest that GPT-4 augmented models align with those trained on student data. This research highlights the potential and effectiveness of data augmentation techniques, utilizing generative large language models like GPT-4, in addressing imbalanced datasets within automatic assessment.

基于机器学习的自动评分面临着不同评分类别学生反应不平衡的挑战。为了解决这个问题,我们引入了一种新的文本数据增强框架,该框架利用GPT-4,这是一种专门为自动评分中的不平衡数据集量身定制的生成式大型语言模型。我们的实验数据集包括学生对四个科学项目的书面回答。我们为GPT-4制作了提示以生成响应,特别是对于少数评分类,增强了数据集。然后,我们对基于增强和原始数据集的蒸馏器进行了微调,以实现自动评分。使用准确性、精密度、召回率和F1指标评估模型性能。我们的研究结果表明,纳入gpt -4增强数据可显著提高模型性能,特别是在精度和F1分数方面。有趣的是,改进的程度取决于特定的数据集和所使用的增强数据的比例。值得注意的是,我们发现需要不同数量的增强数据(20%-40%)来实现自动评分的稳定改进。与基于额外学生书面回答训练的模型的比较表明,GPT-4增强模型与基于学生数据训练的模型一致。这项研究强调了数据增强技术的潜力和有效性,利用像GPT-4这样的生成式大型语言模型,在自动评估中解决不平衡的数据集。
{"title":"Using GPT-4 to Augment Imbalanced Data for Automatic Scoring","authors":"Luyang Fang,&nbsp;Gyeonggeon Lee,&nbsp;Xiaoming Zhai","doi":"10.1111/jedm.70020","DOIUrl":"https://doi.org/10.1111/jedm.70020","url":null,"abstract":"<p>Machine learning-based automatic scoring faces challenges with imbalanced student responses across scoring categories. To address this, we introduce a novel text data augmentation framework that leverages GPT-4, a generative large language model specifically tailored for imbalanced datasets in automatic scoring. Our experimental dataset consisted of student-written responses to four science items. We crafted prompts for GPT-4 to generate responses, especially for minority scoring classes, enhancing the dataset. We then fine-tuned DistilBERT for automatic scoring based on the augmented and original datasets. Model performance was assessed using accuracy, precision, recall, and <i>F</i>1 metrics. Our findings revealed that incorporating GPT-4-augmented data significantly improved model performance, particularly in terms of precision and <i>F</i>1 scores. Interestingly, the extent of improvement varied depending on the specific dataset and the proportion of augmented data used. Notably, we found that a varying amount of augmented data (20%-40%) was required to achieve stable improvement in automatic scoring. Comparisons with models trained on additional student-written responses suggest that GPT-4 augmented models align with those trained on student data. This research highlights the potential and effectiveness of data augmentation techniques, utilizing generative large language models like GPT-4, in addressing imbalanced datasets within automatic assessment.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"959-995"},"PeriodicalIF":1.6,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70020","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Vertical Scaling with Moderated Nonlinear Factor Analysis 有调节非线性因子分析的垂直标度
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-11-09 DOI: 10.1111/jedm.70019
Sanford R. Student

Vertical scales are intended to establish a common metric for scores on test forms targeting different levels of development in a specified domain. They are often constructed using common item, nonequivalent group designs that implicitly rely on the linking items being effectively free from differential item functioning (DIF) or the DIF being symmetric to produce unbiased linking constants. Moderated Nonlinear Factor Analysis (MNLFA) is a measurement model that can be used to understand both the presence of DIF among vertical scale common items and the extent to which the presence of DIF may affect grade-to-grade score distributions. Monte Carlo simulation and synthetic data applications show how models that do and do not account for DIF in vertical scale common items can produce meaningfully different answers to the fundamental question of how much students grow from one grade to the next, but that when DIF is not present, MNLFA provides effectively identical growth estimates to traditional concurrent and characteristic curve approaches to vertical linking.

垂直尺度旨在为特定领域中针对不同发展水平的测试表格建立一个通用的分数度量。它们通常使用公共项、非等价组设计来构建,这些设计隐式地依赖于链接项有效地免于微分项功能(DIF)或DIF是对称的以产生无偏连接常数。调节非线性因子分析(MNLFA)是一种测量模型,可以用来了解DIF在垂直量表常见项目中的存在,以及DIF的存在对年级间得分分布的影响程度。蒙特卡罗模拟和综合数据应用表明,在垂直尺度常见项目中考虑和不考虑DIF的模型如何能够对学生从一个年级到下一个年级的增长多少这一基本问题产生有意义的不同答案,但是当DIF不存在时,MNLFA有效地提供了与传统并行和特征曲线方法相同的增长估计垂直链接。
{"title":"Vertical Scaling with Moderated Nonlinear Factor Analysis","authors":"Sanford R. Student","doi":"10.1111/jedm.70019","DOIUrl":"https://doi.org/10.1111/jedm.70019","url":null,"abstract":"<p>Vertical scales are intended to establish a common metric for scores on test forms targeting different levels of development in a specified domain. They are often constructed using common item, nonequivalent group designs that implicitly rely on the linking items being effectively free from differential item functioning (DIF) or the DIF being symmetric to produce unbiased linking constants. Moderated Nonlinear Factor Analysis (MNLFA) is a measurement model that can be used to understand both the presence of DIF among vertical scale common items and the extent to which the presence of DIF may affect grade-to-grade score distributions. Monte Carlo simulation and synthetic data applications show how models that do and do not account for DIF in vertical scale common items can produce meaningfully different answers to the fundamental question of how much students grow from one grade to the next, but that when DIF is not present, MNLFA provides effectively identical growth estimates to traditional concurrent and characteristic curve approaches to vertical linking.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"929-958"},"PeriodicalIF":1.6,"publicationDate":"2025-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Quantitative Method for Evaluating the Predictive Utility of Linked Scores 一种评估关联分数预测效用的定量方法
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-11-04 DOI: 10.1111/jedm.70018
Yoshikazu Sato, Tadashi Shibayama

In U.S. colleges, admissions officers tend to use ACT-SAT concordant scores, also known as linked scores, as predictions of individual scores for tests not taken. The major problem in this situation is the use of linked scores without thoroughly examining their predictive utility (i.e., the degree to which they serve as predicted scores at the individual level). To address this problem, we developed a method, referred to as the “predictive utility analysis,” for quantitatively evaluating the prediction accuracy and error properties of linked scores. A Monte Carlo simulation provided several findings on the behavior of the indices formulated in this paper regarding the number of common examinees, the number of items, and the correlation between tests. Furthermore, we illustrated the predictive utility analysis in concordance and equating with the results of an actual large-scale test, the Japan Law School Admission Test. In both examples, we found that the linked scores obtained by using the equipercentile or linear equating method could be used as predictions of individual scores. Our findings suggest that the predictive utility analysis offers practical guidance for enhancing the use of linked scores as well as supporting institutional accountability.

在美国大学,招生人员倾向于使用ACT-SAT一致分数,也称为关联分数,作为未参加考试的个人分数的预测。在这种情况下的主要问题是使用关联分数而没有彻底检查其预测效用(即,它们在个人层面上作为预测分数的程度)。为了解决这个问题,我们开发了一种方法,称为“预测效用分析”,用于定量评估关联分数的预测准确性和错误属性。蒙特卡罗模拟提供了关于本文中制定的关于普通考生数量、项目数量和测试之间相关性的指数的行为的几个发现。此外,我们说明了预测效用分析的一致性和等同于一个实际的大规模测试,日本法学院入学考试的结果。在这两个例子中,我们发现使用等百分位或线性等式方法获得的关联分数可以用作个人分数的预测。我们的研究结果表明,预测效用分析为加强关联分数的使用以及支持机构问责制提供了实际指导。
{"title":"A Quantitative Method for Evaluating the Predictive Utility of Linked Scores","authors":"Yoshikazu Sato,&nbsp;Tadashi Shibayama","doi":"10.1111/jedm.70018","DOIUrl":"https://doi.org/10.1111/jedm.70018","url":null,"abstract":"<p>In U.S. colleges, admissions officers tend to use ACT-SAT concordant scores, also known as linked scores, as predictions of individual scores for tests not taken. The major problem in this situation is the use of linked scores without thoroughly examining their predictive utility (i.e., the degree to which they serve as predicted scores at the individual level). To address this problem, we developed a method, referred to as the “predictive utility analysis,” for quantitatively evaluating the prediction accuracy and error properties of linked scores. A Monte Carlo simulation provided several findings on the behavior of the indices formulated in this paper regarding the number of common examinees, the number of items, and the correlation between tests. Furthermore, we illustrated the predictive utility analysis in concordance and equating with the results of an actual large-scale test, the Japan Law School Admission Test. In both examples, we found that the linked scores obtained by using the equipercentile or linear equating method could be used as predictions of individual scores. Our findings suggest that the predictive utility analysis offers practical guidance for enhancing the use of linked scores as well as supporting institutional accountability.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"907-928"},"PeriodicalIF":1.6,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70018","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Identifying Features Contributing to Differential Prediction Bias of Automated Scoring Systems 识别导致自动评分系统差异预测偏差的特征
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-11-03 DOI: 10.1111/jedm.70015
Ikkyu Choi, Matthew S. Johnson

Automated scoring systems provide multiple benefits but also pose challenges, notably potential bias. Various methods exist to evaluate these algorithms and their outputs for bias. Upon detecting bias, the next logical step is to investigate its cause, often by examining feature distributions. Recently, Johnson and McCaffrey proposed an exploratory approach to identify features responsible for differential prediction bias. However, their approach applies only to linear additive prediction models, excluding many machine learning algorithms. In this paper, we propose the bias contribution measure, a statistic that expands Johnson and McCaffrey's approach to any prediction algorithms that have partial derivatives and that can be implemented in any framework that supports automatic differentiation and matrix inversion. We demonstrated its application and effectiveness on synthetic and real-word data using multiple nonlinear prediction algorithms, including a single-layer feed-forward network (FFN), a support vector regressor, and a deep FFN with multiple hidden layers. In the synthetic data examples, the bias contribution measure successfully identified the feature responsible for the bias. When applied to a real-world data set, the bias contribution measure consistently identified the same set of features across all considered prediction algorithms.

自动评分系统提供了多种好处,但也带来了挑战,尤其是潜在的偏见。存在各种方法来评估这些算法及其输出的偏差。在检测到偏差后,下一个合乎逻辑的步骤是调查其原因,通常是通过检查特征分布。最近,Johnson和McCaffrey提出了一种探索性方法来识别导致差异预测偏差的特征。然而,他们的方法只适用于线性加性预测模型,不包括许多机器学习算法。在本文中,我们提出了偏差贡献度量,这是一种将Johnson和McCaffrey的方法扩展到具有偏导数的任何预测算法的统计量,并且可以在任何支持自动微分和矩阵反演的框架中实现。我们使用多种非线性预测算法,包括单层前馈网络(FFN),支持向量回归器和具有多个隐藏层的深度FFN,展示了其在合成和实际数据上的应用和有效性。在合成数据示例中,偏差贡献度量成功地识别了导致偏差的特征。当应用于真实世界的数据集时,偏差贡献度量在所有考虑的预测算法中一致地识别出相同的特征集。
{"title":"Identifying Features Contributing to Differential Prediction Bias of Automated Scoring Systems","authors":"Ikkyu Choi,&nbsp;Matthew S. Johnson","doi":"10.1111/jedm.70015","DOIUrl":"https://doi.org/10.1111/jedm.70015","url":null,"abstract":"<p>Automated scoring systems provide multiple benefits but also pose challenges, notably potential bias. Various methods exist to evaluate these algorithms and their outputs for bias. Upon detecting bias, the next logical step is to investigate its cause, often by examining feature distributions. Recently, Johnson and McCaffrey proposed an exploratory approach to identify features responsible for differential prediction bias. However, their approach applies only to linear additive prediction models, excluding many machine learning algorithms. In this paper, we propose the bias contribution measure, a statistic that expands Johnson and McCaffrey's approach to any prediction algorithms that have partial derivatives and that can be implemented in any framework that supports automatic differentiation and matrix inversion. We demonstrated its application and effectiveness on synthetic and real-word data using multiple nonlinear prediction algorithms, including a single-layer feed-forward network (FFN), a support vector regressor, and a deep FFN with multiple hidden layers. In the synthetic data examples, the bias contribution measure successfully identified the feature responsible for the bias. When applied to a real-world data set, the bias contribution measure consistently identified the same set of features across all considered prediction algorithms.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"838-861"},"PeriodicalIF":1.6,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Finding Words Associated with DIF: Predicting Differential Item Functioning Using LLMs and Explainable AI 寻找与DIF相关的词:使用llm和可解释的AI预测差异项目功能
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-10-31 DOI: 10.1111/jedm.70017
Hotaka Maeda PhD, Yikai Lu (EK) PhD

We fine-tuned and compared several encoder-based Transformer large language models (LLM) to predict differential item functioning (DIF) from the item text. We then applied explainable artificial intelligence (XAI) methods to identify specific words associated with the DIF prediction. The data included 42,180 items designed for English language arts and mathematics summative state assessments among students in grades 3 to 11. Prediction R2$R^2$ ranged from .04 to .32 among eight focal and reference group pairs. Our findings suggest that many words associated with DIF reflect minor subdomains included in the test blueprint by design, rather than construct-irrelevant content that may need to be removed from assessments. This may explain why qualitative reviews of DIF items often yield inconclusive results. Our approach can be used to (1) screen words associated with DIF during the item-writing process for immediate revision to reduce preventable adverse DIF, (2) assist traditional DIF item reviews by highlighting key words, or (3) use DIF prediction as an alternative when obtaining sufficient sample size for traditional DIF analyses is impossible. Extensions of this research can enhance the assessment fairness, especially programs that lack resources to build high-quality items, and among smaller subpopulations with insufficient sample sizes for traditional DIF analyses. See source code here.

我们对几个基于编码器的Transformer大型语言模型(LLM)进行了微调和比较,以从项目文本中预测差异项目功能(DIF)。然后,我们应用可解释的人工智能(XAI)方法来识别与DIF预测相关的特定单词。这些数据包括为3至11年级学生的英语语言艺术和数学总结性状态评估设计的42,180个项目。预测r2 $R^2$的范围从。04到。8对焦点组和参照组中有32对。我们的研究结果表明,许多与DIF相关的单词反映了设计中包含在测试蓝图中的次要子域,而不是可能需要从评估中删除的与构建无关的内容。这也许可以解释为什么对DIF项目的定性评价经常产生不确定的结果。我们的方法可以用于(1)在项目编写过程中筛选与DIF相关的单词,以便立即修改以减少可预防的不利DIF,(2)通过突出关键词来辅助传统的DIF项目审查,或者(3)当无法获得足够的样本量进行传统的DIF分析时,使用DIF预测作为替代方案。该研究的扩展可以提高评估的公平性,特别是在缺乏资源来构建高质量项目的项目中,以及在传统DIF分析中样本量不足的较小亚群中。在这里查看源代码。
{"title":"Finding Words Associated with DIF: Predicting Differential Item Functioning Using LLMs and Explainable AI","authors":"Hotaka Maeda PhD,&nbsp;Yikai Lu (EK) PhD","doi":"10.1111/jedm.70017","DOIUrl":"https://doi.org/10.1111/jedm.70017","url":null,"abstract":"<p>We fine-tuned and compared several encoder-based Transformer large language models (LLM) to predict differential item functioning (DIF) from the item text. We then applied explainable artificial intelligence (XAI) methods to identify specific words associated with the DIF prediction. The data included 42,180 items designed for English language arts and mathematics summative state assessments among students in grades 3 to 11. Prediction <span></span><math>\u0000 <semantics>\u0000 <msup>\u0000 <mi>R</mi>\u0000 <mn>2</mn>\u0000 </msup>\u0000 <annotation>$R^2$</annotation>\u0000 </semantics></math> ranged from .04 to .32 among eight focal and reference group pairs. Our findings suggest that many words associated with DIF reflect minor subdomains included in the test blueprint by design, rather than construct-irrelevant content that may need to be removed from assessments. This may explain why qualitative reviews of DIF items often yield inconclusive results. Our approach can be used to (1) screen words associated with DIF during the item-writing process for immediate revision to reduce preventable adverse DIF, (2) assist traditional DIF item reviews by highlighting key words, or (3) use DIF prediction as an alternative when obtaining sufficient sample size for traditional DIF analyses is impossible. Extensions of this research can enhance the assessment fairness, especially programs that lack resources to build high-quality items, and among smaller subpopulations with insufficient sample sizes for traditional DIF analyses. See source code here.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"883-906"},"PeriodicalIF":1.6,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70017","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Investigation Into Item Calibration in Multidimensional Multistage Testing 多维多阶段测试中项目标定问题研究
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-10-26 DOI: 10.1111/jedm.70016
Xi Wang, Catherine Welch

This study builds on prior research on adaptive testing by examining the performance of item calibration methods in the context of multidimensional multistage tests with within-item multidimensionality. Building on the adaptive module-level approach, where test-takers proceed through customized modules based on their initial performance, this research investigates how different calibration methods perform under certain conditions. Specifically, the study evaluates three calibration methods—concurrent calibration, fixed item parameter calibration, and concurrent calibration with multiple panels—within a multidimensional multistage test framework. Using computer simulations, the study assesses ability and item parameter recovery across various conditions, including sample size, correlations among dimensions, and routing stage length. Across 36 simulation conditions, each replicated 10 times, results show that although calibration methods exert minimal influence on item and ability parameter estimates, the correlation among dimensions plays a significant role in both item and ability estimation. Additionally, sample size and routing stage length notably impact the estimation of item discrimination parameters. This study lays the foundation for further research and practical advancements in multidimensional multistage testing, offering a starting point for refining and innovating testing practices.

本研究在已有的自适应测试研究的基础上,考察了项目校准方法在具有项目内多维度的多维多阶段测试中的表现。在自适应模块级方法的基础上,本研究调查了不同的校准方法在特定条件下的表现。在自适应模块级方法中,考生根据自己的初始表现进行定制模块。具体而言,研究在多维多阶段测试框架下评估了同步校准、固定项目参数校准和多面板同步校准三种校准方法。利用计算机模拟,该研究评估了各种条件下的恢复能力和项目参数,包括样本量、维度之间的相关性和路由阶段长度。在36个模拟条件中,每个条件重复10次,结果表明,尽管校准方法对项目和能力参数估计的影响很小,但维度之间的相关性在项目和能力估计中都起着重要作用。此外,样本量和路径阶段长度显著影响项目辨别参数的估计。本研究为多维多阶段测试的进一步研究和实践进步奠定了基础,为完善和创新测试实践提供了起点。
{"title":"An Investigation Into Item Calibration in Multidimensional Multistage Testing","authors":"Xi Wang,&nbsp;Catherine Welch","doi":"10.1111/jedm.70016","DOIUrl":"https://doi.org/10.1111/jedm.70016","url":null,"abstract":"<p>This study builds on prior research on adaptive testing by examining the performance of item calibration methods in the context of multidimensional multistage tests with within-item multidimensionality. Building on the adaptive module-level approach, where test-takers proceed through customized modules based on their initial performance, this research investigates how different calibration methods perform under certain conditions. Specifically, the study evaluates three calibration methods—concurrent calibration, fixed item parameter calibration, and concurrent calibration with multiple panels—within a multidimensional multistage test framework. Using computer simulations, the study assesses ability and item parameter recovery across various conditions, including sample size, correlations among dimensions, and routing stage length. Across 36 simulation conditions, each replicated 10 times, results show that although calibration methods exert minimal influence on item and ability parameter estimates, the correlation among dimensions plays a significant role in both item and ability estimation. Additionally, sample size and routing stage length notably impact the estimation of item discrimination parameters. This study lays the foundation for further research and practical advancements in multidimensional multistage testing, offering a starting point for refining and innovating testing practices.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"862-882"},"PeriodicalIF":1.6,"publicationDate":"2025-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automated Coding of Communications in Collaborative Problem-Solving Tasks Using ChatGPT 使用ChatGPT的协作解决问题任务中的通信自动编码
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-10-23 DOI: 10.1111/jedm.70014
Jiangang Hao, Wenju Cui, Patrick Kyllonen, Emily Kerzabi, Lei Liu, Michael Flor

Collaborative problem solving is widely recognized as a critical 21st-century skill. Assessing collaborative problem solving depends on coding the communication data using a construct-relevant framework, and this process has long been a major bottleneck to scaling up such assessments. Based on five datasets and two coding frameworks, we demonstrate that ChatGPT can code communication data to a satisfactory level, though performance varies across ChatGPT models and depends on the coding framework and task characteristics. Interestingly, newer reasoning-focused models, such as GPT-o1-mini and GPT-o3-mini, do not necessarily yield better coding results. Additionally, we show that refining prompts based on feedback from miscoded cases can improve coding accuracy in some instances, though the effectiveness of this approach is not consistent across all tasks. These findings offer practical guidance for researchers and practitioners in developing scalable, efficient methods to analyze communication data in support of 21st-century skill assessment.

协作解决问题被广泛认为是21世纪的关键技能。评估协作问题解决依赖于使用与构造相关的框架对通信数据进行编码,而这个过程长期以来一直是扩展此类评估的主要瓶颈。基于五个数据集和两个编码框架,我们证明了ChatGPT可以将通信数据编码到令人满意的水平,尽管不同的ChatGPT模型的性能不同,并且取决于编码框架和任务特征。有趣的是,较新的以推理为重点的模型,如gpt - 01 -mini和gpt - 03 -mini,不一定产生更好的编码结果。此外,我们还表明,在某些情况下,基于错误编码案例的反馈改进提示可以提高编码的准确性,尽管这种方法的有效性在所有任务中并不一致。这些发现为研究人员和从业人员开发可扩展的、有效的方法来分析通信数据,以支持21世纪的技能评估提供了实用指导。
{"title":"Automated Coding of Communications in Collaborative Problem-Solving Tasks Using ChatGPT","authors":"Jiangang Hao,&nbsp;Wenju Cui,&nbsp;Patrick Kyllonen,&nbsp;Emily Kerzabi,&nbsp;Lei Liu,&nbsp;Michael Flor","doi":"10.1111/jedm.70014","DOIUrl":"https://doi.org/10.1111/jedm.70014","url":null,"abstract":"<p>Collaborative problem solving is widely recognized as a critical 21st-century skill. Assessing collaborative problem solving depends on coding the communication data using a construct-relevant framework, and this process has long been a major bottleneck to scaling up such assessments. Based on five datasets and two coding frameworks, we demonstrate that ChatGPT can code communication data to a satisfactory level, though performance varies across ChatGPT models and depends on the coding framework and task characteristics. Interestingly, newer reasoning-focused models, such as GPT-o1-mini and GPT-o3-mini, do not necessarily yield better coding results. Additionally, we show that refining prompts based on feedback from miscoded cases can improve coding accuracy in some instances, though the effectiveness of this approach is not consistent across all tasks. These findings offer practical guidance for researchers and practitioners in developing scalable, efficient methods to analyze communication data in support of 21st-century skill assessment.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"809-837"},"PeriodicalIF":1.6,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Measuring the Accuracy of True Score Predictions for AI Scoring Evaluation 测量AI评分评估中真实分数预测的准确性
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-10-12 DOI: 10.1111/jedm.70011
Daniel F. McCaffrey, Jodi M. Casabianca, Matthew S. Johnson

Use of artificial intelligence (AI) to score responses is growing in popularity and likely to increase. Evidence of the validity of scores relies on quadratic weighted kappa (QWK) to demonstrate agreement between AI scores and human ratings. QWK is a measure of agreement that accounts for chance agreement and the ordinality of the data by giving greater weight to larger disagreements. It has known shortcomings including sensitivity to the human rating reliability. The proportional reduction in mean squared error (PRMSE) measures agreement between predictions and their target that accounts for measurement error in the target. For example, the accuracy of the automated scoring model, with respect to prediction of the human true scores rather than the observed ratings. Extensive simulation study results show PRMSE is robust to many factors to which QWK is sensitive such as the human rater reliability, skew in the data and the number of score points. Analysis of operational test data demonstrates QWK and PRMSE can lead to different conclusions about AI scores. We investigate sample size requirements for accurate estimation of PRMSE in the context of AI scoring, although the results could apply more generally to measures with similar distributions as those tested in our study.

使用人工智能(AI)对回答进行评分越来越受欢迎,并且可能会增加。分数有效性的证据依赖于二次加权卡帕(QWK)来证明人工智能分数和人类评分之间的一致性。QWK是一种一致性的度量,它通过对较大的分歧给予更大的权重来说明偶然一致性和数据的平常性。它有已知的缺点,包括对人类评级可靠性的敏感性。均方误差(PRMSE)的比例减少度量了预测与目标之间的一致性,这说明了目标中的测量误差。例如,自动评分模型的准确性,相对于人类真实得分的预测,而不是观察到的评分。大量的仿真研究结果表明,PRMSE对QWK敏感的人为评分可靠性、数据偏差和分数点数等因素具有较强的鲁棒性。对运行测试数据的分析表明,QWK和PRMSE可以得出关于AI分数的不同结论。我们研究了在人工智能评分背景下准确估计PRMSE的样本量要求,尽管结果可以更普遍地适用于与我们研究中测试的分布相似的测量。
{"title":"Measuring the Accuracy of True Score Predictions for AI Scoring Evaluation","authors":"Daniel F. McCaffrey,&nbsp;Jodi M. Casabianca,&nbsp;Matthew S. Johnson","doi":"10.1111/jedm.70011","DOIUrl":"https://doi.org/10.1111/jedm.70011","url":null,"abstract":"<p>Use of artificial intelligence (AI) to score responses is growing in popularity and likely to increase. Evidence of the validity of scores relies on quadratic weighted kappa (QWK) to demonstrate agreement between AI scores and human ratings. QWK is a measure of agreement that accounts for chance agreement and the ordinality of the data by giving greater weight to larger disagreements. It has known shortcomings including sensitivity to the human rating reliability. The proportional reduction in mean squared error (PRMSE) measures agreement between predictions and their target that accounts for measurement error in the target. For example, the accuracy of the automated scoring model, with respect to prediction of the human true scores rather than the observed ratings. Extensive simulation study results show PRMSE is robust to many factors to which QWK is sensitive such as the human rater reliability, skew in the data and the number of score points. Analysis of operational test data demonstrates QWK and PRMSE can lead to different conclusions about AI scores. We investigate sample size requirements for accurate estimation of PRMSE in the context of AI scoring, although the results could apply more generally to measures with similar distributions as those tested in our study.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"763-786"},"PeriodicalIF":1.6,"publicationDate":"2025-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Two-Phase Content-Balancing CD-CAT Online Item Calibration 两阶段内容平衡CD-CAT在线项目校准
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-10-08 DOI: 10.1111/jedm.70012
Jing Huang, Yuxiao Zhang, Jason W. Morphew, Jayson M. Nissen, Ben Van Dusen, Hua Hua Chang

Online calibration estimates new item parameters alongside previously calibrated items, supporting efficient item replenishment. However, most existing online calibration procedures for Cognitive Diagnostic Computerized Adaptive Testing (CD-CAT) lack mechanisms to ensure content balance during live testing. This limitation can lead to uneven content coverage, potentially undermining the alignment with instructional goals. This research extends the current calibration framework by integrating a two-phase test design with a content-balancing item selection method into the online calibration procedure. Simulation studies evaluated item parameter recovery and attribute profile estimation accuracy under the proposed procedure. Results indicated that the developed procedure yielded more accurate new item parameter estimates. The procedure also maintained content representativeness under both balanced and unbalanced constraints. Attribute profile estimation was sensitive to item parameter values. Accuracy declined when items had larger parameter values. Calibration improved with larger sample sizes and smaller parameter values. Longer test lengths contributed more to profile estimation than to new item calibration. These findings highlight design trade-offs in adaptive item replenishment and suggest new directions for hybrid calibration methods.

在线校准估计新的项目参数和以前校准的项目,支持有效的项目补充。然而,大多数现有的认知诊断计算机自适应测试(CD-CAT)在线校准程序缺乏确保实时测试期间内容平衡的机制。这种限制可能导致内容覆盖不均匀,潜在地破坏与教学目标的一致性。本研究通过将两阶段测试设计与内容平衡项目选择方法集成到在线校准程序中,扩展了当前的校准框架。仿真研究评估了该方法下的项目参数恢复和属性轮廓估计精度。结果表明,开发的程序产生了更准确的新项目参数估计。该程序在平衡约束和不平衡约束下均保持了内容代表性。属性轮廓估计对项目参数值敏感。当项目参数值较大时,准确性下降。更大的样本量和更小的参数值改进了校准。较长的测试长度对轮廓估计的贡献大于对新项目校准的贡献。这些发现突出了自适应项目补充的设计权衡,并为混合校准方法提出了新的方向。
{"title":"Two-Phase Content-Balancing CD-CAT Online Item Calibration","authors":"Jing Huang,&nbsp;Yuxiao Zhang,&nbsp;Jason W. Morphew,&nbsp;Jayson M. Nissen,&nbsp;Ben Van Dusen,&nbsp;Hua Hua Chang","doi":"10.1111/jedm.70012","DOIUrl":"https://doi.org/10.1111/jedm.70012","url":null,"abstract":"<p>Online calibration estimates new item parameters alongside previously calibrated items, supporting efficient item replenishment. However, most existing online calibration procedures for Cognitive Diagnostic Computerized Adaptive Testing (CD-CAT) lack mechanisms to ensure content balance during live testing. This limitation can lead to uneven content coverage, potentially undermining the alignment with instructional goals. This research extends the current calibration framework by integrating a two-phase test design with a content-balancing item selection method into the online calibration procedure. Simulation studies evaluated item parameter recovery and attribute profile estimation accuracy under the proposed procedure. Results indicated that the developed procedure yielded more accurate new item parameter estimates. The procedure also maintained content representativeness under both balanced and unbalanced constraints. Attribute profile estimation was sensitive to item parameter values. Accuracy declined when items had larger parameter values. Calibration improved with larger sample sizes and smaller parameter values. Longer test lengths contributed more to profile estimation than to new item calibration. These findings highlight design trade-offs in adaptive item replenishment and suggest new directions for hybrid calibration methods.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"787-808"},"PeriodicalIF":1.6,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.70012","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IRT Scoring and Recursion for Estimating Reliability and Other Accuracy Indices IRT评分和递归估计可靠性和其他准确性指标
IF 1.6 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-09-28 DOI: 10.1111/jedm.70008
Tim Moses, YoungKoung Kim

This study considers the estimation of marginal reliability and conditional accuracy measures using a generalized recursion procedure with several IRT-based ability and score estimators. The estimators include MLE, TCC, and EAP abilities, and corresponding test scores obtained with different weightings of the item scores. We consider reliability estimates for 1-, 2-, and 3-parameter logistic IRT models (1PL, 2PL, and 3PL) for tests of dichotomously scored items, using IRT calibrations from two datasets. The generalized recursion procedure is shown to produce conditional probability distributions for the considered IRT estimators that can be used in the estimation of marginal reliabilities and conditional accuracies (biases and CSEMs). These reliabilities and conditional accuracies are shown to have less extreme and more plausible values compared to theoretical approaches based on test information. The proposed recursion procedure for the estimation of reliability and other accuracy measures are demonstrated for testing situations involving different test lengths, IRT models, and different types of IRT parameter inaccuracies.

本研究使用广义递归方法与若干基于irt的能力和分数估计器来估计边际可靠性和条件精度。估计量包括MLE、TCC和EAP能力,以及相应的测试分数,这些分数是通过项目分数的不同权重得到的。我们使用来自两个数据集的IRT校准,对二分类得分项目的测试考虑1、2和3参数logistic IRT模型(1PL、2PL和3PL)的可靠性估计。广义递归过程显示为考虑的IRT估计器产生条件概率分布,可用于估计边际可靠性和条件精度(偏差和cems)。与基于测试信息的理论方法相比,这些可靠性和条件准确性显示出不那么极端和更可信的值。在涉及不同测试长度、IRT模型和不同类型IRT参数不准确性的测试情况下,演示了所提出的用于估计可靠性和其他精度度量的递归过程。
{"title":"IRT Scoring and Recursion for Estimating Reliability and Other Accuracy Indices","authors":"Tim Moses,&nbsp;YoungKoung Kim","doi":"10.1111/jedm.70008","DOIUrl":"https://doi.org/10.1111/jedm.70008","url":null,"abstract":"<p>This study considers the estimation of marginal reliability and conditional accuracy measures using a generalized recursion procedure with several IRT-based ability and score estimators. The estimators include MLE, TCC, and EAP abilities, and corresponding test scores obtained with different weightings of the item scores. We consider reliability estimates for 1-, 2-, and 3-parameter logistic IRT models (1PL, 2PL, and 3PL) for tests of dichotomously scored items, using IRT calibrations from two datasets. The generalized recursion procedure is shown to produce conditional probability distributions for the considered IRT estimators that can be used in the estimation of marginal reliabilities and conditional accuracies (biases and CSEMs). These reliabilities and conditional accuracies are shown to have less extreme and more plausible values compared to theoretical approaches based on test information. The proposed recursion procedure for the estimation of reliability and other accuracy measures are demonstrated for testing situations involving different test lengths, IRT models, and different types of IRT parameter inaccuracies.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 4","pages":"718-739"},"PeriodicalIF":1.6,"publicationDate":"2025-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Educational Measurement
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1