首页 > 最新文献

Journal of Educational Measurement最新文献

英文 中文
IRT Observed-Score Equating for Rater-Mediated Assessments Using a Hierarchical Rater Model 使用分层评分模型对评分介导的评估进行IRT观察得分相等
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-01-13 DOI: 10.1111/jedm.12425
Tong Wu, Stella Y. Kim, Carl Westine, Michelle Boyer

While significant attention has been given to test equating to ensure score comparability, limited research has explored equating methods for rater-mediated assessments, where human raters inherently introduce error. If not properly addressed, these errors can undermine score interchangeability and test validity. This study proposes an equating method that accounts for rater errors by utilizing item response theory (IRT) observed-score equating with a hierarchical rater model (HRM). Its effectiveness is compared to an IRT observed-score equating method using the generalized partial credit model across 16 rater combinations with varying levels of rater bias and variability. The results indicate that equating performance depends on the interaction between rater bias and variability across forms. Both the proposed and traditional methods demonstrated robustness in terms of bias and RMSE when rater bias and variability were similar between forms, with a few exceptions. However, when rater errors varied significantly across forms, the proposed method consistently produced more stable equating results. Differences in standard error between the methods were minimal under most conditions.

虽然相当重视测试相等以确保分数的可比性,但有限的研究探索了评分中介评估的相等方法,其中人为评分者固有地引入错误。如果处理不当,这些错误会破坏分数的互换性和测试的有效性。本研究提出了一种等价化方法,利用项目反应理论(IRT)观察得分等价化与分层评分模型(HRM)来解释评分者误差。将其有效性与使用广义部分信用模型的IRT观察得分等同方法进行比较,该方法跨16种具有不同程度的评分偏差和可变性的评分组合。结果表明,相等的性能取决于评分偏差和不同形式的可变性之间的相互作用。除了少数例外,当不同形式之间的偏差和可变性相似时,所提出的方法和传统方法在偏差和RMSE方面都表现出鲁棒性。然而,当不同形式的评分误差显著变化时,所提出的方法始终产生更稳定的相等结果。在大多数条件下,两种方法的标准误差差异极小。
{"title":"IRT Observed-Score Equating for Rater-Mediated Assessments Using a Hierarchical Rater Model","authors":"Tong Wu,&nbsp;Stella Y. Kim,&nbsp;Carl Westine,&nbsp;Michelle Boyer","doi":"10.1111/jedm.12425","DOIUrl":"https://doi.org/10.1111/jedm.12425","url":null,"abstract":"<p>While significant attention has been given to test equating to ensure score comparability, limited research has explored equating methods for rater-mediated assessments, where human raters inherently introduce error. If not properly addressed, these errors can undermine score interchangeability and test validity. This study proposes an equating method that accounts for rater errors by utilizing item response theory (IRT) observed-score equating with a hierarchical rater model (HRM). Its effectiveness is compared to an IRT observed-score equating method using the generalized partial credit model across 16 rater combinations with varying levels of rater bias and variability. The results indicate that equating performance depends on the interaction between rater bias and variability across forms. Both the proposed and traditional methods demonstrated robustness in terms of bias and RMSE when rater bias and variability were similar between forms, with a few exceptions. However, when rater errors varied significantly across forms, the proposed method consistently produced more stable equating results. Differences in standard error between the methods were minimal under most conditions.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 1","pages":"145-171"},"PeriodicalIF":1.4,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Note on the Use of Categorical Subscores 关于使用分类分值的说明
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-01-07 DOI: 10.1111/jedm.12423
Kylie Gorney, Sandip Sinharay

Although there exists an extensive amount of research on subscores and their properties, limited research has been conducted on categorical subscores and their interpretations. In this paper, we focus on the claim of Feinberg and von Davier that categorical subscores are useful for remediation and instructional purposes. We investigate this claim by examining (a) the agreement between true and observed subscore classifications and (b) the agreement between subscore classifications across parallel forms of a test. Results show that the categorical subscores of Feinberg and von Davier are often inaccurate and/or inconsistent, pointing to a lack of justification for using them for remediation or instructional purposes.

虽然对子分数及其性质的研究已经非常广泛,但对分类子分数及其解释的研究却非常有限。在本文中,我们重点关注Feinberg和von Davier的主张,即分类子分数对补救和教学目的有用。我们通过检查(a)真实和观察到的子分数分类之间的协议和(b)跨测试的平行形式的子分数分类之间的协议来调查这一主张。结果表明,Feinberg和von Davier的分类分数经常是不准确和/或不一致的,这表明缺乏将它们用于补救或教学目的的理由。
{"title":"A Note on the Use of Categorical Subscores","authors":"Kylie Gorney,&nbsp;Sandip Sinharay","doi":"10.1111/jedm.12423","DOIUrl":"https://doi.org/10.1111/jedm.12423","url":null,"abstract":"<p>Although there exists an extensive amount of research on subscores and their properties, limited research has been conducted on categorical subscores and their interpretations. In this paper, we focus on the claim of Feinberg and von Davier that categorical subscores are useful for remediation and instructional purposes. We investigate this claim by examining (a) the agreement between true and observed subscore classifications and (b) the agreement between subscore classifications across parallel forms of a test. Results show that the categorical subscores of Feinberg and von Davier are often inaccurate and/or inconsistent, pointing to a lack of justification for using them for remediation or instructional purposes.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 1","pages":"101-119"},"PeriodicalIF":1.4,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12423","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Exploratory Study Using Innovative Graphical Network Analysis to Model Eye Movements in Spatial Reasoning Problem Solving 利用创新的图形网络分析模拟空间推理问题的眼动的探索性研究
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2024-12-20 DOI: 10.1111/jedm.12421
Kaiwen Man, Joni M. Lakin

Eye-tracking procedures generate copious process data that could be valuable in establishing the response processes component of modern validity theory. However, there is a lack of tools for assessing and visualizing response processes using process data such as eye-tracking fixation sequences, especially those suitable for young children. This study, which explored student responses to a spatial reasoning task, employed eye tracking and social network analysis to model, examine, and visualize students' visual transition patterns while solving spatial problems to begin to elucidate these processes. Fifty students in Grades 2–8 completed a spatial reasoning task as eye movements were recorded. Areas of interest (AoIs) were defined within the task for each spatial reasoning question. Transition networks between AoIs were constructed and analyzed using selected network measures. Results revealed shared transition sequences across students as well as strategic differences between high and low performers. High performers demonstrated more integrated transitions between AoIs, while low performers considered information more in isolation. Additionally, age and the interaction of age and performance did not significantly impact these measures. The study demonstrates a novel modeling approach for investigating visual processing and provides initial evidence that high-performing students more deeply engage with visual information in solving these types of questions.

眼动追踪过程产生了丰富的过程数据,这些数据对于建立现代效度理论的反应过程成分是有价值的。然而,缺乏工具来评估和可视化反应过程,使用过程数据,如眼动追踪固定序列,特别是那些适合幼儿。本研究探讨了学生对空间推理任务的反应,采用眼动追踪和社会网络分析来模拟、检查和可视化学生在解决空间问题时的视觉过渡模式,以开始阐明这些过程。50名2-8年级的学生完成了一项记录眼球运动的空间推理任务。在每个空间推理问题的任务中定义了兴趣区域(AoIs)。构建了aoi之间的过渡网络,并使用选定的网络度量对其进行了分析。结果显示,学生之间有共同的过渡顺序,以及表现优异和表现不佳的学生之间的策略差异。绩效高的人在aoi之间表现出更综合的转换,而绩效低的人则更多地孤立地考虑信息。此外,年龄以及年龄和表现的相互作用对这些测量没有显著影响。该研究展示了一种研究视觉处理的新颖建模方法,并提供了初步证据,表明表现优异的学生在解决这类问题时更深入地利用视觉信息。
{"title":"An Exploratory Study Using Innovative Graphical Network Analysis to Model Eye Movements in Spatial Reasoning Problem Solving","authors":"Kaiwen Man,&nbsp;Joni M. Lakin","doi":"10.1111/jedm.12421","DOIUrl":"https://doi.org/10.1111/jedm.12421","url":null,"abstract":"<p>Eye-tracking procedures generate copious process data that could be valuable in establishing the response processes component of modern validity theory. However, there is a lack of tools for assessing and visualizing response processes using process data such as eye-tracking fixation sequences, especially those suitable for young children. This study, which explored student responses to a spatial reasoning task, employed eye tracking and social network analysis to model, examine, and visualize students' visual transition patterns while solving spatial problems to begin to elucidate these processes. Fifty students in Grades 2–8 completed a spatial reasoning task as eye movements were recorded. Areas of interest (AoIs) were defined within the task for each spatial reasoning question. Transition networks between AoIs were constructed and analyzed using selected network measures. Results revealed shared transition sequences across students as well as strategic differences between high and low performers. High performers demonstrated more integrated transitions between AoIs, while low performers considered information more in isolation. Additionally, age and the interaction of age and performance did not significantly impact these measures. The study demonstrates a novel modeling approach for investigating visual processing and provides initial evidence that high-performing students more deeply engage with visual information in solving these types of questions.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 4","pages":"710-739"},"PeriodicalIF":1.4,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143253061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modeling Directional Testlet Effects on Multiple Open-Ended Questions 定向测试对多个开放式问题的影响建模
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2024-12-10 DOI: 10.1111/jedm.12422
Kuan-Yu Jin, Wai-Lok Siu

Educational tests often have a cluster of items linked by a common stimulus (testlet). In such a design, the dependencies caused between items are called testlet effects. In particular, the directional testlet effect (DTE) refers to a recursive influence whereby responses to earlier items can positively or negatively affect the scores on later items. This study aims to introduce an innovative measurement model to describe DTEs among multiple polytomouslyscored open-ended items. Through simulations, we found that (1) item and DTE parameters can be accurately recovered in Latent GOLD®, (2) ignoring positive (or negative) DTEs by fitting a standard item response theory model can result in the overestimation (or underestimation) of test reliability, (3) collapsing multiple items of a testlet into a super item is still effective in eliminating DTEs, (4) the popular multidimensional strategy of adding nuisance factors to describe item dependencies fails to account for DTE adequately, and (5) fitting the proposed model for DTE to testlet data involving nuisance factors will observe positive DTEs but will not have a better fit. Moreover, using the proposed model, we demonstrated the coexistence of positive and negative DTEs in a real history exam.

教育测试通常有一组由共同刺激(试题)连接的项目。在这样的设计中,项目之间产生的依赖关系称为testlet效应。特别是,定向测试效应(DTE)指的是一种递归影响,即对早期项目的反应可以积极或消极地影响后期项目的分数。本研究旨在引入一种创新的测量模型来描述多个多重评分的开放式项目之间的dte。通过仿真,我们发现:(1)Latent GOLD®可以准确地恢复项目和DTE参数,(2)通过拟合标准项目反应理论模型忽略正(或负)DTE可能导致测试信度高估(或低估),(3)将一个测试小的多个项目折叠成一个超级项目仍然可以有效地消除DTE。(4)目前流行的多维度策略,即添加妨害因素来描述项目依赖关系,并不能充分解释DTE;(5)将提出的DTE模型拟合到包含妨害因素的测试数据中,可以观察到正的DTE,但没有更好的拟合。此外,利用所提出的模型,我们在真实的历史考试中证明了积极和消极的dte共存。
{"title":"Modeling Directional Testlet Effects on Multiple Open-Ended Questions","authors":"Kuan-Yu Jin,&nbsp;Wai-Lok Siu","doi":"10.1111/jedm.12422","DOIUrl":"https://doi.org/10.1111/jedm.12422","url":null,"abstract":"<p>Educational tests often have a cluster of items linked by a common stimulus (<i>testlet</i>). In such a design, the dependencies caused between items are called <i>testlet effects</i>. In particular, the directional testlet effect (DTE) refers to a recursive influence whereby responses to earlier items can positively or negatively affect the scores on later items. This study aims to introduce an innovative measurement model to describe DTEs among multiple polytomouslyscored open-ended items. Through simulations, we found that (1) item and DTE parameters can be accurately recovered in Latent GOLD<sup>®</sup>, (2) ignoring positive (or negative) DTEs by fitting a standard item response theory model can result in the overestimation (or underestimation) of test reliability, (3) collapsing multiple items of a testlet into a super item is still effective in eliminating DTEs, (4) the popular multidimensional strategy of adding nuisance factors to describe item dependencies fails to account for DTE adequately, and (5) fitting the proposed model for DTE to testlet data involving nuisance factors will observe positive DTEs but will not have a better fit. Moreover, using the proposed model, we demonstrated the coexistence of positive and negative DTEs in a real history exam.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 1","pages":"81-100"},"PeriodicalIF":1.4,"publicationDate":"2024-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Differences in Time Usage as a Competing Hypothesis for Observed Group Differences in Accuracy with an Application to Observed Gender Differences in PISA Data 时间使用差异作为观察到的群体准确性差异的竞争假设,并应用于观察到的PISA数据中的性别差异
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2024-11-01 DOI: 10.1111/jedm.12419
Radhika Kapoor, Erin Fahle, Klint Kanopka, David Klinowski, Ana Trindade Ribeiro, Benjamin W. Domingue

Group differences in test scores are a key metric in education policy. Response time offers novel opportunities for understanding these differences, especially in low-stakes settings. Here, we describe how observed group differences in test accuracy can be attributed to group differences in latent response speed or group differences in latent capacity, where capacity is defined as expected accuracy for a given response speed. This article introduces a method for decomposing observed group differences in accuracy into these differences in speed versus differences in capacity. We first illustrate in simulation studies that this approach can reliably distinguish between group speed and capacity differences. We then use this approach to probe gender differences in science and reading fluency in PISA 2018 for 71 countries. In science, score differentials largely increase when males, who respond more rapidly, are the higher performing group and decrease when females, who respond more slowly, are the higher performing group. In reading fluency, score differentials decrease where females, who respond more rapidly, are the higher performing group. This method can be used to analyze group differences especially in low-stakes assessments where there are potential group differences in speed.

考试成绩的群体差异是教育政策的关键指标。响应时间为理解这些差异提供了新的机会,特别是在低风险环境中。在这里,我们描述了如何将观察到的测试准确性组差异归因于潜在反应速度组差异或潜在容量组差异,其中容量定义为给定反应速度的预期准确性。本文介绍了一种方法,可以将观察到的组间准确度差异分解为速度差异与容量差异。我们首先在仿真研究中说明,这种方法可以可靠地区分组速度和容量差异。然后,我们用这种方法调查了71个国家在2018年国际学生评估项目中科学和阅读流畅性的性别差异。在科学领域,当反应更快的男性成为表现更好的一组时,分数差异就会大大增加,而当反应更慢的女性成为表现更好的一组时,分数差异就会减少。在阅读流畅性方面,反应更迅速的女性表现更出色,得分差异会缩小。这种方法可以用来分析群体差异,特别是在低风险评估中,在速度上存在潜在的群体差异。
{"title":"Differences in Time Usage as a Competing Hypothesis for Observed Group Differences in Accuracy with an Application to Observed Gender Differences in PISA Data","authors":"Radhika Kapoor,&nbsp;Erin Fahle,&nbsp;Klint Kanopka,&nbsp;David Klinowski,&nbsp;Ana Trindade Ribeiro,&nbsp;Benjamin W. Domingue","doi":"10.1111/jedm.12419","DOIUrl":"https://doi.org/10.1111/jedm.12419","url":null,"abstract":"<p>Group differences in test scores are a key metric in education policy. Response time offers novel opportunities for understanding these differences, especially in low-stakes settings. Here, we describe how observed group differences in test accuracy can be attributed to group differences in latent response speed or group differences in latent capacity, where capacity is defined as expected accuracy for a given response speed. This article introduces a method for decomposing observed group differences in accuracy into these differences in speed versus differences in capacity. We first illustrate in simulation studies that this approach can reliably distinguish between group speed and capacity differences. We then use this approach to probe gender differences in science and reading fluency in PISA 2018 for 71 countries. In science, score differentials largely increase when males, who respond more rapidly, are the higher performing group and decrease when females, who respond more slowly, are the higher performing group. In reading fluency, score differentials decrease where females, who respond more rapidly, are the higher performing group. This method can be used to analyze group differences especially in low-stakes assessments where there are potential group differences in speed.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 4","pages":"682-709"},"PeriodicalIF":1.4,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143247456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction to “Expanding the Lognormal Response Time Model Using Profile Similarity Metrics to Improve the Detection of Anomalous Testing Behavior” 修正“使用剖面相似度量扩展对数正态响应时间模型以改进异常测试行为的检测”
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2024-10-23 DOI: 10.1111/jedm.12418

Hurtz, G.M., & Mucino, R. (2024). Expanding the lognormal response time model using profile similarity metrics to improve the detection of anomalous testing behavior. Journal of Educational Measurement, 61, 458–485. https://doi.org/10.1111/jedm.12395

We apologize for this error.

赫茨,通用汽车,&;Mucino, R.(2024)。利用剖面相似度度量扩展对数正态响应时间模型,以改进对异常测试行为的检测。教育测量学报,61,458-485。https://doi.org/10.1111/jedm.12395We为这个错误道歉。
{"title":"Correction to “Expanding the Lognormal Response Time Model Using Profile Similarity Metrics to Improve the Detection of Anomalous Testing Behavior”","authors":"","doi":"10.1111/jedm.12418","DOIUrl":"https://doi.org/10.1111/jedm.12418","url":null,"abstract":"<p>Hurtz, G.M., &amp; Mucino, R. (2024). Expanding the lognormal response time model using profile similarity metrics to improve the detection of anomalous testing behavior. <i>Journal of Educational Measurement, 61</i>, 458–485. https://doi.org/10.1111/jedm.12395</p><p>We apologize for this error.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 4","pages":"780"},"PeriodicalIF":1.4,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12418","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143253107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Subscores: A Practical Guide to Their Production and Consumption. Shelby Haberman, Sandip Sinharay, Richard Feinberg, and Howard Wainer. Cambridge, Cambridge University Press 2024, 176 pp. (paperback) 分:他们的生产和消费的实用指南。Shelby Haberman, Sandip Sinharay, Richard Feinberg和Howard Wainer。剑桥,剑桥大学出版社2024年版,176页(平装本)
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2024-10-18 DOI: 10.1111/jedm.12417
Gautam Puhan
{"title":"Subscores: A Practical Guide to Their Production and Consumption. Shelby Haberman, Sandip Sinharay, Richard Feinberg, and Howard Wainer. Cambridge, Cambridge University Press 2024, 176 pp. (paperback)","authors":"Gautam Puhan","doi":"10.1111/jedm.12417","DOIUrl":"https://doi.org/10.1111/jedm.12417","url":null,"abstract":"","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 4","pages":"763-772"},"PeriodicalIF":1.4,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143252873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Keystroke Behavior Patterns to Detect Nonauthentic Texts in Writing Assessments: Evaluating the Fairness of Predictive Models 使用击键行为模式检测写作评估中的非真实文本:评估预测模型的公平性
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2024-10-18 DOI: 10.1111/jedm.12416
Yang Jiang, Mo Zhang, Jiangang Hao, Paul Deane, Chen Li

The emergence of sophisticated AI tools such as ChatGPT, coupled with the transition to remote delivery of educational assessments in the COVID-19 era, has led to increasing concerns about academic integrity and test security. Using AI tools, test takers can produce high-quality texts effortlessly and use them to game assessments. It is thus critical to detect these nonauthentic texts to ensure test integrity. In this study, we leveraged keystroke logs—recordings of every keypress—to build machine learning (ML) detectors of nonauthentic texts in a large-scale writing assessment. We focused on investigating the fairness of the detectors across demographic subgroups to ensure that nongenuine writing can be predicted equally well across subgroups. Results indicated that keystroke dynamics were effective in identifying nonauthentic texts. While the ML models were slightly more likely to misclassify the original responses submitted by male test takers as consisting of nonauthentic texts than those submitted by females, the effect sizes were negligible. Furthermore, balancing demographic distributions and class labels did not consistently mitigate detector bias across predictive models. Findings of this study not only provide implications for using behavioral data to address test security issues, but also highlight the importance of evaluating the fairness of predictive models in educational contexts.

ChatGPT等复杂人工智能工具的出现,加上新冠肺炎时代教育评估向远程交付的过渡,导致人们越来越关注学术诚信和考试安全。使用人工智能工具,考生可以毫不费力地编写高质量的文本,并将其用于游戏评估。因此,检测这些非真实文本以确保测试的完整性是至关重要的。在这项研究中,我们利用击键日志(每次击键的记录)在大规模写作评估中构建非真实文本的机器学习(ML)检测器。我们专注于调查探测器在人口统计子组中的公平性,以确保在子组中可以同样很好地预测非真实写作。结果表明,击键动力学在识别非真实文本方面是有效的。虽然ML模型更有可能将男性考生提交的原始回答错误分类为由不真实的文本组成,而不是女性考生提交的原始回答,但效应大小可以忽略不计。此外,平衡人口分布和类别标签并不能始终如一地减轻预测模型中的检测器偏差。这项研究的发现不仅为使用行为数据来解决考试安全问题提供了启示,而且还强调了在教育背景下评估预测模型公平性的重要性。
{"title":"Using Keystroke Behavior Patterns to Detect Nonauthentic Texts in Writing Assessments: Evaluating the Fairness of Predictive Models","authors":"Yang Jiang,&nbsp;Mo Zhang,&nbsp;Jiangang Hao,&nbsp;Paul Deane,&nbsp;Chen Li","doi":"10.1111/jedm.12416","DOIUrl":"https://doi.org/10.1111/jedm.12416","url":null,"abstract":"<p>The emergence of sophisticated AI tools such as ChatGPT, coupled with the transition to remote delivery of educational assessments in the COVID-19 era, has led to increasing concerns about academic integrity and test security. Using AI tools, test takers can produce high-quality texts effortlessly and use them to game assessments. It is thus critical to detect these nonauthentic texts to ensure test integrity. In this study, we leveraged keystroke logs—recordings of every keypress—to build machine learning (ML) detectors of nonauthentic texts in a large-scale writing assessment. We focused on investigating the fairness of the detectors across demographic subgroups to ensure that nongenuine writing can be predicted equally well across subgroups. Results indicated that keystroke dynamics were effective in identifying nonauthentic texts. While the ML models were slightly more likely to misclassify the original responses submitted by male test takers as consisting of nonauthentic texts than those submitted by females, the effect sizes were negligible. Furthermore, balancing demographic distributions and class labels did not consistently mitigate detector bias across predictive models. Findings of this study not only provide implications for using behavioral data to address test security issues, but also highlight the importance of evaluating the fairness of predictive models in educational contexts.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 4","pages":"571-594"},"PeriodicalIF":1.4,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143252875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Detecting Differential Item Functioning among Multiple Groups Using IRT Residual DIF Framework 利用IRT残差DIF框架检测多组差异项目功能
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2024-10-17 DOI: 10.1111/jedm.12415
Hwanggyu Lim, Danqi Zhu, Edison M. Choe, KyungT. Han, Chris
<p>This study presents a generalized version of the residual differential item functioning (RDIF) detection framework in item response theory, named GRDIF, to analyze differential item functioning (DIF) in multiple groups. The GRDIF framework retains the advantages of the original RDIF framework, such as computational efficiency and ease of implementation. The performance of GRDIF was assessed through a simulation study and compared with existing DIF detection methods, including the generalized Mantel-Haenszel, Lasso-DIF, and alignment methods. Results showed that the GRDIF framework demonstrated well-controlled Type I error rates close to the nominal level of .05 and satisfactory power in detecting uniform, nonuniform, and mixed DIF across different simulated conditions. Each of the three GRDIF statistics, <span></span><math> <semantics> <mrow> <mi>G</mi> <mi>R</mi> <mi>D</mi> <mi>I</mi> <msub> <mi>F</mi> <mi>R</mi> </msub> </mrow> <annotation>$GRDI{{F}_R}$</annotation> </semantics></math>, <span></span><math> <semantics> <mrow> <mi>G</mi> <mi>R</mi> <mi>D</mi> <mi>I</mi> <msub> <mi>F</mi> <mi>S</mi> </msub> </mrow> <annotation>$GRDI{{F}_S}$</annotation> </semantics></math>, and <span></span><math> <semantics> <mrow> <mi>G</mi> <mi>R</mi> <mi>D</mi> <mi>I</mi> <msub> <mi>F</mi> <mrow> <mi>R</mi> <mi>S</mi> </mrow> </msub> </mrow> <annotation>$GRDI{{F}_{RS}}$</annotation> </semantics></math>, effectively detected the specific type of DIF for which it was designed, with <span></span><math> <semantics> <mrow> <mi>G</mi> <mi>R</mi> <mi>D</mi> <mi>I</mi> <msub> <mi>F</mi> <mrow> <mi>R</mi> <mi>S</mi> </mrow> </msub> </mrow> <annotation>$GRDI{{F}_{RS}}$</annotation> </semantics></math> exhibiting the most robust performance across all types of DIF. The GRDIF framework outperformed other
本研究提出了项目反应理论中残余差异项目功能(RDIF)检测框架的广义版本GRDIF,用于分析多群体中的差异项目功能(DIF)。GRDIF框架保留了原始RDIF框架的优点,如计算效率和易于实现。通过仿真研究评估GRDIF的性能,并与现有的DIF检测方法(包括广义Mantel-Haenszel、Lasso-DIF和对准方法)进行比较。结果表明,GRDIF框架在不同模拟条件下检测均匀、非均匀和混合DIF时具有良好的I型错误率控制,接近0.05的标称水平,并且具有令人满意的能力。GRDIF的三个统计量分别为GRDIF R $GRDI{{F}_R}$,G r d I f s $ grdi {{f} _s}$,和GRDI F RS $GRDI{{F}_{RS}}$,有效地检测了所设计的特定类型的DIF;其中GRDI F RS $GRDI{{F}_{RS}}$在所有类型的DIF中表现出最稳健的性能。GRDIF框架在各种条件下优于其他DIF检测方法,表明其具有实际应用潜力,特别是在涉及多群体的大规模评估中。此外,一项实证研究证明了GRDIF框架在使用高风险评估数据集进行DIF分析时的有效性和实用性。
{"title":"Detecting Differential Item Functioning among Multiple Groups Using IRT Residual DIF Framework","authors":"Hwanggyu Lim,&nbsp;Danqi Zhu,&nbsp;Edison M. Choe,&nbsp;KyungT. Han,&nbsp;Chris","doi":"10.1111/jedm.12415","DOIUrl":"https://doi.org/10.1111/jedm.12415","url":null,"abstract":"&lt;p&gt;This study presents a generalized version of the residual differential item functioning (RDIF) detection framework in item response theory, named GRDIF, to analyze differential item functioning (DIF) in multiple groups. The GRDIF framework retains the advantages of the original RDIF framework, such as computational efficiency and ease of implementation. The performance of GRDIF was assessed through a simulation study and compared with existing DIF detection methods, including the generalized Mantel-Haenszel, Lasso-DIF, and alignment methods. Results showed that the GRDIF framework demonstrated well-controlled Type I error rates close to the nominal level of .05 and satisfactory power in detecting uniform, nonuniform, and mixed DIF across different simulated conditions. Each of the three GRDIF statistics, &lt;span&gt;&lt;/span&gt;&lt;math&gt;\u0000 &lt;semantics&gt;\u0000 &lt;mrow&gt;\u0000 &lt;mi&gt;G&lt;/mi&gt;\u0000 &lt;mi&gt;R&lt;/mi&gt;\u0000 &lt;mi&gt;D&lt;/mi&gt;\u0000 &lt;mi&gt;I&lt;/mi&gt;\u0000 &lt;msub&gt;\u0000 &lt;mi&gt;F&lt;/mi&gt;\u0000 &lt;mi&gt;R&lt;/mi&gt;\u0000 &lt;/msub&gt;\u0000 &lt;/mrow&gt;\u0000 &lt;annotation&gt;$GRDI{{F}_R}$&lt;/annotation&gt;\u0000 &lt;/semantics&gt;&lt;/math&gt;, &lt;span&gt;&lt;/span&gt;&lt;math&gt;\u0000 &lt;semantics&gt;\u0000 &lt;mrow&gt;\u0000 &lt;mi&gt;G&lt;/mi&gt;\u0000 &lt;mi&gt;R&lt;/mi&gt;\u0000 &lt;mi&gt;D&lt;/mi&gt;\u0000 &lt;mi&gt;I&lt;/mi&gt;\u0000 &lt;msub&gt;\u0000 &lt;mi&gt;F&lt;/mi&gt;\u0000 &lt;mi&gt;S&lt;/mi&gt;\u0000 &lt;/msub&gt;\u0000 &lt;/mrow&gt;\u0000 &lt;annotation&gt;$GRDI{{F}_S}$&lt;/annotation&gt;\u0000 &lt;/semantics&gt;&lt;/math&gt;, and &lt;span&gt;&lt;/span&gt;&lt;math&gt;\u0000 &lt;semantics&gt;\u0000 &lt;mrow&gt;\u0000 &lt;mi&gt;G&lt;/mi&gt;\u0000 &lt;mi&gt;R&lt;/mi&gt;\u0000 &lt;mi&gt;D&lt;/mi&gt;\u0000 &lt;mi&gt;I&lt;/mi&gt;\u0000 &lt;msub&gt;\u0000 &lt;mi&gt;F&lt;/mi&gt;\u0000 &lt;mrow&gt;\u0000 &lt;mi&gt;R&lt;/mi&gt;\u0000 &lt;mi&gt;S&lt;/mi&gt;\u0000 &lt;/mrow&gt;\u0000 &lt;/msub&gt;\u0000 &lt;/mrow&gt;\u0000 &lt;annotation&gt;$GRDI{{F}_{RS}}$&lt;/annotation&gt;\u0000 &lt;/semantics&gt;&lt;/math&gt;, effectively detected the specific type of DIF for which it was designed, with &lt;span&gt;&lt;/span&gt;&lt;math&gt;\u0000 &lt;semantics&gt;\u0000 &lt;mrow&gt;\u0000 &lt;mi&gt;G&lt;/mi&gt;\u0000 &lt;mi&gt;R&lt;/mi&gt;\u0000 &lt;mi&gt;D&lt;/mi&gt;\u0000 &lt;mi&gt;I&lt;/mi&gt;\u0000 &lt;msub&gt;\u0000 &lt;mi&gt;F&lt;/mi&gt;\u0000 &lt;mrow&gt;\u0000 &lt;mi&gt;R&lt;/mi&gt;\u0000 &lt;mi&gt;S&lt;/mi&gt;\u0000 &lt;/mrow&gt;\u0000 &lt;/msub&gt;\u0000 &lt;/mrow&gt;\u0000 &lt;annotation&gt;$GRDI{{F}_{RS}}$&lt;/annotation&gt;\u0000 &lt;/semantics&gt;&lt;/math&gt; exhibiting the most robust performance across all types of DIF. The GRDIF framework outperformed other","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 4","pages":"656-681"},"PeriodicalIF":1.4,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12415","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143252779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Item Response Tree Model for Items with Multiple-Choice and Constructed-Response Parts 具有多项选择和构造反应部分的项目反应树模型
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2024-10-07 DOI: 10.1111/jedm.12414
Junhuan Wei, Qin Wang, Buyun Dai, Yan Cai, Dongbo Tu

Traditional IRT and IRTree models are not appropriate for analyzing the item that simultaneously consists of multiple-choice (MC) task and constructed-response (CR) task in one item. To address this issue, this study proposed an item response tree model (called as IRTree-MR) to accommodate items that contain different response types at different steps and multiple different cognitive processes behind each score to effectively investigate the cognitive process and achieve a more accurate evaluation of examinees. The proposed model employs appropriate processing function for each task and allows multiple paths to an observed outcome. The simulation studies were conducted to evaluate the performance of the proposed IRTree-MR, and results show the proposed model outperforms the traditional IRT model in terms of parameters recovery and model-fit. Moreover, an empirical study was carried out to verify the advantages of the proposed model.

传统的IRT和IRTree模型不适用于同时包含多项选择任务和构形反应任务的题项分析。为了解决这一问题,本研究提出了一种项目反应树模型(IRTree-MR),以适应在不同步骤中包含不同反应类型的项目,以及每个分数背后包含多个不同的认知过程,从而有效地研究认知过程,实现对考生更准确的评价。提出的模型为每个任务使用适当的处理功能,并允许多条路径到达观察结果。仿真研究结果表明,该模型在参数恢复和模型拟合方面优于传统的IRT模型。并通过实证研究验证了所提模型的优越性。
{"title":"An Item Response Tree Model for Items with Multiple-Choice and Constructed-Response Parts","authors":"Junhuan Wei,&nbsp;Qin Wang,&nbsp;Buyun Dai,&nbsp;Yan Cai,&nbsp;Dongbo Tu","doi":"10.1111/jedm.12414","DOIUrl":"https://doi.org/10.1111/jedm.12414","url":null,"abstract":"<p>Traditional IRT and IRTree models are not appropriate for analyzing the item that simultaneously consists of multiple-choice (MC) task and constructed-response (CR) task in one item. To address this issue, this study proposed an item response tree model (called as IRTree-MR) to accommodate items that contain different response types at different steps and multiple different cognitive processes behind each score to effectively investigate the cognitive process and achieve a more accurate evaluation of examinees. The proposed model employs appropriate processing function for each task and allows multiple paths to an observed outcome. The simulation studies were conducted to evaluate the performance of the proposed IRTree-MR, and results show the proposed model outperforms the traditional IRT model in terms of parameters recovery and model-fit. Moreover, an empirical study was carried out to verify the advantages of the proposed model.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 4","pages":"634-655"},"PeriodicalIF":1.4,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143248912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Educational Measurement
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1