首页 > 最新文献

Journal of Educational Measurement最新文献

英文 中文
The Vulnerability of AI-Based Scoring Systems to Gaming Strategies: A Case Study
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-02-20 DOI: 10.1111/jedm.12427
Peter Baldwin, Victoria Yaneva, Kai North, Le An Ha, Yiyun Zhou, Alex J. Mechaber, Brian E. Clauser

Recent developments in the use of large-language models have led to substantial improvements in the accuracy of content-based automated scoring of free-text responses. The reported accuracy levels suggest that automated systems could have widespread applicability in assessment. However, before they are used in operational testing, other aspects of their performance warrant examination. In this study, we explore the potential for examinees to inflate their scores by gaming the ACTA automated scoring system. We explore a range of strategies including responding with words selected from the item stem and responding with multiple answers. These responses would be easily identified as incorrect by a human rater but may result in false-positive classifications from an automated system. Our results show that the rate at which these strategies produce responses that are scored as correct varied across items and across strategies but that several vulnerabilities exist.

{"title":"The Vulnerability of AI-Based Scoring Systems to Gaming Strategies: A Case Study","authors":"Peter Baldwin,&nbsp;Victoria Yaneva,&nbsp;Kai North,&nbsp;Le An Ha,&nbsp;Yiyun Zhou,&nbsp;Alex J. Mechaber,&nbsp;Brian E. Clauser","doi":"10.1111/jedm.12427","DOIUrl":"https://doi.org/10.1111/jedm.12427","url":null,"abstract":"<p>Recent developments in the use of large-language models have led to substantial improvements in the accuracy of content-based automated scoring of free-text responses. The reported accuracy levels suggest that automated systems could have widespread applicability in assessment. However, before they are used in operational testing, other aspects of their performance warrant examination. In this study, we explore the potential for examinees to inflate their scores by gaming the ACTA automated scoring system. We explore a range of strategies including responding with words selected from the item stem and responding with multiple answers. These responses would be easily identified as incorrect by a human rater but may result in false-positive classifications from an automated system. Our results show that the rate at which these strategies produce responses that are scored as correct varied across items and across strategies but that several vulnerabilities exist.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 1","pages":"172-194"},"PeriodicalIF":1.4,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Multilabel Neural Network to Score High-Dimensional Assessments for Different Use Foci: An Example with College Major Preference Assessment 使用多标签神经网络为不同使用重点的高维评估打分:以大学专业偏好评估为例
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-01-14 DOI: 10.1111/jedm.12424
Shun-Fu Hu, Amery D. Wu, Jake Stone

Scoring high-dimensional assessments (e.g., > 15 traits) can be a challenging task. This paper introduces the multilabel neural network (MNN) as a scoring method for high-dimensional assessments. Additionally, it demonstrates how MNN can score the same test responses to maximize different performance metrics, such as accuracy, recall, or precision, to suit users' varying needs. These two objectives are illustrated with an example of scoring the short version of the College Majors Preference assessment (Short CMPA) to match the results of whether the 50 college majors would be in one's top three, as determined by the Long CMPA. The results reveal that MNN significantly outperforms the simple-sum ranking method (i.e., ranking the 50 majors' subscale scores) in targeting recall (.95 vs. .68) and precision (.53 vs. .38), while gaining an additional 3% in accuracy (.94 vs. .91). These findings suggest that, when executed properly, MNN can be a flexible and practical tool for scoring numerous traits and addressing various use foci.

{"title":"Using Multilabel Neural Network to Score High-Dimensional Assessments for Different Use Foci: An Example with College Major Preference Assessment","authors":"Shun-Fu Hu,&nbsp;Amery D. Wu,&nbsp;Jake Stone","doi":"10.1111/jedm.12424","DOIUrl":"https://doi.org/10.1111/jedm.12424","url":null,"abstract":"<p>Scoring high-dimensional assessments (e.g., &gt; 15 traits) can be a challenging task. This paper introduces the multilabel neural network (MNN) as a scoring method for high-dimensional assessments. Additionally, it demonstrates how MNN can score the same test responses to maximize different performance metrics, such as accuracy, recall, or precision, to suit users' varying needs. These two objectives are illustrated with an example of scoring the short version of the College Majors Preference assessment (Short CMPA) to match the results of whether the 50 college majors would be in one's top three, as determined by the Long CMPA. The results reveal that MNN significantly outperforms the simple-sum ranking method (i.e., ranking the 50 majors' subscale scores) in targeting recall (.95 vs. .68) and precision (.53 vs. .38), while gaining an additional 3% in accuracy (.94 vs. .91). These findings suggest that, when executed properly, MNN can be a flexible and practical tool for scoring numerous traits and addressing various use foci.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 1","pages":"120-144"},"PeriodicalIF":1.4,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143689091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IRT Observed-Score Equating for Rater-Mediated Assessments Using a Hierarchical Rater Model
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-01-13 DOI: 10.1111/jedm.12425
Tong Wu, Stella Y. Kim, Carl Westine, Michelle Boyer

While significant attention has been given to test equating to ensure score comparability, limited research has explored equating methods for rater-mediated assessments, where human raters inherently introduce error. If not properly addressed, these errors can undermine score interchangeability and test validity. This study proposes an equating method that accounts for rater errors by utilizing item response theory (IRT) observed-score equating with a hierarchical rater model (HRM). Its effectiveness is compared to an IRT observed-score equating method using the generalized partial credit model across 16 rater combinations with varying levels of rater bias and variability. The results indicate that equating performance depends on the interaction between rater bias and variability across forms. Both the proposed and traditional methods demonstrated robustness in terms of bias and RMSE when rater bias and variability were similar between forms, with a few exceptions. However, when rater errors varied significantly across forms, the proposed method consistently produced more stable equating results. Differences in standard error between the methods were minimal under most conditions.

{"title":"IRT Observed-Score Equating for Rater-Mediated Assessments Using a Hierarchical Rater Model","authors":"Tong Wu,&nbsp;Stella Y. Kim,&nbsp;Carl Westine,&nbsp;Michelle Boyer","doi":"10.1111/jedm.12425","DOIUrl":"https://doi.org/10.1111/jedm.12425","url":null,"abstract":"<p>While significant attention has been given to test equating to ensure score comparability, limited research has explored equating methods for rater-mediated assessments, where human raters inherently introduce error. If not properly addressed, these errors can undermine score interchangeability and test validity. This study proposes an equating method that accounts for rater errors by utilizing item response theory (IRT) observed-score equating with a hierarchical rater model (HRM). Its effectiveness is compared to an IRT observed-score equating method using the generalized partial credit model across 16 rater combinations with varying levels of rater bias and variability. The results indicate that equating performance depends on the interaction between rater bias and variability across forms. Both the proposed and traditional methods demonstrated robustness in terms of bias and RMSE when rater bias and variability were similar between forms, with a few exceptions. However, when rater errors varied significantly across forms, the proposed method consistently produced more stable equating results. Differences in standard error between the methods were minimal under most conditions.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 1","pages":"145-171"},"PeriodicalIF":1.4,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Note on the Use of Categorical Subscores
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2025-01-07 DOI: 10.1111/jedm.12423
Kylie Gorney, Sandip Sinharay

Although there exists an extensive amount of research on subscores and their properties, limited research has been conducted on categorical subscores and their interpretations. In this paper, we focus on the claim of Feinberg and von Davier that categorical subscores are useful for remediation and instructional purposes. We investigate this claim by examining (a) the agreement between true and observed subscore classifications and (b) the agreement between subscore classifications across parallel forms of a test. Results show that the categorical subscores of Feinberg and von Davier are often inaccurate and/or inconsistent, pointing to a lack of justification for using them for remediation or instructional purposes.

{"title":"A Note on the Use of Categorical Subscores","authors":"Kylie Gorney,&nbsp;Sandip Sinharay","doi":"10.1111/jedm.12423","DOIUrl":"https://doi.org/10.1111/jedm.12423","url":null,"abstract":"<p>Although there exists an extensive amount of research on subscores and their properties, limited research has been conducted on categorical subscores and their interpretations. In this paper, we focus on the claim of Feinberg and von Davier that categorical subscores are useful for remediation and instructional purposes. We investigate this claim by examining (a) the agreement between true and observed subscore classifications and (b) the agreement between subscore classifications across parallel forms of a test. Results show that the categorical subscores of Feinberg and von Davier are often inaccurate and/or inconsistent, pointing to a lack of justification for using them for remediation or instructional purposes.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 1","pages":"101-119"},"PeriodicalIF":1.4,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12423","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Exploratory Study Using Innovative Graphical Network Analysis to Model Eye Movements in Spatial Reasoning Problem Solving
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2024-12-20 DOI: 10.1111/jedm.12421
Kaiwen Man, Joni M. Lakin

Eye-tracking procedures generate copious process data that could be valuable in establishing the response processes component of modern validity theory. However, there is a lack of tools for assessing and visualizing response processes using process data such as eye-tracking fixation sequences, especially those suitable for young children. This study, which explored student responses to a spatial reasoning task, employed eye tracking and social network analysis to model, examine, and visualize students' visual transition patterns while solving spatial problems to begin to elucidate these processes. Fifty students in Grades 2–8 completed a spatial reasoning task as eye movements were recorded. Areas of interest (AoIs) were defined within the task for each spatial reasoning question. Transition networks between AoIs were constructed and analyzed using selected network measures. Results revealed shared transition sequences across students as well as strategic differences between high and low performers. High performers demonstrated more integrated transitions between AoIs, while low performers considered information more in isolation. Additionally, age and the interaction of age and performance did not significantly impact these measures. The study demonstrates a novel modeling approach for investigating visual processing and provides initial evidence that high-performing students more deeply engage with visual information in solving these types of questions.

{"title":"An Exploratory Study Using Innovative Graphical Network Analysis to Model Eye Movements in Spatial Reasoning Problem Solving","authors":"Kaiwen Man,&nbsp;Joni M. Lakin","doi":"10.1111/jedm.12421","DOIUrl":"https://doi.org/10.1111/jedm.12421","url":null,"abstract":"<p>Eye-tracking procedures generate copious process data that could be valuable in establishing the response processes component of modern validity theory. However, there is a lack of tools for assessing and visualizing response processes using process data such as eye-tracking fixation sequences, especially those suitable for young children. This study, which explored student responses to a spatial reasoning task, employed eye tracking and social network analysis to model, examine, and visualize students' visual transition patterns while solving spatial problems to begin to elucidate these processes. Fifty students in Grades 2–8 completed a spatial reasoning task as eye movements were recorded. Areas of interest (AoIs) were defined within the task for each spatial reasoning question. Transition networks between AoIs were constructed and analyzed using selected network measures. Results revealed shared transition sequences across students as well as strategic differences between high and low performers. High performers demonstrated more integrated transitions between AoIs, while low performers considered information more in isolation. Additionally, age and the interaction of age and performance did not significantly impact these measures. The study demonstrates a novel modeling approach for investigating visual processing and provides initial evidence that high-performing students more deeply engage with visual information in solving these types of questions.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 4","pages":"710-739"},"PeriodicalIF":1.4,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143253061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modeling Directional Testlet Effects on Multiple Open-Ended Questions
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2024-12-10 DOI: 10.1111/jedm.12422
Kuan-Yu Jin, Wai-Lok Siu

Educational tests often have a cluster of items linked by a common stimulus (testlet). In such a design, the dependencies caused between items are called testlet effects. In particular, the directional testlet effect (DTE) refers to a recursive influence whereby responses to earlier items can positively or negatively affect the scores on later items. This study aims to introduce an innovative measurement model to describe DTEs among multiple polytomouslyscored open-ended items. Through simulations, we found that (1) item and DTE parameters can be accurately recovered in Latent GOLD®, (2) ignoring positive (or negative) DTEs by fitting a standard item response theory model can result in the overestimation (or underestimation) of test reliability, (3) collapsing multiple items of a testlet into a super item is still effective in eliminating DTEs, (4) the popular multidimensional strategy of adding nuisance factors to describe item dependencies fails to account for DTE adequately, and (5) fitting the proposed model for DTE to testlet data involving nuisance factors will observe positive DTEs but will not have a better fit. Moreover, using the proposed model, we demonstrated the coexistence of positive and negative DTEs in a real history exam.

{"title":"Modeling Directional Testlet Effects on Multiple Open-Ended Questions","authors":"Kuan-Yu Jin,&nbsp;Wai-Lok Siu","doi":"10.1111/jedm.12422","DOIUrl":"https://doi.org/10.1111/jedm.12422","url":null,"abstract":"<p>Educational tests often have a cluster of items linked by a common stimulus (<i>testlet</i>). In such a design, the dependencies caused between items are called <i>testlet effects</i>. In particular, the directional testlet effect (DTE) refers to a recursive influence whereby responses to earlier items can positively or negatively affect the scores on later items. This study aims to introduce an innovative measurement model to describe DTEs among multiple polytomouslyscored open-ended items. Through simulations, we found that (1) item and DTE parameters can be accurately recovered in Latent GOLD<sup>®</sup>, (2) ignoring positive (or negative) DTEs by fitting a standard item response theory model can result in the overestimation (or underestimation) of test reliability, (3) collapsing multiple items of a testlet into a super item is still effective in eliminating DTEs, (4) the popular multidimensional strategy of adding nuisance factors to describe item dependencies fails to account for DTE adequately, and (5) fitting the proposed model for DTE to testlet data involving nuisance factors will observe positive DTEs but will not have a better fit. Moreover, using the proposed model, we demonstrated the coexistence of positive and negative DTEs in a real history exam.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"62 1","pages":"81-100"},"PeriodicalIF":1.4,"publicationDate":"2024-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143688659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Differences in Time Usage as a Competing Hypothesis for Observed Group Differences in Accuracy with an Application to Observed Gender Differences in PISA Data
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2024-11-01 DOI: 10.1111/jedm.12419
Radhika Kapoor, Erin Fahle, Klint Kanopka, David Klinowski, Ana Trindade Ribeiro, Benjamin W. Domingue

Group differences in test scores are a key metric in education policy. Response time offers novel opportunities for understanding these differences, especially in low-stakes settings. Here, we describe how observed group differences in test accuracy can be attributed to group differences in latent response speed or group differences in latent capacity, where capacity is defined as expected accuracy for a given response speed. This article introduces a method for decomposing observed group differences in accuracy into these differences in speed versus differences in capacity. We first illustrate in simulation studies that this approach can reliably distinguish between group speed and capacity differences. We then use this approach to probe gender differences in science and reading fluency in PISA 2018 for 71 countries. In science, score differentials largely increase when males, who respond more rapidly, are the higher performing group and decrease when females, who respond more slowly, are the higher performing group. In reading fluency, score differentials decrease where females, who respond more rapidly, are the higher performing group. This method can be used to analyze group differences especially in low-stakes assessments where there are potential group differences in speed.

{"title":"Differences in Time Usage as a Competing Hypothesis for Observed Group Differences in Accuracy with an Application to Observed Gender Differences in PISA Data","authors":"Radhika Kapoor,&nbsp;Erin Fahle,&nbsp;Klint Kanopka,&nbsp;David Klinowski,&nbsp;Ana Trindade Ribeiro,&nbsp;Benjamin W. Domingue","doi":"10.1111/jedm.12419","DOIUrl":"https://doi.org/10.1111/jedm.12419","url":null,"abstract":"<p>Group differences in test scores are a key metric in education policy. Response time offers novel opportunities for understanding these differences, especially in low-stakes settings. Here, we describe how observed group differences in test accuracy can be attributed to group differences in latent response speed or group differences in latent capacity, where capacity is defined as expected accuracy for a given response speed. This article introduces a method for decomposing observed group differences in accuracy into these differences in speed versus differences in capacity. We first illustrate in simulation studies that this approach can reliably distinguish between group speed and capacity differences. We then use this approach to probe gender differences in science and reading fluency in PISA 2018 for 71 countries. In science, score differentials largely increase when males, who respond more rapidly, are the higher performing group and decrease when females, who respond more slowly, are the higher performing group. In reading fluency, score differentials decrease where females, who respond more rapidly, are the higher performing group. This method can be used to analyze group differences especially in low-stakes assessments where there are potential group differences in speed.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 4","pages":"682-709"},"PeriodicalIF":1.4,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143247456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction to “Expanding the Lognormal Response Time Model Using Profile Similarity Metrics to Improve the Detection of Anomalous Testing Behavior”
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2024-10-23 DOI: 10.1111/jedm.12418

Hurtz, G.M., & Mucino, R. (2024). Expanding the lognormal response time model using profile similarity metrics to improve the detection of anomalous testing behavior. Journal of Educational Measurement, 61, 458–485. https://doi.org/10.1111/jedm.12395

We apologize for this error.

{"title":"Correction to “Expanding the Lognormal Response Time Model Using Profile Similarity Metrics to Improve the Detection of Anomalous Testing Behavior”","authors":"","doi":"10.1111/jedm.12418","DOIUrl":"https://doi.org/10.1111/jedm.12418","url":null,"abstract":"<p>Hurtz, G.M., &amp; Mucino, R. (2024). Expanding the lognormal response time model using profile similarity metrics to improve the detection of anomalous testing behavior. <i>Journal of Educational Measurement, 61</i>, 458–485. https://doi.org/10.1111/jedm.12395</p><p>We apologize for this error.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 4","pages":"780"},"PeriodicalIF":1.4,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12418","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143253107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Subscores: A Practical Guide to Their Production and Consumption. Shelby Haberman, Sandip Sinharay, Richard Feinberg, and Howard Wainer. Cambridge, Cambridge University Press 2024, 176 pp. (paperback)
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2024-10-18 DOI: 10.1111/jedm.12417
Gautam Puhan
{"title":"Subscores: A Practical Guide to Their Production and Consumption. Shelby Haberman, Sandip Sinharay, Richard Feinberg, and Howard Wainer. Cambridge, Cambridge University Press 2024, 176 pp. (paperback)","authors":"Gautam Puhan","doi":"10.1111/jedm.12417","DOIUrl":"https://doi.org/10.1111/jedm.12417","url":null,"abstract":"","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 4","pages":"763-772"},"PeriodicalIF":1.4,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143252873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Keystroke Behavior Patterns to Detect Nonauthentic Texts in Writing Assessments: Evaluating the Fairness of Predictive Models
IF 1.4 4区 心理学 Q3 PSYCHOLOGY, APPLIED Pub Date : 2024-10-18 DOI: 10.1111/jedm.12416
Yang Jiang, Mo Zhang, Jiangang Hao, Paul Deane, Chen Li

The emergence of sophisticated AI tools such as ChatGPT, coupled with the transition to remote delivery of educational assessments in the COVID-19 era, has led to increasing concerns about academic integrity and test security. Using AI tools, test takers can produce high-quality texts effortlessly and use them to game assessments. It is thus critical to detect these nonauthentic texts to ensure test integrity. In this study, we leveraged keystroke logs—recordings of every keypress—to build machine learning (ML) detectors of nonauthentic texts in a large-scale writing assessment. We focused on investigating the fairness of the detectors across demographic subgroups to ensure that nongenuine writing can be predicted equally well across subgroups. Results indicated that keystroke dynamics were effective in identifying nonauthentic texts. While the ML models were slightly more likely to misclassify the original responses submitted by male test takers as consisting of nonauthentic texts than those submitted by females, the effect sizes were negligible. Furthermore, balancing demographic distributions and class labels did not consistently mitigate detector bias across predictive models. Findings of this study not only provide implications for using behavioral data to address test security issues, but also highlight the importance of evaluating the fairness of predictive models in educational contexts.

{"title":"Using Keystroke Behavior Patterns to Detect Nonauthentic Texts in Writing Assessments: Evaluating the Fairness of Predictive Models","authors":"Yang Jiang,&nbsp;Mo Zhang,&nbsp;Jiangang Hao,&nbsp;Paul Deane,&nbsp;Chen Li","doi":"10.1111/jedm.12416","DOIUrl":"https://doi.org/10.1111/jedm.12416","url":null,"abstract":"<p>The emergence of sophisticated AI tools such as ChatGPT, coupled with the transition to remote delivery of educational assessments in the COVID-19 era, has led to increasing concerns about academic integrity and test security. Using AI tools, test takers can produce high-quality texts effortlessly and use them to game assessments. It is thus critical to detect these nonauthentic texts to ensure test integrity. In this study, we leveraged keystroke logs—recordings of every keypress—to build machine learning (ML) detectors of nonauthentic texts in a large-scale writing assessment. We focused on investigating the fairness of the detectors across demographic subgroups to ensure that nongenuine writing can be predicted equally well across subgroups. Results indicated that keystroke dynamics were effective in identifying nonauthentic texts. While the ML models were slightly more likely to misclassify the original responses submitted by male test takers as consisting of nonauthentic texts than those submitted by females, the effect sizes were negligible. Furthermore, balancing demographic distributions and class labels did not consistently mitigate detector bias across predictive models. Findings of this study not only provide implications for using behavioral data to address test security issues, but also highlight the importance of evaluating the fairness of predictive models in educational contexts.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":"61 4","pages":"571-594"},"PeriodicalIF":1.4,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143252875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Educational Measurement
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1