Unlike standardized testing applications of validity, teachers need a simple and efficient way to reflect on the accuracy of the claims based on student performance, then consider whether the uses of those claims are appropriate. A two-phase reasoning process of validation, consisting of a proficiency claim /argument and a use/argument, is presented as a way for teachers to understand and apply the central tenets of validation to their classroom assessments. Since classroom assessment is contextualized with multiple purposes, each teacher is obligated to use validation for their situation. The accuracy of teachers’ conclusions about the proficiency claims, and uses, will depend on their skill in gathering supportive evidence and considering alternative explanations. Examples of the proposed classroom assessment validation process are presented.
{"title":"Classroom Assessment Validation: Proficiency Claims and Uses","authors":"James H. McMillan","doi":"10.1111/emip.70014","DOIUrl":"https://doi.org/10.1111/emip.70014","url":null,"abstract":"<p>Unlike standardized testing applications of validity, teachers need a simple and efficient way to reflect on the accuracy of the claims based on student performance, then consider whether the uses of those claims are appropriate. A two-phase reasoning process of validation, consisting of a proficiency claim /argument and a use/argument, is presented as a way for teachers to understand and apply the central tenets of validation to their classroom assessments. Since classroom assessment is contextualized with multiple purposes, each teacher is obligated to use validation for their situation. The accuracy of teachers’ conclusions about the proficiency claims, and uses, will depend on their skill in gathering supportive evidence and considering alternative explanations. Examples of the proposed classroom assessment validation process are presented.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"45 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146139296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Zhong, Jiangang Hao, Michael Fauss, Chen Li, Yuan Wang
The rapid advancement of large language models (LLMs) has enabled the generation of coherent essays, making AI-assisted writing increasingly common in educational and professional settings. Using large-scale empirical data, we examine and benchmark the characteristics and quality of essays generated by popular LLMs and discuss their implications for two key components of writing assessments: automated scoring and academic integrity. Our findings highlight limitations in existing automated scoring systems when applied to essays generated or heavily influenced by AI, and identify areas for improvement, including the development of new features to capture deeper thinking and recalibrating feature weights. Despite growing concerns that the increasing variety of LLMs may undermine the feasibility of detecting AI-generated essays, our results show that detectors trained on essays generated from one model can often identify texts from others with high accuracy, suggesting that effective detection could remain manageable in practice.
{"title":"AI-Generated Essays: Characteristics and Implications on Automated Scoring and Academic Integrity","authors":"Yang Zhong, Jiangang Hao, Michael Fauss, Chen Li, Yuan Wang","doi":"10.1111/emip.70013","DOIUrl":"https://doi.org/10.1111/emip.70013","url":null,"abstract":"<p>The rapid advancement of large language models (LLMs) has enabled the generation of coherent essays, making AI-assisted writing increasingly common in educational and professional settings. Using large-scale empirical data, we examine and benchmark the characteristics and quality of essays generated by popular LLMs and discuss their implications for two key components of writing assessments: automated scoring and academic integrity. Our findings highlight limitations in existing automated scoring systems when applied to essays generated or heavily influenced by AI, and identify areas for improvement, including the development of new features to capture deeper thinking and recalibrating feature weights. Despite growing concerns that the increasing variety of LLMs may undermine the feasibility of detecting AI-generated essays, our results show that detectors trained on essays generated from one model can often identify texts from others with high accuracy, suggesting that effective detection could remain manageable in practice.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"45 1","pages":""},"PeriodicalIF":1.9,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146136621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although AI transformer models have demonstrated notable capability in automated scoring, it is difficult to examine how and why these models fall short in scoring some responses. This study investigated how transformer models’ language processing and quantification processes can be leveraged to enhance the accuracy of automated scoring. Automated scoring was applied to five science items. Results indicate that including item descriptions prior to student responses provides additional contextual information to the transformer model, allowing it to generate automated scoring models with improved performance. These automated scoring models achieved scoring accuracy comparable to human raters. However, they struggle to evaluate responses that contain complex scientific terminology and to interpret responses that contain unusual symbols, atypical language errors, or logical inconsistencies. These findings underscore the importance of the efforts from both researchers and teachers in advancing the accuracy, fairness, and effectiveness of automated scoring.
{"title":"Automated Scoring in Learning Progression-Based Assessment: A Comparison of Researcher and Machine Interpretations","authors":"Hui Jin, Cynthia Lima, Limin Wang","doi":"10.1111/emip.70003","DOIUrl":"https://doi.org/10.1111/emip.70003","url":null,"abstract":"<p>Although AI transformer models have demonstrated notable capability in automated scoring, it is difficult to examine how and why these models fall short in scoring some responses. This study investigated how transformer models’ language processing and quantification processes can be leveraged to enhance the accuracy of automated scoring. Automated scoring was applied to five science items. Results indicate that including item descriptions prior to student responses provides additional contextual information to the transformer model, allowing it to generate automated scoring models with improved performance. These automated scoring models achieved scoring accuracy comparable to human raters. However, they struggle to evaluate responses that contain complex scientific terminology and to interpret responses that contain unusual symbols, atypical language errors, or logical inconsistencies. These findings underscore the importance of the efforts from both researchers and teachers in advancing the accuracy, fairness, and effectiveness of automated scoring.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"44 3","pages":"25-37"},"PeriodicalIF":1.9,"publicationDate":"2025-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145012936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stephanie Iaccarino, Brian E. Clauser, Polina Harik, Peter Baldwin, Yiyun Zhou, Michael T. Kane
{"title":"Exploring the Effect of Human Error When Using Expert Judgments to Train an Automated Scoring System","authors":"Stephanie Iaccarino, Brian E. Clauser, Polina Harik, Peter Baldwin, Yiyun Zhou, Michael T. Kane","doi":"10.1111/emip.70002","DOIUrl":"https://doi.org/10.1111/emip.70002","url":null,"abstract":"","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"44 3","pages":"15-24"},"PeriodicalIF":1.9,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145012067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}