首页 > 最新文献

Assessing Writing最新文献

英文 中文
Using ChatGPT to score essays and short-form constructed responses 使用ChatGPT对文章和短文进行评分
IF 5.5 1区 文学 Q1 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2025-10-01 Epub Date: 2025-10-13 DOI: 10.1016/j.asw.2025.100988
Mark D. Shermis
This study evaluates the effectiveness of ChatGPT-4o in scoring essays and short-form constructed responses compared to human raters and traditional machine learning models. Using data from the Automated Student Assessment Prize (ASAP), ChatGPT’s performance was assessed across multiple predictive models, including linear regression, random forest, gradient boost, and XGBoost. Results indicate that while ChatGPT’s gradient boost model achieved quadratic weighted kappa (QWK) scores close to human raters for some datasets, overall performance remained inconsistent, particularly for short-form responses. The study highlights key challenges, including variability in scoring accuracy, potential biases, and limitations in aligning ChatGPT’s predictions with human scoring standards. While ChatGPT demonstrated efficiency and scalability, its leniency and variability suggest that it should not yet replace human raters in high-stakes assessments. Instead, a hybrid approach combining AI with empirical scoring models may improve reliability. Future research should focus on refining AI-driven scoring models through enhanced fine-tuning, bias mitigation, and validation with broader datasets. Ethical considerations, including fairness in automated scoring and data security, must also be addressed. This study concludes that ChatGPT holds promise as a supplementary tool in educational assessment but requires further development to ensure validity and fairness.
本研究评估了chatgpt - 40与人类评分者和传统机器学习模型相比,在评分文章和简短的构造反应方面的有效性。使用来自自动学生评估奖(ASAP)的数据,ChatGPT的性能通过多个预测模型进行评估,包括线性回归、随机森林、梯度增强和XGBoost。结果表明,虽然ChatGPT的梯度提升模型在某些数据集上获得了接近人类评分的二次加权kappa (QWK)分数,但总体表现仍然不一致,特别是对于简短的回答。该研究强调了关键的挑战,包括评分准确性的可变性、潜在的偏见,以及将ChatGPT的预测与人类评分标准保持一致的局限性。虽然ChatGPT展示了效率和可扩展性,但它的宽松性和可变性表明,它还不应该在高风险评估中取代人类评分员。相反,将人工智能与经验评分模型相结合的混合方法可能会提高可靠性。未来的研究应侧重于通过加强微调、减少偏见和使用更广泛的数据集进行验证来改进人工智能驱动的评分模型。道德方面的考虑,包括自动评分的公平性和数据安全,也必须得到解决。本研究得出结论,ChatGPT作为教育评估的补充工具有希望,但需要进一步发展以确保有效性和公平性。
{"title":"Using ChatGPT to score essays and short-form constructed responses","authors":"Mark D. Shermis","doi":"10.1016/j.asw.2025.100988","DOIUrl":"10.1016/j.asw.2025.100988","url":null,"abstract":"<div><div>This study evaluates the effectiveness of ChatGPT-4o in scoring essays and short-form constructed responses compared to human raters and traditional machine learning models. Using data from the Automated Student Assessment Prize (ASAP), ChatGPT’s performance was assessed across multiple predictive models, including linear regression, random forest, gradient boost, and XGBoost. Results indicate that while ChatGPT’s gradient boost model achieved quadratic weighted kappa (QWK) scores close to human raters for some datasets, overall performance remained inconsistent, particularly for short-form responses. The study highlights key challenges, including variability in scoring accuracy, potential biases, and limitations in aligning ChatGPT’s predictions with human scoring standards. While ChatGPT demonstrated efficiency and scalability, its leniency and variability suggest that it should not yet replace human raters in high-stakes assessments. Instead, a hybrid approach combining AI with empirical scoring models may improve reliability. Future research should focus on refining AI-driven scoring models through enhanced fine-tuning, bias mitigation, and validation with broader datasets. Ethical considerations, including fairness in automated scoring and data security, must also be addressed. This study concludes that ChatGPT holds promise as a supplementary tool in educational assessment but requires further development to ensure validity and fairness.</div></div>","PeriodicalId":46865,"journal":{"name":"Assessing Writing","volume":"66 ","pages":"Article 100988"},"PeriodicalIF":5.5,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145320230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Response time for English learners on large-scale writing assessments 英语学习者在大规模写作评估中的反应时间
IF 5.5 1区 文学 Q1 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2025-10-01 Epub Date: 2025-08-13 DOI: 10.1016/j.asw.2025.100979
Catherine Welch, Stephen Dunbar, Jeongmin Ji, Annette Vernon, Junhee Park
The focus of this study is on the relationship between response time and performance for students completing an evidence-based writing assessment as part of a state’s accountability plan. Within an untimed testing administration, this study examined the difference in performance between English Learners and Non-English Learners across four different writing traits. The differences observed lead to recommendations for assessment administration and the appropriate allocation of testing time and teaching strategies for taking the assessment.
作为州问责制计划的一部分,本研究的重点是学生完成循证写作评估的反应时间与表现之间的关系。在一项不定时的测试管理中,这项研究调查了英语学习者和非英语学习者在四种不同写作特征上的表现差异。观察到的差异导致了评估管理和适当分配测试时间和采取评估的教学策略的建议。
{"title":"Response time for English learners on large-scale writing assessments","authors":"Catherine Welch,&nbsp;Stephen Dunbar,&nbsp;Jeongmin Ji,&nbsp;Annette Vernon,&nbsp;Junhee Park","doi":"10.1016/j.asw.2025.100979","DOIUrl":"10.1016/j.asw.2025.100979","url":null,"abstract":"<div><div>The focus of this study is on the relationship between response time and performance for students completing an evidence-based writing assessment as part of a state’s accountability plan. Within an untimed testing administration, this study examined the difference in performance between English Learners and Non-English Learners across four different writing traits. The differences observed lead to recommendations for assessment administration and the appropriate allocation of testing time and teaching strategies for taking the assessment.</div></div>","PeriodicalId":46865,"journal":{"name":"Assessing Writing","volume":"66 ","pages":"Article 100979"},"PeriodicalIF":5.5,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparing GPT-based approaches in automated writing evaluation 比较基于gpt的自动写作评价方法
IF 4.2 1区 文学 Q1 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2025-10-01 Epub Date: 2025-06-24 DOI: 10.1016/j.asw.2025.100961
Yingying Liu , Xiaofei Lu , Huilei Qi
Large language models (LLMs) like OpenAI’s GPT models show significant promise in automated writing evaluation (AWE). However, recent research has mainly focused on non-fine-tuned GPT models, with limited attention to fine-tuned models as well as potential factors influencing performance, such as model type, prompting strategy, and dataset characteristics. This study compares six GPT-based approaches for evaluating TOFEL argumentative writing, namely, GPT-3.5 zero-shot, GPT-3.5 few-shot, GPT-4 zero-shot, GPT-4 few-shot, and two fine-tuning methods. We assess the impact of model type (GPT-3.5 vs. GPT-4), prompting strategy (zero-shot vs. few-shot), fine-tuning, class imbalance and dataset shift on performance. Our findings reveal that fine-tuned GPT models consistently outperform non-fine-tuned GPT-4 models, which in turn outperform GPT-3.5 models. Few-shot prompting does not show clear advantages over zero-shot prompting in this study. Additionally, class imbalance and dataset shift negatively affect model accuracy and reliability. These results offer valuable insights into the effectiveness of different GPT-based approaches and the factors that influence their performance in AWE.
像OpenAI的GPT模型这样的大型语言模型(llm)在自动写作评估(AWE)中显示出了巨大的前景。然而,目前的研究主要集中在非微调的GPT模型上,对微调模型以及模型类型、提示策略、数据集特征等潜在影响性能的因素关注较少。本研究比较了六种基于gpt的托福论文写作评估方法,即GPT-3.5零射击、GPT-3.5少射击、GPT-4零射击、GPT-4少射击和两种微调方法。我们评估了模型类型(GPT-3.5 vs. GPT-4)、提示策略(零投篮vs.少投篮)、微调、类别不平衡和数据集转移对性能的影响。我们的研究结果表明,经过微调的GPT模型始终优于非微调的GPT-4模型,而非微调的GPT-4模型又优于GPT-3.5模型。在本研究中,少针提示没有显示出明显优于零针提示。此外,类别不平衡和数据集移位会对模型的准确性和可靠性产生负面影响。这些结果为了解不同基于gpt的方法的有效性以及影响其在AWE中表现的因素提供了有价值的见解。
{"title":"Comparing GPT-based approaches in automated writing evaluation","authors":"Yingying Liu ,&nbsp;Xiaofei Lu ,&nbsp;Huilei Qi","doi":"10.1016/j.asw.2025.100961","DOIUrl":"10.1016/j.asw.2025.100961","url":null,"abstract":"<div><div>Large language models (LLMs) like OpenAI’s GPT models show significant promise in automated writing evaluation (AWE). However, recent research has mainly focused on non-fine-tuned GPT models, with limited attention to fine-tuned models as well as potential factors influencing performance, such as model type, prompting strategy, and dataset characteristics. This study compares six GPT-based approaches for evaluating TOFEL argumentative writing, namely, GPT-3.5 zero-shot, GPT-3.5 few-shot, GPT-4 zero-shot, GPT-4 few-shot, and two fine-tuning methods. We assess the impact of model type (GPT-3.5 vs. GPT-4), prompting strategy (zero-shot vs. few-shot), fine-tuning, class imbalance and dataset shift on performance. Our findings reveal that fine-tuned GPT models consistently outperform non-fine-tuned GPT-4 models, which in turn outperform GPT-3.5 models. Few-shot prompting does not show clear advantages over zero-shot prompting in this study. Additionally, class imbalance and dataset shift negatively affect model accuracy and reliability. These results offer valuable insights into the effectiveness of different GPT-based approaches and the factors that influence their performance in AWE.</div></div>","PeriodicalId":46865,"journal":{"name":"Assessing Writing","volume":"66 ","pages":"Article 100961"},"PeriodicalIF":4.2,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144471912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring the scoring validity of holistic and dimension-based Comparative Judgements of young learners’ EFL writing 探讨青少年英语写作整体与维度比较判断的评分效度
IF 5.5 1区 文学 Q1 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2025-10-01 Epub Date: 2025-10-15 DOI: 10.1016/j.asw.2025.100986
Rebecca Sickinger , John Pill , Tineke Brunfaut
Comparative Judgement (CJ) is a pairwise comparison evaluation method, typically conducted online. Multiple judges each compare the quality of a series of paired performances and, from their decisions, a rank order is constructed and scores calculated. Research across different educational contexts supports CJ’s reliability for evaluating written performances, permitting more precise scoring of scripts and for dimension-focused evaluation. However, scant insights are available about the basis of judges’ evaluations. This issue is important because argument-based approaches to validation (common in the field of language testing and adopted in this study) require evidence to support claims about how scores are appropriate for test purpose. Therefore, we investigate the scoring validity of CJ, both when used holistically (the standard application of CJ) and when evaluating scripts by individual criteria (termed dimensions in the research context). Twenty-seven judges evaluated 300 scripts addressing two writing task types in a national English as a Foreign Language examination for young learners in Austria. Judges reported via questionnaires what they had focused on while judging. Subsequently, eight judges provided think-aloud data while evaluating 157 scripts, offering further insight into the writing features they considered and their decision-making during CJ. Findings showed that while most judges adapted a decision-making process similar to traditional rating methods, some adapted their method to accommodate the nature of CJ evaluation. Furthermore, results indicated that the judges considered construct-relevant criteria when using CJ, both holistically and by dimension, thus offering support to an argument for the appropriateness of using CJ in this context.
比较判断(CJ)是一种两两比较评价方法,通常在网上进行。多名裁判各自比较一系列配对表演的质量,并根据他们的决定,构建一个排名顺序并计算分数。研究跨越不同的教育背景,支持CJ的可靠性评估书面表演,允许更精确的剧本评分和维度为重点的评估。然而,关于法官评价的依据,人们的见解很少。这个问题很重要,因为基于论证的验证方法(在语言测试领域很常见,并在本研究中采用)需要证据来支持分数如何适合测试目的的说法。因此,我们调查了CJ的评分效度,无论是在整体使用时(CJ的标准应用),还是在根据个人标准评估剧本时(在研究背景下称为维度)。在奥地利举行的全国英语作为外语考试中,27名评委评估了300个针对两种写作任务类型的剧本。评委们通过问卷报告了他们在评判时关注的焦点。随后,8位评委在评估157个剧本的同时,提供了“出声思考”的数据,进一步深入了解了他们在CJ期间考虑的写作特点和他们的决策。调查结果表明,虽然大多数法官采用了类似于传统评级方法的决策过程,但有些法官调整了他们的方法,以适应CJ评价的性质。此外,结果表明,法官在使用CJ时考虑了整体和维度的建构相关标准,从而为在这种情况下使用CJ的适当性提供了支持。
{"title":"Exploring the scoring validity of holistic and dimension-based Comparative Judgements of young learners’ EFL writing","authors":"Rebecca Sickinger ,&nbsp;John Pill ,&nbsp;Tineke Brunfaut","doi":"10.1016/j.asw.2025.100986","DOIUrl":"10.1016/j.asw.2025.100986","url":null,"abstract":"<div><div>Comparative Judgement (CJ) is a pairwise comparison evaluation method, typically conducted online. Multiple judges each compare the quality of a series of paired performances and, from their decisions, a rank order is constructed and scores calculated. Research across different educational contexts supports CJ’s reliability for evaluating written performances, permitting more precise scoring of scripts and for dimension-focused evaluation. However, scant insights are available about the basis of judges’ evaluations. This issue is important because argument-based approaches to validation (common in the field of language testing and adopted in this study) require evidence to support claims about how scores are appropriate for test purpose. Therefore, we investigate the scoring validity of CJ, both when used holistically (the standard application of CJ) and when evaluating scripts by individual criteria (termed dimensions in the research context). Twenty-seven judges evaluated 300 scripts addressing two writing task types in a national English as a Foreign Language examination for young learners in Austria. Judges reported via questionnaires what they had focused on while judging. Subsequently, eight judges provided think-aloud data while evaluating 157 scripts, offering further insight into the writing features they considered and their decision-making during CJ. Findings showed that while most judges adapted a decision-making process similar to traditional rating methods, some adapted their method to accommodate the nature of CJ evaluation. Furthermore, results indicated that the judges considered construct-relevant criteria when using CJ, both holistically and by dimension, thus offering support to an argument for the appropriateness of using CJ in this context.</div></div>","PeriodicalId":46865,"journal":{"name":"Assessing Writing","volume":"66 ","pages":"Article 100986"},"PeriodicalIF":5.5,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145320231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Linguistic predictors of L2 writing performance: Variations across genres 二语写作表现的语言预测因素:不同体裁的差异
IF 5.5 1区 文学 Q1 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2025-10-01 Epub Date: 2025-09-30 DOI: 10.1016/j.asw.2025.100985
Weiwei Yang , Sara T. Cushing , Guoxing Yu
This study investigated how linguistic complexity (including lexical and syntactic complexity), accuracy, and fluency (CAF) predicted second language (L2) writing scores across four essay genres: narration, exposition, expo-argumentation and argumentation. Approximately 60 essays were collected on each of these genres on the same subject matter and were scored using a holistic rubric. Eight measures of complexity, accuracy and fluency were examined. Forward stepwise regression analysis based on Akaike Information Criterion Corrected (AICC) was conducted for each genre. The findings revealed a large amount of score variance explained by CAF: 61 % for the argumentative task and about 70 % for the other three tasks. Fluency was found to be a highly important score predictor for the narrative and expository tasks, while lexical sophistication was equally important or more important than fluency for the expo-argumentative and argumentative tasks. The regression model for the narrative task also differed from those for the expository, argumentative task types, regarding syntactic complexity predictors. Lexical diversity was generally less important in predicting scores than lexical sophistication. The implications of the findings for L2 writing scoring and automated essay scoring are discussed.
本研究调查了语言复杂性(包括词汇和句法复杂性)、准确性和流畅性(CAF)如何预测第二语言(L2)写作分数,涉及四种散文类型:叙述、阐述、阐述论证和论证。在相同的主题上,收集了大约60篇关于这些类型的文章,并使用整体标题进行评分。对语言的复杂性、准确性和流畅性进行了八项测试。对各类型进行基于赤池信息标准修正(AICC)的正逐步回归分析。研究结果显示,CAF解释了大量的分数差异:辩论任务为61 %,其他三个任务约为70 %。在叙述和说明性任务中,流利性被发现是一个非常重要的得分预测指标,而在说明性论证和论证任务中,词汇的复杂性与流利性同样重要,甚至更重要。在句法复杂性预测因子方面,叙事性任务的回归模型也不同于说明性、议论文型任务。词汇多样性在预测分数方面的重要性通常低于词汇复杂程度。研究结果对第二语言写作评分和自动作文评分的影响进行了讨论。
{"title":"Linguistic predictors of L2 writing performance: Variations across genres","authors":"Weiwei Yang ,&nbsp;Sara T. Cushing ,&nbsp;Guoxing Yu","doi":"10.1016/j.asw.2025.100985","DOIUrl":"10.1016/j.asw.2025.100985","url":null,"abstract":"<div><div>This study investigated how linguistic complexity (including lexical and syntactic complexity), accuracy, and fluency (CAF) predicted second language (L2) writing scores across four essay genres: narration, exposition, expo-argumentation and argumentation. Approximately 60 essays were collected on each of these genres on the same subject matter and were scored using a holistic rubric. Eight measures of complexity, accuracy and fluency were examined. Forward stepwise regression analysis based on Akaike Information Criterion Corrected (AICC) was conducted for each genre. The findings revealed a large amount of score variance explained by CAF: 61 % for the argumentative task and about 70 % for the other three tasks. Fluency was found to be a highly important score predictor for the narrative and expository tasks, while lexical sophistication was equally important or more important than fluency for the expo-argumentative and argumentative tasks. The regression model for the narrative task also differed from those for the expository, argumentative task types, regarding syntactic complexity predictors. Lexical diversity was generally less important in predicting scores than lexical sophistication. The implications of the findings for L2 writing scoring and automated essay scoring are discussed.</div></div>","PeriodicalId":46865,"journal":{"name":"Assessing Writing","volume":"66 ","pages":"Article 100985"},"PeriodicalIF":5.5,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145219003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving writing feedback quality and self-efficacy of pre-service teachers in Gen-AI contexts: An experimental mixed-method design 提高Gen-AI情境下职前教师写作反馈质量和自我效能:一种实验性混合方法设计
IF 4.2 1区 文学 Q1 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2025-10-01 Epub Date: 2025-06-19 DOI: 10.1016/j.asw.2025.100960
Siyu Zhu , Qingyang Li , Yuan Yao , Jialin Li , Xinhua Zhu
The rapid advancement of Generative AI (Gen-AI), such as ChatGPT, presents both opportunities and challenges for teacher education. For pre-service teachers (PSTs), Gen-AI offers new tools to enhance the efficiency and quality of writing feedback. However, it also raises concerns, as many PSTs lack classroom experience, confidence in giving feedback, and knowledge of how to effectively integrate AI-generated content into instructional practice. To address these issues, this study adopted a pre-post experimental design to examine the effects of targeted training on PSTs’ provision of writing feedback, with a focus on feedback quality, self-efficacy, and their relationship in ChatGPT-supported contexts. Over a two-week training program with 30 PSTs, Wilcoxon signed-rank test results from the content analysis showed significant improvements in feedback quality and self-efficacy. Semi-structured interviews with eight participants identified cognitive changes and enhanced ChatGPT operational skills as key drivers of these improvements. We reaffirmed that mastery and vicarious experiences are crucial for enhancing teacher self-efficacy. Furthermore, a reciprocal relationship was observed between the quality and self-efficacy in providing ChatGPT-assisted feedback. This study contributes to the broader discourse on ChatGPT in education and offers specific strategies for effectively incorporating new technology into teacher training.
ChatGPT等生成式人工智能(Gen-AI)的快速发展,给教师教育带来了机遇和挑战。Gen-AI为职前教师(pst)提供了新的工具,以提高写作反馈的效率和质量。然而,这也引起了人们的担忧,因为许多pst缺乏课堂经验,缺乏给予反馈的信心,也不知道如何有效地将人工智能生成的内容整合到教学实践中。为了解决这些问题,本研究采用了前后实验设计来检验针对性培训对pst提供写作反馈的影响,重点关注反馈质量、自我效能及其在chatgpt支持情境下的关系。在为期两周的30名pst培训项目中,内容分析的Wilcoxon sign -rank测试结果显示反馈质量和自我效能显著提高。对8位参与者进行的半结构化访谈确定了认知变化和增强的ChatGPT操作技能是这些改进的关键驱动因素。我们重申,掌握和替代经验是提高教师自我效能感的关键。此外,在提供chatgpt辅助反馈的质量和自我效能之间观察到相互关系。本研究为ChatGPT在教育中的广泛讨论做出了贡献,并为有效地将新技术纳入教师培训提供了具体策略。
{"title":"Improving writing feedback quality and self-efficacy of pre-service teachers in Gen-AI contexts: An experimental mixed-method design","authors":"Siyu Zhu ,&nbsp;Qingyang Li ,&nbsp;Yuan Yao ,&nbsp;Jialin Li ,&nbsp;Xinhua Zhu","doi":"10.1016/j.asw.2025.100960","DOIUrl":"10.1016/j.asw.2025.100960","url":null,"abstract":"<div><div>The rapid advancement of Generative AI (Gen-AI), such as ChatGPT, presents both opportunities and challenges for teacher education. For pre-service teachers (PSTs), Gen-AI offers new tools to enhance the efficiency and quality of writing feedback. However, it also raises concerns, as many PSTs lack classroom experience, confidence in giving feedback, and knowledge of how to effectively integrate AI-generated content into instructional practice. To address these issues, this study adopted a pre-post experimental design to examine the effects of targeted training on PSTs’ provision of writing feedback, with a focus on feedback quality, self-efficacy, and their relationship in ChatGPT-supported contexts. Over a two-week training program with 30 PSTs, Wilcoxon signed-rank test results from the content analysis showed significant improvements in feedback quality and self-efficacy. Semi-structured interviews with eight participants identified cognitive changes and enhanced ChatGPT operational skills as key drivers of these improvements. We reaffirmed that mastery and vicarious experiences are crucial for enhancing teacher self-efficacy. Furthermore, a reciprocal relationship was observed between the quality and self-efficacy in providing ChatGPT-assisted feedback. This study contributes to the broader discourse on ChatGPT in education and offers specific strategies for effectively incorporating new technology into teacher training.</div></div>","PeriodicalId":46865,"journal":{"name":"Assessing Writing","volume":"66 ","pages":"Article 100960"},"PeriodicalIF":4.2,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144321635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Challenges and opportunities of automated essay scoring for low-proficient L2 English writers 对低熟练程度的第二语言英语作家的自动作文评分的挑战和机遇
IF 5.5 1区 文学 Q1 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2025-10-01 Epub Date: 2025-09-20 DOI: 10.1016/j.asw.2025.100982
Vanessa De Wilde , Orphée De Clercq
Assessing students’ writing can be a challenging activity. To make writing assessment more feasible, researchers have investigated the possibilities of automated essay scoring (AES). Most studies investigating AES have focused on L1 writing or intermediate to advanced L2 writing. In this study we explored the possibilities of using AES with low proficiency L2 English writers. We used a dataset which comprised writing samples from 3166 young L2 English learners who were at the very start of L2 English instruction. All tasks received a score assigned by humans.
For automated scoring we experimented with two machine learning methods. First, a feature-based approach for which the dataset was linguistically preprocessed using natural language processing tools. The second approach employed deep learning by fine-tuning various large language models. Because we were particularly interested in the influence of spelling errors, we also created a corrected, spell-checked version of our dataset.
Models trained on the uncorrected samples yield the best results. Especially the deep learning approach leads to a satisfying performance with a quadratic weighted kappa above .70. The model which was fine-tuned on an underlying Dutch large language model was superior, which might be linked to the low L2 English proficiency of the young L1 Dutch writers in our sample.
评估学生的写作是一项具有挑战性的活动。为了使写作评估更加可行,研究人员研究了自动论文评分(AES)的可能性。大多数调查AES的研究都集中在第一语言写作或中级到高级第二语言写作上。在本研究中,我们探讨了对低熟练程度的二语作者使用AES的可能性。我们使用了一个数据集,其中包括来自3166名年轻的第二语言英语学习者的写作样本,他们都是在第二语言英语教学的开始。所有的任务都会收到一个由人类分配的分数。对于自动评分,我们尝试了两种机器学习方法。首先,基于特征的方法,使用自然语言处理工具对数据集进行语言预处理。第二种方法采用深度学习,对各种大型语言模型进行微调。因为我们对拼写错误的影响特别感兴趣,所以我们还创建了一个经过拼写检查的更正版本的数据集。在未校正的样本上训练的模型产生最好的结果。尤其是深度学习方法,其二次加权kappa值在0.70以上,取得了令人满意的效果。在潜在的荷兰语大语言模型上进行微调的模型更优越,这可能与我们样本中年轻的母语荷兰语作家的第二语言英语熟练程度较低有关。
{"title":"Challenges and opportunities of automated essay scoring for low-proficient L2 English writers","authors":"Vanessa De Wilde ,&nbsp;Orphée De Clercq","doi":"10.1016/j.asw.2025.100982","DOIUrl":"10.1016/j.asw.2025.100982","url":null,"abstract":"<div><div>Assessing students’ writing can be a challenging activity. To make writing assessment more feasible, researchers have investigated the possibilities of automated essay scoring (AES). Most studies investigating AES have focused on L1 writing or intermediate to advanced L2 writing. In this study we explored the possibilities of using AES with low proficiency L2 English writers. We used a dataset which comprised writing samples from 3166 young L2 English learners who were at the very start of L2 English instruction. All tasks received a score assigned by humans.</div><div>For automated scoring we experimented with two machine learning methods. First, a feature-based approach for which the dataset was linguistically preprocessed using natural language processing tools. The second approach employed deep learning by fine-tuning various large language models. Because we were particularly interested in the influence of spelling errors, we also created a corrected, spell-checked version of our dataset.</div><div>Models trained on the uncorrected samples yield the best results. Especially the deep learning approach leads to a satisfying performance with a quadratic weighted kappa above .70. The model which was fine-tuned on an underlying Dutch large language model was superior, which might be linked to the low L2 English proficiency of the young L1 Dutch writers in our sample.</div></div>","PeriodicalId":46865,"journal":{"name":"Assessing Writing","volume":"66 ","pages":"Article 100982"},"PeriodicalIF":5.5,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145105005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Growth mindset and writing engagement: The roles of motivation regulation and engagement with teacher’s written corrective feedback 成长心态与写作投入:动机调节与参与对教师书面纠正反馈的作用
IF 5.5 1区 文学 Q1 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2025-10-01 Epub Date: 2025-08-20 DOI: 10.1016/j.asw.2025.100980
Mahdieh Darvari , S. Yahya Hejazi , Majid Sadoughi
Although engagement is widely recognized as a key and desirable outcome in Foreign/Second Language (L2) education and has garnered increasing research attention, its skill-specific manifestation, particularly in areas such as writing, remains underexplored. To address this gap, this study is among the first attempts to examine L2 writing engagement and its three potential predictors. To this end, motivated by Dweck’s (2017) mindsets theory and Lou and Noels’ (2019) language mindset meaning system, the present study investigated the link between growth L2 writing mindset, i.e., beliefs about improvability and changeability of L2 writing skills and abilities, and engagement by taking into account the parallel mediating roles of learners’ motivation regulation and engagement with teacher’s written corrective feedback (WCF). A total of 343 Iranian learners at intermediate proficiency level were selected through convenience sampling and responded to questionnaires. Structural Equation Modelling (SEM) analyses indicated that growth writing mindset could positively predict engagement, and this link was mediated by learners’ motivation regulation and engagement with teacher’s WCF. Implications and suggestions for fostering growth mindset and promoting learners’ feedback engagement as well as motivation regulation are presented.
尽管参与被广泛认为是外语/第二语言(L2)教育的一个关键和理想的结果,并且已经引起了越来越多的研究关注,但其特定技能的表现,特别是在写作等领域,仍未得到充分探索。为了解决这一差距,本研究是第一次尝试检查二语写作投入及其三个潜在的预测因素。为此,本研究在Dweck(2017)的思维模式理论和Lou and Noels(2019)的语言思维模式意义系统的激励下,考虑到学习者动机调节和参与教师书面纠正反馈(WCF)的平行中介作用,研究了增长型二语写作心态(即对二语写作技能和能力的可改进性和可变性的信念)与参与之间的联系。采用方便抽样的方法,选取343名中级熟练程度的伊朗学习者进行问卷调查。结构方程模型(SEM)分析表明,成长型写作思维对教师写作投入有正向预测作用,学习者的动机调节和教师写作投入在二者之间起中介作用。提出了培养成长型思维、促进学习者反馈参与和动机调节的启示和建议。
{"title":"Growth mindset and writing engagement: The roles of motivation regulation and engagement with teacher’s written corrective feedback","authors":"Mahdieh Darvari ,&nbsp;S. Yahya Hejazi ,&nbsp;Majid Sadoughi","doi":"10.1016/j.asw.2025.100980","DOIUrl":"10.1016/j.asw.2025.100980","url":null,"abstract":"<div><div>Although engagement is widely recognized as a key and desirable outcome in Foreign/Second Language (L2) education and has garnered increasing research attention, its skill-specific manifestation, particularly in areas such as writing, remains underexplored. To address this gap, this study is among the first attempts to examine L2 writing engagement and its three potential predictors. To this end, motivated by Dweck’s (2017) mindsets theory and Lou and Noels’ (2019) language mindset meaning system, the present study investigated the link between growth L2 writing mindset, i.e., beliefs about improvability and changeability of L2 writing skills and abilities, and engagement by taking into account the parallel mediating roles of learners’ motivation regulation and engagement with teacher’s written corrective feedback (WCF). A total of 343 Iranian learners at intermediate proficiency level were selected through convenience sampling and responded to questionnaires. Structural Equation Modelling (SEM) analyses indicated that growth writing mindset could positively predict engagement, and this link was mediated by learners’ motivation regulation and engagement with teacher’s WCF. Implications and suggestions for fostering growth mindset and promoting learners’ feedback engagement as well as motivation regulation are presented.</div></div>","PeriodicalId":46865,"journal":{"name":"Assessing Writing","volume":"66 ","pages":"Article 100980"},"PeriodicalIF":5.5,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144867216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Assessing L2 writing formality using syntactic complexity indices: A fuzzy evaluation approach 用句法复杂性指标评价二语写作的正式性:一种模糊评价方法
IF 4.2 1区 文学 Q1 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2025-10-01 Epub Date: 2025-07-12 DOI: 10.1016/j.asw.2025.100973
Zhiyun Huang , Guangyao Chen , Zhanhao Jiang
Addressing the ambiguity in formality standards, this study introduces a cutting-edge Multi-dimensional Connection Cloud Model (MCCM) that leverages syntactic complexity indices to develop a fuzzy assessment model for formality in L2 writing. Employing Elastic Net Regression (ENR), the results revealed that four large-grained indices (mean length of sentence, mean length of T-unit, complex nominals per T-unit and complex nominals per clause), and one fine-grained index (average number of dependents per direct object) were significant in predicting the level of formality in L2 writing. To evaluate the model’s predictive power, 45 essays were used as a validation set. The MCCM model achieved a prediction accuracy of 91.1 % (41 out of 45 cases) in matching human ratings, with connection degrees effectively capturing classification uncertainty and boundary transitions. This pioneering framework effectively navigates the complexities and variable distributions of indicators, offering a more objective solution compared to conventional expert evaluations and introducing a novel methodological approach to assessing formality in academic writing.
针对正式性标准的模糊性,本研究引入了一种前沿的多维连接云模型(mcm),该模型利用句法复杂性指数建立了二语写作正式性的模糊评估模型。采用弹性网络回归(Elastic Net Regression, ENR)分析发现,四个大粒度指标(句子平均长度、t -单位平均长度、每个t -单位复合语料和每个子句复合语料)和一个细粒度指标(每个直接宾语的平均依赖数)在预测二语写作的正式程度方面具有显著意义。为了评估模型的预测能力,45篇论文被用作验证集。MCCM模型在匹配人类评分方面的预测精度为91.1 %(45例中的41例),连接度有效地捕获了分类不确定性和边界转移。这个开创性的框架有效地驾驭了指标的复杂性和可变分布,与传统的专家评估相比,提供了一个更客观的解决方案,并引入了一种新的方法来评估学术写作的正式性。
{"title":"Assessing L2 writing formality using syntactic complexity indices: A fuzzy evaluation approach","authors":"Zhiyun Huang ,&nbsp;Guangyao Chen ,&nbsp;Zhanhao Jiang","doi":"10.1016/j.asw.2025.100973","DOIUrl":"10.1016/j.asw.2025.100973","url":null,"abstract":"<div><div>Addressing the ambiguity in formality standards, this study introduces a cutting-edge Multi-dimensional Connection Cloud Model (MCCM) that leverages syntactic complexity indices to develop a fuzzy assessment model for formality in L2 writing. Employing Elastic Net Regression (ENR), the results revealed that four large-grained indices (mean length of sentence, mean length of T-unit, complex nominals per T-unit and complex nominals per clause), and one fine-grained index (average number of dependents per direct object) were significant in predicting the level of formality in L2 writing. To evaluate the model’s predictive power, 45 essays were used as a validation set. The MCCM model achieved a prediction accuracy of 91.1 % (41 out of 45 cases) in matching human ratings, with connection degrees effectively capturing classification uncertainty and boundary transitions. This pioneering framework effectively navigates the complexities and variable distributions of indicators, offering a more objective solution compared to conventional expert evaluations and introducing a novel methodological approach to assessing formality in academic writing.</div></div>","PeriodicalId":46865,"journal":{"name":"Assessing Writing","volume":"66 ","pages":"Article 100973"},"PeriodicalIF":4.2,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144604347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GenAI and human assessments of L2 Chinese writing: Interrater reliability and rater bias 第二语言中文写作的基因与人评:评者间信度与评者偏差
IF 5.5 1区 文学 Q1 EDUCATION & EDUCATIONAL RESEARCH Pub Date : 2025-10-01 Epub Date: 2025-10-31 DOI: 10.1016/j.asw.2025.100989
Yuan Lu , Xiaoying Liles , Xi Ma
This study examines generative artificial intelligence (GenAI), specifically ChatGPT and DeepSeek, and human assessments of Chinese as a second language (L2) writing, with a focus on interrater reliability, severity, consistency, and potential genre-based biases. Agreement and correlation analyses revealed substantial variability in interrater reliability among human raters, regardless of their rating experience. ChatGPT consistently demonstrated higher agreement with human raters than DeepSeek. The lowest levels of agreement were observed between DeepSeek and human raters as well as between the two GenAI raters. A Many-Facet Rasch Model analysis showed that ChatGPT tended to rate essays more leniently than DeepSeek and closely resembled experienced human raters in terms of severity, but DeepSeek’s severity aligned more closely with that of novice human raters. No significant genre-based biases were identified for GenAI and human raters. The observed differences in GenAI rating performance may likely result from distinctions in their large language models’ training data, computing capacities, model architectures, and functionalities. These findings offer evidence-based practical implications for the integration of GenAI tools in L2 Chinese writing assessment.
本研究探讨了生成式人工智能(GenAI),特别是ChatGPT和DeepSeek,以及人类对汉语作为第二语言(L2)写作的评估,重点是翻译的可靠性、严重性、一致性和潜在的基于体裁的偏见。一致性和相关性分析揭示了人类评价者之间的可靠性的实质性变化,无论他们的评级经验如何。与DeepSeek相比,ChatGPT始终表现出与人类评分更高的一致性。DeepSeek和人类评分者以及两个GenAI评分者之间的一致性最低。一项多面Rasch模型分析显示,ChatGPT对文章的评分比DeepSeek更宽松,在严重程度上与经验丰富的人类评分者非常相似,但DeepSeek的严重程度与新手的人类评分者更接近。GenAI和人类评分者没有发现明显的基于体裁的偏差。观察到的GenAI评级性能的差异可能是由于它们的大型语言模型的训练数据、计算能力、模型体系结构和功能的差异。这些发现为将GenAI工具整合到第二语言中文写作评估中提供了基于证据的实践意义。
{"title":"GenAI and human assessments of L2 Chinese writing: Interrater reliability and rater bias","authors":"Yuan Lu ,&nbsp;Xiaoying Liles ,&nbsp;Xi Ma","doi":"10.1016/j.asw.2025.100989","DOIUrl":"10.1016/j.asw.2025.100989","url":null,"abstract":"<div><div>This study examines generative artificial intelligence (GenAI), specifically ChatGPT and DeepSeek, and human assessments of Chinese as a second language (L2) writing, with a focus on interrater reliability, severity, consistency, and potential genre-based biases. Agreement and correlation analyses revealed substantial variability in interrater reliability among human raters, regardless of their rating experience. ChatGPT consistently demonstrated higher agreement with human raters than DeepSeek. The lowest levels of agreement were observed between DeepSeek and human raters as well as between the two GenAI raters. A Many-Facet Rasch Model analysis showed that ChatGPT tended to rate essays more leniently than DeepSeek and closely resembled experienced human raters in terms of severity, but DeepSeek’s severity aligned more closely with that of novice human raters. No significant genre-based biases were identified for GenAI and human raters. The observed differences in GenAI rating performance may likely result from distinctions in their large language models’ training data, computing capacities, model architectures, and functionalities. These findings offer evidence-based practical implications for the integration of GenAI tools in L2 Chinese writing assessment.</div></div>","PeriodicalId":46865,"journal":{"name":"Assessing Writing","volume":"66 ","pages":"Article 100989"},"PeriodicalIF":5.5,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145415988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Assessing Writing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1