评分特征对英语自动评分模型准确性的影响

Dongkwang Shin
{"title":"评分特征对英语自动评分模型准确性的影响","authors":"Dongkwang Shin","doi":"10.14333/kjte.2022.38.6.04","DOIUrl":null,"url":null,"abstract":"Purpose: This study attempted to compare the performance of an automated scoring model based on scoring features and three other automated scoring models that do not use the scoring feature. Methods: The data used in this study were 300 essays written by native English speakers in the tenth grade, which were divided into training and validating data at a ratio of 7:3. The RF model was used to predict the scores of the essays based on the analysis of 106 linguistic features extracted from Coh-Metrix. The accuracy of this model was compared to that of three deep learning models― RNN, LSTM, and GRU―which do not use those scoring features. Lastly, in the case of the RF model, scoring features mainly affecting accuracy prediction were further analyzed. Results: RNN which is a type of deep learning had an agreement with human scoring results ranging from .39 to .69 (mean=.58) in each rating domain, while LSTM showed an agreement between .59 and .72 (mean=.64), and GRU’s agreement was between .59 and .72 (mean=.64). However, when using the RF model based on the scoring features, its average accuracy was between .60 to .76 (mean=.69), which is the highest agreement among the four models. In the case of the RF model, ‘word count,’ ‘sentence count,’ and ‘vocabulary density’ were common characteristics across all scoring domains, but several scoring features reflecting the characteristics of each scoring domain were also found. Conclusion: When large-scale data is not available, the RF model showed relatively high automated scoring performance, and the RF model that provides learners with pedagogical feedback based on scoring features would be more useful compared to the other three automated scoring models in terms of the utilization of the automated scoring result.","PeriodicalId":22672,"journal":{"name":"The Journal of Korean Teacher Education","volume":"16 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Effects of Scoring Features on the Accuracy of the Automated Scoring Model of English\",\"authors\":\"Dongkwang Shin\",\"doi\":\"10.14333/kjte.2022.38.6.04\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose: This study attempted to compare the performance of an automated scoring model based on scoring features and three other automated scoring models that do not use the scoring feature. Methods: The data used in this study were 300 essays written by native English speakers in the tenth grade, which were divided into training and validating data at a ratio of 7:3. The RF model was used to predict the scores of the essays based on the analysis of 106 linguistic features extracted from Coh-Metrix. The accuracy of this model was compared to that of three deep learning models― RNN, LSTM, and GRU―which do not use those scoring features. Lastly, in the case of the RF model, scoring features mainly affecting accuracy prediction were further analyzed. Results: RNN which is a type of deep learning had an agreement with human scoring results ranging from .39 to .69 (mean=.58) in each rating domain, while LSTM showed an agreement between .59 and .72 (mean=.64), and GRU’s agreement was between .59 and .72 (mean=.64). However, when using the RF model based on the scoring features, its average accuracy was between .60 to .76 (mean=.69), which is the highest agreement among the four models. In the case of the RF model, ‘word count,’ ‘sentence count,’ and ‘vocabulary density’ were common characteristics across all scoring domains, but several scoring features reflecting the characteristics of each scoring domain were also found. Conclusion: When large-scale data is not available, the RF model showed relatively high automated scoring performance, and the RF model that provides learners with pedagogical feedback based on scoring features would be more useful compared to the other three automated scoring models in terms of the utilization of the automated scoring result.\",\"PeriodicalId\":22672,\"journal\":{\"name\":\"The Journal of Korean Teacher Education\",\"volume\":\"16 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Journal of Korean Teacher Education\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.14333/kjte.2022.38.6.04\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Journal of Korean Teacher Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14333/kjte.2022.38.6.04","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

目的:本研究试图比较基于评分特征的自动评分模型和其他三种不使用评分特征的自动评分模型的性能。方法:本研究使用的数据为以英语为母语的十年级学生作文300篇,按7:3的比例分为训练数据和验证数据。基于对从Coh-Metrix中提取的106个语言特征的分析,使用RF模型来预测文章的分数。该模型的准确性与不使用这些评分特征的三种深度学习模型(RNN, LSTM和gru)的准确性进行了比较。最后,以射频模型为例,进一步分析了主要影响准确率预测的评分特征。结果:RNN作为深度学习的一种,与人类评分结果在各评分域的一致性为0.39 ~ 0.69(均值= 0.58),LSTM的一致性为0.59 ~ 0.72(均值= 0.64),GRU的一致性为0.59 ~ 0.72(均值= 0.64)。然而,当使用基于评分特征的RF模型时,其平均准确率在0.60 ~ 0.76之间(平均值= 0.69),是四个模型中一致性最高的。在RF模型的情况下,“单词计数”、“句子计数”和“词汇密度”是所有评分域的共同特征,但也发现了反映每个评分域特征的几个评分特征。结论:在没有大规模数据的情况下,射频模型显示出较高的自动评分性能,在自动评分结果的利用方面,基于评分特征为学习者提供教学反馈的射频模型比其他三种自动评分模型更有用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Effects of Scoring Features on the Accuracy of the Automated Scoring Model of English
Purpose: This study attempted to compare the performance of an automated scoring model based on scoring features and three other automated scoring models that do not use the scoring feature. Methods: The data used in this study were 300 essays written by native English speakers in the tenth grade, which were divided into training and validating data at a ratio of 7:3. The RF model was used to predict the scores of the essays based on the analysis of 106 linguistic features extracted from Coh-Metrix. The accuracy of this model was compared to that of three deep learning models― RNN, LSTM, and GRU―which do not use those scoring features. Lastly, in the case of the RF model, scoring features mainly affecting accuracy prediction were further analyzed. Results: RNN which is a type of deep learning had an agreement with human scoring results ranging from .39 to .69 (mean=.58) in each rating domain, while LSTM showed an agreement between .59 and .72 (mean=.64), and GRU’s agreement was between .59 and .72 (mean=.64). However, when using the RF model based on the scoring features, its average accuracy was between .60 to .76 (mean=.69), which is the highest agreement among the four models. In the case of the RF model, ‘word count,’ ‘sentence count,’ and ‘vocabulary density’ were common characteristics across all scoring domains, but several scoring features reflecting the characteristics of each scoring domain were also found. Conclusion: When large-scale data is not available, the RF model showed relatively high automated scoring performance, and the RF model that provides learners with pedagogical feedback based on scoring features would be more useful compared to the other three automated scoring models in terms of the utilization of the automated scoring result.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Analysis of Life Respect in Elementary School Education with the Morals Subject from the Perspective of Population Education The Effects of Microteaching on English T eachers’ Efficacy with Pre-service Elementary School Teachers Exploring the meaning of movement according to pre-service teachers’ acquisition of exercise skills Defect and Compensation: An Inquiry of Methodological Principles for a Moral Instructional Activity Based on L. S. Vygotsky’s Theory An Exploration of the Operation Cases and Improvement Methods in a Small Middle School Joint Physical Education Curriculum
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1