{"title":"评分特征对英语自动评分模型准确性的影响","authors":"Dongkwang Shin","doi":"10.14333/kjte.2022.38.6.04","DOIUrl":null,"url":null,"abstract":"Purpose: This study attempted to compare the performance of an automated scoring model based on scoring features and three other automated scoring models that do not use the scoring feature. Methods: The data used in this study were 300 essays written by native English speakers in the tenth grade, which were divided into training and validating data at a ratio of 7:3. The RF model was used to predict the scores of the essays based on the analysis of 106 linguistic features extracted from Coh-Metrix. The accuracy of this model was compared to that of three deep learning models― RNN, LSTM, and GRU―which do not use those scoring features. Lastly, in the case of the RF model, scoring features mainly affecting accuracy prediction were further analyzed. Results: RNN which is a type of deep learning had an agreement with human scoring results ranging from .39 to .69 (mean=.58) in each rating domain, while LSTM showed an agreement between .59 and .72 (mean=.64), and GRU’s agreement was between .59 and .72 (mean=.64). However, when using the RF model based on the scoring features, its average accuracy was between .60 to .76 (mean=.69), which is the highest agreement among the four models. In the case of the RF model, ‘word count,’ ‘sentence count,’ and ‘vocabulary density’ were common characteristics across all scoring domains, but several scoring features reflecting the characteristics of each scoring domain were also found. Conclusion: When large-scale data is not available, the RF model showed relatively high automated scoring performance, and the RF model that provides learners with pedagogical feedback based on scoring features would be more useful compared to the other three automated scoring models in terms of the utilization of the automated scoring result.","PeriodicalId":22672,"journal":{"name":"The Journal of Korean Teacher Education","volume":"16 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Effects of Scoring Features on the Accuracy of the Automated Scoring Model of English\",\"authors\":\"Dongkwang Shin\",\"doi\":\"10.14333/kjte.2022.38.6.04\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose: This study attempted to compare the performance of an automated scoring model based on scoring features and three other automated scoring models that do not use the scoring feature. Methods: The data used in this study were 300 essays written by native English speakers in the tenth grade, which were divided into training and validating data at a ratio of 7:3. The RF model was used to predict the scores of the essays based on the analysis of 106 linguistic features extracted from Coh-Metrix. The accuracy of this model was compared to that of three deep learning models― RNN, LSTM, and GRU―which do not use those scoring features. Lastly, in the case of the RF model, scoring features mainly affecting accuracy prediction were further analyzed. Results: RNN which is a type of deep learning had an agreement with human scoring results ranging from .39 to .69 (mean=.58) in each rating domain, while LSTM showed an agreement between .59 and .72 (mean=.64), and GRU’s agreement was between .59 and .72 (mean=.64). However, when using the RF model based on the scoring features, its average accuracy was between .60 to .76 (mean=.69), which is the highest agreement among the four models. In the case of the RF model, ‘word count,’ ‘sentence count,’ and ‘vocabulary density’ were common characteristics across all scoring domains, but several scoring features reflecting the characteristics of each scoring domain were also found. Conclusion: When large-scale data is not available, the RF model showed relatively high automated scoring performance, and the RF model that provides learners with pedagogical feedback based on scoring features would be more useful compared to the other three automated scoring models in terms of the utilization of the automated scoring result.\",\"PeriodicalId\":22672,\"journal\":{\"name\":\"The Journal of Korean Teacher Education\",\"volume\":\"16 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Journal of Korean Teacher Education\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.14333/kjte.2022.38.6.04\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Journal of Korean Teacher Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14333/kjte.2022.38.6.04","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Effects of Scoring Features on the Accuracy of the Automated Scoring Model of English
Purpose: This study attempted to compare the performance of an automated scoring model based on scoring features and three other automated scoring models that do not use the scoring feature. Methods: The data used in this study were 300 essays written by native English speakers in the tenth grade, which were divided into training and validating data at a ratio of 7:3. The RF model was used to predict the scores of the essays based on the analysis of 106 linguistic features extracted from Coh-Metrix. The accuracy of this model was compared to that of three deep learning models― RNN, LSTM, and GRU―which do not use those scoring features. Lastly, in the case of the RF model, scoring features mainly affecting accuracy prediction were further analyzed. Results: RNN which is a type of deep learning had an agreement with human scoring results ranging from .39 to .69 (mean=.58) in each rating domain, while LSTM showed an agreement between .59 and .72 (mean=.64), and GRU’s agreement was between .59 and .72 (mean=.64). However, when using the RF model based on the scoring features, its average accuracy was between .60 to .76 (mean=.69), which is the highest agreement among the four models. In the case of the RF model, ‘word count,’ ‘sentence count,’ and ‘vocabulary density’ were common characteristics across all scoring domains, but several scoring features reflecting the characteristics of each scoring domain were also found. Conclusion: When large-scale data is not available, the RF model showed relatively high automated scoring performance, and the RF model that provides learners with pedagogical feedback based on scoring features would be more useful compared to the other three automated scoring models in terms of the utilization of the automated scoring result.