{"title":"利用异构语料库自动采样进行语法纠错","authors":"Shichang Zhu, Jianjian Liu, Ying Li, Zhengtao Yu","doi":"10.1007/s40747-024-01653-3","DOIUrl":null,"url":null,"abstract":"<p>Thanks to the strong representation capability of the pre-trained language models, supervised grammatical error correction has achieved promising performance. However, traditional model training depends significantly on the large scale of similar distributed samples. The model performance decreases sharply once the distributions of training and testing data are inconsistent. To address this issue, we propose an automatic sampling approach to effectively select high-quality samples from different corpora and filter out irrelevant or harmful ones. Concretely, we first provide a detailed analysis of error type and sentence length distributions on all datasets. Second, our corpus weighting approach is exploited to yield different weights for each sample automatically based on analysis results, thus emphasizing beneficial samples and ignoring the noisy ones. Finally, we enhance typical Seq2Seq and Seq2Edit grammatical error correction models with pre-trained language models and design a model ensemble algorithm for integrating the advantages of heterogeneous models and weighted samples. Experiments on the benchmark datasets demonstrate that the proper utilization of different corpora is extremely helpful in enhancing the accuracy of grammatical error correction. The detailed analysis gains more insights into the effect of different corpus weighting strategies.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"63 1","pages":""},"PeriodicalIF":5.0000,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automatical sampling with heterogeneous corpora for grammatical error correction\",\"authors\":\"Shichang Zhu, Jianjian Liu, Ying Li, Zhengtao Yu\",\"doi\":\"10.1007/s40747-024-01653-3\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Thanks to the strong representation capability of the pre-trained language models, supervised grammatical error correction has achieved promising performance. However, traditional model training depends significantly on the large scale of similar distributed samples. The model performance decreases sharply once the distributions of training and testing data are inconsistent. To address this issue, we propose an automatic sampling approach to effectively select high-quality samples from different corpora and filter out irrelevant or harmful ones. Concretely, we first provide a detailed analysis of error type and sentence length distributions on all datasets. Second, our corpus weighting approach is exploited to yield different weights for each sample automatically based on analysis results, thus emphasizing beneficial samples and ignoring the noisy ones. Finally, we enhance typical Seq2Seq and Seq2Edit grammatical error correction models with pre-trained language models and design a model ensemble algorithm for integrating the advantages of heterogeneous models and weighted samples. Experiments on the benchmark datasets demonstrate that the proper utilization of different corpora is extremely helpful in enhancing the accuracy of grammatical error correction. The detailed analysis gains more insights into the effect of different corpus weighting strategies.</p>\",\"PeriodicalId\":10524,\"journal\":{\"name\":\"Complex & Intelligent Systems\",\"volume\":\"63 1\",\"pages\":\"\"},\"PeriodicalIF\":5.0000,\"publicationDate\":\"2024-11-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Complex & Intelligent Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s40747-024-01653-3\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Complex & Intelligent Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s40747-024-01653-3","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Automatical sampling with heterogeneous corpora for grammatical error correction
Thanks to the strong representation capability of the pre-trained language models, supervised grammatical error correction has achieved promising performance. However, traditional model training depends significantly on the large scale of similar distributed samples. The model performance decreases sharply once the distributions of training and testing data are inconsistent. To address this issue, we propose an automatic sampling approach to effectively select high-quality samples from different corpora and filter out irrelevant or harmful ones. Concretely, we first provide a detailed analysis of error type and sentence length distributions on all datasets. Second, our corpus weighting approach is exploited to yield different weights for each sample automatically based on analysis results, thus emphasizing beneficial samples and ignoring the noisy ones. Finally, we enhance typical Seq2Seq and Seq2Edit grammatical error correction models with pre-trained language models and design a model ensemble algorithm for integrating the advantages of heterogeneous models and weighted samples. Experiments on the benchmark datasets demonstrate that the proper utilization of different corpora is extremely helpful in enhancing the accuracy of grammatical error correction. The detailed analysis gains more insights into the effect of different corpus weighting strategies.
期刊介绍:
Complex & Intelligent Systems aims to provide a forum for presenting and discussing novel approaches, tools and techniques meant for attaining a cross-fertilization between the broad fields of complex systems, computational simulation, and intelligent analytics and visualization. The transdisciplinary research that the journal focuses on will expand the boundaries of our understanding by investigating the principles and processes that underlie many of the most profound problems facing society today.