Nankai Lin , Hongbin Zhang , Menglan Shen , Yu Wang , Shengyi Jiang , Aimin Yang
{"title":"Corpus and unsupervised benchmark: Towards Tagalog grammatical error correction","authors":"Nankai Lin , Hongbin Zhang , Menglan Shen , Yu Wang , Shengyi Jiang , Aimin Yang","doi":"10.1016/j.csl.2024.101750","DOIUrl":null,"url":null,"abstract":"<div><div>Grammatical error correction (GEC) is a challenging task for natural language processing techniques. Many efforts to address GEC have been made for high-resource languages such as English or Chinese. However, limited work has been done for low-resource languages because of the lack of large annotated corpora. In low-resource languages, the current unsupervised GEC based on language model scoring performs well. However, the pre-trained language model is still to be explored in this context. This study proposes a BERT-based unsupervised GEC framework that primarily addresses word-level errors, where GEC is viewed as a multi-class classification task. The framework contains three modules: a data flow construction module, a sentence perplexity scoring module, and an error detecting and correcting module. We propose a novel scoring method for pseudo-perplexity to evaluate a sentence’s probable correctness and construct a Tagalog corpus for Tagalog GEC research. It obtains competitive performance on the self-constructed Tagalog corpus and the open-source Indonesian corpus, and it demonstrates that our framework is complementary to the baseline methods for low-resource GEC tasks. Our corpus can be obtained from <span><span>https://github.com/GKLMIP/TagalogGEC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"91 ","pages":"Article 101750"},"PeriodicalIF":3.1000,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824001335","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Grammatical error correction (GEC) is a challenging task for natural language processing techniques. Many efforts to address GEC have been made for high-resource languages such as English or Chinese. However, limited work has been done for low-resource languages because of the lack of large annotated corpora. In low-resource languages, the current unsupervised GEC based on language model scoring performs well. However, the pre-trained language model is still to be explored in this context. This study proposes a BERT-based unsupervised GEC framework that primarily addresses word-level errors, where GEC is viewed as a multi-class classification task. The framework contains three modules: a data flow construction module, a sentence perplexity scoring module, and an error detecting and correcting module. We propose a novel scoring method for pseudo-perplexity to evaluate a sentence’s probable correctness and construct a Tagalog corpus for Tagalog GEC research. It obtains competitive performance on the self-constructed Tagalog corpus and the open-source Indonesian corpus, and it demonstrates that our framework is complementary to the baseline methods for low-resource GEC tasks. Our corpus can be obtained from https://github.com/GKLMIP/TagalogGEC.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.