Corpus and unsupervised benchmark: Towards Tagalog grammatical error correction

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Speech and Language Pub Date : 2024-11-14 DOI:10.1016/j.csl.2024.101750

Nankai Lin , Hongbin Zhang , Menglan Shen , Yu Wang , Shengyi Jiang , Aimin Yang

{"title":"Corpus and unsupervised benchmark: Towards Tagalog grammatical error correction","authors":"Nankai Lin , Hongbin Zhang , Menglan Shen , Yu Wang , Shengyi Jiang , Aimin Yang","doi":"10.1016/j.csl.2024.101750","DOIUrl":null,"url":null,"abstract":"<div><div>Grammatical error correction (GEC) is a challenging task for natural language processing techniques. Many efforts to address GEC have been made for high-resource languages such as English or Chinese. However, limited work has been done for low-resource languages because of the lack of large annotated corpora. In low-resource languages, the current unsupervised GEC based on language model scoring performs well. However, the pre-trained language model is still to be explored in this context. This study proposes a BERT-based unsupervised GEC framework that primarily addresses word-level errors, where GEC is viewed as a multi-class classification task. The framework contains three modules: a data flow construction module, a sentence perplexity scoring module, and an error detecting and correcting module. We propose a novel scoring method for pseudo-perplexity to evaluate a sentence’s probable correctness and construct a Tagalog corpus for Tagalog GEC research. It obtains competitive performance on the self-constructed Tagalog corpus and the open-source Indonesian corpus, and it demonstrates that our framework is complementary to the baseline methods for low-resource GEC tasks. Our corpus can be obtained from <span><span>https://github.com/GKLMIP/TagalogGEC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"91 ","pages":"Article 101750"},"PeriodicalIF":3.1000,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824001335","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Grammatical error correction (GEC) is a challenging task for natural language processing techniques. Many efforts to address GEC have been made for high-resource languages such as English or Chinese. However, limited work has been done for low-resource languages because of the lack of large annotated corpora. In low-resource languages, the current unsupervised GEC based on language model scoring performs well. However, the pre-trained language model is still to be explored in this context. This study proposes a BERT-based unsupervised GEC framework that primarily addresses word-level errors, where GEC is viewed as a multi-class classification task. The framework contains three modules: a data flow construction module, a sentence perplexity scoring module, and an error detecting and correcting module. We propose a novel scoring method for pseudo-perplexity to evaluate a sentence’s probable correctness and construct a Tagalog corpus for Tagalog GEC research. It obtains competitive performance on the self-constructed Tagalog corpus and the open-source Indonesian corpus, and it demonstrates that our framework is complementary to the baseline methods for low-resource GEC tasks. Our corpus can be obtained from https://github.com/GKLMIP/TagalogGEC.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

语料库和无监督基准：实现他加禄语语法错误校正

语法纠错（GEC）是自然语言处理技术的一项具有挑战性的任务。许多针对语法纠错的工作都是针对英语或汉语等高资源语言的。然而，由于缺乏大型注释语料库，针对低资源语言所做的工作十分有限。在低资源语言中，目前基于语言模型评分的无监督 GEC 性能良好。然而，在这种情况下，预训练语言模型仍有待探索。本研究提出了一种基于 BERT 的无监督 GEC 框架，该框架主要解决词级错误，GEC 被视为一种多类分类任务。该框架包含三个模块：数据流构建模块、句子易读性评分模块和错误检测与纠正模块。我们提出了一种新颖的伪易错性评分方法来评估句子的可能正确性，并为他加禄语 GEC 研究构建了他加禄语语料库。它在自建的他加禄语语料库和开源印尼语语料库上取得了具有竞争力的性能，并证明了我们的框架是低资源 GEC 任务基准方法的补充。我们的语料库可从 https://github.com/GKLMIP/TagalogGEC 获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.