Corpus and unsupervised benchmark: Towards Tagalog grammatical error correction

IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Speech and Language Pub Date : 2024-11-14 DOI:10.1016/j.csl.2024.101750
Nankai Lin , Hongbin Zhang , Menglan Shen , Yu Wang , Shengyi Jiang , Aimin Yang
{"title":"Corpus and unsupervised benchmark: Towards Tagalog grammatical error correction","authors":"Nankai Lin ,&nbsp;Hongbin Zhang ,&nbsp;Menglan Shen ,&nbsp;Yu Wang ,&nbsp;Shengyi Jiang ,&nbsp;Aimin Yang","doi":"10.1016/j.csl.2024.101750","DOIUrl":null,"url":null,"abstract":"<div><div>Grammatical error correction (GEC) is a challenging task for natural language processing techniques. Many efforts to address GEC have been made for high-resource languages such as English or Chinese. However, limited work has been done for low-resource languages because of the lack of large annotated corpora. In low-resource languages, the current unsupervised GEC based on language model scoring performs well. However, the pre-trained language model is still to be explored in this context. This study proposes a BERT-based unsupervised GEC framework that primarily addresses word-level errors, where GEC is viewed as a multi-class classification task. The framework contains three modules: a data flow construction module, a sentence perplexity scoring module, and an error detecting and correcting module. We propose a novel scoring method for pseudo-perplexity to evaluate a sentence’s probable correctness and construct a Tagalog corpus for Tagalog GEC research. It obtains competitive performance on the self-constructed Tagalog corpus and the open-source Indonesian corpus, and it demonstrates that our framework is complementary to the baseline methods for low-resource GEC tasks. Our corpus can be obtained from <span><span>https://github.com/GKLMIP/TagalogGEC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"91 ","pages":"Article 101750"},"PeriodicalIF":3.1000,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824001335","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Grammatical error correction (GEC) is a challenging task for natural language processing techniques. Many efforts to address GEC have been made for high-resource languages such as English or Chinese. However, limited work has been done for low-resource languages because of the lack of large annotated corpora. In low-resource languages, the current unsupervised GEC based on language model scoring performs well. However, the pre-trained language model is still to be explored in this context. This study proposes a BERT-based unsupervised GEC framework that primarily addresses word-level errors, where GEC is viewed as a multi-class classification task. The framework contains three modules: a data flow construction module, a sentence perplexity scoring module, and an error detecting and correcting module. We propose a novel scoring method for pseudo-perplexity to evaluate a sentence’s probable correctness and construct a Tagalog corpus for Tagalog GEC research. It obtains competitive performance on the self-constructed Tagalog corpus and the open-source Indonesian corpus, and it demonstrates that our framework is complementary to the baseline methods for low-resource GEC tasks. Our corpus can be obtained from https://github.com/GKLMIP/TagalogGEC.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
语料库和无监督基准:实现他加禄语语法错误校正
语法纠错(GEC)是自然语言处理技术的一项具有挑战性的任务。许多针对语法纠错的工作都是针对英语或汉语等高资源语言的。然而,由于缺乏大型注释语料库,针对低资源语言所做的工作十分有限。在低资源语言中,目前基于语言模型评分的无监督 GEC 性能良好。然而,在这种情况下,预训练语言模型仍有待探索。本研究提出了一种基于 BERT 的无监督 GEC 框架,该框架主要解决词级错误,GEC 被视为一种多类分类任务。该框架包含三个模块:数据流构建模块、句子易读性评分模块和错误检测与纠正模块。我们提出了一种新颖的伪易错性评分方法来评估句子的可能正确性,并为他加禄语 GEC 研究构建了他加禄语语料库。它在自建的他加禄语语料库和开源印尼语语料库上取得了具有竞争力的性能,并证明了我们的框架是低资源 GEC 任务基准方法的补充。我们的语料库可从 https://github.com/GKLMIP/TagalogGEC 获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Computer Speech and Language
Computer Speech and Language 工程技术-计算机:人工智能
CiteScore
11.30
自引率
4.70%
发文量
80
审稿时长
22.9 weeks
期刊介绍: Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.
期刊最新文献
Modeling correlated causal-effect structure with a hypergraph for document-level event causality identification You Are What You Write: Author re-identification privacy attacks in the era of pre-trained language models End-to-End Speech-to-Text Translation: A Survey Corpus and unsupervised benchmark: Towards Tagalog grammatical error correction TR-Net: Token Relation Inspired Table Filling Network for Joint Entity and Relation Extraction
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1