ViDeBERTa: A powerful pre-trained language model for Vietnamese

Findings (Sydney (N.S.W.) Pub Date : 2023-01-25 DOI:10.48550/arXiv.2301.10439

Cong Dao Tran, Nhut Huy Pham, Anh-Viêt Nguyên, T. Hy, Tu Vu

{"title":"ViDeBERTa: A powerful pre-trained language model for Vietnamese","authors":"Cong Dao Tran, Nhut Huy Pham, Anh-Viêt Nguyên, T. Hy, Tu Vu","doi":"10.48550/arXiv.2301.10439","DOIUrl":null,"url":null,"abstract":"This paper presents ViDeBERTa, a new pre-trained monolingual language model for Vietnamese, with three versions - ViDeBERTa_xsmall, ViDeBERTa_base, and ViDeBERTa_large, which are pre-trained on a large-scale corpus of high-quality and diverse Vietnamese texts using DeBERTa architecture. Although many successful pre-trained language models based on Transformer have been widely proposed for the English language, there are still few pre-trained models for Vietnamese, a low-resource language, that perform good results on downstream tasks, especially Question answering. We fine-tune and evaluate our model on three important natural language downstream tasks, Part-of-speech tagging, Named-entity recognition, and Question answering. The empirical results demonstrate that ViDeBERTa with far fewer parameters surpasses the previous state-of-the-art models on multiple Vietnamese-specific natural language understanding tasks. Notably, ViDeBERTa_base with 86M parameters, which is only about 23% of PhoBERT_large with 370M parameters, still performs the same or better results than the previous state-of-the-art model. Our ViDeBERTa models are available at: https://github.com/HySonLab/ViDeBERTa.","PeriodicalId":73025,"journal":{"name":"Findings (Sydney (N.S.W.)","volume":"1 1","pages":"1041-1048"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Findings (Sydney (N.S.W.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2301.10439","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

This paper presents ViDeBERTa, a new pre-trained monolingual language model for Vietnamese, with three versions - ViDeBERTa_xsmall, ViDeBERTa_base, and ViDeBERTa_large, which are pre-trained on a large-scale corpus of high-quality and diverse Vietnamese texts using DeBERTa architecture. Although many successful pre-trained language models based on Transformer have been widely proposed for the English language, there are still few pre-trained models for Vietnamese, a low-resource language, that perform good results on downstream tasks, especially Question answering. We fine-tune and evaluate our model on three important natural language downstream tasks, Part-of-speech tagging, Named-entity recognition, and Question answering. The empirical results demonstrate that ViDeBERTa with far fewer parameters surpasses the previous state-of-the-art models on multiple Vietnamese-specific natural language understanding tasks. Notably, ViDeBERTa_base with 86M parameters, which is only about 23% of PhoBERT_large with 370M parameters, still performs the same or better results than the previous state-of-the-art model. Our ViDeBERTa models are available at: https://github.com/HySonLab/ViDeBERTa.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ViDeBERTa：一个强大的越南语预训练语言模型

本文提出了一种新的预训练越南语单语模型ViDeBERTa，它有三个版本——ViDeBERTa_xsmall、ViDeBERTa_base和ViDeBERTa_large，它们是使用DeBERTa架构在高质量和多样化的越南语文本的大规模语料库上预训练的。尽管已经为英语广泛提出了许多基于Transformer的成功的预训练语言模型，但对于越南语这一低资源语言，仍然很少有预训练模型在下游任务，特别是问答任务中表现出良好的效果。我们在三个重要的自然语言下游任务上对我们的模型进行了微调和评估，即词性标记、命名实体识别和问答。实证结果表明，在多个特定于越南语的自然语言理解任务上，参数少得多的ViDeBERTa超过了以前最先进的模型。值得注意的是，具有86M参数的ViDeBERTa_base仅为具有370M参数的PhoBERT_larg的约23%，其仍然执行与先前最先进的模型相同或更好的结果。我们的ViDeBERTa型号可在：https://github.com/HySonLab/ViDeBERTa.

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Findings (Sydney (N.S.W.)

自引率

0.00%

发文量

审稿时长

4 weeks