摩洛哥方言的两个类bert预训练模型:MorRoBERTa和MorrBERT

Mendel Pub Date : 2023-06-30 DOI:10.13164/mendel.2023.1.055

Otman Moussaoui, Yacine El Younnoussi

{"title":"摩洛哥方言的两个类bert预训练模型:MorRoBERTa和MorrBERT","authors":"Otman Moussaoui, Yacine El Younnoussi","doi":"10.13164/mendel.2023.1.055","DOIUrl":null,"url":null,"abstract":"This research article presents a comprehensive study on the pre-training of two language models, MorRoBERTa and MorrBERT, for the Moroccan Dialect, using the Masked Language Modeling (MLM) pre-training approach. The study details the various data collection and pre-processing steps involved in building a large corpus of over six million sentences and 71 billion tokens, sourced from social media platforms such as Facebook, Twitter, and YouTube. The pre-training process was carried out using the HuggingFace Transformers API, and the paper elaborates on the configurations and training methodologies of the models. The study concludes by demonstrating the high accuracy rates achieved by both MorRoBERTa and MorrBERT in multiple downstream tasks, indicating their potential effectiveness in natural language processing applications specific to the Moroccan Dialect.","PeriodicalId":38293,"journal":{"name":"Mendel","volume":"78 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Pre-training Two BERT-Like Models for Moroccan Dialect: MorRoBERTa and MorrBERT\",\"authors\":\"Otman Moussaoui, Yacine El Younnoussi\",\"doi\":\"10.13164/mendel.2023.1.055\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This research article presents a comprehensive study on the pre-training of two language models, MorRoBERTa and MorrBERT, for the Moroccan Dialect, using the Masked Language Modeling (MLM) pre-training approach. The study details the various data collection and pre-processing steps involved in building a large corpus of over six million sentences and 71 billion tokens, sourced from social media platforms such as Facebook, Twitter, and YouTube. The pre-training process was carried out using the HuggingFace Transformers API, and the paper elaborates on the configurations and training methodologies of the models. The study concludes by demonstrating the high accuracy rates achieved by both MorRoBERTa and MorrBERT in multiple downstream tasks, indicating their potential effectiveness in natural language processing applications specific to the Moroccan Dialect.\",\"PeriodicalId\":38293,\"journal\":{\"name\":\"Mendel\",\"volume\":\"78 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Mendel\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.13164/mendel.2023.1.055\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mendel","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.13164/mendel.2023.1.055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文采用蒙面语言建模(mask language Modeling, MLM)预训练方法，对摩洛哥方言MorRoBERTa和MorrBERT两种语言模型进行了全面的预训练研究。该研究详细介绍了建立一个由超过600万个句子和710亿个代币组成的大型语料库所涉及的各种数据收集和预处理步骤，这些语料库来自Facebook、Twitter和YouTube等社交媒体平台。使用HuggingFace Transformers API进行预训练过程，并详细介绍了模型的配置和训练方法。该研究的结论是，MorRoBERTa和MorrBERT在多个下游任务中都取得了很高的准确率，这表明它们在摩洛哥方言的自然语言处理应用中具有潜在的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Pre-training Two BERT-Like Models for Moroccan Dialect: MorRoBERTa and MorrBERT

This research article presents a comprehensive study on the pre-training of two language models, MorRoBERTa and MorrBERT, for the Moroccan Dialect, using the Masked Language Modeling (MLM) pre-training approach. The study details the various data collection and pre-processing steps involved in building a large corpus of over six million sentences and 71 billion tokens, sourced from social media platforms such as Facebook, Twitter, and YouTube. The pre-training process was carried out using the HuggingFace Transformers API, and the paper elaborates on the configurations and training methodologies of the models. The study concludes by demonstrating the high accuracy rates achieved by both MorRoBERTa and MorrBERT in multiple downstream tasks, indicating their potential effectiveness in natural language processing applications specific to the Moroccan Dialect.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Mendel Decision Sciences-Decision Sciences (miscellaneous)

CiteScore

2.20

自引率

0.00%

发文量