{"title":"摩洛哥方言的两个类bert预训练模型:MorRoBERTa和MorrBERT","authors":"Otman Moussaoui, Yacine El Younnoussi","doi":"10.13164/mendel.2023.1.055","DOIUrl":null,"url":null,"abstract":"This research article presents a comprehensive study on the pre-training of two language models, MorRoBERTa and MorrBERT, for the Moroccan Dialect, using the Masked Language Modeling (MLM) pre-training approach. The study details the various data collection and pre-processing steps involved in building a large corpus of over six million sentences and 71 billion tokens, sourced from social media platforms such as Facebook, Twitter, and YouTube. The pre-training process was carried out using the HuggingFace Transformers API, and the paper elaborates on the configurations and training methodologies of the models. The study concludes by demonstrating the high accuracy rates achieved by both MorRoBERTa and MorrBERT in multiple downstream tasks, indicating their potential effectiveness in natural language processing applications specific to the Moroccan Dialect.","PeriodicalId":38293,"journal":{"name":"Mendel","volume":"78 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Pre-training Two BERT-Like Models for Moroccan Dialect: MorRoBERTa and MorrBERT\",\"authors\":\"Otman Moussaoui, Yacine El Younnoussi\",\"doi\":\"10.13164/mendel.2023.1.055\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This research article presents a comprehensive study on the pre-training of two language models, MorRoBERTa and MorrBERT, for the Moroccan Dialect, using the Masked Language Modeling (MLM) pre-training approach. The study details the various data collection and pre-processing steps involved in building a large corpus of over six million sentences and 71 billion tokens, sourced from social media platforms such as Facebook, Twitter, and YouTube. The pre-training process was carried out using the HuggingFace Transformers API, and the paper elaborates on the configurations and training methodologies of the models. The study concludes by demonstrating the high accuracy rates achieved by both MorRoBERTa and MorrBERT in multiple downstream tasks, indicating their potential effectiveness in natural language processing applications specific to the Moroccan Dialect.\",\"PeriodicalId\":38293,\"journal\":{\"name\":\"Mendel\",\"volume\":\"78 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Mendel\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.13164/mendel.2023.1.055\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mendel","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.13164/mendel.2023.1.055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
本文采用蒙面语言建模(mask language Modeling, MLM)预训练方法,对摩洛哥方言MorRoBERTa和MorrBERT两种语言模型进行了全面的预训练研究。该研究详细介绍了建立一个由超过600万个句子和710亿个代币组成的大型语料库所涉及的各种数据收集和预处理步骤,这些语料库来自Facebook、Twitter和YouTube等社交媒体平台。使用HuggingFace Transformers API进行预训练过程,并详细介绍了模型的配置和训练方法。该研究的结论是,MorRoBERTa和MorrBERT在多个下游任务中都取得了很高的准确率,这表明它们在摩洛哥方言的自然语言处理应用中具有潜在的有效性。
Pre-training Two BERT-Like Models for Moroccan Dialect: MorRoBERTa and MorrBERT
This research article presents a comprehensive study on the pre-training of two language models, MorRoBERTa and MorrBERT, for the Moroccan Dialect, using the Masked Language Modeling (MLM) pre-training approach. The study details the various data collection and pre-processing steps involved in building a large corpus of over six million sentences and 71 billion tokens, sourced from social media platforms such as Facebook, Twitter, and YouTube. The pre-training process was carried out using the HuggingFace Transformers API, and the paper elaborates on the configurations and training methodologies of the models. The study concludes by demonstrating the high accuracy rates achieved by both MorRoBERTa and MorrBERT in multiple downstream tasks, indicating their potential effectiveness in natural language processing applications specific to the Moroccan Dialect.