Deep Bidirectional Transformers for Arabic Dialect Identification

Proceedings of the 6th International Conference on Future Networks & Distributed Systems Pub Date : 2022-12-15 DOI:10.1145/3584202.3584243

Amal Alghamdi, Areej Alshutayri, Basma Alharbi

{"title":"Deep Bidirectional Transformers for Arabic Dialect Identification","authors":"Amal Alghamdi, Areej Alshutayri, Basma Alharbi","doi":"10.1145/3584202.3584243","DOIUrl":null,"url":null,"abstract":"The rising adoption of social media has led to the widespread dissemination of online textual data. Arabic is among the top five most popular languages worldwide (Arabic is spoken by a total of about 360.2 million people worldwide as a native language). In this regard, Arabic-text data available on social media are presented using different Arabic dialects, such as the Gulf, Iraqi, Egyptian, Levantine, and North Africa dialects. Particularly, identifying the Arabic dialect used in text is of significant value for several natural language processing tasks, such as machine translation, text generation, word correction, and information retrieval. Arabic-dialect identification is a multiclass classification problem in which classes represent different Arabic dialects. In this study, we investigated the performance of two bidirectional deep learning models for Arabic-dialect classification: MARBERT and ARBERT. We analyzed the performance of the models on two publicly available datasets: the Arabic Online Commentary dataset and the Social Media Arabic Dialect Corpus. Extensive experiments were conducted, encompassing binary dialect classification, three-way dialect classification, and multi-way dialect classification. The results indicate that MARBERT consistently achieved higher F1-scores than ARBERT, which can be attributed to the significant differences between the two models, including their architectures, training mechanisms, and data sources.","PeriodicalId":438341,"journal":{"name":"Proceedings of the 6th International Conference on Future Networks & Distributed Systems","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Conference on Future Networks & Distributed Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3584202.3584243","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The rising adoption of social media has led to the widespread dissemination of online textual data. Arabic is among the top five most popular languages worldwide (Arabic is spoken by a total of about 360.2 million people worldwide as a native language). In this regard, Arabic-text data available on social media are presented using different Arabic dialects, such as the Gulf, Iraqi, Egyptian, Levantine, and North Africa dialects. Particularly, identifying the Arabic dialect used in text is of significant value for several natural language processing tasks, such as machine translation, text generation, word correction, and information retrieval. Arabic-dialect identification is a multiclass classification problem in which classes represent different Arabic dialects. In this study, we investigated the performance of two bidirectional deep learning models for Arabic-dialect classification: MARBERT and ARBERT. We analyzed the performance of the models on two publicly available datasets: the Arabic Online Commentary dataset and the Social Media Arabic Dialect Corpus. Extensive experiments were conducted, encompassing binary dialect classification, three-way dialect classification, and multi-way dialect classification. The results indicate that MARBERT consistently achieved higher F1-scores than ARBERT, which can be attributed to the significant differences between the two models, including their architectures, training mechanisms, and data sources.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于阿拉伯语方言识别的深层双向变压器

社交媒体的日益普及导致了在线文本数据的广泛传播。阿拉伯语是世界上最受欢迎的五种语言之一(全世界共有约3.602亿人将阿拉伯语作为母语)。在这方面，社交媒体上可用的阿拉伯语文本数据使用不同的阿拉伯语方言呈现，例如海湾、伊拉克、埃及、黎凡特和北非方言。特别是，识别文本中使用的阿拉伯语方言对于一些自然语言处理任务具有重要价值，例如机器翻译，文本生成，单词校正和信息检索。阿拉伯方言识别是一个多类分类问题，其中类代表不同的阿拉伯方言。在这项研究中，我们研究了两种用于阿拉伯语方言分类的双向深度学习模型:MARBERT和ARBERT的性能。我们在两个公开可用的数据集上分析了模型的性能:阿拉伯语在线评论数据集和社交媒体阿拉伯语方言语料库。进行了大量的实验，包括二元方言分类、三向方言分类和多向方言分类。结果表明，MARBERT始终比ARBERT获得更高的f1分数，这可以归因于两个模型之间的显著差异，包括他们的架构，训练机制和数据源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 6th International Conference on Future Networks & Distributed Systems

自引率

0.00%

发文量