{"title":"Deep Bidirectional Transformers for Arabic Dialect Identification","authors":"Amal Alghamdi, Areej Alshutayri, Basma Alharbi","doi":"10.1145/3584202.3584243","DOIUrl":null,"url":null,"abstract":"The rising adoption of social media has led to the widespread dissemination of online textual data. Arabic is among the top five most popular languages worldwide (Arabic is spoken by a total of about 360.2 million people worldwide as a native language). In this regard, Arabic-text data available on social media are presented using different Arabic dialects, such as the Gulf, Iraqi, Egyptian, Levantine, and North Africa dialects. Particularly, identifying the Arabic dialect used in text is of significant value for several natural language processing tasks, such as machine translation, text generation, word correction, and information retrieval. Arabic-dialect identification is a multiclass classification problem in which classes represent different Arabic dialects. In this study, we investigated the performance of two bidirectional deep learning models for Arabic-dialect classification: MARBERT and ARBERT. We analyzed the performance of the models on two publicly available datasets: the Arabic Online Commentary dataset and the Social Media Arabic Dialect Corpus. Extensive experiments were conducted, encompassing binary dialect classification, three-way dialect classification, and multi-way dialect classification. The results indicate that MARBERT consistently achieved higher F1-scores than ARBERT, which can be attributed to the significant differences between the two models, including their architectures, training mechanisms, and data sources.","PeriodicalId":438341,"journal":{"name":"Proceedings of the 6th International Conference on Future Networks & Distributed Systems","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Conference on Future Networks & Distributed Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3584202.3584243","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The rising adoption of social media has led to the widespread dissemination of online textual data. Arabic is among the top five most popular languages worldwide (Arabic is spoken by a total of about 360.2 million people worldwide as a native language). In this regard, Arabic-text data available on social media are presented using different Arabic dialects, such as the Gulf, Iraqi, Egyptian, Levantine, and North Africa dialects. Particularly, identifying the Arabic dialect used in text is of significant value for several natural language processing tasks, such as machine translation, text generation, word correction, and information retrieval. Arabic-dialect identification is a multiclass classification problem in which classes represent different Arabic dialects. In this study, we investigated the performance of two bidirectional deep learning models for Arabic-dialect classification: MARBERT and ARBERT. We analyzed the performance of the models on two publicly available datasets: the Arabic Online Commentary dataset and the Social Media Arabic Dialect Corpus. Extensive experiments were conducted, encompassing binary dialect classification, three-way dialect classification, and multi-way dialect classification. The results indicate that MARBERT consistently achieved higher F1-scores than ARBERT, which can be attributed to the significant differences between the two models, including their architectures, training mechanisms, and data sources.