Annotation and evaluation of a dialectal Arabic sentiment corpus against benchmark datasets using transformers

IF 1.8 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Language Resources and Evaluation Pub Date : 2024-08-18 DOI:10.1007/s10579-024-09750-y

Ibtissam Touahri, Azzeddine Mazroui

{"title":"Annotation and evaluation of a dialectal Arabic sentiment corpus against benchmark datasets using transformers","authors":"Ibtissam Touahri, Azzeddine Mazroui","doi":"10.1007/s10579-024-09750-y","DOIUrl":null,"url":null,"abstract":"<p>Sentiment analysis is a task in natural language processing aiming to identify the overall polarity of reviews for subsequent analysis. This study used the Arabic speech-act and sentiment analysis, Arabic sentiment tweets dataset, and SemEval benchmark datasets, along with the Moroccan sentiment analysis corpus, which focuses on the Moroccan dialect. Furthermore, the modern standard and dialectal Arabic corpus dataset has been created and annotated based on the three language types: modern standard Arabic, Moroccan Arabic Dialect, and Mixed Language. Additionally, the annotation has been performed at the sentiment level, categorizing sentiments as positive, negative, or mixed. The sizes of the datasets range from 2000 to 21,000 reviews. The essential dialectal characteristics to enhance a sentiment classification system have been outlined. The proposed approach has involved deploying several models employing the supervised approach, including occurrence vectors, Recurrent Neural Network-Long Short Term Memory, and the pre-trained transformer model Arabic bidirectional encoder representations from transformers (AraBERT), complemented by the integration of Generative Adversarial Networks (GANs). The uniqueness of the proposed approach lies in constructing and annotating manually a dialectal sentiment corpus and studying carefully its main characteristics, which are used then to feed the classical supervised model. Moreover, GANs that widen the gap between the studied classes have been used to enhance the obtained results with AraBERT. The classification test results have been promising, enabling a comparison with other systems. The proposed system has been evaluated against Mazajak and CAMelTools state-of-the-art systems, designed for most Arabic dialects, using the mentioned datasets. A significant improvement of 30 points in F<sup>NN</sup> has been observed. These results have affirmed the versatility of the proposed system, demonstrating its effectiveness across multi-dialectal, multi-domain datasets, as well as balanced and unbalanced ones.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"1 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Resources and Evaluation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10579-024-09750-y","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Sentiment analysis is a task in natural language processing aiming to identify the overall polarity of reviews for subsequent analysis. This study used the Arabic speech-act and sentiment analysis, Arabic sentiment tweets dataset, and SemEval benchmark datasets, along with the Moroccan sentiment analysis corpus, which focuses on the Moroccan dialect. Furthermore, the modern standard and dialectal Arabic corpus dataset has been created and annotated based on the three language types: modern standard Arabic, Moroccan Arabic Dialect, and Mixed Language. Additionally, the annotation has been performed at the sentiment level, categorizing sentiments as positive, negative, or mixed. The sizes of the datasets range from 2000 to 21,000 reviews. The essential dialectal characteristics to enhance a sentiment classification system have been outlined. The proposed approach has involved deploying several models employing the supervised approach, including occurrence vectors, Recurrent Neural Network-Long Short Term Memory, and the pre-trained transformer model Arabic bidirectional encoder representations from transformers (AraBERT), complemented by the integration of Generative Adversarial Networks (GANs). The uniqueness of the proposed approach lies in constructing and annotating manually a dialectal sentiment corpus and studying carefully its main characteristics, which are used then to feed the classical supervised model. Moreover, GANs that widen the gap between the studied classes have been used to enhance the obtained results with AraBERT. The classification test results have been promising, enabling a comparison with other systems. The proposed system has been evaluated against Mazajak and CAMelTools state-of-the-art systems, designed for most Arabic dialects, using the mentioned datasets. A significant improvement of 30 points in F^NN has been observed. These results have affirmed the versatility of the proposed system, demonstrating its effectiveness across multi-dialectal, multi-domain datasets, as well as balanced and unbalanced ones.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用转换器对照基准数据集对阿拉伯语方言情感语料库进行注释和评估

情感分析是自然语言处理中的一项任务，旨在识别评论的整体极性，以便进行后续分析。本研究使用了阿拉伯语语音行为和情感分析、阿拉伯语情感推文数据集和 SemEval 基准数据集，以及侧重于摩洛哥方言的摩洛哥情感分析语料库。此外，还创建了现代标准和方言阿拉伯语语料库数据集，并根据现代标准阿拉伯语、摩洛哥阿拉伯方言和混合语言这三种语言类型进行了注释。此外，还在情感层面进行了注释，将情感分为正面、负面和混合情感。数据集的规模从 2000 到 21000 条评论不等。概述了增强情感分类系统的基本方言特征。所提出的方法涉及部署多个采用监督方法的模型，包括发生向量、循环神经网络-长短期记忆和预先训练的变压器模型阿拉伯变压器双向编码器表示法（AraBERT），并辅以生成对抗网络（GAN）的集成。所提议方法的独特之处在于通过手动方式构建和注释方言情感语料库，并仔细研究其主要特征，然后将其用于为经典监督模型提供信息。此外，还使用了 GANs 来拉大所研究类别之间的差距，以增强 AraBERT 所获得的结果。分类测试结果很不错，可以与其他系统进行比较。我们使用上述数据集，与 Mazajak 和 CAMelTools 这两个针对大多数阿拉伯语方言设计的最先进系统进行了评估。结果表明，FNN 明显提高了 30 个百分点。这些结果证实了所提系统的多功能性，证明了它在多方言、多领域数据集以及平衡和不平衡数据集上的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Language Resources and Evaluation 工程技术-计算机：跨学科应用

CiteScore

6.50

自引率

3.70%

发文量

审稿时长

>12 weeks

期刊介绍： Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications. Language resources include language data and descriptions in machine readable form used to assist and augment language processing applications, such as written or spoken corpora and lexica, multimodal resources, grammars, terminology or domain specific databases and dictionaries, ontologies, multimedia databases, etc., as well as basic software tools for their acquisition, preparation, annotation, management, customization, and use. Evaluation of language resources concerns assessing the state-of-the-art for a given technology, comparing different approaches to a given problem, assessing the availability of resources and technologies for a given application, benchmarking, and assessing system usability and user satisfaction.