Mohamed Lichouri, Khaled Lounnas, Aicha Zitouni, H. Latrache, R. Djeradi
{"title":"USTHB at NADI 2023 shared task: Exploring Preprocessing and Feature Engineering Strategies for Arabic Dialect Identification","authors":"Mohamed Lichouri, Khaled Lounnas, Aicha Zitouni, H. Latrache, R. Djeradi","doi":"10.18653/v1/2023.arabicnlp-1.69","DOIUrl":null,"url":null,"abstract":"In this paper, we conduct an in-depth analysis of several key factors influencing the performance of Arabic Dialect Identification NADI’2023, with a specific focus on the first subtask involving country-level dialect identification. Our investigation encompasses the effects of surface preprocessing, morphological preprocessing, FastText vector model, and the weighted concatenation of TF-IDF features. For classification purposes, we employ the Linear Support Vector Classification (LSVC) model. During the evaluation phase, our system demonstrates noteworthy results, achieving an F_1 score of 62.51%. This achievement closely aligns with the average F_1 scores attained by other systems submitted for the first subtask, which stands at 72.91%.","PeriodicalId":503921,"journal":{"name":"ARABICNLP","volume":"227 2","pages":"647-651"},"PeriodicalIF":0.0000,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ARABICNLP","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2023.arabicnlp-1.69","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this paper, we conduct an in-depth analysis of several key factors influencing the performance of Arabic Dialect Identification NADI’2023, with a specific focus on the first subtask involving country-level dialect identification. Our investigation encompasses the effects of surface preprocessing, morphological preprocessing, FastText vector model, and the weighted concatenation of TF-IDF features. For classification purposes, we employ the Linear Support Vector Classification (LSVC) model. During the evaluation phase, our system demonstrates noteworthy results, achieving an F_1 score of 62.51%. This achievement closely aligns with the average F_1 scores attained by other systems submitted for the first subtask, which stands at 72.91%.