ArbDialectID at MADAR Shared Task 1: Language Modelling and Ensemble Learning for Fine Grained Arabic Dialect Identification

WANLP@ACL 2019 Pub Date : 2019-08-01 DOI:10.18653/v1/W19-4632

K. Kwaik, Motaz Saad

引用次数: 10

Abstract

In this paper, we present a Dialect Identification system (ArbDialectID) that competed at Task 1 of the MADAR shared task, MADARTravel Domain Dialect Identification. We build a course and a fine-grained identification model to predict the label (corresponding to a dialect of Arabic) of a given text. We build two language models by extracting features at two levels (words and characters). We firstly build a coarse identification model to classify each sentence into one out of six dialects, then use this label as a feature for the fine-grained model that classifies the sentence among 26 dialects from different Arab cities, after that we apply ensemble voting classifier on both sub-systems. Our system ranked 1st that achieving an f-score of 67.32%. Both the models and our feature engineering tools are made available to the research community.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

共享任务1:细粒度阿拉伯语方言识别的语言建模和集成学习

在本文中，我们提出了一个方言识别系统(ArbDialectID)来竞争MADAR共享任务中的任务1——MADARTravel域方言识别。我们构建了一个课程和一个细粒度的识别模型来预测给定文本的标签(对应于阿拉伯语方言)。我们通过提取两个层次(单词和字符)的特征来构建两个语言模型。我们首先建立一个粗识别模型，将每个句子分为六种方言中的一种，然后将该标签作为细粒度模型的特征，将句子从来自不同阿拉伯城市的26种方言中分类出来，之后我们在两个子系统上应用集成投票分类器。我们的系统排名第一，达到67.32%的f分。模型和我们的特征工程工具都可供研究社区使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

WANLP@ACL 2019

自引率

0.00%

发文量