Multi-Lingual Language Variety Identification using Conventional Deep Learning and Transfer Learning Approaches

Int. Arab J. Inf. Technol. Pub Date : 2022-01-01 DOI:10.34028/iajit/19/5/1

Sameeah Noreen Hameed, M. Ashraf, Yanan Qiao

{"title":"Multi-Lingual Language Variety Identification using Conventional Deep Learning and Transfer Learning Approaches","authors":"Sameeah Noreen Hameed, M. Ashraf, Yanan Qiao","doi":"10.34028/iajit/19/5/1","DOIUrl":null,"url":null,"abstract":"Language variety identification tends to identify lexical and semantic variations in different varieties of a single language. Language variety identification helps build the linguistic profile of an author from written text which can be used for cyber forensics and marketing purposes. Investigating previous efforts for language variety identification, we hardly find any study that experiments with transfer learning approaches and/or performs a thorough comparison of different deep learning approaches on a range of benchmark datasets. So, to bridge this gap, we propose transfer learning approaches for language variety identification tasks and perform an extensive comparison of them with deep learning approaches on multiple varieties of four widely spoken languages, i.e., Arabic, English, Portuguese, and Spanish. This research has treated this task as a binary classification problem (Portuguese) and multi-class classification problem (Arabic, English, and Spanish). We applied two transfer learning Bidirectional Encoder Representations from Transformers (BERT), Universal Language Model Fine-tuning (ULMFiT), three deep learning-Convolutional Neural Networks (CNN), Bidirectional Long Short Term Memory (Bi-LSTM), Gated Recurrent Units (GRU), and an ensemble approach for identifying different varieties. A thorough comparison between the approaches suggests that the transfer learning based ULMFiT model outperforms all other approaches and produces the best accuracy results for binary and multi-class language variety identification tasks.","PeriodicalId":13624,"journal":{"name":"Int. Arab J. Inf. Technol.","volume":"13 1","pages":"705-712"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. Arab J. Inf. Technol.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.34028/iajit/19/5/1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Language variety identification tends to identify lexical and semantic variations in different varieties of a single language. Language variety identification helps build the linguistic profile of an author from written text which can be used for cyber forensics and marketing purposes. Investigating previous efforts for language variety identification, we hardly find any study that experiments with transfer learning approaches and/or performs a thorough comparison of different deep learning approaches on a range of benchmark datasets. So, to bridge this gap, we propose transfer learning approaches for language variety identification tasks and perform an extensive comparison of them with deep learning approaches on multiple varieties of four widely spoken languages, i.e., Arabic, English, Portuguese, and Spanish. This research has treated this task as a binary classification problem (Portuguese) and multi-class classification problem (Arabic, English, and Spanish). We applied two transfer learning Bidirectional Encoder Representations from Transformers (BERT), Universal Language Model Fine-tuning (ULMFiT), three deep learning-Convolutional Neural Networks (CNN), Bidirectional Long Short Term Memory (Bi-LSTM), Gated Recurrent Units (GRU), and an ensemble approach for identifying different varieties. A thorough comparison between the approaches suggests that the transfer learning based ULMFiT model outperforms all other approaches and produces the best accuracy results for binary and multi-class language variety identification tasks.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用传统深度学习和迁移学习方法的多语种语言多样性识别

语言变体识别倾向于识别单一语言中不同变体的词汇和语义变化。语言多样性识别有助于从书面文本中建立作者的语言概况，可用于网络取证和营销目的。在调查之前的语言多样性识别工作时，我们几乎没有发现任何使用迁移学习方法进行实验和/或在一系列基准数据集上对不同深度学习方法进行彻底比较的研究。因此，为了弥合这一差距，我们提出了语言多样性识别任务的迁移学习方法，并将其与深度学习方法在四种广泛使用的语言(即阿拉伯语、英语、葡萄牙语和西班牙语)的多种变体上进行了广泛的比较。本研究将该任务视为二元分类问题(葡萄牙语)和多类分类问题(阿拉伯语、英语和西班牙语)。我们应用了两种迁移学习双向编码器表示(BERT)、通用语言模型微调(ULMFiT)、三种深度学习卷积神经网络(CNN)、双向长短期记忆(Bi-LSTM)、门控循环单元(GRU)和一种集成方法来识别不同的变体。两种方法之间的全面比较表明，基于迁移学习的ULMFiT模型优于所有其他方法，并在二元和多类语言多样性识别任务中产生最佳的准确性结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Int. Arab J. Inf. Technol.

自引率

0.00%

发文量