Multi-Lingual Language Variety Identification using Conventional Deep Learning and Transfer Learning Approaches

Sameeah Noreen Hameed, M. Ashraf, Yanan Qiao
{"title":"Multi-Lingual Language Variety Identification using Conventional Deep Learning and Transfer Learning Approaches","authors":"Sameeah Noreen Hameed, M. Ashraf, Yanan Qiao","doi":"10.34028/iajit/19/5/1","DOIUrl":null,"url":null,"abstract":"Language variety identification tends to identify lexical and semantic variations in different varieties of a single language. Language variety identification helps build the linguistic profile of an author from written text which can be used for cyber forensics and marketing purposes. Investigating previous efforts for language variety identification, we hardly find any study that experiments with transfer learning approaches and/or performs a thorough comparison of different deep learning approaches on a range of benchmark datasets. So, to bridge this gap, we propose transfer learning approaches for language variety identification tasks and perform an extensive comparison of them with deep learning approaches on multiple varieties of four widely spoken languages, i.e., Arabic, English, Portuguese, and Spanish. This research has treated this task as a binary classification problem (Portuguese) and multi-class classification problem (Arabic, English, and Spanish). We applied two transfer learning Bidirectional Encoder Representations from Transformers (BERT), Universal Language Model Fine-tuning (ULMFiT), three deep learning-Convolutional Neural Networks (CNN), Bidirectional Long Short Term Memory (Bi-LSTM), Gated Recurrent Units (GRU), and an ensemble approach for identifying different varieties. A thorough comparison between the approaches suggests that the transfer learning based ULMFiT model outperforms all other approaches and produces the best accuracy results for binary and multi-class language variety identification tasks.","PeriodicalId":13624,"journal":{"name":"Int. Arab J. Inf. Technol.","volume":"13 1","pages":"705-712"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. Arab J. Inf. Technol.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.34028/iajit/19/5/1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Language variety identification tends to identify lexical and semantic variations in different varieties of a single language. Language variety identification helps build the linguistic profile of an author from written text which can be used for cyber forensics and marketing purposes. Investigating previous efforts for language variety identification, we hardly find any study that experiments with transfer learning approaches and/or performs a thorough comparison of different deep learning approaches on a range of benchmark datasets. So, to bridge this gap, we propose transfer learning approaches for language variety identification tasks and perform an extensive comparison of them with deep learning approaches on multiple varieties of four widely spoken languages, i.e., Arabic, English, Portuguese, and Spanish. This research has treated this task as a binary classification problem (Portuguese) and multi-class classification problem (Arabic, English, and Spanish). We applied two transfer learning Bidirectional Encoder Representations from Transformers (BERT), Universal Language Model Fine-tuning (ULMFiT), three deep learning-Convolutional Neural Networks (CNN), Bidirectional Long Short Term Memory (Bi-LSTM), Gated Recurrent Units (GRU), and an ensemble approach for identifying different varieties. A thorough comparison between the approaches suggests that the transfer learning based ULMFiT model outperforms all other approaches and produces the best accuracy results for binary and multi-class language variety identification tasks.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用传统深度学习和迁移学习方法的多语种语言多样性识别
语言变体识别倾向于识别单一语言中不同变体的词汇和语义变化。语言多样性识别有助于从书面文本中建立作者的语言概况,可用于网络取证和营销目的。在调查之前的语言多样性识别工作时,我们几乎没有发现任何使用迁移学习方法进行实验和/或在一系列基准数据集上对不同深度学习方法进行彻底比较的研究。因此,为了弥合这一差距,我们提出了语言多样性识别任务的迁移学习方法,并将其与深度学习方法在四种广泛使用的语言(即阿拉伯语、英语、葡萄牙语和西班牙语)的多种变体上进行了广泛的比较。本研究将该任务视为二元分类问题(葡萄牙语)和多类分类问题(阿拉伯语、英语和西班牙语)。我们应用了两种迁移学习双向编码器表示(BERT)、通用语言模型微调(ULMFiT)、三种深度学习卷积神经网络(CNN)、双向长短期记忆(Bi-LSTM)、门控循环单元(GRU)和一种集成方法来识别不同的变体。两种方法之间的全面比较表明,基于迁移学习的ULMFiT模型优于所有其他方法,并在二元和多类语言多样性识别任务中产生最佳的准确性结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A Novel Energy Efficient Harvesting Technique for SDWSN using RF Transmitters with MISO Beamforming Incorporating triple attention and multi-scale pyramid network for underwater image enhancement Generative adversarial networks with data augmentation and multiple penalty areas for image synthesis MAPNEWS: a framework for aggregating and organizing online news articles Deep learning based mobilenet and multi-head attention model for facial expression recognition
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1