A Comprehensive Understanding of Code-Mixed Language Semantics Using Hierarchical Transformer

IF 4.5 2区 计算机科学 Q1 COMPUTER SCIENCE, CYBERNETICS IEEE Transactions on Computational Social Systems Pub Date : 2024-03-25 DOI:10.1109/TCSS.2024.3360378
Tharun Suresh;Ayan Sengupta;Md Shad Akhtar;Tanmoy Chakraborty
{"title":"A Comprehensive Understanding of Code-Mixed Language Semantics Using Hierarchical Transformer","authors":"Tharun Suresh;Ayan Sengupta;Md Shad Akhtar;Tanmoy Chakraborty","doi":"10.1109/TCSS.2024.3360378","DOIUrl":null,"url":null,"abstract":"Being a popular mode of text-based communication in multilingual communities, code mixing in online social media has become an important subject to study. Learning the semantics and morphology of code-mixed language remains a key challenge due to the scarcity of data, the unavailability of robust, and language-invariant representation learning techniques. Any morphologically rich language can benefit from character, subword, and word-level embeddings, aiding in learning meaningful correlations. In this article, we explore a hierarchical transformer (HIT)-based architecture to learn the semantics of code-mixed languages. HIT consists of multiheaded self-attention (MSA) and outer product attention components to simultaneously comprehend the semantic and syntactic structures of code-mixed texts. We evaluate the proposed method across six Indian languages (Bengali, Gujarati, Hindi, Tamil, Telugu, and Malayalam) and Spanish for nine tasks on 17 datasets. The HIT model outperforms state-of-the-art code-mixed representation learning and multilingual language models on 13 datasets across eight tasks. We further demonstrate the generalizability of the HIT architecture using masked language modeling (MLM)-based pretraining, zero-shot learning (ZSL), and transfer learning approaches. Our empirical results show that the pretraining objectives significantly improve the performance of downstream tasks.","PeriodicalId":13044,"journal":{"name":"IEEE Transactions on Computational Social Systems","volume":null,"pages":null},"PeriodicalIF":4.5000,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computational Social Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10477442/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, CYBERNETICS","Score":null,"Total":0}
引用次数: 0

Abstract

Being a popular mode of text-based communication in multilingual communities, code mixing in online social media has become an important subject to study. Learning the semantics and morphology of code-mixed language remains a key challenge due to the scarcity of data, the unavailability of robust, and language-invariant representation learning techniques. Any morphologically rich language can benefit from character, subword, and word-level embeddings, aiding in learning meaningful correlations. In this article, we explore a hierarchical transformer (HIT)-based architecture to learn the semantics of code-mixed languages. HIT consists of multiheaded self-attention (MSA) and outer product attention components to simultaneously comprehend the semantic and syntactic structures of code-mixed texts. We evaluate the proposed method across six Indian languages (Bengali, Gujarati, Hindi, Tamil, Telugu, and Malayalam) and Spanish for nine tasks on 17 datasets. The HIT model outperforms state-of-the-art code-mixed representation learning and multilingual language models on 13 datasets across eight tasks. We further demonstrate the generalizability of the HIT architecture using masked language modeling (MLM)-based pretraining, zero-shot learning (ZSL), and transfer learning approaches. Our empirical results show that the pretraining objectives significantly improve the performance of downstream tasks.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用层次转换器全面理解代码混合语言语义
作为多语言社区中一种流行的基于文本的交流模式,网络社交媒体中的代码混合已成为一个重要的研究课题。由于数据稀缺、缺乏稳健且语言不变的表征学习技术,学习代码混合语言的语义和形态仍然是一个关键挑战。任何形态丰富的语言都可以从字符、子词和词级嵌入中受益,从而帮助学习有意义的相关性。在本文中,我们探索了一种基于分层变换器(HIT)的架构,用于学习代码混合语言的语义。HIT 由多头自我注意(MSA)和外积注意组件组成,可同时理解代码混合文本的语义和句法结构。我们在 17 个数据集上对六种印度语言(孟加拉语、古吉拉特语、印地语、泰米尔语、泰卢固语和马拉雅拉姆语)和西班牙语的九项任务对所提出的方法进行了评估。在 8 项任务的 13 个数据集上,HIT 模型的表现优于最先进的代码混合表示学习和多语言语言模型。我们使用基于掩码语言建模(MLM)的预训练、零点学习(ZSL)和迁移学习方法进一步证明了 HIT 架构的通用性。我们的实证结果表明,预训练目标显著提高了下游任务的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Transactions on Computational Social Systems
IEEE Transactions on Computational Social Systems Social Sciences-Social Sciences (miscellaneous)
CiteScore
10.00
自引率
20.00%
发文量
316
期刊介绍: IEEE Transactions on Computational Social Systems focuses on such topics as modeling, simulation, analysis and understanding of social systems from the quantitative and/or computational perspective. "Systems" include man-man, man-machine and machine-machine organizations and adversarial situations as well as social media structures and their dynamics. More specifically, the proposed transactions publishes articles on modeling the dynamics of social systems, methodologies for incorporating and representing socio-cultural and behavioral aspects in computational modeling, analysis of social system behavior and structure, and paradigms for social systems modeling and simulation. The journal also features articles on social network dynamics, social intelligence and cognition, social systems design and architectures, socio-cultural modeling and representation, and computational behavior modeling, and their applications.
期刊最新文献
Table of Contents Guest Editorial: Special Issue on Dark Side of the Socio-Cyber World: Media Manipulation, Fake News, and Misinformation IEEE Transactions on Computational Social Systems Publication Information IEEE Transactions on Computational Social Systems Information for Authors IEEE Systems, Man, and Cybernetics Society Information
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1