利用层次转换器全面理解代码混合语言语义

IF 4.5 2区计算机科学 Q1 COMPUTER SCIENCE, CYBERNETICS IEEE Transactions on Computational Social Systems Pub Date : 2024-03-25 DOI:10.1109/TCSS.2024.3360378

Tharun Suresh;Ayan Sengupta;Md Shad Akhtar;Tanmoy Chakraborty

{"title":"利用层次转换器全面理解代码混合语言语义","authors":"Tharun Suresh;Ayan Sengupta;Md Shad Akhtar;Tanmoy Chakraborty","doi":"10.1109/TCSS.2024.3360378","DOIUrl":null,"url":null,"abstract":"Being a popular mode of text-based communication in multilingual communities, code mixing in online social media has become an important subject to study. Learning the semantics and morphology of code-mixed language remains a key challenge due to the scarcity of data, the unavailability of robust, and language-invariant representation learning techniques. Any morphologically rich language can benefit from character, subword, and word-level embeddings, aiding in learning meaningful correlations. In this article, we explore a hierarchical transformer (HIT)-based architecture to learn the semantics of code-mixed languages. HIT consists of multiheaded self-attention (MSA) and outer product attention components to simultaneously comprehend the semantic and syntactic structures of code-mixed texts. We evaluate the proposed method across six Indian languages (Bengali, Gujarati, Hindi, Tamil, Telugu, and Malayalam) and Spanish for nine tasks on 17 datasets. The HIT model outperforms state-of-the-art code-mixed representation learning and multilingual language models on 13 datasets across eight tasks. We further demonstrate the generalizability of the HIT architecture using masked language modeling (MLM)-based pretraining, zero-shot learning (ZSL), and transfer learning approaches. Our empirical results show that the pretraining objectives significantly improve the performance of downstream tasks.","PeriodicalId":13044,"journal":{"name":"IEEE Transactions on Computational Social Systems","volume":"11 3","pages":"4139-4148"},"PeriodicalIF":4.5000,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Comprehensive Understanding of Code-Mixed Language Semantics Using Hierarchical Transformer\",\"authors\":\"Tharun Suresh;Ayan Sengupta;Md Shad Akhtar;Tanmoy Chakraborty\",\"doi\":\"10.1109/TCSS.2024.3360378\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Being a popular mode of text-based communication in multilingual communities, code mixing in online social media has become an important subject to study. Learning the semantics and morphology of code-mixed language remains a key challenge due to the scarcity of data, the unavailability of robust, and language-invariant representation learning techniques. Any morphologically rich language can benefit from character, subword, and word-level embeddings, aiding in learning meaningful correlations. In this article, we explore a hierarchical transformer (HIT)-based architecture to learn the semantics of code-mixed languages. HIT consists of multiheaded self-attention (MSA) and outer product attention components to simultaneously comprehend the semantic and syntactic structures of code-mixed texts. We evaluate the proposed method across six Indian languages (Bengali, Gujarati, Hindi, Tamil, Telugu, and Malayalam) and Spanish for nine tasks on 17 datasets. The HIT model outperforms state-of-the-art code-mixed representation learning and multilingual language models on 13 datasets across eight tasks. We further demonstrate the generalizability of the HIT architecture using masked language modeling (MLM)-based pretraining, zero-shot learning (ZSL), and transfer learning approaches. Our empirical results show that the pretraining objectives significantly improve the performance of downstream tasks.\",\"PeriodicalId\":13044,\"journal\":{\"name\":\"IEEE Transactions on Computational Social Systems\",\"volume\":\"11 3\",\"pages\":\"4139-4148\"},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2024-03-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Computational Social Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10477442/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, CYBERNETICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computational Social Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10477442/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, CYBERNETICS","Score":null,"Total":0}

引用次数: 0

摘要

作为多语言社区中一种流行的基于文本的交流模式，网络社交媒体中的代码混合已成为一个重要的研究课题。由于数据稀缺、缺乏稳健且语言不变的表征学习技术，学习代码混合语言的语义和形态仍然是一个关键挑战。任何形态丰富的语言都可以从字符、子词和词级嵌入中受益，从而帮助学习有意义的相关性。在本文中，我们探索了一种基于分层变换器（HIT）的架构，用于学习代码混合语言的语义。HIT 由多头自我注意（MSA）和外积注意组件组成，可同时理解代码混合文本的语义和句法结构。我们在 17 个数据集上对六种印度语言（孟加拉语、古吉拉特语、印地语、泰米尔语、泰卢固语和马拉雅拉姆语）和西班牙语的九项任务对所提出的方法进行了评估。在 8 项任务的 13 个数据集上，HIT 模型的表现优于最先进的代码混合表示学习和多语言语言模型。我们使用基于掩码语言建模（MLM）的预训练、零点学习（ZSL）和迁移学习方法进一步证明了 HIT 架构的通用性。我们的实证结果表明，预训练目标显著提高了下游任务的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Comprehensive Understanding of Code-Mixed Language Semantics Using Hierarchical Transformer

Being a popular mode of text-based communication in multilingual communities, code mixing in online social media has become an important subject to study. Learning the semantics and morphology of code-mixed language remains a key challenge due to the scarcity of data, the unavailability of robust, and language-invariant representation learning techniques. Any morphologically rich language can benefit from character, subword, and word-level embeddings, aiding in learning meaningful correlations. In this article, we explore a hierarchical transformer (HIT)-based architecture to learn the semantics of code-mixed languages. HIT consists of multiheaded self-attention (MSA) and outer product attention components to simultaneously comprehend the semantic and syntactic structures of code-mixed texts. We evaluate the proposed method across six Indian languages (Bengali, Gujarati, Hindi, Tamil, Telugu, and Malayalam) and Spanish for nine tasks on 17 datasets. The HIT model outperforms state-of-the-art code-mixed representation learning and multilingual language models on 13 datasets across eight tasks. We further demonstrate the generalizability of the HIT architecture using masked language modeling (MLM)-based pretraining, zero-shot learning (ZSL), and transfer learning approaches. Our empirical results show that the pretraining objectives significantly improve the performance of downstream tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Computational Social Systems Social Sciences-Social Sciences (miscellaneous)

CiteScore

10.00

自引率

20.00%

发文量

316

期刊介绍： IEEE Transactions on Computational Social Systems focuses on such topics as modeling, simulation, analysis and understanding of social systems from the quantitative and/or computational perspective. "Systems" include man-man, man-machine and machine-machine organizations and adversarial situations as well as social media structures and their dynamics. More specifically, the proposed transactions publishes articles on modeling the dynamics of social systems, methodologies for incorporating and representing socio-cultural and behavioral aspects in computational modeling, analysis of social system behavior and structure, and paradigms for social systems modeling and simulation. The journal also features articles on social network dynamics, social intelligence and cognition, social systems design and architectures, socio-cultural modeling and representation, and computational behavior modeling, and their applications.