{"title":"利用层次转换器全面理解代码混合语言语义","authors":"Tharun Suresh;Ayan Sengupta;Md Shad Akhtar;Tanmoy Chakraborty","doi":"10.1109/TCSS.2024.3360378","DOIUrl":null,"url":null,"abstract":"Being a popular mode of text-based communication in multilingual communities, code mixing in online social media has become an important subject to study. Learning the semantics and morphology of code-mixed language remains a key challenge due to the scarcity of data, the unavailability of robust, and language-invariant representation learning techniques. Any morphologically rich language can benefit from character, subword, and word-level embeddings, aiding in learning meaningful correlations. In this article, we explore a hierarchical transformer (HIT)-based architecture to learn the semantics of code-mixed languages. HIT consists of multiheaded self-attention (MSA) and outer product attention components to simultaneously comprehend the semantic and syntactic structures of code-mixed texts. We evaluate the proposed method across six Indian languages (Bengali, Gujarati, Hindi, Tamil, Telugu, and Malayalam) and Spanish for nine tasks on 17 datasets. The HIT model outperforms state-of-the-art code-mixed representation learning and multilingual language models on 13 datasets across eight tasks. We further demonstrate the generalizability of the HIT architecture using masked language modeling (MLM)-based pretraining, zero-shot learning (ZSL), and transfer learning approaches. Our empirical results show that the pretraining objectives significantly improve the performance of downstream tasks.","PeriodicalId":13044,"journal":{"name":"IEEE Transactions on Computational Social Systems","volume":null,"pages":null},"PeriodicalIF":4.5000,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Comprehensive Understanding of Code-Mixed Language Semantics Using Hierarchical Transformer\",\"authors\":\"Tharun Suresh;Ayan Sengupta;Md Shad Akhtar;Tanmoy Chakraborty\",\"doi\":\"10.1109/TCSS.2024.3360378\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Being a popular mode of text-based communication in multilingual communities, code mixing in online social media has become an important subject to study. Learning the semantics and morphology of code-mixed language remains a key challenge due to the scarcity of data, the unavailability of robust, and language-invariant representation learning techniques. Any morphologically rich language can benefit from character, subword, and word-level embeddings, aiding in learning meaningful correlations. In this article, we explore a hierarchical transformer (HIT)-based architecture to learn the semantics of code-mixed languages. HIT consists of multiheaded self-attention (MSA) and outer product attention components to simultaneously comprehend the semantic and syntactic structures of code-mixed texts. We evaluate the proposed method across six Indian languages (Bengali, Gujarati, Hindi, Tamil, Telugu, and Malayalam) and Spanish for nine tasks on 17 datasets. The HIT model outperforms state-of-the-art code-mixed representation learning and multilingual language models on 13 datasets across eight tasks. We further demonstrate the generalizability of the HIT architecture using masked language modeling (MLM)-based pretraining, zero-shot learning (ZSL), and transfer learning approaches. Our empirical results show that the pretraining objectives significantly improve the performance of downstream tasks.\",\"PeriodicalId\":13044,\"journal\":{\"name\":\"IEEE Transactions on Computational Social Systems\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2024-03-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Computational Social Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10477442/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, CYBERNETICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computational Social Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10477442/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, CYBERNETICS","Score":null,"Total":0}
A Comprehensive Understanding of Code-Mixed Language Semantics Using Hierarchical Transformer
Being a popular mode of text-based communication in multilingual communities, code mixing in online social media has become an important subject to study. Learning the semantics and morphology of code-mixed language remains a key challenge due to the scarcity of data, the unavailability of robust, and language-invariant representation learning techniques. Any morphologically rich language can benefit from character, subword, and word-level embeddings, aiding in learning meaningful correlations. In this article, we explore a hierarchical transformer (HIT)-based architecture to learn the semantics of code-mixed languages. HIT consists of multiheaded self-attention (MSA) and outer product attention components to simultaneously comprehend the semantic and syntactic structures of code-mixed texts. We evaluate the proposed method across six Indian languages (Bengali, Gujarati, Hindi, Tamil, Telugu, and Malayalam) and Spanish for nine tasks on 17 datasets. The HIT model outperforms state-of-the-art code-mixed representation learning and multilingual language models on 13 datasets across eight tasks. We further demonstrate the generalizability of the HIT architecture using masked language modeling (MLM)-based pretraining, zero-shot learning (ZSL), and transfer learning approaches. Our empirical results show that the pretraining objectives significantly improve the performance of downstream tasks.
期刊介绍:
IEEE Transactions on Computational Social Systems focuses on such topics as modeling, simulation, analysis and understanding of social systems from the quantitative and/or computational perspective. "Systems" include man-man, man-machine and machine-machine organizations and adversarial situations as well as social media structures and their dynamics. More specifically, the proposed transactions publishes articles on modeling the dynamics of social systems, methodologies for incorporating and representing socio-cultural and behavioral aspects in computational modeling, analysis of social system behavior and structure, and paradigms for social systems modeling and simulation. The journal also features articles on social network dynamics, social intelligence and cognition, social systems design and architectures, socio-cultural modeling and representation, and computational behavior modeling, and their applications.