Share What You Already Know: Cross-Language-Script Transfer and Alignment for Sentiment Detection in Code-Mixed Data

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE ACM Transactions on Asian and Low-Resource Language Information Processing Pub Date : 2024-04-27 DOI:10.1145/3661307

Niraj Pahari, Kazutaka Shimada

{"title":"Share What You Already Know: Cross-Language-Script Transfer and Alignment for Sentiment Detection in Code-Mixed Data","authors":"Niraj Pahari, Kazutaka Shimada","doi":"10.1145/3661307","DOIUrl":null,"url":null,"abstract":"<p>Code-switching entails mixing multiple languages. It is an increasingly occurring phenomenon in social media texts. Usually, code-mixed texts are written in a single script, even though the languages involved have different scripts. Pre-trained multilingual models primarily utilize the data in the native script of the language. In existing studies, the code-switched texts are utilized as they are. However, using the native script for each language can generate better representations of the text owing to the pre-trained knowledge. Therefore, a cross-language-script knowledge sharing architecture utilizing the cross attention and alignment of the representations of text in individual language scripts was proposed in this study. Experimental results on two different datasets containing Nepali-English and Hindi-English code-switched texts, demonstrate the effectiveness of the proposed method. The interpretation of the model using model explainability technique illustrates the sharing of language-specific knowledge between language-specific representations.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"3 1","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Asian and Low-Resource Language Information Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3661307","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Code-switching entails mixing multiple languages. It is an increasingly occurring phenomenon in social media texts. Usually, code-mixed texts are written in a single script, even though the languages involved have different scripts. Pre-trained multilingual models primarily utilize the data in the native script of the language. In existing studies, the code-switched texts are utilized as they are. However, using the native script for each language can generate better representations of the text owing to the pre-trained knowledge. Therefore, a cross-language-script knowledge sharing architecture utilizing the cross attention and alignment of the representations of text in individual language scripts was proposed in this study. Experimental results on two different datasets containing Nepali-English and Hindi-English code-switched texts, demonstrate the effectiveness of the proposed method. The interpretation of the model using model explainability technique illustrates the sharing of language-specific knowledge between language-specific representations.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

分享您已掌握的知识：在混合代码数据中进行跨语言脚本转移和对齐以实现情感检测

代码转换是指混合使用多种语言。这种现象在社交媒体文本中越来越常见。通常情况下，代码混合文本是用单一脚本书写的，尽管涉及的语言有不同的脚本。预训练的多语言模型主要利用语言的母语脚本数据。在现有的研究中，代码混合文本是按原样使用的。然而，由于预先训练的知识，使用每种语言的母语脚本可以生成更好的文本表述。因此，本研究提出了一种跨语言脚本知识共享架构，该架构利用了各语言脚本中文本表征的交叉关注和对齐。在包含尼泊尔语-英语和印地语-英语代码转换文本的两个不同数据集上的实验结果证明了所提方法的有效性。使用模型可解释性技术对模型的解释说明了特定语言表征之间特定语言知识的共享。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Asian and Low-Resource Language Information Processing Computer Science-General Computer Science

CiteScore

3.60

自引率

15.00%

发文量

241

期刊介绍： The ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) publishes high quality original archival papers and technical notes in the areas of computation and processing of information in Asian languages, low-resource languages of Africa, Australasia, Oceania and the Americas, as well as related disciplines. The subject areas covered by TALLIP include, but are not limited to: -Computational Linguistics: including computational phonology, computational morphology, computational syntax (e.g. parsing), computational semantics, computational pragmatics, etc. -Linguistic Resources: including computational lexicography, terminology, electronic dictionaries, cross-lingual dictionaries, electronic thesauri, etc. -Hardware and software algorithms and tools for Asian or low-resource language processing, e.g., handwritten character recognition. -Information Understanding: including text understanding, speech understanding, character recognition, discourse processing, dialogue systems, etc. -Machine Translation involving Asian or low-resource languages. -Information Retrieval: including natural language processing (NLP) for concept-based indexing, natural language query interfaces, semantic relevance judgments, etc. -Information Extraction and Filtering: including automatic abstraction, user profiling, etc. -Speech processing: including text-to-speech synthesis and automatic speech recognition. -Multimedia Asian Information Processing: including speech, image, video, image/text translation, etc. -Cross-lingual information processing involving Asian or low-resource languages. -Papers that deal in theory, systems design, evaluation and applications in the aforesaid subjects are appropriate for TALLIP. Emphasis will be placed on the originality and the practical significance of the reported research.