ACM Transactions on Asian and Low-Resource Language Information Processing最新文献_第2页

Towards Better Quantity Representations for Solving Math Word Problems 用更好的数量表示法解决数学字词问题

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-05-18 DOI: 10.1145/3665644

Runxin Sun, Shizhu He, Jun Zhao, Kang Liu

Solving a math word problem requires selecting quantities in it and performing appropriate arithmetic operations to obtain the answer. For deep learning-based methods, it is vital to obtain good quantity representations, i.e., to selectively and emphatically aggregate information in the context of quantities. However, existing works have not paid much attention to this aspect. Many works simply encode quantities as ordinary tokens, or use some implicit or rule-based methods to select information in their context. This leads to poor results when dealing with linguistic variations and confounding quantities. This paper proposes a novel method to identify question-related distinguishing features of quantities by contrasting their context with the question and the context of other quantities, thereby enhancing the representation of quantities. Our method not only considers the contrastive relationship between quantities, but also considers multiple relationships jointly. Besides, we propose two auxiliary tasks to further guide the representation learning of quantities: 1) predicting whether a quantity is used in the question; 2) predicting the relations (operators) between quantities given the question. Experimental results show that our method outperforms previous methods on SVAMP and ASDiv-A under similar settings, even some newly released strong baselines. Supplementary experiments further confirm that our method indeed improves the performance of quantity selection by improving the representation of both quantities and questions.

解决数学单词问题需要选择其中的数量，并进行适当的算术运算以获得答案。对于基于深度学习的方法来说，获得良好的数量表示至关重要，即有选择地、强调地聚合数量背景下的信息。然而，现有的研究并不重视这一方面。许多作品只是简单地将数量编码为普通标记，或使用一些隐式或基于规则的方法来选择其上下文中的信息。这导致在处理语言变化和混杂数量时效果不佳。本文提出了一种新颖的方法，通过将数量的上下文与问题和其他数量的上下文进行对比，来识别与问题相关的数量区分特征，从而增强数量的表征能力。我们的方法不仅考虑了数量之间的对比关系，还联合考虑了多种关系。此外，我们还提出了两个辅助任务来进一步指导量的表征学习：1) 预测问题中是否使用了某个量；2) 预测问题中量与量之间的关系（算子）。实验结果表明，在类似设置下，我们的方法在 SVAMP 和 ASDiv-A 上的表现优于之前的方法，甚至优于一些新发布的强基线方法。补充实验进一步证实，我们的方法通过改进数量和问题的表征，确实提高了数量选择的性能。

{"title":"Towards Better Quantity Representations for Solving Math Word Problems","authors":"Runxin Sun, Shizhu He, Jun Zhao, Kang Liu","doi":"10.1145/3665644","DOIUrl":"https://doi.org/10.1145/3665644","url":null,"abstract":"Solving a math word problem requires selecting quantities in it and performing appropriate arithmetic operations to obtain the answer. For deep learning-based methods, it is vital to obtain good quantity representations, i.e., to selectively and emphatically aggregate information in the context of quantities. However, existing works have not paid much attention to this aspect. Many works simply encode quantities as ordinary tokens, or use some implicit or rule-based methods to select information in their context. This leads to poor results when dealing with linguistic variations and confounding quantities. This paper proposes a novel method to identify question-related distinguishing features of quantities by contrasting their context with the question and the context of other quantities, thereby enhancing the representation of quantities. Our method not only considers the contrastive relationship between quantities, but also considers multiple relationships jointly. Besides, we propose two auxiliary tasks to further guide the representation learning of quantities: 1) predicting whether a quantity is used in the question; 2) predicting the relations (operators) between quantities given the question. Experimental results show that our method outperforms previous methods on SVAMP and ASDiv-A under similar settings, even some newly released strong baselines. Supplementary experiments further confirm that our method indeed improves the performance of quantity selection by improving the representation of both quantities and questions.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"13 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141060662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Abusive Language Detection in Khasi Social Media Comments 检测卡西族社交媒体评论中的辱骂性语言

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-05-14 DOI: 10.1145/3664285

Arup Baruah, Lakhamti Wahlang, Firstbornson Jyrwa, Floriginia Shadap, Ferdous Barbhuiya, Kuntal Dey

This paper describes the work performed for automated abusive language detection in the Khasi language, a low-resource language spoken primarily in the state of Meghalaya, India. A dataset named Khasi Abusive Language Dataset (KALD) was created which consists of 4,573 human-annotated Khasi YouTube and Facebook comments. A corpus of Khasi text was built and it was used to create Khasi word2vec and fastText word embeddings. Deep learning, traditional machine learning, and ensemble models were used in the study. Experiments were performed using word2vec, fastText, and topic vectors obtained using LDA. Experiments were also performed to check if zero-shot cross-lingual nature of language models such as LaBSE and LASER can be utilized for abusive language detection in the Khasi language. The best F1 score of 0.90725 was obtained by an XGBoost classifier. After feature selection and rebalancing of the dataset, F1 score of 0.91828 and 0.91945 were obtained by an SVM based classifiers.

卡西语是一种低资源语言，主要在印度梅加拉亚邦使用。本文创建了一个名为 "卡西语辱骂语言数据集"（KALD）的数据集，该数据集由 4,573 条人工标注的卡西语 YouTube 和 Facebook 评论组成。该数据集由 4,573 条人类标注的 Khasi 语 YouTube 和 Facebook 评论组成。我们建立了 Khasi 语文本语料库，并利用该语料库创建了 Khasi word2vec 和 fastText 词嵌入。研究中使用了深度学习、传统机器学习和集合模型。实验使用了 word2vec、fastText 和使用 LDA 获得的主题向量。实验还检验了 LaBSE 和 LASER 等语言模型的零点跨语言性质是否可用于卡西语的滥用语言检测。XGBoost 分类器获得了 0.90725 的最佳 F1 分数。在对数据集进行特征选择和重新平衡后，基于 SVM 的分类器获得了 0.91828 和 0.91945 的 F1 分数。

引用次数: 0

Marathi to Indian Sign Language Machine Translation 马拉地语至印度手语机器翻译

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-05-13 DOI: 10.1145/3664609

Suvarna Rajesh. Bhagwat, R. P. Bhavsar, B. V. Pawar

Machine translation has been a prominent field of research, contributing significantly to human life enhancement. Sign language machine translation, a subfield, focuses on translating spoken language content into sign language and vice versa, thereby facilitating communication between the normal hearing and hard-of-hearing communities, promoting inclusivity.

This study presents the development of a ‘sign language machine translation system’ converting simple Marathi sentences into Indian Sign Language (ISL) glosses and animation. Given the low-resource nature of both languages, a phrase-level rule-based approach was employed for the translation. Initial encoding of translation rules relied on basic linguistic knowledge of Marathi and ISL, with subsequent incorporation of rules to address 'simultaneous morphological' features in ISL. These rules were applied during the ‘generation phase’ of translation to dynamically adjust phonological sign parameters, resulting in improved target sentence fluency.

The paper provides a detailed description of the system architecture, translation rules, and comprehensive experimentation. Rigorous evaluation efforts were undertaken, encompassing various linguistic features, and the findings are discussed herein.

The web-based version of the system serves as an interpreter for brief communications and can support the teaching and learning of sign language and its grammar in schools for hard-of-hearing students.

机器翻译一直是一个突出的研究领域，为改善人类生活做出了巨大贡献。本研究介绍了 "手语机器翻译系统 "的开发情况，该系统可将简单的马拉地语句子转换为印度手语（ISL）词汇和动画。鉴于这两种语言的低资源性，翻译采用了基于短语规则的方法。翻译规则的初始编码依赖于马拉地语和印度手语的基本语言知识，随后加入了针对印度手语 "同时形态 "特征的规则。这些规则应用于翻译的 "生成阶段"，以动态调整语音符号参数，从而提高目标句子的流畅性。论文详细描述了系统架构、翻译规则和综合实验。该系统的网络版可作为简短交流的口译员，并可为学校中的重听学生手语及其语法的教学提供支持。

{"title":"Marathi to Indian Sign Language Machine Translation","authors":"Suvarna Rajesh. Bhagwat, R. P. Bhavsar, B. V. Pawar","doi":"10.1145/3664609","DOIUrl":"https://doi.org/10.1145/3664609","url":null,"abstract":"Machine translation has been a prominent field of research, contributing significantly to human life enhancement. Sign language machine translation, a subfield, focuses on translating spoken language content into sign language and vice versa, thereby facilitating communication between the normal hearing and hard-of-hearing communities, promoting inclusivity.This study presents the development of a ‘sign language machine translation system’ converting simple Marathi sentences into Indian Sign Language (ISL) glosses and animation. Given the low-resource nature of both languages, a phrase-level rule-based approach was employed for the translation. Initial encoding of translation rules relied on basic linguistic knowledge of Marathi and ISL, with subsequent incorporation of rules to address 'simultaneous morphological' features in ISL. These rules were applied during the ‘generation phase’ of translation to dynamically adjust phonological sign parameters, resulting in improved target sentence fluency.The paper provides a detailed description of the system architecture, translation rules, and comprehensive experimentation. Rigorous evaluation efforts were undertaken, encompassing various linguistic features, and the findings are discussed herein.The web-based version of the system serves as an interpreter for brief communications and can support the teaching and learning of sign language and its grammar in schools for hard-of-hearing students.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"84 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140942285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi Task Learning Based Shallow Parsing for Indian Languages 基于多任务学习的印度语言浅层解析

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-05-11 DOI: 10.1145/3664620

Pruthwik Mishra, Vandan Mujadia

Shallow Parsing is an important step for many Natural Language Processing tasks. Although shallow parsing has a rich history for resource rich languages, it is not the case for most Indian languages. Shallow Parsing consists of POS Tagging and Chunking. Our study focuses on developing shallow parsers for Indian languages. As part of shallow parsing we included morph analysis as well.

For the study, we first consolidated available shallow parsing corpora for 7 Indian Languages (Hindi, Kannada, Bangla, Malayalam, Marathi, Urdu, Telugu) for which treebanks are publicly available. We then trained models to achieve state of the art performance for shallow parsing in these languages for multiple domains. Since analyzing the performance of model predictions at sentence level is more realistic, we report the performance of these shallow parsers not only at the token level, but also at the sentence level. We also present machine learning techniques for multitask shallow parsing. Our experiments show that fine-tuned contextual embedding with multi-task learning improves the performance of multiple as well as individual shallow parsing tasks across different domains. We show the transfer learning capability of these models by creating shallow parsers (only with POS and Chunk) for Gujarati, Odia, and Punjabi for which no treebanks are available.

As a part of this work, we will be releasing the Indian Languages Shallow Linguistic (ILSL) benchmarks for 10 Indian languages including both the major language families Indo-Aryan and Dravidian as common building blocks that can be used to evaluate and understand various linguistic phenomena found in Indian languages and how well newer approaches can tackle them.

浅层解析是许多自然语言处理任务的重要步骤。虽然浅层解析在资源丰富的语言中有着悠久的历史，但在大多数印度语言中却并非如此。浅层解析包括 POS 标记和分块。我们的研究重点是为印度语言开发浅层解析器。作为浅层解析的一部分，我们还包括形态分析。在研究中，我们首先整合了 7 种印度语言（印地语、卡纳达语、孟加拉语、马拉雅拉姆语、马拉地语、乌尔都语和泰卢固语）的现有浅层解析语料库，这些语料库都是公开的树库。然后，我们对模型进行了训练，使这些语言在多个领域的浅层解析方面达到了最先进的性能。由于在句子层面分析模型预测的性能更为现实，我们不仅报告了这些浅层解析器在标记层面的性能，还报告了它们在句子层面的性能。我们还介绍了多任务浅层解析的机器学习技术。我们的实验表明，通过多任务学习对上下文嵌入进行微调，可以提高不同领域中多个以及单个浅层解析任务的性能。我们通过为古吉拉特语、奥迪亚语和旁遮普语创建浅层解析器（仅使用 POS 和 Chunk），展示了这些模型的迁移学习能力，因为这些语种没有树库可用。作为这项工作的一部分，我们将发布 10 种印度语言的印度语言浅层语言学（ILSL）基准，其中包括印度-雅利安语系和德拉威语系这两个主要语系，作为共同的构建模块，可用于评估和理解印度语言中发现的各种语言现象，以及新方法如何很好地解决这些问题。

{"title":"Multi Task Learning Based Shallow Parsing for Indian Languages","authors":"Pruthwik Mishra, Vandan Mujadia","doi":"10.1145/3664620","DOIUrl":"https://doi.org/10.1145/3664620","url":null,"abstract":"Shallow Parsing is an important step for many Natural Language Processing tasks. Although shallow parsing has a rich history for resource rich languages, it is not the case for most Indian languages. Shallow Parsing consists of POS Tagging and Chunking. Our study focuses on developing shallow parsers for Indian languages. As part of shallow parsing we included morph analysis as well. For the study, we first consolidated available shallow parsing corpora for 7 Indian Languages (Hindi, Kannada, Bangla, Malayalam, Marathi, Urdu, Telugu) for which treebanks are publicly available. We then trained models to achieve state of the art performance for shallow parsing in these languages for multiple domains. Since analyzing the performance of model predictions at sentence level is more realistic, we report the performance of these shallow parsers not only at the token level, but also at the sentence level. We also present machine learning techniques for multitask shallow parsing. Our experiments show that fine-tuned contextual embedding with multi-task learning improves the performance of multiple as well as individual shallow parsing tasks across different domains. We show the transfer learning capability of these models by creating shallow parsers (only with POS and Chunk) for Gujarati, Odia, and Punjabi for which no treebanks are available. As a part of this work, we will be releasing the Indian Languages Shallow Linguistic (ILSL) benchmarks for 10 Indian languages including both the major language families Indo-Aryan and Dravidian as common building blocks that can be used to evaluate and understand various linguistic phenomena found in Indian languages and how well newer approaches can tackle them.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"155 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140928413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Chinese Event Extraction with Event Trigger Structures 利用事件触发器结构增强中文事件提取功能

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-05-07 DOI: 10.1145/3663567

Fei Li, Kaifang Deng, Yiwen Mo, Yuanze Ji, Chong Teng, Donghong Ji

The dependency syntactic structure is widely used in event extraction. However, the dependency structure reflecting syntactic features is essentially different from the event structure that reflects semantic features, leading to the performance degradation. In this paper, we propose to use Event Trigger Structure for Event Extraction (ETSEE), which can compensate the inconsistency between two structures. First, we leverage the ACE2005 dataset as case study, and annotate 3 kinds of ETSs, i.e., “light verb + trigger”, “preposition structures” and “tense + trigger”. Then we design a graph-based event extraction model that jointly identifies triggers and arguments, where the graph consists of both the dependency structure and ETSs. Experiments show that our model significantly outperforms the state-of-the-art methods. Through empirical analysis and manual observation, we find that the ETSs can bring the following benefits: (1) enriching trigger identification features by introducing structural event information; (2) enriching dependency structures with event semantic information; (3) enhancing the interactions between triggers and candidate arguments by shortening their distances in the dependency graph.

依赖语法结构被广泛应用于事件提取。然而，反映语法特征的依赖结构与反映语义特征的事件结构存在本质区别，从而导致性能下降。本文建议使用事件触发结构（ETSEE）进行事件提取，它可以弥补两种结构之间的不一致性。首先，我们以 ACE2005 数据集为案例，标注了 3 种事件触发结构，即 "轻动词 + 触发"、"介词结构 "和 "时态 + 触发"。然后，我们设计了一种基于图的事件提取模型，该模型可联合识别触发器和参数，其中图由依赖结构和 ETS 组成。实验表明，我们的模型明显优于最先进的方法。通过实证分析和人工观察，我们发现 ETS 可以带来以下好处：(1) 通过引入结构性事件信息丰富触发器识别特征；(2) 通过事件语义信息丰富依赖结构；(3) 通过缩短触发器和候选参数在依赖图中的距离增强它们之间的交互。

{"title":"Enhancing Chinese Event Extraction with Event Trigger Structures","authors":"Fei Li, Kaifang Deng, Yiwen Mo, Yuanze Ji, Chong Teng, Donghong Ji","doi":"10.1145/3663567","DOIUrl":"https://doi.org/10.1145/3663567","url":null,"abstract":"The dependency syntactic structure is widely used in event extraction. However, the dependency structure reflecting syntactic features is essentially different from the event structure that reflects semantic features, leading to the performance degradation. In this paper, we propose to use Event Trigger Structure for Event Extraction (ETSEE), which can compensate the inconsistency between two structures. First, we leverage the ACE2005 dataset as case study, and annotate 3 kinds of ETSs, i.e., “light verb + trigger”, “preposition structures” and “tense + trigger”. Then we design a graph-based event extraction model that jointly identifies triggers and arguments, where the graph consists of both the dependency structure and ETSs. Experiments show that our model significantly outperforms the state-of-the-art methods. Through empirical analysis and manual observation, we find that the ETSs can bring the following benefits: (1) enriching trigger identification features by introducing structural event information; (2) enriching dependency structures with event semantic information; (3) enhancing the interactions between triggers and candidate arguments by shortening their distances in the dependency graph.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"62 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140885519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Hybrid Deep BiLSTM-CNN for Hate Speech Detection in Multi-social media 用于检测多元社交媒体中仇恨言论的混合深度 BiLSTM-CNN

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-05-06 DOI: 10.1145/3657635

Ashwini Kumar, Santosh Kumar, Kalpdrum Passi, Aniket Mahanti

Nowadays, ways of communication among people have changed due to advancements in information technology and the rise of online multi-social media. Many people express their feelings, ideas, and emotions on social media sites such as Instagram, Twitter, Gab, Reddit, Facebook, YouTube, etc. However, people have misused social media to send hateful messages to specific individuals or groups to create chaos. For various Governance authorities, manually identifying hate speech on various social media platforms is a difficult task to avoid such chaos. In this study, a hybrid deep-learning model, where bidirectional long short-term memory (BiLSTM) and convolutional neural network (CNN) are used to classify hate speech in textual data, has been proposed. This model incorporates a GLOVE-based word embedding approach, dropout, L2 regularization, and global max pooling to get impressive results. Further, the proposed BiLSTM-CNN model has been evaluated on various datasets to achieve state-of-the-art performance that is superior to the traditional and existing machine learning methods in terms of accuracy, precision, recall, and F1-score.

如今，由于信息技术的进步和在线多元社交媒体的兴起，人与人之间的交流方式发生了变化。许多人在 Instagram、Twitter、Gab、Reddit、Facebook、YouTube 等社交媒体网站上表达自己的情感、想法和情绪。然而，有人滥用社交媒体向特定个人或群体发送仇恨信息，制造混乱。对于各治理部门来说，要避免这种混乱局面，人工识别各种社交媒体平台上的仇恨言论是一项艰巨的任务。本研究提出了一种混合深度学习模型，利用双向长短期记忆（BiLSTM）和卷积神经网络（CNN）对文本数据中的仇恨言论进行分类。该模型采用了基于 GLOVE 的单词嵌入方法、剔除、L2 正则化和全局最大池化，取得了令人印象深刻的结果。此外，还在各种数据集上对所提出的 BiLSTM-CNN 模型进行了评估，结果表明该模型在准确率、精确度、召回率和 F1 分数方面都优于传统和现有的机器学习方法，达到了最先进的性能。

{"title":"A Hybrid Deep BiLSTM-CNN for Hate Speech Detection in Multi-social media","authors":"Ashwini Kumar, Santosh Kumar, Kalpdrum Passi, Aniket Mahanti","doi":"10.1145/3657635","DOIUrl":"https://doi.org/10.1145/3657635","url":null,"abstract":"Nowadays, ways of communication among people have changed due to advancements in information technology and the rise of online multi-social media. Many people express their feelings, ideas, and emotions on social media sites such as Instagram, Twitter, Gab, Reddit, Facebook, YouTube, etc. However, people have misused social media to send hateful messages to specific individuals or groups to create chaos. For various Governance authorities, manually identifying hate speech on various social media platforms is a difficult task to avoid such chaos. In this study, a hybrid deep-learning model, where bidirectional long short-term memory (BiLSTM) and convolutional neural network (CNN) are used to classify hate speech in textual data, has been proposed. This model incorporates a GLOVE-based word embedding approach, dropout, L2 regularization, and global max pooling to get impressive results. Further, the proposed BiLSTM-CNN model has been evaluated on various datasets to achieve state-of-the-art performance that is superior to the traditional and existing machine learning methods in terms of accuracy, precision, recall, and F1-score.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"1 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140885818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

UrduAspectNet: Fusing Transformers and Dual GCN for Urdu Aspect-Based Sentiment Detection UrduAspectNet：融合变换器和双 GCN 实现基于乌尔都语特征的情感检测

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-05-04 DOI: 10.1145/3663367

Kamran Aziz, Aizihaierjiang Yusufu, Jun Zhou, Donghong Ji, Muhammad Shahid Iqbal, Shijie Wang, Hassan Jalil Hadi, Zhengming Yuan

Urdu, characterized by its intricate morphological structure and linguistic nuances, presents distinct challenges in computational sentiment analysis. Addressing these, we introduce ”UrduAspectNet” – a dedicated model tailored for Aspect-Based Sentiment Analysis (ABSA) in Urdu. Central to our approach is a rigorous preprocessing phase. Leveraging the Stanza library, we extract Part-of-Speech (POS) tags and lemmas, ensuring Urdu’s linguistic intricacies are aptly represented. To probe the effectiveness of different embeddings, we trained our model using both mBERT and XLM-R embeddings, comparing their performances to identify the most effective representation for Urdu ABSA. Recognizing the nuanced inter-relationships between words, especially in Urdu’s flexible syntactic constructs, our model incorporates a dual Graph Convolutional Network (GCN) layer.Addressing the challenge of the absence of a dedicated Urdu ABSA dataset, we curated our own, collecting over 4,603 news headlines from various domains, such as politics, entertainment, business, and sports. These headlines, sourced from diverse news platforms, not only identify prevalent aspects but also pinpoints their sentiment polarities, categorized as positive, negative, or neutral. Despite the inherent complexities of Urdu, such as its colloquial expressions and idioms, ”UrduAspectNet” showcases remarkable efficacy. Initial comparisons between mBERT and XLM-R embeddings integrated with dual GCN provide valuable insights into their respective strengths in the context of Urdu ABSA. With broad applications spanning media analytics, business insights, and socio-cultural analysis, ”UrduAspectNet” is positioned as a pivotal benchmark in Urdu ABSA research.

乌尔都语以其错综复杂的形态结构和语言上的细微差别为特点，给计算情感分析带来了独特的挑战。为了解决这些问题，我们推出了 "UrduAspectNet"--一种专门为基于方面的乌尔都语情感分析（ABSA）定制的模型。我们方法的核心是严格的预处理阶段。我们利用 Stanza 库提取语音部分（POS）标签和词组，确保乌尔都语的语言复杂性得到恰当的表达。为了探究不同嵌入式的有效性，我们使用 mBERT 和 XLM-R 嵌入式对模型进行了训练，并比较了它们的性能，以确定对乌尔都语 ABSA 最有效的表示方法。为了应对缺乏专门的乌尔都语 ABSA 数据集这一挑战，我们建立了自己的数据集，从政治、娱乐、商业和体育等不同领域收集了 4603 条新闻标题。这些头条新闻来自不同的新闻平台，不仅能识别出普遍存在的问题，还能指出其情绪极性，分为正面、负面和中性。尽管乌尔都语具有固有的复杂性，如其口语表达和成语，但 "UrduAspectNet "仍显示出卓越的功效。在乌尔都语 ABSA 的背景下，mBERT 和 XLM-R 嵌入与双 GCN 的初步比较为了解它们各自的优势提供了宝贵的见解。UrduAspectNet" 的应用范围广泛，包括媒体分析、商业洞察和社会文化分析，被定位为乌尔都语 ABSA 研究的重要基准。

{"title":"UrduAspectNet: Fusing Transformers and Dual GCN for Urdu Aspect-Based Sentiment Detection","authors":"Kamran Aziz, Aizihaierjiang Yusufu, Jun Zhou, Donghong Ji, Muhammad Shahid Iqbal, Shijie Wang, Hassan Jalil Hadi, Zhengming Yuan","doi":"10.1145/3663367","DOIUrl":"https://doi.org/10.1145/3663367","url":null,"abstract":"Urdu, characterized by its intricate morphological structure and linguistic nuances, presents distinct challenges in computational sentiment analysis. Addressing these, we introduce ”UrduAspectNet” – a dedicated model tailored for Aspect-Based Sentiment Analysis (ABSA) in Urdu. Central to our approach is a rigorous preprocessing phase. Leveraging the Stanza library, we extract Part-of-Speech (POS) tags and lemmas, ensuring Urdu’s linguistic intricacies are aptly represented. To probe the effectiveness of different embeddings, we trained our model using both mBERT and XLM-R embeddings, comparing their performances to identify the most effective representation for Urdu ABSA. Recognizing the nuanced inter-relationships between words, especially in Urdu’s flexible syntactic constructs, our model incorporates a dual Graph Convolutional Network (GCN) layer.Addressing the challenge of the absence of a dedicated Urdu ABSA dataset, we curated our own, collecting over 4,603 news headlines from various domains, such as politics, entertainment, business, and sports. These headlines, sourced from diverse news platforms, not only identify prevalent aspects but also pinpoints their sentiment polarities, categorized as positive, negative, or neutral. Despite the inherent complexities of Urdu, such as its colloquial expressions and idioms, ”UrduAspectNet” showcases remarkable efficacy. Initial comparisons between mBERT and XLM-R embeddings integrated with dual GCN provide valuable insights into their respective strengths in the context of Urdu ABSA. With broad applications spanning media analytics, business insights, and socio-cultural analysis, ”UrduAspectNet” is positioned as a pivotal benchmark in Urdu ABSA research.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"56 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140839748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Integrated End-to-End automatic speech recognition for languages for agglutinative languages 集成式端到端语言自动语音识别功能，用于凝集语言

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-05-03 DOI: 10.1145/3663568

Akbayan Bekarystankyzy, Orken Mamyrbayev, Tolganay Anarbekova

The relevance of the problem of automatic speech recognition lies in the lack of research for low-resource languages, stemming from limited training data and the necessity for new technologies to enhance efficiency and performance. The purpose of this work was to study the main aspects of integrated end-to-end speech recognition and the use of modern technologies in the natural processing of agglutinative languages, including Kazakh. In this article, the study of language models was carried out using comparative, graphic, statistical and analytical-synthetic methods, which were used in combination. This paper addresses automatic speech recognition (ASR) in agglutinative languages, particularly Kazakh, through a unified neural network model that integrates both acoustic and language modeling. Employing advanced techniques like connectionist temporal classification and attention mechanisms, the study focuses on effective speech-to-text transcription for languages with complex morphologies. Transfer learning from high-resource languages helps mitigate data scarcity in languages such as Kazakh, Kyrgyz, Uzbek, Turkish, and Azerbaijani. The research assesses model performance, underscores ASR challenges, and proposes advancements for these languages. It includes a comparative analysis of phonetic and word-formation features in agglutinative Turkic languages, using statistical data. The findings aid further research in linguistics and technology for enhancing speech recognition and synthesis, contributing to voice identification and automation processes.

自动语音识别问题的相关性在于缺乏对低资源语言的研究，原因是训练数据有限，而且需要新技术来提高效率和性能。这项工作的目的是研究端到端综合语音识别的主要方面，以及现代技术在包括哈萨克语在内的凝集语自然处理中的应用。本文采用比较法、图形法、统计法和分析-合成法对语言模型进行了研究。本文通过声学建模和语言建模相结合的统一神经网络模型来解决凝集语（尤其是哈萨克语）的自动语音识别（ASR）问题。该研究采用了联结时序分类和注意力机制等先进技术，重点关注具有复杂形态的语言的有效语音到文本转录。从高资源语言中转移学习有助于缓解哈萨克语、吉尔吉斯语、乌兹别克语、土耳其语和阿塞拜疆语等语言的数据稀缺问题。研究评估了模型性能，强调了 ASR 面临的挑战，并提出了针对这些语言的改进建议。研究还利用统计数据对突厥语的语音和构词特征进行了比较分析。研究结果有助于进一步开展语言学和技术研究，以提高语音识别和合成能力，促进语音识别和自动化进程。

{"title":"Integrated End-to-End automatic speech recognition for languages for agglutinative languages","authors":"Akbayan Bekarystankyzy, Orken Mamyrbayev, Tolganay Anarbekova","doi":"10.1145/3663568","DOIUrl":"https://doi.org/10.1145/3663568","url":null,"abstract":"The relevance of the problem of automatic speech recognition lies in the lack of research for low-resource languages, stemming from limited training data and the necessity for new technologies to enhance efficiency and performance. The purpose of this work was to study the main aspects of integrated end-to-end speech recognition and the use of modern technologies in the natural processing of agglutinative languages, including Kazakh. In this article, the study of language models was carried out using comparative, graphic, statistical and analytical-synthetic methods, which were used in combination. This paper addresses automatic speech recognition (ASR) in agglutinative languages, particularly Kazakh, through a unified neural network model that integrates both acoustic and language modeling. Employing advanced techniques like connectionist temporal classification and attention mechanisms, the study focuses on effective speech-to-text transcription for languages with complex morphologies. Transfer learning from high-resource languages helps mitigate data scarcity in languages such as Kazakh, Kyrgyz, Uzbek, Turkish, and Azerbaijani. The research assesses model performance, underscores ASR challenges, and proposes advancements for these languages. It includes a comparative analysis of phonetic and word-formation features in agglutinative Turkic languages, using statistical data. The findings aid further research in linguistics and technology for enhancing speech recognition and synthesis, contributing to voice identification and automation processes.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"31 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140839751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Emotion Detection System for Malayalam Text using Deep Learning and Transformers 使用深度学习和变换器的马拉雅拉姆语文本情感检测系统

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-05-01 DOI: 10.1145/3663475

Anuja K, P. C. Reghu Raj, Remesh Babu K R

Recent advances in Natural Language Processing (NLP) have improved the performance of the systems that perform tasks, such as Emotion Detection (ED), Information Retrieval, Translation, etc., in resource-rich languages like English and Chinese. But similar advancements have not been made in Malayalam due to the dearth of annotated datasets. Because of its rich morphology, free word order and agglutinative character, data preparation in Malayalam is highly challenging. In this paper, we employ traditional Machine Learning (ML) techniques such as support vector machines (SVM) and multilayer perceptrons (MLP), and recent deep learning methods such as Recurrent Neural Networks (RNN) and advanced transformer-based methodologies to train an emotion detection system. This work stands out since all the previous attempts to extract emotions from Malayalam text have relied on lexicons, which are inappropriate for handling large amounts of data. By tweaking the hyperparameters, we enhanced the transformer-based model known as MuRIL to obtain an accuracy of 79%, which is then compared with the only state-of-the-art (SOTA) model. We found that the proposed techniques surpass the SOTA system available for detecting emotions in Malayalam reported so far.

自然语言处理（NLP）领域的最新进展提高了执行任务的系统性能，如在英语和中文等资源丰富的语言中执行情感检测（ED）、信息检索、翻译等任务。但由于缺乏注释数据集，马拉雅拉姆语还没有取得类似的进步。由于马拉雅拉姆语具有丰富的词形、自由词序和聚合特征，因此数据准备工作极具挑战性。在本文中，我们采用了传统的机器学习（ML）技术，如支持向量机（SVM）和多层感知器（MLP），以及最新的深度学习方法，如递归神经网络（RNN）和先进的基于变换器的方法来训练情绪检测系统。这项工作非常突出，因为之前从马拉雅拉姆语文本中提取情感的所有尝试都依赖于词典，而词典并不适合处理大量数据。通过调整超参数，我们增强了名为 MuRIL 的基于变换器的模型，从而获得了 79% 的准确率，并将其与唯一的最先进模型（SOTA）进行了比较。我们发现，所提出的技术超越了迄今为止所报道的用于检测马拉雅拉姆语情绪的 SOTA 系统。

{"title":"Emotion Detection System for Malayalam Text using Deep Learning and Transformers","authors":"Anuja K, P. C. Reghu Raj, Remesh Babu K R","doi":"10.1145/3663475","DOIUrl":"https://doi.org/10.1145/3663475","url":null,"abstract":"Recent advances in Natural Language Processing (NLP) have improved the performance of the systems that perform tasks, such as Emotion Detection (ED), Information Retrieval, Translation, etc., in resource-rich languages like English and Chinese. But similar advancements have not been made in Malayalam due to the dearth of annotated datasets. Because of its rich morphology, free word order and agglutinative character, data preparation in Malayalam is highly challenging. In this paper, we employ traditional Machine Learning (ML) techniques such as support vector machines (SVM) and multilayer perceptrons (MLP), and recent deep learning methods such as Recurrent Neural Networks (RNN) and advanced transformer-based methodologies to train an emotion detection system. This work stands out since all the previous attempts to extract emotions from Malayalam text have relied on lexicons, which are inappropriate for handling large amounts of data. By tweaking the hyperparameters, we enhanced the transformer-based model known as MuRIL to obtain an accuracy of 79%, which is then compared with the only state-of-the-art (SOTA) model. We found that the proposed techniques surpass the SOTA system available for detecting emotions in Malayalam reported so far.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"11 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140839856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning Domain Specific Sub-layer Latent Variable for Multi-Domain Adaptation Neural Machine Translation 为多领域适应性神经机器翻译学习特定领域子层潜变量

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-04-29 DOI: 10.1145/3661305

Shuanghong Huang, Chong Feng, Ge Shi, Zhengjun Li, Xuan Zhao, Xinyan Li, Xiaomei Wang

Domain adaptation proves to be an effective solution for addressing inadequate translation performance within specific domains. However, the straightforward approach of mixing data from multiple domains to obtain the multi-domain neural machine translation (NMT) model can give rise to the parameter interference between domains problem, resulting in a degradation of overall performance. To address this, we introduce a multi-domain adaptive NMT method aimed at learning domain specific sub-layer latent variable and employ the Gumbel-Softmax reparameterization technique to concurrently train both model parameters and domain specific sub-layer latent variable. This approach facilitates the learning of private domain-specific knowledge while sharing common domain-invariant knowledge, effectively mitigating the parameter interference problem. The experimental results show that our proposed method significantly improved by up to 7.68 and 3.71 BLEU compared with the baseline model in English-German and Chinese-English public multi-domain datasets, respectively.

事实证明，域适应是解决特定域内翻译性能不足的有效解决方案。然而，直接混合来自多个领域的数据以获得多领域神经机器翻译（NMT）模型的方法可能会引起领域间参数干扰问题，从而导致整体性能下降。为解决这一问题，我们引入了一种多域自适应 NMT 方法，旨在学习特定域子层潜变量，并采用 Gumbel-Softmax 重参数化技术同时训练模型参数和特定域子层潜变量。这种方法有助于学习特定领域的私有知识，同时共享共同的领域不变知识，有效缓解了参数干扰问题。实验结果表明，在英德和中英公共多领域数据集中，与基线模型相比，我们提出的方法分别显著提高了 7.68 和 3.71 BLEU。

{"title":"Learning Domain Specific Sub-layer Latent Variable for Multi-Domain Adaptation Neural Machine Translation","authors":"Shuanghong Huang, Chong Feng, Ge Shi, Zhengjun Li, Xuan Zhao, Xinyan Li, Xiaomei Wang","doi":"10.1145/3661305","DOIUrl":"https://doi.org/10.1145/3661305","url":null,"abstract":"Domain adaptation proves to be an effective solution for addressing inadequate translation performance within specific domains. However, the straightforward approach of mixing data from multiple domains to obtain the multi-domain neural machine translation (NMT) model can give rise to the parameter interference between domains problem, resulting in a degradation of overall performance. To address this, we introduce a multi-domain adaptive NMT method aimed at learning domain specific sub-layer latent variable and employ the Gumbel-Softmax reparameterization technique to concurrently train both model parameters and domain specific sub-layer latent variable. This approach facilitates the learning of private domain-specific knowledge while sharing common domain-invariant knowledge, effectively mitigating the parameter interference problem. The experimental results show that our proposed method significantly improved by up to 7.68 and 3.71 BLEU compared with the baseline model in English-German and Chinese-English public multi-domain datasets, respectively.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"10 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140811942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0