首页 > 最新文献

ACM Transactions on Asian and Low-Resource Language Information Processing最新文献

英文 中文
Neural Machine Translation for Low-Resource Languages from a Chinese-centric Perspective: A Survey 以中文为中心的低资源语言神经机器翻译:调查
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-05-16 DOI: 10.1145/3665244
Jinyi Zhang, Ke Su, Haowei Li, Jiannan Mao, Ye Tian, Feng Wen, Chong Guo, Tadahiro Matsumoto
Machine translation—the automatic transformation of one natural language (source language) into another (target language) through computational means—occupies a central role in computational linguistics and stands as a cornerstone of research within the field of Natural Language Processing (NLP). In recent years, the prominence of Neural Machine Translation (NMT) has grown exponentially, offering an advanced framework for machine translation research. It is noted for its superior translation performance, especially when tackling the challenges posed by low-resource language pairs that suffer from a limited corpus of data resources. This article offers an exhaustive exploration of the historical trajectory and advancements in NMT, accompanied by an analysis of the underlying foundational concepts. It subsequently provides a concise demarcation of the unique characteristics associated with low-resource languages and presents a succinct review of pertinent translation models and their applications, specifically within the context of languages with low-resources. Moreover, this article delves deeply into machine translation techniques, highlighting approaches tailored for Chinese-centric low-resource languages. Ultimately, it anticipates upcoming research directions in the realm of low-resource language translation.
机器翻译--通过计算手段将一种自然语言(源语言)自动转换成另一种自然语言(目标语言)--在计算语言学中占据核心地位,是自然语言处理(NLP)领域的研究基石。近年来,神经机器翻译(NMT)的地位急剧上升,为机器翻译研究提供了一个先进的框架。神经机器翻译因其卓越的翻译性能而备受瞩目,尤其是在应对低资源语言对所带来的挑战时,因为这些语言对的语料库数据资源有限。本文详尽探讨了 NMT 的历史轨迹和进步,并对其基本的基础概念进行了分析。随后,文章简明扼要地划分了与低资源语言相关的独特特征,并对相关翻译模型及其应用进行了简明扼要的评述,特别是在低资源语言的背景下。此外,本文还深入探讨了机器翻译技术,重点介绍了为以中文为中心的低资源语言量身定制的方法。最后,文章预测了低资源语言翻译领域即将出现的研究方向。
{"title":"Neural Machine Translation for Low-Resource Languages from a Chinese-centric Perspective: A Survey","authors":"Jinyi Zhang, Ke Su, Haowei Li, Jiannan Mao, Ye Tian, Feng Wen, Chong Guo, Tadahiro Matsumoto","doi":"10.1145/3665244","DOIUrl":"https://doi.org/10.1145/3665244","url":null,"abstract":"Machine translation—the automatic transformation of one natural language (source language) into another (target language) through computational means—occupies a central role in computational linguistics and stands as a cornerstone of research within the field of Natural Language Processing (NLP). In recent years, the prominence of Neural Machine Translation (NMT) has grown exponentially, offering an advanced framework for machine translation research. It is noted for its superior translation performance, especially when tackling the challenges posed by low-resource language pairs that suffer from a limited corpus of data resources. This article offers an exhaustive exploration of the historical trajectory and advancements in NMT, accompanied by an analysis of the underlying foundational concepts. It subsequently provides a concise demarcation of the unique characteristics associated with low-resource languages and presents a succinct review of pertinent translation models and their applications, specifically within the context of languages with low-resources. Moreover, this article delves deeply into machine translation techniques, highlighting approaches tailored for Chinese-centric low-resource languages. Ultimately, it anticipates upcoming research directions in the realm of low-resource language translation.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140971525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scoring Multi-hop Question Decomposition Using Masked Language Models 使用屏蔽语言模型对多跳问题分解进行评分
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-05-15 DOI: 10.1145/3665140
Abdellah Hamouda Sidhoum, M'hamed Mataoui, Faouzi Sebbak, Adil Imad Eddine Hosni, Kamel Smaili
Question answering (QA) is a sub-field of Natural Language Processing (NLP) that focuses on developing systems capable of answering natural language queries. Within this domain, multi-hop question answering represents an advanced QA task that requires gathering and reasoning over multiple pieces of information from diverse sources or passages. To handle the complexity of multi-hop questions, question decomposition has been proven to be a valuable approach. This technique involves breaking down complex questions into simpler sub-questions, reducing the complexity of the problem. However, it’s worth noting that existing question decomposition methods often rely on training data, which may not always be readily available for low-resource languages or specialized domains. To address this issue, we propose a novel approach that utilizes pre-trained masked language models to score decomposition candidates in a zero-shot manner. The method involves generating decomposition candidates, scoring them using a pseudo-log likelihood estimation, and ranking them based on their scores. To evaluate the efficacy of the decomposition process, we conducted experiments on two datasets annotated on decomposition in two different languages, Arabic and English. Subsequently, we integrated our approach into a complete QA system and conducted a reading comprehension performance evaluation on the HotpotQA dataset. The obtained results emphasize that while the system exhibited a small drop in performance, it still maintained a significant advance compared to the baseline model. The proposed approach highlights the efficiency of the language model scoring technique in complex reasoning tasks such as multi-hop question decomposition.
问题解答(QA)是自然语言处理(NLP)的一个分支领域,主要致力于开发能够回答自然语言查询的系统。在这一领域中,多跳问题解答代表了一种高级 QA 任务,需要收集和推理来自不同来源或段落的多条信息。为了处理多跳问题的复杂性,问题分解已被证明是一种有价值的方法。这种技术包括将复杂的问题分解成更简单的子问题,从而降低问题的复杂性。不过,值得注意的是,现有的问题分解方法往往依赖于训练数据,而对于低资源语言或专业领域来说,训练数据可能并不总是现成的。为了解决这个问题,我们提出了一种新颖的方法,利用预先训练好的屏蔽语言模型,以 "0-shot "的方式对分解候选问题进行评分。该方法包括生成分解候选语料,使用伪对数似然估计法对其进行评分,并根据评分对其进行排序。为了评估分解过程的有效性,我们在阿拉伯语和英语这两种不同语言的两个数据集上进行了分解注释实验。随后,我们将我们的方法集成到一个完整的质量保证系统中,并在 HotpotQA 数据集上进行了阅读理解性能评估。所得结果表明,虽然系统的性能略有下降,但与基线模型相比仍保持了显著的进步。所提出的方法凸显了语言模型评分技术在多跳问题分解等复杂推理任务中的效率。
{"title":"Scoring Multi-hop Question Decomposition Using Masked Language Models","authors":"Abdellah Hamouda Sidhoum, M'hamed Mataoui, Faouzi Sebbak, Adil Imad Eddine Hosni, Kamel Smaili","doi":"10.1145/3665140","DOIUrl":"https://doi.org/10.1145/3665140","url":null,"abstract":"Question answering (QA) is a sub-field of Natural Language Processing (NLP) that focuses on developing systems capable of answering natural language queries. Within this domain, multi-hop question answering represents an advanced QA task that requires gathering and reasoning over multiple pieces of information from diverse sources or passages. To handle the complexity of multi-hop questions, question decomposition has been proven to be a valuable approach. This technique involves breaking down complex questions into simpler sub-questions, reducing the complexity of the problem. However, it’s worth noting that existing question decomposition methods often rely on training data, which may not always be readily available for low-resource languages or specialized domains. To address this issue, we propose a novel approach that utilizes pre-trained masked language models to score decomposition candidates in a zero-shot manner. The method involves generating decomposition candidates, scoring them using a pseudo-log likelihood estimation, and ranking them based on their scores. To evaluate the efficacy of the decomposition process, we conducted experiments on two datasets annotated on decomposition in two different languages, Arabic and English. Subsequently, we integrated our approach into a complete QA system and conducted a reading comprehension performance evaluation on the HotpotQA dataset. The obtained results emphasize that while the system exhibited a small drop in performance, it still maintained a significant advance compared to the baseline model. The proposed approach highlights the efficiency of the language model scoring technique in complex reasoning tasks such as multi-hop question decomposition.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140975243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Abusive Language Detection in Khasi Social Media Comments 检测卡西族社交媒体评论中的辱骂性语言
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-05-14 DOI: 10.1145/3664285
Arup Baruah, Lakhamti Wahlang, Firstbornson Jyrwa, Floriginia Shadap, Ferdous Barbhuiya, Kuntal Dey

This paper describes the work performed for automated abusive language detection in the Khasi language, a low-resource language spoken primarily in the state of Meghalaya, India. A dataset named Khasi Abusive Language Dataset (KALD) was created which consists of 4,573 human-annotated Khasi YouTube and Facebook comments. A corpus of Khasi text was built and it was used to create Khasi word2vec and fastText word embeddings. Deep learning, traditional machine learning, and ensemble models were used in the study. Experiments were performed using word2vec, fastText, and topic vectors obtained using LDA. Experiments were also performed to check if zero-shot cross-lingual nature of language models such as LaBSE and LASER can be utilized for abusive language detection in the Khasi language. The best F1 score of 0.90725 was obtained by an XGBoost classifier. After feature selection and rebalancing of the dataset, F1 score of 0.91828 and 0.91945 were obtained by an SVM based classifiers.

卡西语是一种低资源语言,主要在印度梅加拉亚邦使用。本文创建了一个名为 "卡西语辱骂语言数据集"(KALD)的数据集,该数据集由 4,573 条人工标注的卡西语 YouTube 和 Facebook 评论组成。该数据集由 4,573 条人类标注的 Khasi 语 YouTube 和 Facebook 评论组成。我们建立了 Khasi 语文本语料库,并利用该语料库创建了 Khasi word2vec 和 fastText 词嵌入。研究中使用了深度学习、传统机器学习和集合模型。实验使用了 word2vec、fastText 和使用 LDA 获得的主题向量。实验还检验了 LaBSE 和 LASER 等语言模型的零点跨语言性质是否可用于卡西语的滥用语言检测。XGBoost 分类器获得了 0.90725 的最佳 F1 分数。在对数据集进行特征选择和重新平衡后,基于 SVM 的分类器获得了 0.91828 和 0.91945 的 F1 分数。
{"title":"Abusive Language Detection in Khasi Social Media Comments","authors":"Arup Baruah, Lakhamti Wahlang, Firstbornson Jyrwa, Floriginia Shadap, Ferdous Barbhuiya, Kuntal Dey","doi":"10.1145/3664285","DOIUrl":"https://doi.org/10.1145/3664285","url":null,"abstract":"<p>This paper describes the work performed for automated abusive language detection in the Khasi language, a low-resource language spoken primarily in the state of Meghalaya, India. A dataset named Khasi Abusive Language Dataset (KALD) was created which consists of 4,573 human-annotated Khasi YouTube and Facebook comments. A corpus of Khasi text was built and it was used to create Khasi word2vec and fastText word embeddings. Deep learning, traditional machine learning, and ensemble models were used in the study. Experiments were performed using word2vec, fastText, and topic vectors obtained using LDA. Experiments were also performed to check if zero-shot cross-lingual nature of language models such as LaBSE and LASER can be utilized for abusive language detection in the Khasi language. The best F1 score of 0.90725 was obtained by an XGBoost classifier. After feature selection and rebalancing of the dataset, F1 score of 0.91828 and 0.91945 were obtained by an SVM based classifiers.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140928629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Marathi to Indian Sign Language Machine Translation 马拉地语至印度手语机器翻译
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-05-13 DOI: 10.1145/3664609
Suvarna Rajesh. Bhagwat, R. P. Bhavsar, B. V. Pawar

Machine translation has been a prominent field of research, contributing significantly to human life enhancement. Sign language machine translation, a subfield, focuses on translating spoken language content into sign language and vice versa, thereby facilitating communication between the normal hearing and hard-of-hearing communities, promoting inclusivity.

This study presents the development of a ‘sign language machine translation system’ converting simple Marathi sentences into Indian Sign Language (ISL) glosses and animation. Given the low-resource nature of both languages, a phrase-level rule-based approach was employed for the translation. Initial encoding of translation rules relied on basic linguistic knowledge of Marathi and ISL, with subsequent incorporation of rules to address 'simultaneous morphological' features in ISL. These rules were applied during the ‘generation phase’ of translation to dynamically adjust phonological sign parameters, resulting in improved target sentence fluency.

The paper provides a detailed description of the system architecture, translation rules, and comprehensive experimentation. Rigorous evaluation efforts were undertaken, encompassing various linguistic features, and the findings are discussed herein.

The web-based version of the system serves as an interpreter for brief communications and can support the teaching and learning of sign language and its grammar in schools for hard-of-hearing students.

机器翻译一直是一个突出的研究领域,为改善人类生活做出了巨大贡献。本研究介绍了 "手语机器翻译系统 "的开发情况,该系统可将简单的马拉地语句子转换为印度手语(ISL)词汇和动画。鉴于这两种语言的低资源性,翻译采用了基于短语规则的方法。翻译规则的初始编码依赖于马拉地语和印度手语的基本语言知识,随后加入了针对印度手语 "同时形态 "特征的规则。这些规则应用于翻译的 "生成阶段",以动态调整语音符号参数,从而提高目标句子的流畅性。论文详细描述了系统架构、翻译规则和综合实验。该系统的网络版可作为简短交流的口译员,并可为学校中的重听学生手语及其语法的教学提供支持。
{"title":"Marathi to Indian Sign Language Machine Translation","authors":"Suvarna Rajesh. Bhagwat, R. P. Bhavsar, B. V. Pawar","doi":"10.1145/3664609","DOIUrl":"https://doi.org/10.1145/3664609","url":null,"abstract":"<p>Machine translation has been a prominent field of research, contributing significantly to human life enhancement. Sign language machine translation, a subfield, focuses on translating spoken language content into sign language and vice versa, thereby facilitating communication between the normal hearing and hard-of-hearing communities, promoting inclusivity.</p><p>This study presents the development of a ‘sign language machine translation system’ converting simple Marathi sentences into Indian Sign Language (ISL) glosses and animation. Given the low-resource nature of both languages, a phrase-level rule-based approach was employed for the translation. Initial encoding of translation rules relied on basic linguistic knowledge of Marathi and ISL, with subsequent incorporation of rules to address 'simultaneous morphological' features in ISL. These rules were applied during the ‘generation phase’ of translation to dynamically adjust phonological sign parameters, resulting in improved target sentence fluency.</p><p>The paper provides a detailed description of the system architecture, translation rules, and comprehensive experimentation. Rigorous evaluation efforts were undertaken, encompassing various linguistic features, and the findings are discussed herein.</p><p>The web-based version of the system serves as an interpreter for brief communications and can support the teaching and learning of sign language and its grammar in schools for hard-of-hearing students.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140942285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi Task Learning Based Shallow Parsing for Indian Languages 基于多任务学习的印度语言浅层解析
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-05-11 DOI: 10.1145/3664620
Pruthwik Mishra, Vandan Mujadia

Shallow Parsing is an important step for many Natural Language Processing tasks. Although shallow parsing has a rich history for resource rich languages, it is not the case for most Indian languages. Shallow Parsing consists of POS Tagging and Chunking. Our study focuses on developing shallow parsers for Indian languages. As part of shallow parsing we included morph analysis as well.

For the study, we first consolidated available shallow parsing corpora for 7 Indian Languages (Hindi, Kannada, Bangla, Malayalam, Marathi, Urdu, Telugu) for which treebanks are publicly available. We then trained models to achieve state of the art performance for shallow parsing in these languages for multiple domains. Since analyzing the performance of model predictions at sentence level is more realistic, we report the performance of these shallow parsers not only at the token level, but also at the sentence level. We also present machine learning techniques for multitask shallow parsing. Our experiments show that fine-tuned contextual embedding with multi-task learning improves the performance of multiple as well as individual shallow parsing tasks across different domains. We show the transfer learning capability of these models by creating shallow parsers (only with POS and Chunk) for Gujarati, Odia, and Punjabi for which no treebanks are available.

As a part of this work, we will be releasing the Indian Languages Shallow Linguistic (ILSL) benchmarks for 10 Indian languages including both the major language families Indo-Aryan and Dravidian as common building blocks that can be used to evaluate and understand various linguistic phenomena found in Indian languages and how well newer approaches can tackle them.

浅层解析是许多自然语言处理任务的重要步骤。虽然浅层解析在资源丰富的语言中有着悠久的历史,但在大多数印度语言中却并非如此。浅层解析包括 POS 标记和分块。我们的研究重点是为印度语言开发浅层解析器。作为浅层解析的一部分,我们还包括形态分析。在研究中,我们首先整合了 7 种印度语言(印地语、卡纳达语、孟加拉语、马拉雅拉姆语、马拉地语、乌尔都语和泰卢固语)的现有浅层解析语料库,这些语料库都是公开的树库。然后,我们对模型进行了训练,使这些语言在多个领域的浅层解析方面达到了最先进的性能。由于在句子层面分析模型预测的性能更为现实,我们不仅报告了这些浅层解析器在标记层面的性能,还报告了它们在句子层面的性能。我们还介绍了多任务浅层解析的机器学习技术。我们的实验表明,通过多任务学习对上下文嵌入进行微调,可以提高不同领域中多个以及单个浅层解析任务的性能。我们通过为古吉拉特语、奥迪亚语和旁遮普语创建浅层解析器(仅使用 POS 和 Chunk),展示了这些模型的迁移学习能力,因为这些语种没有树库可用。作为这项工作的一部分,我们将发布 10 种印度语言的印度语言浅层语言学(ILSL)基准,其中包括印度-雅利安语系和德拉威语系这两个主要语系,作为共同的构建模块,可用于评估和理解印度语言中发现的各种语言现象,以及新方法如何很好地解决这些问题。
{"title":"Multi Task Learning Based Shallow Parsing for Indian Languages","authors":"Pruthwik Mishra, Vandan Mujadia","doi":"10.1145/3664620","DOIUrl":"https://doi.org/10.1145/3664620","url":null,"abstract":"<p>Shallow Parsing is an important step for many Natural Language Processing tasks. Although shallow parsing has a rich history for resource rich languages, it is not the case for most Indian languages. Shallow Parsing consists of POS Tagging and Chunking. Our study focuses on developing shallow parsers for Indian languages. As part of shallow parsing we included morph analysis as well. </p><p>For the study, we first consolidated available shallow parsing corpora for <b>7 Indian Languages</b> (Hindi, Kannada, Bangla, Malayalam, Marathi, Urdu, Telugu) for which treebanks are publicly available. We then trained models to achieve state of the art performance for shallow parsing in these languages for multiple domains. Since analyzing the performance of model predictions at sentence level is more realistic, we report the performance of these shallow parsers not only at the token level, but also at the sentence level. We also present machine learning techniques for multitask shallow parsing. Our experiments show that fine-tuned contextual embedding with multi-task learning improves the performance of multiple as well as individual shallow parsing tasks across different domains. We show the transfer learning capability of these models by creating shallow parsers (only with POS and Chunk) for Gujarati, Odia, and Punjabi for which no treebanks are available. </p><p>As a part of this work, we will be releasing the Indian Languages Shallow Linguistic (ILSL) benchmarks for 10 Indian languages including both the major language families Indo-Aryan and Dravidian as common building blocks that can be used to evaluate and understand various linguistic phenomena found in Indian languages and how well newer approaches can tackle them.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140928413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Lingual Representation of Natural Language Processing for Low Resource Asian Language Processing Systems 为低资源亚洲语言处理系统提供自然语言处理的多语言表示法
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-05-10 DOI: 10.1145/3603169
Elena Verdú, Y. Nieto, N. Saleem
{"title":"Multi-Lingual Representation of Natural Language Processing for Low Resource Asian Language Processing Systems","authors":"Elena Verdú, Y. Nieto, N. Saleem","doi":"10.1145/3603169","DOIUrl":"https://doi.org/10.1145/3603169","url":null,"abstract":"","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140990365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing Chinese Event Extraction with Event Trigger Structures 利用事件触发器结构增强中文事件提取功能
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-05-07 DOI: 10.1145/3663567
Fei Li, Kaifang Deng, Yiwen Mo, Yuanze Ji, Chong Teng, Donghong Ji

The dependency syntactic structure is widely used in event extraction. However, the dependency structure reflecting syntactic features is essentially different from the event structure that reflects semantic features, leading to the performance degradation. In this paper, we propose to use Event Trigger Structure for Event Extraction (ETSEE), which can compensate the inconsistency between two structures. First, we leverage the ACE2005 dataset as case study, and annotate 3 kinds of ETSs, i.e., “light verb + trigger”, “preposition structures” and “tense + trigger”. Then we design a graph-based event extraction model that jointly identifies triggers and arguments, where the graph consists of both the dependency structure and ETSs. Experiments show that our model significantly outperforms the state-of-the-art methods. Through empirical analysis and manual observation, we find that the ETSs can bring the following benefits: (1) enriching trigger identification features by introducing structural event information; (2) enriching dependency structures with event semantic information; (3) enhancing the interactions between triggers and candidate arguments by shortening their distances in the dependency graph.

依赖语法结构被广泛应用于事件提取。然而,反映语法特征的依赖结构与反映语义特征的事件结构存在本质区别,从而导致性能下降。本文建议使用事件触发结构(ETSEE)进行事件提取,它可以弥补两种结构之间的不一致性。首先,我们以 ACE2005 数据集为案例,标注了 3 种事件触发结构,即 "轻动词 + 触发"、"介词结构 "和 "时态 + 触发"。然后,我们设计了一种基于图的事件提取模型,该模型可联合识别触发器和参数,其中图由依赖结构和 ETS 组成。实验表明,我们的模型明显优于最先进的方法。通过实证分析和人工观察,我们发现 ETS 可以带来以下好处:(1) 通过引入结构性事件信息丰富触发器识别特征;(2) 通过事件语义信息丰富依赖结构;(3) 通过缩短触发器和候选参数在依赖图中的距离增强它们之间的交互。
{"title":"Enhancing Chinese Event Extraction with Event Trigger Structures","authors":"Fei Li, Kaifang Deng, Yiwen Mo, Yuanze Ji, Chong Teng, Donghong Ji","doi":"10.1145/3663567","DOIUrl":"https://doi.org/10.1145/3663567","url":null,"abstract":"<p>The dependency syntactic structure is widely used in event extraction. However, the dependency structure reflecting syntactic features is essentially different from the event structure that reflects semantic features, leading to the performance degradation. In this paper, we propose to use Event Trigger Structure for Event Extraction (ETSEE), which can compensate the inconsistency between two structures. First, we leverage the ACE2005 dataset as case study, and annotate 3 kinds of ETSs, i.e., “light verb + trigger”, “preposition structures” and “tense + trigger”. Then we design a graph-based event extraction model that jointly identifies triggers and arguments, where the graph consists of both the dependency structure and ETSs. Experiments show that our model significantly outperforms the state-of-the-art methods. Through empirical analysis and manual observation, we find that the ETSs can bring the following benefits: (1) enriching trigger identification features by introducing structural event information; (2) enriching dependency structures with event semantic information; (3) enhancing the interactions between triggers and candidate arguments by shortening their distances in the dependency graph.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140885519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Hybrid Deep BiLSTM-CNN for Hate Speech Detection in Multi-social media 用于检测多元社交媒体中仇恨言论的混合深度 BiLSTM-CNN
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-05-06 DOI: 10.1145/3657635
Ashwini Kumar, Santosh Kumar, Kalpdrum Passi, Aniket Mahanti

Nowadays, ways of communication among people have changed due to advancements in information technology and the rise of online multi-social media. Many people express their feelings, ideas, and emotions on social media sites such as Instagram, Twitter, Gab, Reddit, Facebook, YouTube, etc. However, people have misused social media to send hateful messages to specific individuals or groups to create chaos. For various Governance authorities, manually identifying hate speech on various social media platforms is a difficult task to avoid such chaos. In this study, a hybrid deep-learning model, where bidirectional long short-term memory (BiLSTM) and convolutional neural network (CNN) are used to classify hate speech in textual data, has been proposed. This model incorporates a GLOVE-based word embedding approach, dropout, L2 regularization, and global max pooling to get impressive results. Further, the proposed BiLSTM-CNN model has been evaluated on various datasets to achieve state-of-the-art performance that is superior to the traditional and existing machine learning methods in terms of accuracy, precision, recall, and F1-score.

如今,由于信息技术的进步和在线多元社交媒体的兴起,人与人之间的交流方式发生了变化。许多人在 Instagram、Twitter、Gab、Reddit、Facebook、YouTube 等社交媒体网站上表达自己的情感、想法和情绪。然而,有人滥用社交媒体向特定个人或群体发送仇恨信息,制造混乱。对于各治理部门来说,要避免这种混乱局面,人工识别各种社交媒体平台上的仇恨言论是一项艰巨的任务。本研究提出了一种混合深度学习模型,利用双向长短期记忆(BiLSTM)和卷积神经网络(CNN)对文本数据中的仇恨言论进行分类。该模型采用了基于 GLOVE 的单词嵌入方法、剔除、L2 正则化和全局最大池化,取得了令人印象深刻的结果。此外,还在各种数据集上对所提出的 BiLSTM-CNN 模型进行了评估,结果表明该模型在准确率、精确度、召回率和 F1 分数方面都优于传统和现有的机器学习方法,达到了最先进的性能。
{"title":"A Hybrid Deep BiLSTM-CNN for Hate Speech Detection in Multi-social media","authors":"Ashwini Kumar, Santosh Kumar, Kalpdrum Passi, Aniket Mahanti","doi":"10.1145/3657635","DOIUrl":"https://doi.org/10.1145/3657635","url":null,"abstract":"<p>Nowadays, ways of communication among people have changed due to advancements in information technology and the rise of online multi-social media. Many people express their feelings, ideas, and emotions on social media sites such as Instagram, Twitter, Gab, Reddit, Facebook, YouTube, etc. However, people have misused social media to send hateful messages to specific individuals or groups to create chaos. For various Governance authorities, manually identifying hate speech on various social media platforms is a difficult task to avoid such chaos. In this study, a hybrid deep-learning model, where bidirectional long short-term memory (BiLSTM) and convolutional neural network (CNN) are used to classify hate speech in textual data, has been proposed. This model incorporates a GLOVE-based word embedding approach, dropout, L2 regularization, and global max pooling to get impressive results. Further, the proposed BiLSTM-CNN model has been evaluated on various datasets to achieve state-of-the-art performance that is superior to the traditional and existing machine learning methods in terms of accuracy, precision, recall, and F1-score.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140885818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UrduAspectNet: Fusing Transformers and Dual GCN for Urdu Aspect-Based Sentiment Detection UrduAspectNet:融合变换器和双 GCN 实现基于乌尔都语特征的情感检测
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-05-04 DOI: 10.1145/3663367
Kamran Aziz, Aizihaierjiang Yusufu, Jun Zhou, Donghong Ji, Muhammad Shahid Iqbal, Shijie Wang, Hassan Jalil Hadi, Zhengming Yuan

Urdu, characterized by its intricate morphological structure and linguistic nuances, presents distinct challenges in computational sentiment analysis. Addressing these, we introduce ”UrduAspectNet” – a dedicated model tailored for Aspect-Based Sentiment Analysis (ABSA) in Urdu. Central to our approach is a rigorous preprocessing phase. Leveraging the Stanza library, we extract Part-of-Speech (POS) tags and lemmas, ensuring Urdu’s linguistic intricacies are aptly represented. To probe the effectiveness of different embeddings, we trained our model using both mBERT and XLM-R embeddings, comparing their performances to identify the most effective representation for Urdu ABSA. Recognizing the nuanced inter-relationships between words, especially in Urdu’s flexible syntactic constructs, our model incorporates a dual Graph Convolutional Network (GCN) layer.Addressing the challenge of the absence of a dedicated Urdu ABSA dataset, we curated our own, collecting over 4,603 news headlines from various domains, such as politics, entertainment, business, and sports. These headlines, sourced from diverse news platforms, not only identify prevalent aspects but also pinpoints their sentiment polarities, categorized as positive, negative, or neutral. Despite the inherent complexities of Urdu, such as its colloquial expressions and idioms, ”UrduAspectNet” showcases remarkable efficacy. Initial comparisons between mBERT and XLM-R embeddings integrated with dual GCN provide valuable insights into their respective strengths in the context of Urdu ABSA. With broad applications spanning media analytics, business insights, and socio-cultural analysis, ”UrduAspectNet” is positioned as a pivotal benchmark in Urdu ABSA research.

乌尔都语以其错综复杂的形态结构和语言上的细微差别为特点,给计算情感分析带来了独特的挑战。为了解决这些问题,我们推出了 "UrduAspectNet"--一种专门为基于方面的乌尔都语情感分析(ABSA)定制的模型。我们方法的核心是严格的预处理阶段。我们利用 Stanza 库提取语音部分(POS)标签和词组,确保乌尔都语的语言复杂性得到恰当的表达。为了探究不同嵌入式的有效性,我们使用 mBERT 和 XLM-R 嵌入式对模型进行了训练,并比较了它们的性能,以确定对乌尔都语 ABSA 最有效的表示方法。为了应对缺乏专门的乌尔都语 ABSA 数据集这一挑战,我们建立了自己的数据集,从政治、娱乐、商业和体育等不同领域收集了 4603 条新闻标题。这些头条新闻来自不同的新闻平台,不仅能识别出普遍存在的问题,还能指出其情绪极性,分为正面、负面和中性。尽管乌尔都语具有固有的复杂性,如其口语表达和成语,但 "UrduAspectNet "仍显示出卓越的功效。在乌尔都语 ABSA 的背景下,mBERT 和 XLM-R 嵌入与双 GCN 的初步比较为了解它们各自的优势提供了宝贵的见解。UrduAspectNet" 的应用范围广泛,包括媒体分析、商业洞察和社会文化分析,被定位为乌尔都语 ABSA 研究的重要基准。
{"title":"UrduAspectNet: Fusing Transformers and Dual GCN for Urdu Aspect-Based Sentiment Detection","authors":"Kamran Aziz, Aizihaierjiang Yusufu, Jun Zhou, Donghong Ji, Muhammad Shahid Iqbal, Shijie Wang, Hassan Jalil Hadi, Zhengming Yuan","doi":"10.1145/3663367","DOIUrl":"https://doi.org/10.1145/3663367","url":null,"abstract":"<p>Urdu, characterized by its intricate morphological structure and linguistic nuances, presents distinct challenges in computational sentiment analysis. Addressing these, we introduce ”UrduAspectNet” – a dedicated model tailored for Aspect-Based Sentiment Analysis (ABSA) in Urdu. Central to our approach is a rigorous preprocessing phase. Leveraging the Stanza library, we extract Part-of-Speech (POS) tags and lemmas, ensuring Urdu’s linguistic intricacies are aptly represented. To probe the effectiveness of different embeddings, we trained our model using both mBERT and XLM-R embeddings, comparing their performances to identify the most effective representation for Urdu ABSA. Recognizing the nuanced inter-relationships between words, especially in Urdu’s flexible syntactic constructs, our model incorporates a dual Graph Convolutional Network (GCN) layer.Addressing the challenge of the absence of a dedicated Urdu ABSA dataset, we curated our own, collecting over 4,603 news headlines from various domains, such as politics, entertainment, business, and sports. These headlines, sourced from diverse news platforms, not only identify prevalent aspects but also pinpoints their sentiment polarities, categorized as positive, negative, or neutral. Despite the inherent complexities of Urdu, such as its colloquial expressions and idioms, ”UrduAspectNet” showcases remarkable efficacy. Initial comparisons between mBERT and XLM-R embeddings integrated with dual GCN provide valuable insights into their respective strengths in the context of Urdu ABSA. With broad applications spanning media analytics, business insights, and socio-cultural analysis, ”UrduAspectNet” is positioned as a pivotal benchmark in Urdu ABSA research.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140839748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrated End-to-End automatic speech recognition for languages for agglutinative languages 集成式端到端语言自动语音识别功能,用于凝集语言
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-05-03 DOI: 10.1145/3663568
Akbayan Bekarystankyzy, Orken Mamyrbayev, Tolganay Anarbekova

The relevance of the problem of automatic speech recognition lies in the lack of research for low-resource languages, stemming from limited training data and the necessity for new technologies to enhance efficiency and performance. The purpose of this work was to study the main aspects of integrated end-to-end speech recognition and the use of modern technologies in the natural processing of agglutinative languages, including Kazakh. In this article, the study of language models was carried out using comparative, graphic, statistical and analytical-synthetic methods, which were used in combination. This paper addresses automatic speech recognition (ASR) in agglutinative languages, particularly Kazakh, through a unified neural network model that integrates both acoustic and language modeling. Employing advanced techniques like connectionist temporal classification and attention mechanisms, the study focuses on effective speech-to-text transcription for languages with complex morphologies. Transfer learning from high-resource languages helps mitigate data scarcity in languages such as Kazakh, Kyrgyz, Uzbek, Turkish, and Azerbaijani. The research assesses model performance, underscores ASR challenges, and proposes advancements for these languages. It includes a comparative analysis of phonetic and word-formation features in agglutinative Turkic languages, using statistical data. The findings aid further research in linguistics and technology for enhancing speech recognition and synthesis, contributing to voice identification and automation processes.

自动语音识别问题的相关性在于缺乏对低资源语言的研究,原因是训练数据有限,而且需要新技术来提高效率和性能。这项工作的目的是研究端到端综合语音识别的主要方面,以及现代技术在包括哈萨克语在内的凝集语自然处理中的应用。本文采用比较法、图形法、统计法和分析-合成法对语言模型进行了研究。本文通过声学建模和语言建模相结合的统一神经网络模型来解决凝集语(尤其是哈萨克语)的自动语音识别(ASR)问题。该研究采用了联结时序分类和注意力机制等先进技术,重点关注具有复杂形态的语言的有效语音到文本转录。从高资源语言中转移学习有助于缓解哈萨克语、吉尔吉斯语、乌兹别克语、土耳其语和阿塞拜疆语等语言的数据稀缺问题。研究评估了模型性能,强调了 ASR 面临的挑战,并提出了针对这些语言的改进建议。研究还利用统计数据对突厥语的语音和构词特征进行了比较分析。研究结果有助于进一步开展语言学和技术研究,以提高语音识别和合成能力,促进语音识别和自动化进程。
{"title":"Integrated End-to-End automatic speech recognition for languages for agglutinative languages","authors":"Akbayan Bekarystankyzy, Orken Mamyrbayev, Tolganay Anarbekova","doi":"10.1145/3663568","DOIUrl":"https://doi.org/10.1145/3663568","url":null,"abstract":"<p>The relevance of the problem of automatic speech recognition lies in the lack of research for low-resource languages, stemming from limited training data and the necessity for new technologies to enhance efficiency and performance. The purpose of this work was to study the main aspects of integrated end-to-end speech recognition and the use of modern technologies in the natural processing of agglutinative languages, including Kazakh. In this article, the study of language models was carried out using comparative, graphic, statistical and analytical-synthetic methods, which were used in combination. This paper addresses automatic speech recognition (ASR) in agglutinative languages, particularly Kazakh, through a unified neural network model that integrates both acoustic and language modeling. Employing advanced techniques like connectionist temporal classification and attention mechanisms, the study focuses on effective speech-to-text transcription for languages with complex morphologies. Transfer learning from high-resource languages helps mitigate data scarcity in languages such as Kazakh, Kyrgyz, Uzbek, Turkish, and Azerbaijani. The research assesses model performance, underscores ASR challenges, and proposes advancements for these languages. It includes a comparative analysis of phonetic and word-formation features in agglutinative Turkic languages, using statistical data. The findings aid further research in linguistics and technology for enhancing speech recognition and synthesis, contributing to voice identification and automation processes.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140839751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ACM Transactions on Asian and Low-Resource Language Information Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1