ACM Transactions on Asian and Low-Resource Language Information Processing最新文献_第5页

Cross-Domain Aspect-based Sentiment Classification with Pre-Training and Fine-Tuning Strategy for Low-Resource Domains 采用预训练和微调策略的基于方面的跨域情感分类，适用于低资源领域

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-03-21 DOI: 10.1145/3653299

Chunjun Zhao, Meiling Wu, Xinyi Yang, Xuzhuang Sun, Suge Wang, Deyu Li

Aspect-based sentiment classification (ABSC) is a crucial subtask of fine-grained sentiment analysis (SA), which aims to predict the sentiment polarity of the given aspects in a sentence as positive, negative, or neutral. Most existing ABSC methods based on supervised learning. However, these methods rely heavily on fine-grained labeled training data, which can be scarce in low-resource domains, limiting their effectiveness. To overcome this challenge, we propose a low-resource cross-domain aspect-based sentiment classification (CDABSC) approach based on a pre-training and fine-tuning strategy. This approach applies the pre-training and fine-tuning strategy to an advanced deep learning method designed for ABSC, namely the attention-based encoding graph convolutional network (AEGCN) model. Specifically, a high-resource domain is selected as the source domain, and the AEGCN model is pre-trained using a large amount of fine-grained annotated data from the source domain. The optimal parameters of the model are preserved. Subsequently, a low-resource domain is used as the target domain, and the pre-trained model parameters are used as the initial parameters of the target domain model. The target domain is fine-tuned using a small amount of annotated data to adapt the parameters to the target domain model, improving the accuracy of sentiment classification in the low-resource domain. Finally, experimental validation on two domain benchmark datasets, restaurant and laptop, demonstrates that significant outperformance of our approach over the baselines in CDABSC Micro-F1.

基于方面的情感分类（ABSC）是细粒度情感分析（SA）的一个重要子任务，其目的是预测句子中给定方面的情感极性是积极的、消极的还是中性的。现有的 ABSC 方法大多基于监督学习。然而，这些方法在很大程度上依赖于细粒度标记的训练数据，而这些数据在低资源领域可能非常稀缺，从而限制了它们的有效性。为了克服这一挑战，我们提出了一种基于预训练和微调策略的低资源跨域基于方面的情感分类（CDABSC）方法。该方法将预训练和微调策略应用于专为 ABSC 设计的高级深度学习方法，即基于注意力的编码图卷积网络（AEGCN）模型。具体来说，选择一个高资源域作为源域，使用源域中的大量细粒度注释数据对 AEGCN 模型进行预训练。模型的最佳参数被保留下来。随后，使用低资源域作为目标域，并将预训练的模型参数用作目标域模型的初始参数。使用少量注释数据对目标域进行微调，使参数适应目标域模型，从而提高低资源域情感分类的准确性。最后，在餐厅和笔记本电脑这两个领域基准数据集上进行的实验验证表明，我们的方法在 CDABSC Micro-F1 中的性能明显优于基线方法。

{"title":"Cross-Domain Aspect-based Sentiment Classification with Pre-Training and Fine-Tuning Strategy for Low-Resource Domains","authors":"Chunjun Zhao, Meiling Wu, Xinyi Yang, Xuzhuang Sun, Suge Wang, Deyu Li","doi":"10.1145/3653299","DOIUrl":"https://doi.org/10.1145/3653299","url":null,"abstract":"Aspect-based sentiment classification (ABSC) is a crucial subtask of fine-grained sentiment analysis (SA), which aims to predict the sentiment polarity of the given aspects in a sentence as positive, negative, or neutral. Most existing ABSC methods based on supervised learning. However, these methods rely heavily on fine-grained labeled training data, which can be scarce in low-resource domains, limiting their effectiveness. To overcome this challenge, we propose a low-resource cross-domain aspect-based sentiment classification (CDABSC) approach based on a pre-training and fine-tuning strategy. This approach applies the pre-training and fine-tuning strategy to an advanced deep learning method designed for ABSC, namely the attention-based encoding graph convolutional network (AEGCN) model. Specifically, a high-resource domain is selected as the source domain, and the AEGCN model is pre-trained using a large amount of fine-grained annotated data from the source domain. The optimal parameters of the model are preserved. Subsequently, a low-resource domain is used as the target domain, and the pre-trained model parameters are used as the initial parameters of the target domain model. The target domain is fine-tuned using a small amount of annotated data to adapt the parameters to the target domain model, improving the accuracy of sentiment classification in the low-resource domain. Finally, experimental validation on two domain benchmark datasets, restaurant and laptop, demonstrates that significant outperformance of our approach over the baselines in CDABSC Micro-F1.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"136 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140199921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Supervised Contrast Learning Text Classification Model Based on Data Quality Augmentation 基于数据质量增强的有监督对比学习文本分类模型

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-03-19 DOI: 10.1145/3653300

Liang Wu, Fangfang Zhang, Chao Cheng, Shinan Song

Token-level data augmentation generates text samples by modifying the words of the sentences. However, data that are not easily classified can negatively affect the model. In particular, not considering the role of keywords when performing random augmentation operations on samples may lead to the generation of low-quality supplementary samples. Therefore, we propose a supervised contrast learning text classification model based on data quality augment (DQA). First, dynamic training is used to screen high-quality datasets containing beneficial information for model training. The selected data is then augmented with data based on important words with tag information. To obtain a better text representation to serve the downstream classification task, we employ a standard supervised contrast loss to train the model. Finally, we conduct experiments on five text classification datasets to validate the effectiveness of our model. In addition, ablation experiments are conducted to verify the impact of each module on classification.

标记级数据增强通过修改句子中的单词来生成文本样本。然而，不易分类的数据会对模型产生负面影响。特别是，在对样本进行随机扩增操作时，如果不考虑关键词的作用，可能会导致生成低质量的补充样本。因此，我们提出了一种基于数据质量增强（DQA）的有监督对比学习文本分类模型。首先，利用动态训练筛选出包含有益信息的高质量数据集，用于模型训练。然后，根据带有标签信息的重要词语对所选数据进行增强。为了获得更好的文本表示以服务于下游分类任务，我们采用了标准的监督对比度损失来训练模型。最后，我们在五个文本分类数据集上进行了实验，以验证模型的有效性。此外，我们还进行了消减实验，以验证每个模块对分类的影响。

引用次数: 0

NPEL: Neural Paired Entity Linking in Web Tables NPEL：网络表格中的神经配对实体链接

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-03-19 DOI: 10.1145/3652511

Tianxing Wu, Lin Li, Huan Gao, Guilin Qi, Yuxiang Wang, Yuehua Li

This paper studies entity linking (EL) in Web tables, which aims to link the string mentions in table cells to their referent entities in a knowledge base. Two main problems exist in previous studies: 1) contextual information is not well utilized in mention-entity similarity computation; 2) the assumption on entity coherence that all entities in the same row or column are highly related to each other is not always correct. In this paper, we propose NPEL, a new Neural Paired Entity Linking framework, to overcome the above problems. In NPEL, we design a deep learning model with different neural networks and an attention mechanism, to model different kinds of contextual information of mentions and entities, for mention-entity similarity computation in Web tables. NPEL also relaxes the above assumption on entity coherence by a new paired entity linking algorithm, which iteratively selects two mentions with the highest confidence for EL. Experiments on real-world datasets exhibit that NPEL has the best performance compared with state-of-the-art baselines in different evaluation metrics.

本文研究了网络表格中的实体链接（EL），其目的是将表格单元格中提及的字符串与知识库中的参照实体链接起来。以往的研究存在两个主要问题：1）在计算提及-实体相似性时，上下文信息没有得到很好的利用；2）关于实体一致性的假设，即同一行或列中的所有实体彼此高度相关，并不总是正确的。本文提出了一种新的神经配对实体链接框架 NPEL，以克服上述问题。在 NPEL 中，我们设计了一个具有不同神经网络和注意力机制的深度学习模型，以模拟提及和实体的不同类型上下文信息，用于网络表格中的提及-实体相似性计算。NPEL 还通过一种新的配对实体链接算法放宽了对实体一致性的上述假设，该算法会迭代选择置信度最高的两个提及作为 EL。在实际数据集上的实验表明，在不同的评价指标上，NPEL 与最先进的基线相比具有最佳性能。

{"title":"NPEL: Neural Paired Entity Linking in Web Tables","authors":"Tianxing Wu, Lin Li, Huan Gao, Guilin Qi, Yuxiang Wang, Yuehua Li","doi":"10.1145/3652511","DOIUrl":"https://doi.org/10.1145/3652511","url":null,"abstract":"This paper studies entity linking (EL) in Web tables, which aims to link the string mentions in table cells to their referent entities in a knowledge base. Two main problems exist in previous studies: 1) contextual information is not well utilized in mention-entity similarity computation; 2) the assumption on entity coherence that all entities in the same row or column are highly related to each other is not always correct. In this paper, we propose NPEL, a new Neural Paired Entity Linking framework, to overcome the above problems. In NPEL, we design a deep learning model with different neural networks and an attention mechanism, to model different kinds of contextual information of mentions and entities, for mention-entity similarity computation in Web tables. NPEL also relaxes the above assumption on entity coherence by a new paired entity linking algorithm, which iteratively selects two mentions with the highest confidence for EL. Experiments on real-world datasets exhibit that NPEL has the best performance compared with state-of-the-art baselines in different evaluation metrics.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"31 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140169662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

THAR- Targeted Hate Speech Against Religion: A high-quality Hindi-English code-mixed Dataset with the Application of Deep Learning Models for Automatic Detection THAR--有针对性的反宗教仇恨言论：应用深度学习模型进行自动检测的高质量印地语-英语混合代码数据集

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-03-18 DOI: 10.1145/3653017

Deepawali Sharma, Aakash Singh, Vivek Kumar Singh

During the last decade, social media has gained significant popularity as a medium for individuals to express their views on various topics. However, some individuals also exploit the social media platforms to spread hatred through their comments and posts, some of which target individuals, communities or religions. Given the deep emotional connections people have to their religious beliefs, this form of hate speech can be divisive and harmful, and may result in issues of mental health as social disorder. Therefore, there is a need of algorithmic approaches for the automatic detection of instances of hate speech. Most of the existing studies in this area focus on social media content in English, and as a result several low-resource languages lack computational resources for the task. This study attempts to address this research gap by providing a high-quality annotated dataset designed specifically for identifying hate speech against religions in the Hindi-English code-mixed language. This dataset “Targeted Hate Speech Against Religion” (THAR)) consists of 11,549 comments and has been annotated by five independent annotators. It comprises two subtasks: (i) Subtask-1 (Binary classification), (ii) Subtask-2 (multi-class classification). To ensure the quality of annotation, the Fleiss Kappa measure has been employed. The suitability of the dataset is then further explored by applying different standard deep learning, and transformer-based models. The transformer-based model, namely Multilingual Representations for Indian Languages (MuRIL), is found to outperform the other implemented models in both subtasks, achieving macro average and weighted average F1 scores of 0.78 and 0.78 for Subtask-1, and 0.65 and 0.72 for Subtask-2, respectively. The experimental results obtained not only confirm the suitability of the dataset but also advance the research towards automatic detection of hate speech, particularly in the low-resource Hindi-English code-mixed language.

在过去十年中，社交媒体作为个人就各种话题表达观点的媒介大受欢迎。然而，一些人也利用社交媒体平台，通过评论和帖子散布仇恨，其中一些针对个人、社区或宗教。鉴于人们与其宗教信仰有着深厚的情感联系，这种形式的仇恨言论可能会造成分裂和伤害，并可能导致心理健康问题和社会混乱。因此，需要采用算法方法来自动检测仇恨言论。该领域的大多数现有研究都集中在英语社交媒体内容上，因此一些低资源语言缺乏完成该任务的计算资源。本研究试图通过提供一个高质量的注释数据集来解决这一研究空白，该数据集是专门为识别印地语-英语混合编码语言中针对宗教的仇恨言论而设计的。该数据集 "Targeted Hate Speech Against Religion"（THAR））由 11,549 条评论组成，并由五位独立注释者进行注释。它包括两个子任务：(i) 子任务-1（二元分类），(ii) 子任务-2（多类分类）。为确保标注质量，采用了 Fleiss Kappa 测量法。然后，通过应用不同的标准深度学习和基于转换器的模型，进一步探索数据集的适用性。结果发现，基于转换器的模型，即印度语言的多语言表征（MuRIL），在两个子任务中的表现均优于其他已实施的模型，在子任务-1 中的宏观平均和加权平均 F1 分数分别为 0.78 和 0.78，在子任务-2 中的宏观平均和加权平均 F1 分数分别为 0.65 和 0.72。实验结果不仅证实了数据集的适用性，还推动了仇恨言论自动检测研究的发展，尤其是在低资源的印地语-英语混合编码语言中。

{"title":"THAR- Targeted Hate Speech Against Religion: A high-quality Hindi-English code-mixed Dataset with the Application of Deep Learning Models for Automatic Detection","authors":"Deepawali Sharma, Aakash Singh, Vivek Kumar Singh","doi":"10.1145/3653017","DOIUrl":"https://doi.org/10.1145/3653017","url":null,"abstract":"During the last decade, social media has gained significant popularity as a medium for individuals to express their views on various topics. However, some individuals also exploit the social media platforms to spread hatred through their comments and posts, some of which target individuals, communities or religions. Given the deep emotional connections people have to their religious beliefs, this form of hate speech can be divisive and harmful, and may result in issues of mental health as social disorder. Therefore, there is a need of algorithmic approaches for the automatic detection of instances of hate speech. Most of the existing studies in this area focus on social media content in English, and as a result several low-resource languages lack computational resources for the task. This study attempts to address this research gap by providing a high-quality annotated dataset designed specifically for identifying hate speech against religions in the Hindi-English code-mixed language. This dataset “Targeted Hate Speech Against Religion” (THAR)) consists of 11,549 comments and has been annotated by five independent annotators. It comprises two subtasks: (i) Subtask-1 (Binary classification), (ii) Subtask-2 (multi-class classification). To ensure the quality of annotation, the Fleiss Kappa measure has been employed. The suitability of the dataset is then further explored by applying different standard deep learning, and transformer-based models. The transformer-based model, namely Multilingual Representations for Indian Languages (MuRIL), is found to outperform the other implemented models in both subtasks, achieving macro average and weighted average F1 scores of 0.78 and 0.78 for Subtask-1, and 0.65 and 0.72 for Subtask-2, respectively. The experimental results obtained not only confirm the suitability of the dataset but also advance the research towards automatic detection of hate speech, particularly in the low-resource Hindi-English code-mixed language.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"55 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140146646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Neurocomputer System of Semantic Analysis of the Text in the Kazakh Language 哈萨克语文本语义分析神经计算机系统

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-03-13 DOI: 10.1145/3652159

Akerke Akanova, Aisulu Ismailova, Zhanar Oralbekova, Zhanat Kenzhebayeva, Galiya Anarbekova

The purpose of the study is to solve an extreme mathematical problem – semantic analysis of natural language, which can be used in various fields, including marketing research, online translators, and search engines. When training the neural network, data training methods based on the LDA model and vector representation of words were used. This study presents the development of a neurocomputer system used for the purpose of semantic analysis of the text in the Kazakh language, based on machine learning and the use of the LDA model. In the course of the study, the stages of system development were considered, regarding the text recognition algorithm. The Python programming language was used as a tool using libraries that greatly simplify the process of creating neural networks, including the Keras library. An experiment was conducted with the involvement of experts to test the effectiveness of the system, the results of which confirmed the reliability of the data provided by the system. The papers of modern computer linguists dealing with the problems of natural language processing using various technologies and methods are considered.

本研究的目的是解决一个极端数学问题--自然语言的语义分析，它可用于市场研究、在线翻译和搜索引擎等多个领域。在训练神经网络时，使用了基于 LDA 模型和词的向量表示的数据训练方法。本研究以机器学习和 LDA 模型为基础，介绍了用于哈萨克语文本语义分析的神经计算机系统的开发情况。在研究过程中，就文本识别算法考虑了系统开发的各个阶段。使用 Python 编程语言作为工具，使用大大简化神经网络创建过程的库，包括 Keras 库。为了测试系统的有效性，在专家的参与下进行了一次实验，实验结果证实了系统所提供数据的可靠性。现代计算机语言学家使用各种技术和方法处理自然语言处理问题的论文也在考虑之列。

{"title":"Neurocomputer System of Semantic Analysis of the Text in the Kazakh Language","authors":"Akerke Akanova, Aisulu Ismailova, Zhanar Oralbekova, Zhanat Kenzhebayeva, Galiya Anarbekova","doi":"10.1145/3652159","DOIUrl":"https://doi.org/10.1145/3652159","url":null,"abstract":"The purpose of the study is to solve an extreme mathematical problem – semantic analysis of natural language, which can be used in various fields, including marketing research, online translators, and search engines. When training the neural network, data training methods based on the LDA model and vector representation of words were used. This study presents the development of a neurocomputer system used for the purpose of semantic analysis of the text in the Kazakh language, based on machine learning and the use of the LDA model. In the course of the study, the stages of system development were considered, regarding the text recognition algorithm. The Python programming language was used as a tool using libraries that greatly simplify the process of creating neural networks, including the Keras library. An experiment was conducted with the involvement of experts to test the effectiveness of the system, the results of which confirmed the reliability of the data provided by the system. The papers of modern computer linguists dealing with the problems of natural language processing using various technologies and methods are considered.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"2017 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140124444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multilingual Neural Machine Translation for Indic to Indic Languages 印地语到印地语的多语言神经机器翻译

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-03-12 DOI: 10.1145/3652026

Sudhansu Bala Das, Divyajyoti Panda, Tapas Kumar Mishra, Bidyut Kr. Patra, Asif Ekbal

The method of translation from one language to another without human intervention is known as Machine Translation (MT). Multilingual neural machine translation (MNMT) is a technique for MT that builds a single model for multiple languages. It is preferred over other approaches since it decreases training time and improves translation in low-resource contexts, i.e. for languages that have insufficient corpus. However, good-quality MT models are yet to be built for many scenarios such as for Indic-to-Indic Languages (IL-IL). Hence, this paper is an attempt to address and develop the baseline models for low-resource languages i.e. IL-IL (for 11 Indic Languages (ILs)) in a multilingual environment. The models are built on the Samanantar corpus and analyzed on the Flores-200 corpus. All the models are evaluated using standard evaluation metrics i.e. Bilingual Evaluation Understudy (BLEU) score (with the range of 0 to 100). This paper examines the effect of the grouping of related languages, namely East Indo-Aryan (EI), Dravidian (DR), and West Indo-Aryan (WI) on the MNMT model. From the experiments, the results reveal that related language grouping is beneficial for the WI group only while it is detrimental for the EI group and it shows an inconclusive effect on the DR group. The role of pivot-based MNMT models in enhancing translation quality is also investigated in this paper. Owing to the presence of large good-quality corpora from English (EN) to ILs, MNMT IL-IL models using EN as a pivot are built and examined. To achieve this, English-Indic Language (EN-IL) models are developed with and without the usage of related languages. Results show that the use of related language grouping is advantageous specifically for EN to ILs. Thus, related language groups are used for the development of pivot MNMT models. It is also observed that the usage of pivot models greatly improves MNMT baselines. Furthermore, the effect of transliteration on ILs is also analyzed in this paper. To explore transliteration, the best MNMT models from the previous approaches (in most of cases pivot model using related groups) are determined and built on corpus transliterated from the corresponding scripts to a modified Indian language Transliteration script (ITRANS). The outcome of the experiments indicates that transliteration helps the models built for lexically rich languages, with the best increment of BLEU scores observed in Malayalam (ML) and Tamil (TA), i.e. 6.74 and 4.72, respectively. The BLEU score using transliteration models ranges from 7.03 to 24.29. The best model obtained is the Punjabi (PA)-Hindi (HI) language pair trained on PA-WI transliterated corpus.

在没有人工干预的情况下将一种语言翻译成另一种语言的方法被称为机器翻译（MT）。多语言神经机器翻译（MNMT）是一种为多种语言建立单一模型的机器翻译技术。与其他方法相比，MNMT 更受青睐，因为它可以减少训练时间，并改善低资源环境下的翻译，即语料不足的语言。然而，在许多情况下，如印地语到印地语（IL-IL），高质量的 MT 模型尚未建立。因此，本文试图在多语言环境中解决和开发低资源语言（即 IL-IL，针对 11 种印地语 (IL)）的基准模型。这些模型是在 Samanantar 语料库上建立的，并在 Flores-200 语料库上进行了分析。所有模型均采用标准评估指标进行评估，即双语评估（BLEU）得分（范围为 0 至 100）。本文研究了相关语言分组（即东印度-雅利安语（EI）、达罗毗荼语（DR）和西印度-雅利安语（WI））对 MNMT 模型的影响。实验结果表明，关联语言分组只对 WI 组有利，而对 EI 组不利，对 DR 组的影响不确定。本文还研究了基于枢轴的 MNMT 模型在提高翻译质量方面的作用。由于存在从英语（EN）到日语（IL）的大量高质量语料库，本文建立并检验了以 EN 为支点的 MNMT IL-IL 模型。为此，开发了使用和不使用相关语言的英语-印地语（EN-IL）模型。结果表明，使用关联语言分组对EN-IL特别有利。因此，相关语言组被用于开发枢轴 MNMT 模型。我们还观察到，枢轴模型的使用大大改善了 MNMT 基线。此外，本文还分析了音译对 IL 的影响。为了探索音译，本文确定了之前方法中的最佳 MNMT 模型（大多数情况下使用相关组的枢轴模型），并在从相应脚本音译为修改后的印度语音译脚本 (ITRANS) 的语料库上构建了这些模型。实验结果表明，音译有助于为词汇丰富的语言建立模型，在马拉雅拉姆语（ML）和泰米尔语（TA）中观察到的 BLEU 分数增量最好，分别为 6.74 和 4.72。使用音译模型得到的 BLEU 分从 7.03 到 24.29 不等。获得最佳模型的是在 PA-WI 音译语料库上训练的旁遮普语（PA）-印度语（HI）语言对。

{"title":"Multilingual Neural Machine Translation for Indic to Indic Languages","authors":"Sudhansu Bala Das, Divyajyoti Panda, Tapas Kumar Mishra, Bidyut Kr. Patra, Asif Ekbal","doi":"10.1145/3652026","DOIUrl":"https://doi.org/10.1145/3652026","url":null,"abstract":"The method of translation from one language to another without human intervention is known as Machine Translation (MT). Multilingual neural machine translation (MNMT) is a technique for MT that builds a single model for multiple languages. It is preferred over other approaches since it decreases training time and improves translation in low-resource contexts, i.e. for languages that have insufficient corpus. However, good-quality MT models are yet to be built for many scenarios such as for Indic-to-Indic Languages (IL-IL). Hence, this paper is an attempt to address and develop the baseline models for low-resource languages i.e. IL-IL (for 11 Indic Languages (ILs)) in a multilingual environment. The models are built on the Samanantar corpus and analyzed on the Flores-200 corpus. All the models are evaluated using standard evaluation metrics i.e. Bilingual Evaluation Understudy (BLEU) score (with the range of 0 to 100). This paper examines the effect of the grouping of related languages, namely East Indo-Aryan (EI), Dravidian (DR), and West Indo-Aryan (WI) on the MNMT model. From the experiments, the results reveal that related language grouping is beneficial for the WI group only while it is detrimental for the EI group and it shows an inconclusive effect on the DR group. The role of pivot-based MNMT models in enhancing translation quality is also investigated in this paper. Owing to the presence of large good-quality corpora from English (EN) to ILs, MNMT IL-IL models using EN as a pivot are built and examined. To achieve this, English-Indic Language (EN-IL) models are developed with and without the usage of related languages. Results show that the use of related language grouping is advantageous specifically for EN to ILs. Thus, related language groups are used for the development of pivot MNMT models. It is also observed that the usage of pivot models greatly improves MNMT baselines. Furthermore, the effect of transliteration on ILs is also analyzed in this paper. To explore transliteration, the best MNMT models from the previous approaches (in most of cases pivot model using related groups) are determined and built on corpus transliterated from the corresponding scripts to a modified Indian language Transliteration script (ITRANS). The outcome of the experiments indicates that transliteration helps the models built for lexically rich languages, with the best increment of BLEU scores observed in Malayalam (ML) and Tamil (TA), i.e. 6.74 and 4.72, respectively. The BLEU score using transliteration models ranges from 7.03 to 24.29. The best model obtained is the Punjabi (PA)-Hindi (HI) language pair trained on PA-WI transliterated corpus.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"4 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140129943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Medical Question Summarization with Entity-driven Contrastive Learning 利用实体驱动对比学习总结医学问题

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-03-11 DOI: 10.1145/3652160

Wenpeng Lu, Sibo Wei, Xueping Peng, Yi-Fei Wang, Usman Naseem, Shoujin Wang

By summarizing longer consumer health questions into shorter and essential ones, medical question-answering systems can more accurately understand consumer intentions and retrieve suitable answers. However, medical question summarization is very challenging due to obvious distinctions in health trouble descriptions from patients and doctors. Although deep learning has been applied to successfully address the medical question summarization (MQS) task, two challenges remain: how to correctly capture question focus to model its semantic intention, and how to obtain reliable datasets to fairly evaluate performance. To address these challenges, this paper proposes a novel medical question summarization framework based on entity-driven contrastive learning (ECL). ECL employs medical entities present in frequently asked questions (FAQs) as focuses and devises an effective mechanism to generate hard negative samples. This approach compels models to focus on essential information and consequently generate more accurate question summaries. Furthermore, we have discovered that some MQS datasets, such as the iCliniq dataset with a 33% duplicate rate, have significant data leakage issues. To ensure an impartial evaluation of the related methods, this paper carefully examines leaked samples to reorganize more reasonable datasets. Extensive experiments demonstrate that our ECL method outperforms the existing methods and achieves new state-of-the-art performance, i.e., 52.85, 43.16, 41.31, 43.52 in terms of ROUGE-1 metric on MeQSum, CHQ-Summ, iCliniq, HealthCareMagic dataset, respectively. The code and datasets are available at https://github.com/yrbobo/MQS-ECL.

通过将较长的消费者健康问题归纳为较短的基本问题，医疗问题解答系统可以更准确地理解消费者的意图，并检索出合适的答案。然而，由于患者和医生对健康问题的描述存在明显差异，因此医疗问题总结非常具有挑战性。虽然深度学习已被成功应用于医疗问题总结（MQS）任务，但仍存在两个挑战：如何正确捕捉问题焦点以模拟其语义意图，以及如何获得可靠的数据集以公平地评估性能。为了应对这些挑战，本文提出了一种基于实体驱动对比学习（ECL）的新型医学问题总结框架。ECL 将常见问题（FAQs）中的医学实体作为重点，并设计了一种有效的机制来生成硬负样本。这种方法迫使模型关注基本信息，从而生成更准确的问题摘要。此外，我们还发现一些 MQS 数据集（如重复率高达 33% 的 iCliniq 数据集）存在严重的数据泄漏问题。为了确保对相关方法进行公正的评估，本文仔细检查了泄漏样本，以重组更合理的数据集。大量实验证明，我们的 ECL 方法优于现有方法，并在 MeQSum、CHQ-Summ、iCliniq、HealthCareMagic 数据集上实现了新的一流性能，即 ROUGE-1 指标分别为 52.85、43.16、41.31、43.52。代码和数据集可在 https://github.com/yrbobo/MQS-ECL 上获取。

{"title":"Medical Question Summarization with Entity-driven Contrastive Learning","authors":"Wenpeng Lu, Sibo Wei, Xueping Peng, Yi-Fei Wang, Usman Naseem, Shoujin Wang","doi":"10.1145/3652160","DOIUrl":"https://doi.org/10.1145/3652160","url":null,"abstract":"By summarizing longer consumer health questions into shorter and essential ones, medical question-answering systems can more accurately understand consumer intentions and retrieve suitable answers. However, medical question summarization is very challenging due to obvious distinctions in health trouble descriptions from patients and doctors. Although deep learning has been applied to successfully address the medical question summarization (MQS) task, two challenges remain: how to correctly capture question focus to model its semantic intention, and how to obtain reliable datasets to fairly evaluate performance. To address these challenges, this paper proposes a novel medical question summarization framework based on <underline>e</underline>ntity-driven <underline>c</underline>ontrastive <underline>l</underline>earning (ECL). ECL employs medical entities present in frequently asked questions (FAQs) as focuses and devises an effective mechanism to generate hard negative samples. This approach compels models to focus on essential information and consequently generate more accurate question summaries. Furthermore, we have discovered that some MQS datasets, such as the iCliniq dataset with a 33% duplicate rate, have significant data leakage issues. To ensure an impartial evaluation of the related methods, this paper carefully examines leaked samples to reorganize more reasonable datasets. Extensive experiments demonstrate that our ECL method outperforms the existing methods and achieves new state-of-the-art performance, i.e., 52.85, 43.16, 41.31, 43.52 in terms of ROUGE-1 metric on MeQSum, CHQ-Summ, iCliniq, HealthCareMagic dataset, respectively. The code and datasets are available at https://github.com/yrbobo/MQS-ECL.\u0000","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"20 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140129809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Unsupervised Multimodal Machine Translation for Low-Resource Distant Language Pairs 针对低资源远距离语言对的无监督多模态机器翻译

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-03-09 DOI: 10.1145/3652161

Turghun Tayir, Lin Li

Unsupervised machine translation (UMT) has recently attracted more attention from researchers, enabling models to translate when languages lack parallel corpora. However, the current works mainly consider close language pairs (e.g., English-German and English-French), and the effectiveness of visual content for distant language pairs has yet to be investigated. This paper proposes a unsupervised multimodal machine translation (UMMT) model for low-resource distant language pairs. Specifically, we first employ adequate measures such as transliteration and re-ordering to bring distant language pairs closer together. We then use visual content to extend masked language modeling (MLM) and generate visual masked language modeling (VMLM) for UMT. Finally, empirical experiments are conducted on our distant language pair dataset and the public Multi30k dataset. Experimental results demonstrate the superior performance of our model, with BLEU score improvements of 2.5 and 2.6 on translation for distant language pairs English-Uyghur and Chinese-Uyghur. Moreover, our model also brings remarkable results for close language pairs, improving 2.3 BLEU compared with the existing models in English-German.

无监督机器翻译（UMT）最近吸引了更多研究人员的关注，它使模型能够在语言缺乏平行语料库的情况下进行翻译。然而，目前的研究主要考虑的是近距离语言对（如英德和英法），对于远距离语言对的视觉内容的有效性还有待研究。本文提出了一种针对低资源远距离语言对的无监督多模态机器翻译（UMMT）模型。具体来说，我们首先采用音译和重新排序等适当的措施来拉近远距离语言对之间的距离。然后，我们利用视觉内容来扩展遮蔽语言建模（MLM），并为 UMT 生成视觉遮蔽语言建模（VMLM）。最后，我们在我们的远距离语言对数据集和公开的 Multi30k 数据集上进行了实证实验。实验结果表明，我们的模型性能优越，在翻译英语-维吾尔语和汉语-维吾尔语远距离语言对时，BLEU 分数分别提高了 2.5 和 2.6。此外，我们的模型还为近距离语言对带来了显著效果，与现有的英德翻译模型相比，BLEU 提高了 2.3 分。

{"title":"Unsupervised Multimodal Machine Translation for Low-Resource Distant Language Pairs","authors":"Turghun Tayir, Lin Li","doi":"10.1145/3652161","DOIUrl":"https://doi.org/10.1145/3652161","url":null,"abstract":"Unsupervised machine translation (UMT) has recently attracted more attention from researchers, enabling models to translate when languages lack parallel corpora. However, the current works mainly consider close language pairs (e.g., English-German and English-French), and the effectiveness of visual content for distant language pairs has yet to be investigated. This paper proposes a unsupervised multimodal machine translation (UMMT) model for low-resource distant language pairs. Specifically, we first employ adequate measures such as transliteration and re-ordering to bring distant language pairs closer together. We then use visual content to extend masked language modeling (MLM) and generate visual masked language modeling (VMLM) for UMT. Finally, empirical experiments are conducted on our distant language pair dataset and the public Multi30k dataset. Experimental results demonstrate the superior performance of our model, with BLEU score improvements of 2.5 and 2.6 on translation for distant language pairs English-Uyghur and Chinese-Uyghur. Moreover, our model also brings remarkable results for close language pairs, improving 2.3 BLEU compared with the existing models in English-German.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"134 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140072787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DeepMedFeature: An Accurate Feature Extraction and Drug-Drug Interaction Model for Clinical Text in Medical Informatics DeepMedFeature：医学信息学中临床文本的精确特征提取和药物相互作用模型

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-03-09 DOI: 10.1145/3651159

M. Shoaib Malik, Sara Jawad, Syed Atif Moqurrab, Gautam Srivastava

Drug-drug interactions (DDIs) are an important biological phenomenon which can result in medical errors from medical practitioners. Drug interactions can change the molecular structure of interacting agents which may prove to be fatal in the worst case. Finding drug interactions early in diagnosis can be pivotal in side-effect prevention. The growth of big data provides a rich source of information for clinical studies to investigate DDIs. We propose a hierarchical classification model which is double-pass in nature. The first pass predicts the occurrence of an interaction and then the second pass further predicts the type of interaction such as effect, advice, mechanism, and int. We applied different deep learning algorithms with Convolutional Bi-LSTM (ConvBLSTM) proving to be the best. The results show that pre-trained vector embeddings prove to be the most appropriate features. The F1-score of the ConvBLSTM algorithm turned out to be 96.39% and 98.37% in Russian and English language respectively which is greater than the state-of-the-art systems. According to the results, it can be concluded that adding a convolution layer before the bi-directional pass improves model performance in the automatic classification and extraction of drug interactions, using pre-trained vector embeddings such as Fasttext and Bio-Bert.

药物相互作用（DDIs）是一种重要的生物现象，可能导致医疗从业人员的医疗失误。药物相互作用会改变相互作用药物的分子结构，在最坏的情况下可能致命。在诊断早期发现药物相互作用对预防副作用至关重要。大数据的增长为研究 DDIs 的临床研究提供了丰富的信息来源。我们提出了一种分层分类模型，该模型具有双重性质。第一道工序是预测相互作用的发生，第二道工序是进一步预测相互作用的类型，如效应、建议、机制和内涵。我们应用了不同的深度学习算法，其中卷积双 LSTM（ConvBLSTM）被证明是最好的。结果表明，预训练的向量嵌入被证明是最合适的特征。ConvBLSTM 算法在俄语和英语中的 F1 分数分别为 96.39% 和 98.37%，高于最先进的系统。根据这些结果，可以得出结论：在双向传递之前添加卷积层，可以提高使用 Fasttext 和 Bio-Bert 等预训练向量嵌入的药物相互作用自动分类和提取模型的性能。

{"title":"DeepMedFeature: An Accurate Feature Extraction and Drug-Drug Interaction Model for Clinical Text in Medical Informatics","authors":"M. Shoaib Malik, Sara Jawad, Syed Atif Moqurrab, Gautam Srivastava","doi":"10.1145/3651159","DOIUrl":"https://doi.org/10.1145/3651159","url":null,"abstract":"Drug-drug interactions (DDIs) are an important biological phenomenon which can result in medical errors from medical practitioners. Drug interactions can change the molecular structure of interacting agents which may prove to be fatal in the worst case. Finding drug interactions early in diagnosis can be pivotal in side-effect prevention. The growth of big data provides a rich source of information for clinical studies to investigate DDIs. We propose a hierarchical classification model which is double-pass in nature. The first pass predicts the occurrence of an interaction and then the second pass further predicts the type of interaction such as effect, advice, mechanism, and int. We applied different deep learning algorithms with Convolutional Bi-LSTM (ConvBLSTM) proving to be the best. The results show that pre-trained vector embeddings prove to be the most appropriate features. The F1-score of the ConvBLSTM algorithm turned out to be 96.39% and 98.37% in Russian and English language respectively which is greater than the state-of-the-art systems. According to the results, it can be concluded that adding a convolution layer before the bi-directional pass improves model performance in the automatic classification and extraction of drug interactions, using pre-trained vector embeddings such as Fasttext and Bio-Bert.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"53 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140072940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Consensus-Based Machine Translation for Code-Mixed Texts 基于共识的混码文本机器翻译

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-03-09 DOI: 10.1145/3628427

Sainik Kumar Mahata, Dipankar Das, Sivaji Bandyopadhyay

Multilingualism in India is widespread due to its long history of foreign acquaintances. This leads to the presence of an audience familiar with conversing using more than one language. Additionally, due to the social media boom, the usage of multiple languages to communicate has become extensive. Hence, the need for a translation system that can serve the novice and monolingual user is the need of the hour. Such translation systems can be developed by methods such as statistical machine translation and neural machine translation, where each approach has its advantages as well as disadvantages. In addition, the parallel corpus needed to build a translation system, with code-mixed data, is not readily available. In the present work, we present two translation frameworks that can leverage the individual advantages of these pre-existing approaches by building an ensemble model that takes a consensus of the final outputs of the preceding approaches and generates the target output. The developed models were used for translating English-Bengali code-mixed data (written in Roman script) into their equivalent monolingual Bengali instances. A code-mixed to monolingual parallel corpus was also developed to train the preceding systems. Empirical results show improved BLEU and TER scores of 17.23 and 53.18 and 19.12 and 51.29, respectively, for the developed frameworks.

由于与外国交往的历史悠久，印度的多语言现象十分普遍。这导致印度受众熟悉使用一种以上的语言进行对话。此外，由于社交媒体的蓬勃发展，使用多种语言进行交流也变得十分广泛。因此，当务之急是需要一个能为新手和单语用户提供服务的翻译系统。这种翻译系统可以通过统计机器翻译和神经机器翻译等方法开发，每种方法都有其优点和缺点。此外，建立翻译系统所需的代码混合数据平行语料库并不容易获得。在本研究中，我们提出了两种翻译框架，它们可以通过建立一个集合模型来利用这些已有方法的各自优势，该集合模型将前几种方法的最终输出达成共识并生成目标输出。所开发的模型用于将英语-孟加拉语混合编码数据（以罗马字母书写）翻译成等效的单语孟加拉语实例。此外，还开发了一个从代码混合到单语的平行语料库来训练前面的系统。经验结果表明，所开发框架的 BLEU 和 TER 分数分别提高了 17.23 分和 53.18 分，以及 19.12 分和 51.29 分。

{"title":"Consensus-Based Machine Translation for Code-Mixed Texts","authors":"Sainik Kumar Mahata, Dipankar Das, Sivaji Bandyopadhyay","doi":"10.1145/3628427","DOIUrl":"https://doi.org/10.1145/3628427","url":null,"abstract":"Multilingualism in India is widespread due to its long history of foreign acquaintances. This leads to the presence of an audience familiar with conversing using more than one language. Additionally, due to the social media boom, the usage of multiple languages to communicate has become extensive. Hence, the need for a translation system that can serve the novice and monolingual user is the need of the hour. Such translation systems can be developed by methods such as statistical machine translation and neural machine translation, where each approach has its advantages as well as disadvantages. In addition, the parallel corpus needed to build a translation system, with code-mixed data, is not readily available. In the present work, we present two translation frameworks that can leverage the individual advantages of these pre-existing approaches by building an ensemble model that takes a consensus of the final outputs of the preceding approaches and generates the target output. The developed models were used for translating English-Bengali code-mixed data (written in Roman script) into their equivalent monolingual Bengali instances. A code-mixed to monolingual parallel corpus was also developed to train the preceding systems. Empirical results show improved BLEU and TER scores of 17.23 and 53.18 and 19.12 and 51.29, respectively, for the developed frameworks.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"88 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140076312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0