ACM Transactions on Asian and Low-Resource Language Information Processing最新文献_第8页

Neurocomputer System of Semantic Analysis of the Text in the Kazakh Language 哈萨克语文本语义分析神经计算机系统

IF 2 4区计算机科学 Q2 Computer Science

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-03-13 DOI: 10.1145/3652159

Akerke Akanova, Aisulu Ismailova, Zhanar Oralbekova, Zhanat Kenzhebayeva, Galiya Anarbekova

The purpose of the study is to solve an extreme mathematical problem – semantic analysis of natural language, which can be used in various fields, including marketing research, online translators, and search engines. When training the neural network, data training methods based on the LDA model and vector representation of words were used. This study presents the development of a neurocomputer system used for the purpose of semantic analysis of the text in the Kazakh language, based on machine learning and the use of the LDA model. In the course of the study, the stages of system development were considered, regarding the text recognition algorithm. The Python programming language was used as a tool using libraries that greatly simplify the process of creating neural networks, including the Keras library. An experiment was conducted with the involvement of experts to test the effectiveness of the system, the results of which confirmed the reliability of the data provided by the system. The papers of modern computer linguists dealing with the problems of natural language processing using various technologies and methods are considered.

本研究的目的是解决一个极端数学问题--自然语言的语义分析，它可用于市场研究、在线翻译和搜索引擎等多个领域。在训练神经网络时，使用了基于 LDA 模型和词的向量表示的数据训练方法。本研究以机器学习和 LDA 模型为基础，介绍了用于哈萨克语文本语义分析的神经计算机系统的开发情况。在研究过程中，就文本识别算法考虑了系统开发的各个阶段。使用 Python 编程语言作为工具，使用大大简化神经网络创建过程的库，包括 Keras 库。为了测试系统的有效性，在专家的参与下进行了一次实验，实验结果证实了系统所提供数据的可靠性。现代计算机语言学家使用各种技术和方法处理自然语言处理问题的论文也在考虑之列。

{"title":"Neurocomputer System of Semantic Analysis of the Text in the Kazakh Language","authors":"Akerke Akanova, Aisulu Ismailova, Zhanar Oralbekova, Zhanat Kenzhebayeva, Galiya Anarbekova","doi":"10.1145/3652159","DOIUrl":"https://doi.org/10.1145/3652159","url":null,"abstract":"The purpose of the study is to solve an extreme mathematical problem – semantic analysis of natural language, which can be used in various fields, including marketing research, online translators, and search engines. When training the neural network, data training methods based on the LDA model and vector representation of words were used. This study presents the development of a neurocomputer system used for the purpose of semantic analysis of the text in the Kazakh language, based on machine learning and the use of the LDA model. In the course of the study, the stages of system development were considered, regarding the text recognition algorithm. The Python programming language was used as a tool using libraries that greatly simplify the process of creating neural networks, including the Keras library. An experiment was conducted with the involvement of experts to test the effectiveness of the system, the results of which confirmed the reliability of the data provided by the system. The papers of modern computer linguists dealing with the problems of natural language processing using various technologies and methods are considered.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140124444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multilingual Neural Machine Translation for Indic to Indic Languages 印地语到印地语的多语言神经机器翻译

IF 2 4区计算机科学 Q2 Computer Science

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-03-12 DOI: 10.1145/3652026

Sudhansu Bala Das, Divyajyoti Panda, Tapas Kumar Mishra, Bidyut Kr. Patra, Asif Ekbal

The method of translation from one language to another without human intervention is known as Machine Translation (MT). Multilingual neural machine translation (MNMT) is a technique for MT that builds a single model for multiple languages. It is preferred over other approaches since it decreases training time and improves translation in low-resource contexts, i.e. for languages that have insufficient corpus. However, good-quality MT models are yet to be built for many scenarios such as for Indic-to-Indic Languages (IL-IL). Hence, this paper is an attempt to address and develop the baseline models for low-resource languages i.e. IL-IL (for 11 Indic Languages (ILs)) in a multilingual environment. The models are built on the Samanantar corpus and analyzed on the Flores-200 corpus. All the models are evaluated using standard evaluation metrics i.e. Bilingual Evaluation Understudy (BLEU) score (with the range of 0 to 100). This paper examines the effect of the grouping of related languages, namely East Indo-Aryan (EI), Dravidian (DR), and West Indo-Aryan (WI) on the MNMT model. From the experiments, the results reveal that related language grouping is beneficial for the WI group only while it is detrimental for the EI group and it shows an inconclusive effect on the DR group. The role of pivot-based MNMT models in enhancing translation quality is also investigated in this paper. Owing to the presence of large good-quality corpora from English (EN) to ILs, MNMT IL-IL models using EN as a pivot are built and examined. To achieve this, English-Indic Language (EN-IL) models are developed with and without the usage of related languages. Results show that the use of related language grouping is advantageous specifically for EN to ILs. Thus, related language groups are used for the development of pivot MNMT models. It is also observed that the usage of pivot models greatly improves MNMT baselines. Furthermore, the effect of transliteration on ILs is also analyzed in this paper. To explore transliteration, the best MNMT models from the previous approaches (in most of cases pivot model using related groups) are determined and built on corpus transliterated from the corresponding scripts to a modified Indian language Transliteration script (ITRANS). The outcome of the experiments indicates that transliteration helps the models built for lexically rich languages, with the best increment of BLEU scores observed in Malayalam (ML) and Tamil (TA), i.e. 6.74 and 4.72, respectively. The BLEU score using transliteration models ranges from 7.03 to 24.29. The best model obtained is the Punjabi (PA)-Hindi (HI) language pair trained on PA-WI transliterated corpus.

在没有人工干预的情况下将一种语言翻译成另一种语言的方法被称为机器翻译（MT）。多语言神经机器翻译（MNMT）是一种为多种语言建立单一模型的机器翻译技术。与其他方法相比，MNMT 更受青睐，因为它可以减少训练时间，并改善低资源环境下的翻译，即语料不足的语言。然而，在许多情况下，如印地语到印地语（IL-IL），高质量的 MT 模型尚未建立。因此，本文试图在多语言环境中解决和开发低资源语言（即 IL-IL，针对 11 种印地语 (IL)）的基准模型。这些模型是在 Samanantar 语料库上建立的，并在 Flores-200 语料库上进行了分析。所有模型均采用标准评估指标进行评估，即双语评估（BLEU）得分（范围为 0 至 100）。本文研究了相关语言分组（即东印度-雅利安语（EI）、达罗毗荼语（DR）和西印度-雅利安语（WI））对 MNMT 模型的影响。实验结果表明，关联语言分组只对 WI 组有利，而对 EI 组不利，对 DR 组的影响不确定。本文还研究了基于枢轴的 MNMT 模型在提高翻译质量方面的作用。由于存在从英语（EN）到日语（IL）的大量高质量语料库，本文建立并检验了以 EN 为支点的 MNMT IL-IL 模型。为此，开发了使用和不使用相关语言的英语-印地语（EN-IL）模型。结果表明，使用关联语言分组对EN-IL特别有利。因此，相关语言组被用于开发枢轴 MNMT 模型。我们还观察到，枢轴模型的使用大大改善了 MNMT 基线。此外，本文还分析了音译对 IL 的影响。为了探索音译，本文确定了之前方法中的最佳 MNMT 模型（大多数情况下使用相关组的枢轴模型），并在从相应脚本音译为修改后的印度语音译脚本 (ITRANS) 的语料库上构建了这些模型。实验结果表明，音译有助于为词汇丰富的语言建立模型，在马拉雅拉姆语（ML）和泰米尔语（TA）中观察到的 BLEU 分数增量最好，分别为 6.74 和 4.72。使用音译模型得到的 BLEU 分从 7.03 到 24.29 不等。获得最佳模型的是在 PA-WI 音译语料库上训练的旁遮普语（PA）-印度语（HI）语言对。

{"title":"Multilingual Neural Machine Translation for Indic to Indic Languages","authors":"Sudhansu Bala Das, Divyajyoti Panda, Tapas Kumar Mishra, Bidyut Kr. Patra, Asif Ekbal","doi":"10.1145/3652026","DOIUrl":"https://doi.org/10.1145/3652026","url":null,"abstract":"The method of translation from one language to another without human intervention is known as Machine Translation (MT). Multilingual neural machine translation (MNMT) is a technique for MT that builds a single model for multiple languages. It is preferred over other approaches since it decreases training time and improves translation in low-resource contexts, i.e. for languages that have insufficient corpus. However, good-quality MT models are yet to be built for many scenarios such as for Indic-to-Indic Languages (IL-IL). Hence, this paper is an attempt to address and develop the baseline models for low-resource languages i.e. IL-IL (for 11 Indic Languages (ILs)) in a multilingual environment. The models are built on the Samanantar corpus and analyzed on the Flores-200 corpus. All the models are evaluated using standard evaluation metrics i.e. Bilingual Evaluation Understudy (BLEU) score (with the range of 0 to 100). This paper examines the effect of the grouping of related languages, namely East Indo-Aryan (EI), Dravidian (DR), and West Indo-Aryan (WI) on the MNMT model. From the experiments, the results reveal that related language grouping is beneficial for the WI group only while it is detrimental for the EI group and it shows an inconclusive effect on the DR group. The role of pivot-based MNMT models in enhancing translation quality is also investigated in this paper. Owing to the presence of large good-quality corpora from English (EN) to ILs, MNMT IL-IL models using EN as a pivot are built and examined. To achieve this, English-Indic Language (EN-IL) models are developed with and without the usage of related languages. Results show that the use of related language grouping is advantageous specifically for EN to ILs. Thus, related language groups are used for the development of pivot MNMT models. It is also observed that the usage of pivot models greatly improves MNMT baselines. Furthermore, the effect of transliteration on ILs is also analyzed in this paper. To explore transliteration, the best MNMT models from the previous approaches (in most of cases pivot model using related groups) are determined and built on corpus transliterated from the corresponding scripts to a modified Indian language Transliteration script (ITRANS). The outcome of the experiments indicates that transliteration helps the models built for lexically rich languages, with the best increment of BLEU scores observed in Malayalam (ML) and Tamil (TA), i.e. 6.74 and 4.72, respectively. The BLEU score using transliteration models ranges from 7.03 to 24.29. The best model obtained is the Punjabi (PA)-Hindi (HI) language pair trained on PA-WI transliterated corpus.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140129943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Medical Question Summarization with Entity-driven Contrastive Learning 利用实体驱动对比学习总结医学问题

IF 2 4区计算机科学 Q2 Computer Science

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-03-11 DOI: 10.1145/3652160

Wenpeng Lu, Sibo Wei, Xueping Peng, Yi-Fei Wang, Usman Naseem, Shoujin Wang

By summarizing longer consumer health questions into shorter and essential ones, medical question-answering systems can more accurately understand consumer intentions and retrieve suitable answers. However, medical question summarization is very challenging due to obvious distinctions in health trouble descriptions from patients and doctors. Although deep learning has been applied to successfully address the medical question summarization (MQS) task, two challenges remain: how to correctly capture question focus to model its semantic intention, and how to obtain reliable datasets to fairly evaluate performance. To address these challenges, this paper proposes a novel medical question summarization framework based on entity-driven contrastive learning (ECL). ECL employs medical entities present in frequently asked questions (FAQs) as focuses and devises an effective mechanism to generate hard negative samples. This approach compels models to focus on essential information and consequently generate more accurate question summaries. Furthermore, we have discovered that some MQS datasets, such as the iCliniq dataset with a 33% duplicate rate, have significant data leakage issues. To ensure an impartial evaluation of the related methods, this paper carefully examines leaked samples to reorganize more reasonable datasets. Extensive experiments demonstrate that our ECL method outperforms the existing methods and achieves new state-of-the-art performance, i.e., 52.85, 43.16, 41.31, 43.52 in terms of ROUGE-1 metric on MeQSum, CHQ-Summ, iCliniq, HealthCareMagic dataset, respectively. The code and datasets are available at https://github.com/yrbobo/MQS-ECL.

通过将较长的消费者健康问题归纳为较短的基本问题，医疗问题解答系统可以更准确地理解消费者的意图，并检索出合适的答案。然而，由于患者和医生对健康问题的描述存在明显差异，因此医疗问题总结非常具有挑战性。虽然深度学习已被成功应用于医疗问题总结（MQS）任务，但仍存在两个挑战：如何正确捕捉问题焦点以模拟其语义意图，以及如何获得可靠的数据集以公平地评估性能。为了应对这些挑战，本文提出了一种基于实体驱动对比学习（ECL）的新型医学问题总结框架。ECL 将常见问题（FAQs）中的医学实体作为重点，并设计了一种有效的机制来生成硬负样本。这种方法迫使模型关注基本信息，从而生成更准确的问题摘要。此外，我们还发现一些 MQS 数据集（如重复率高达 33% 的 iCliniq 数据集）存在严重的数据泄漏问题。为了确保对相关方法进行公正的评估，本文仔细检查了泄漏样本，以重组更合理的数据集。大量实验证明，我们的 ECL 方法优于现有方法，并在 MeQSum、CHQ-Summ、iCliniq、HealthCareMagic 数据集上实现了新的一流性能，即 ROUGE-1 指标分别为 52.85、43.16、41.31、43.52。代码和数据集可在 https://github.com/yrbobo/MQS-ECL 上获取。

{"title":"Medical Question Summarization with Entity-driven Contrastive Learning","authors":"Wenpeng Lu, Sibo Wei, Xueping Peng, Yi-Fei Wang, Usman Naseem, Shoujin Wang","doi":"10.1145/3652160","DOIUrl":"https://doi.org/10.1145/3652160","url":null,"abstract":"By summarizing longer consumer health questions into shorter and essential ones, medical question-answering systems can more accurately understand consumer intentions and retrieve suitable answers. However, medical question summarization is very challenging due to obvious distinctions in health trouble descriptions from patients and doctors. Although deep learning has been applied to successfully address the medical question summarization (MQS) task, two challenges remain: how to correctly capture question focus to model its semantic intention, and how to obtain reliable datasets to fairly evaluate performance. To address these challenges, this paper proposes a novel medical question summarization framework based on <underline>e</underline>ntity-driven <underline>c</underline>ontrastive <underline>l</underline>earning (ECL). ECL employs medical entities present in frequently asked questions (FAQs) as focuses and devises an effective mechanism to generate hard negative samples. This approach compels models to focus on essential information and consequently generate more accurate question summaries. Furthermore, we have discovered that some MQS datasets, such as the iCliniq dataset with a 33% duplicate rate, have significant data leakage issues. To ensure an impartial evaluation of the related methods, this paper carefully examines leaked samples to reorganize more reasonable datasets. Extensive experiments demonstrate that our ECL method outperforms the existing methods and achieves new state-of-the-art performance, i.e., 52.85, 43.16, 41.31, 43.52 in terms of ROUGE-1 metric on MeQSum, CHQ-Summ, iCliniq, HealthCareMagic dataset, respectively. The code and datasets are available at https://github.com/yrbobo/MQS-ECL.\u0000","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140129809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Unsupervised Multimodal Machine Translation for Low-Resource Distant Language Pairs 针对低资源远距离语言对的无监督多模态机器翻译

IF 2 4区计算机科学 Q2 Computer Science

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-03-09 DOI: 10.1145/3652161

Turghun Tayir, Lin Li

Unsupervised machine translation (UMT) has recently attracted more attention from researchers, enabling models to translate when languages lack parallel corpora. However, the current works mainly consider close language pairs (e.g., English-German and English-French), and the effectiveness of visual content for distant language pairs has yet to be investigated. This paper proposes a unsupervised multimodal machine translation (UMMT) model for low-resource distant language pairs. Specifically, we first employ adequate measures such as transliteration and re-ordering to bring distant language pairs closer together. We then use visual content to extend masked language modeling (MLM) and generate visual masked language modeling (VMLM) for UMT. Finally, empirical experiments are conducted on our distant language pair dataset and the public Multi30k dataset. Experimental results demonstrate the superior performance of our model, with BLEU score improvements of 2.5 and 2.6 on translation for distant language pairs English-Uyghur and Chinese-Uyghur. Moreover, our model also brings remarkable results for close language pairs, improving 2.3 BLEU compared with the existing models in English-German.

无监督机器翻译（UMT）最近吸引了更多研究人员的关注，它使模型能够在语言缺乏平行语料库的情况下进行翻译。然而，目前的研究主要考虑的是近距离语言对（如英德和英法），对于远距离语言对的视觉内容的有效性还有待研究。本文提出了一种针对低资源远距离语言对的无监督多模态机器翻译（UMMT）模型。具体来说，我们首先采用音译和重新排序等适当的措施来拉近远距离语言对之间的距离。然后，我们利用视觉内容来扩展遮蔽语言建模（MLM），并为 UMT 生成视觉遮蔽语言建模（VMLM）。最后，我们在我们的远距离语言对数据集和公开的 Multi30k 数据集上进行了实证实验。实验结果表明，我们的模型性能优越，在翻译英语-维吾尔语和汉语-维吾尔语远距离语言对时，BLEU 分数分别提高了 2.5 和 2.6。此外，我们的模型还为近距离语言对带来了显著效果，与现有的英德翻译模型相比，BLEU 提高了 2.3 分。

{"title":"Unsupervised Multimodal Machine Translation for Low-Resource Distant Language Pairs","authors":"Turghun Tayir, Lin Li","doi":"10.1145/3652161","DOIUrl":"https://doi.org/10.1145/3652161","url":null,"abstract":"Unsupervised machine translation (UMT) has recently attracted more attention from researchers, enabling models to translate when languages lack parallel corpora. However, the current works mainly consider close language pairs (e.g., English-German and English-French), and the effectiveness of visual content for distant language pairs has yet to be investigated. This paper proposes a unsupervised multimodal machine translation (UMMT) model for low-resource distant language pairs. Specifically, we first employ adequate measures such as transliteration and re-ordering to bring distant language pairs closer together. We then use visual content to extend masked language modeling (MLM) and generate visual masked language modeling (VMLM) for UMT. Finally, empirical experiments are conducted on our distant language pair dataset and the public Multi30k dataset. Experimental results demonstrate the superior performance of our model, with BLEU score improvements of 2.5 and 2.6 on translation for distant language pairs English-Uyghur and Chinese-Uyghur. Moreover, our model also brings remarkable results for close language pairs, improving 2.3 BLEU compared with the existing models in English-German.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140072787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DeepMedFeature: An Accurate Feature Extraction and Drug-Drug Interaction Model for Clinical Text in Medical Informatics DeepMedFeature：医学信息学中临床文本的精确特征提取和药物相互作用模型

IF 2 4区计算机科学 Q2 Computer Science

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-03-09 DOI: 10.1145/3651159

M. Shoaib Malik, Sara Jawad, Syed Atif Moqurrab, Gautam Srivastava

Drug-drug interactions (DDIs) are an important biological phenomenon which can result in medical errors from medical practitioners. Drug interactions can change the molecular structure of interacting agents which may prove to be fatal in the worst case. Finding drug interactions early in diagnosis can be pivotal in side-effect prevention. The growth of big data provides a rich source of information for clinical studies to investigate DDIs. We propose a hierarchical classification model which is double-pass in nature. The first pass predicts the occurrence of an interaction and then the second pass further predicts the type of interaction such as effect, advice, mechanism, and int. We applied different deep learning algorithms with Convolutional Bi-LSTM (ConvBLSTM) proving to be the best. The results show that pre-trained vector embeddings prove to be the most appropriate features. The F1-score of the ConvBLSTM algorithm turned out to be 96.39% and 98.37% in Russian and English language respectively which is greater than the state-of-the-art systems. According to the results, it can be concluded that adding a convolution layer before the bi-directional pass improves model performance in the automatic classification and extraction of drug interactions, using pre-trained vector embeddings such as Fasttext and Bio-Bert.

药物相互作用（DDIs）是一种重要的生物现象，可能导致医疗从业人员的医疗失误。药物相互作用会改变相互作用药物的分子结构，在最坏的情况下可能致命。在诊断早期发现药物相互作用对预防副作用至关重要。大数据的增长为研究 DDIs 的临床研究提供了丰富的信息来源。我们提出了一种分层分类模型，该模型具有双重性质。第一道工序是预测相互作用的发生，第二道工序是进一步预测相互作用的类型，如效应、建议、机制和内涵。我们应用了不同的深度学习算法，其中卷积双 LSTM（ConvBLSTM）被证明是最好的。结果表明，预训练的向量嵌入被证明是最合适的特征。ConvBLSTM 算法在俄语和英语中的 F1 分数分别为 96.39% 和 98.37%，高于最先进的系统。根据这些结果，可以得出结论：在双向传递之前添加卷积层，可以提高使用 Fasttext 和 Bio-Bert 等预训练向量嵌入的药物相互作用自动分类和提取模型的性能。

{"title":"DeepMedFeature: An Accurate Feature Extraction and Drug-Drug Interaction Model for Clinical Text in Medical Informatics","authors":"M. Shoaib Malik, Sara Jawad, Syed Atif Moqurrab, Gautam Srivastava","doi":"10.1145/3651159","DOIUrl":"https://doi.org/10.1145/3651159","url":null,"abstract":"Drug-drug interactions (DDIs) are an important biological phenomenon which can result in medical errors from medical practitioners. Drug interactions can change the molecular structure of interacting agents which may prove to be fatal in the worst case. Finding drug interactions early in diagnosis can be pivotal in side-effect prevention. The growth of big data provides a rich source of information for clinical studies to investigate DDIs. We propose a hierarchical classification model which is double-pass in nature. The first pass predicts the occurrence of an interaction and then the second pass further predicts the type of interaction such as effect, advice, mechanism, and int. We applied different deep learning algorithms with Convolutional Bi-LSTM (ConvBLSTM) proving to be the best. The results show that pre-trained vector embeddings prove to be the most appropriate features. The F1-score of the ConvBLSTM algorithm turned out to be 96.39% and 98.37% in Russian and English language respectively which is greater than the state-of-the-art systems. According to the results, it can be concluded that adding a convolution layer before the bi-directional pass improves model performance in the automatic classification and extraction of drug interactions, using pre-trained vector embeddings such as Fasttext and Bio-Bert.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140072940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Consensus-Based Machine Translation for Code-Mixed Texts 基于共识的混码文本机器翻译

IF 2 4区计算机科学 Q2 Computer Science

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-03-09 DOI: 10.1145/3628427

Sainik Kumar Mahata, Dipankar Das, Sivaji Bandyopadhyay

Multilingualism in India is widespread due to its long history of foreign acquaintances. This leads to the presence of an audience familiar with conversing using more than one language. Additionally, due to the social media boom, the usage of multiple languages to communicate has become extensive. Hence, the need for a translation system that can serve the novice and monolingual user is the need of the hour. Such translation systems can be developed by methods such as statistical machine translation and neural machine translation, where each approach has its advantages as well as disadvantages. In addition, the parallel corpus needed to build a translation system, with code-mixed data, is not readily available. In the present work, we present two translation frameworks that can leverage the individual advantages of these pre-existing approaches by building an ensemble model that takes a consensus of the final outputs of the preceding approaches and generates the target output. The developed models were used for translating English-Bengali code-mixed data (written in Roman script) into their equivalent monolingual Bengali instances. A code-mixed to monolingual parallel corpus was also developed to train the preceding systems. Empirical results show improved BLEU and TER scores of 17.23 and 53.18 and 19.12 and 51.29, respectively, for the developed frameworks.

由于与外国交往的历史悠久，印度的多语言现象十分普遍。这导致印度受众熟悉使用一种以上的语言进行对话。此外，由于社交媒体的蓬勃发展，使用多种语言进行交流也变得十分广泛。因此，当务之急是需要一个能为新手和单语用户提供服务的翻译系统。这种翻译系统可以通过统计机器翻译和神经机器翻译等方法开发，每种方法都有其优点和缺点。此外，建立翻译系统所需的代码混合数据平行语料库并不容易获得。在本研究中，我们提出了两种翻译框架，它们可以通过建立一个集合模型来利用这些已有方法的各自优势，该集合模型将前几种方法的最终输出达成共识并生成目标输出。所开发的模型用于将英语-孟加拉语混合编码数据（以罗马字母书写）翻译成等效的单语孟加拉语实例。此外，还开发了一个从代码混合到单语的平行语料库来训练前面的系统。经验结果表明，所开发框架的 BLEU 和 TER 分数分别提高了 17.23 分和 53.18 分，以及 19.12 分和 51.29 分。

{"title":"Consensus-Based Machine Translation for Code-Mixed Texts","authors":"Sainik Kumar Mahata, Dipankar Das, Sivaji Bandyopadhyay","doi":"10.1145/3628427","DOIUrl":"https://doi.org/10.1145/3628427","url":null,"abstract":"Multilingualism in India is widespread due to its long history of foreign acquaintances. This leads to the presence of an audience familiar with conversing using more than one language. Additionally, due to the social media boom, the usage of multiple languages to communicate has become extensive. Hence, the need for a translation system that can serve the novice and monolingual user is the need of the hour. Such translation systems can be developed by methods such as statistical machine translation and neural machine translation, where each approach has its advantages as well as disadvantages. In addition, the parallel corpus needed to build a translation system, with code-mixed data, is not readily available. In the present work, we present two translation frameworks that can leverage the individual advantages of these pre-existing approaches by building an ensemble model that takes a consensus of the final outputs of the preceding approaches and generates the target output. The developed models were used for translating English-Bengali code-mixed data (written in Roman script) into their equivalent monolingual Bengali instances. A code-mixed to monolingual parallel corpus was also developed to train the preceding systems. Empirical results show improved BLEU and TER scores of 17.23 and 53.18 and 19.12 and 51.29, respectively, for the developed frameworks.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140076312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multization: Multi-Modal Summarization Enhanced by Multi-Contextually Relevant and Irrelevant Attention Alignment 多化：通过多语境相关和不相关注意力对齐增强多模态总结能力

IF 2 4区计算机科学 Q2 Computer Science

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-03-09 DOI: 10.1145/3651983

Huan Rong, Zhongfeng Chen, Zhenyu Lu, Fan Xu, Victor S. Sheng

This paper focuses on the task of Multi-Modal Summarization with Multi-Modal Output for China JD.COM e-commerce product description containing both source text and source images. In the context learning of multi-modal (text and image) input, there exists a semantic gap between text and image, especially in the cross-modal semantics of text and image. As a result, capturing shared cross-modal semantics earlier becomes crucial for multi-modal summarization. On the other hand, when generating the multi-modal summarization, based on the different contributions of input text and images, the relevance and irrelevance of multi-modal contexts to the target summary should be considered, so as to optimize the process of learning cross-modal context to guide the summary generation process and to emphasize the significant semantics within each modality. To address the aforementioned challenges, Multization has been proposed to enhance multi-modal semantic information by multi-contextually relevant and irrelevant attention alignment. Specifically, a Semantic Alignment Enhancement mechanism is employed to capture shared semantics between different modalities (text and image), so as to enhance the importance of crucial multi-modal information in the encoding stage. Additionally, the IR-Relevant Multi-Context Learning mechanism is utilized to observe the summary generation process from both relevant and irrelevant perspectives, so as to form a multi-modal context that incorporates both text and image semantic information. The experimental results in the China JD.COM e-commerce dataset demonstrate that the proposed Multization method effectively captures the shared semantics between the input source text and source images, and highlights essential semantics. It also successfully generates the multi-modal summary (including image and text) that comprehensively considers the semantics information of both text and image.

本文主要针对中国 JD.COM 电子商务中同时包含源文本和源图像的产品描述，提出了多模态总结与多模态输出的任务。在多模态（文本和图像）输入的语境学习中，文本和图像之间存在语义鸿沟，尤其是文本和图像的跨模态语义。因此，及早捕捉共享的跨模态语义对于多模态总结至关重要。另一方面，在生成多模态摘要时，基于输入文本和图像的不同贡献，应考虑多模态上下文与目标摘要的相关性和不相关性，从而优化学习跨模态上下文的过程，以指导摘要生成过程，并强调每种模态中的重要语义。为了应对上述挑战，有人提出了多化（Multization）方法，通过多语境相关和不相关注意力对齐来增强多模态语义信息。具体来说，我们采用了语义对齐增强机制来捕捉不同模态（文本和图像）之间的共享语义，从而在编码阶段提高关键多模态信息的重要性。此外，利用红外相关多语境学习机制，从相关和不相关两个角度观察摘要生成过程，从而形成包含文本和图像语义信息的多模态语境。在中国 JD.COM 电子商务数据集中的实验结果表明，所提出的多化方法能有效捕捉输入源文本和源图像之间的共享语义，并突出重要语义。它还成功生成了多模态摘要（包括图像和文本），全面考虑了文本和图像的语义信息。

{"title":"Multization: Multi-Modal Summarization Enhanced by Multi-Contextually Relevant and Irrelevant Attention Alignment","authors":"Huan Rong, Zhongfeng Chen, Zhenyu Lu, Fan Xu, Victor S. Sheng","doi":"10.1145/3651983","DOIUrl":"https://doi.org/10.1145/3651983","url":null,"abstract":"This paper focuses on the task of Multi-Modal Summarization with Multi-Modal Output for China JD.COM e-commerce product description containing both source text and source images. In the context learning of multi-modal (text and image) input, there exists a semantic gap between text and image, especially in the cross-modal semantics of text and image. As a result, capturing shared cross-modal semantics earlier becomes crucial for multi-modal summarization. On the other hand, when generating the multi-modal summarization, based on the different contributions of input text and images, the relevance and irrelevance of multi-modal contexts to the target summary should be considered, so as to optimize the process of learning cross-modal context to guide the summary generation process and to emphasize the significant semantics within each modality. To address the aforementioned challenges, Multization has been proposed to enhance multi-modal semantic information by multi-contextually relevant and irrelevant attention alignment. Specifically, a Semantic Alignment Enhancement mechanism is employed to capture shared semantics between different modalities (text and image), so as to enhance the importance of crucial multi-modal information in the encoding stage. Additionally, the IR-Relevant Multi-Context Learning mechanism is utilized to observe the summary generation process from both relevant and irrelevant perspectives, so as to form a multi-modal context that incorporates both text and image semantic information. The experimental results in the China JD.COM e-commerce dataset demonstrate that the proposed Multization method effectively captures the shared semantics between the input source text and source images, and highlights essential semantics. It also successfully generates the multi-modal summary (including image and text) that comprehensively considers the semantics information of both text and image.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140072938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Am I hurt?: Evaluating Psychological Pain Detection in Hindi Text using Transformer-based Models 我受伤了吗？使用基于变换器的模型评估印地语文本中的心理痛苦检测

IF 2 4区计算机科学 Q2 Computer Science

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-03-05 DOI: 10.1145/3650206

Ravleen Kaur, M. P. S. Bhatia, Akshi Kumar

The automated evaluation of pain is critical for developing effective pain management approaches that seek to alleviate while preserving patients’ functioning. Transformer-based models can aid in detecting pain from Hindi text data gathered from social media by leveraging their ability to capture complex language patterns and contextual information. By understanding the nuances and context of Hindi text, transformer models can effectively identify linguistic cues, sentiment and expressions associated with pain enabling the detection and analysis of pain-related content present in social media posts. The purpose of this research is to analyse the feasibility of utilizing NLP techniques to automatically identify pain within Hindi textual data, providing a valuable tool for pain assessment in Hindi-speaking populations. The research showcases the HindiPainNet model, a deep neural network that employs the IndicBERT model, classifying the dataset into two class labels {pain, no_pain} for detecting pain in Hindi textual data. The model is trained and tested using a novel dataset, दर्द-ए-शायरी (pronounced as Dard-e-Shayari) curated using posts from social media platforms. The results demonstrate the model's effectiveness, achieving an accuracy of 70.5%. This pioneer research highlights the potential of utilizing textual data from diverse sources to identify and understand pain experiences based on psychosocial factors. This research could pave the path for the development of automated pain assessment tools that help medical professionals comprehend and treat pain in Hindi speaking populations. Additionally, it opens avenues to conduct further NLP-based multilingual pain detection research, addressing the needs of diverse language communities.

疼痛的自动评估对于开发有效的疼痛管理方法至关重要，这种方法既能缓解疼痛，又能保护患者的功能。基于转换器的模型可以利用其捕捉复杂语言模式和上下文信息的能力，帮助从社交媒体收集的印地语文本数据中检测疼痛。通过理解印地语文本的细微差别和上下文，转换器模型可以有效识别与疼痛相关的语言线索、情感和表达方式，从而检测和分析社交媒体帖子中与疼痛相关的内容。本研究旨在分析利用 NLP 技术自动识别印地语文本数据中疼痛的可行性，为印地语人群的疼痛评估提供有价值的工具。该研究展示了 HindiPainNet 模型，这是一个采用 IndicBERT 模型的深度神经网络，可将数据集分为两个类别标签 {pain, no_pain}，用于检测印地语文本数据中的疼痛。该模型使用一个新的数据集进行了训练和测试，该数据集是利用社交媒体平台上的帖子策划的दर्द-ए-शायरी（发音为 Dard-e-Shayari）。研究结果证明了该模型的有效性，准确率达到 70.5%。这项开创性的研究凸显了利用不同来源的文本数据来识别和理解基于社会心理因素的疼痛体验的潜力。这项研究可以为开发自动疼痛评估工具铺平道路，帮助医疗专业人员理解和治疗印地语人群的疼痛。此外，它还为进一步开展基于 NLP 的多语言疼痛检测研究开辟了道路，从而满足不同语言社区的需求。

{"title":"Am I hurt?: Evaluating Psychological Pain Detection in Hindi Text using Transformer-based Models","authors":"Ravleen Kaur, M. P. S. Bhatia, Akshi Kumar","doi":"10.1145/3650206","DOIUrl":"https://doi.org/10.1145/3650206","url":null,"abstract":"The automated evaluation of pain is critical for developing effective pain management approaches that seek to alleviate while preserving patients’ functioning. Transformer-based models can aid in detecting pain from Hindi text data gathered from social media by leveraging their ability to capture complex language patterns and contextual information. By understanding the nuances and context of Hindi text, transformer models can effectively identify linguistic cues, sentiment and expressions associated with pain enabling the detection and analysis of pain-related content present in social media posts. The purpose of this research is to analyse the feasibility of utilizing NLP techniques to automatically identify pain within Hindi textual data, providing a valuable tool for pain assessment in Hindi-speaking populations. The research showcases the HindiPainNet model, a deep neural network that employs the IndicBERT model, classifying the dataset into two class labels {pain, no_pain} for detecting pain in Hindi textual data. The model is trained and tested using a novel dataset, दर्द-ए-शायरी (pronounced as Dard-e-Shayari) curated using posts from social media platforms. The results demonstrate the model's effectiveness, achieving an accuracy of 70.5%. This pioneer research highlights the potential of utilizing textual data from diverse sources to identify and understand pain experiences based on psychosocial factors. This research could pave the path for the development of automated pain assessment tools that help medical professionals comprehend and treat pain in Hindi speaking populations. Additionally, it opens avenues to conduct further NLP-based multilingual pain detection research, addressing the needs of diverse language communities.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140034668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TransVAE-PAM: A Combined Transformer and DAG-based Approach for Enhanced Fake News Detection in Indian Context TransVAE-PAM：基于变压器和 DAG 的组合方法，用于增强印度背景下的假新闻检测能力

IF 2 4区计算机科学 Q2 Computer Science

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-03-02 DOI: 10.1145/3651160

Shivani Tufchi, Tanveer Ahmed, Ashima Yadav, Krishna Kant Agrawal, Ankit Vidyarthi

In this study, we introduce a novel method, “TransVAE-PAM”, for the classification of fake news articles, tailored specifically for the Indian context. The approach capitalizes on state-of-the-art contextual and sentence transformer-based embedding models to generate article embeddings. Furthermore, we also try to address the issue of compact model size. In this respect, we employ a Variational Autoencoder (VAE) and β-VAE to reduce the dimensions of the embeddings, thereby yielding compact latent representations. To capture the thematic essence or important topics in the news articles, we use the Pachinko Allocation Model (PAM) model, a Directed Acyclic Graph (DAG) based approach, to generate meaningful topics. These two facets of representation - the reduced-dimension embeddings from the VAE and the extracted topics from the PAM model - are fused together to create a feature set. This representation is subsequently channeled into five different methods for fake news classification. Furthermore, we use eight distinct transformer-based architectures to test the embedding generation. To validate the feasibility of the proposed approach, we have conducted extensive experimentation on a proprietary dataset. The dataset is sourced from “Times of India” and other online media. Considering the size of the dataset, large-scale experiments are conducted on an NVIDIA supercomputer. Through this comprehensive numerical investigation, we have achieved an accuracy of 96.2% and an F1 score of 96% using the DistilBERT transformer architecture. By complementing the method via topic modeling, we record a performance improvement with the accuracy and F1 score both at 97%. These results indicate a promising direction toward leveraging the combination of advanced topic models into existing classification schemes to enhance research on fake news detection.

在本研究中，我们介绍了一种新方法 "TransVAE-PAM"，用于对假新闻文章进行分类，该方法专门针对印度的情况而定制。该方法利用最先进的基于上下文和句子转换器的嵌入模型来生成文章嵌入。此外，我们还尝试解决模型尺寸紧凑的问题。在这方面，我们采用了变异自动编码器（VAE）和 β-VAE 来减少嵌入的维度，从而生成紧凑的潜在表示。为了捕捉新闻文章中的主题本质或重要话题，我们使用了基于有向无环图（DAG）的柏青柯分配模型（PAM）来生成有意义的话题。这两方面的表征--来自 VAE 的降维嵌入和来自 PAM 模型的提取主题--被融合在一起以创建一个特征集。这一表征随后被导入五种不同的假新闻分类方法中。此外，我们还使用了八种不同的基于变换器的架构来测试嵌入生成。为了验证所提方法的可行性，我们在一个专有数据集上进行了广泛的实验。该数据集来自《印度时报》和其他网络媒体。考虑到数据集的规模，我们在英伟达超级计算机上进行了大规模实验。通过全面的数值研究，我们利用 DistilBERT 变换器架构实现了 96.2% 的准确率和 96% 的 F1 分数。通过主题建模对该方法进行补充，我们的准确率和 F1 分数均达到了 97%，性能得到了提高。这些结果表明，将先进的话题模型与现有的分类方案相结合，加强假新闻检测研究是一个很有前景的方向。

{"title":"TransVAE-PAM: A Combined Transformer and DAG-based Approach for Enhanced Fake News Detection in Indian Context","authors":"Shivani Tufchi, Tanveer Ahmed, Ashima Yadav, Krishna Kant Agrawal, Ankit Vidyarthi","doi":"10.1145/3651160","DOIUrl":"https://doi.org/10.1145/3651160","url":null,"abstract":"In this study, we introduce a novel method, “TransVAE-PAM”, for the classification of fake news articles, tailored specifically for the Indian context. The approach capitalizes on state-of-the-art contextual and sentence transformer-based embedding models to generate article embeddings. Furthermore, we also try to address the issue of compact model size. In this respect, we employ a Variational Autoencoder (VAE) and β-VAE to reduce the dimensions of the embeddings, thereby yielding compact latent representations. To capture the thematic essence or important topics in the news articles, we use the Pachinko Allocation Model (PAM) model, a Directed Acyclic Graph (DAG) based approach, to generate meaningful topics. These two facets of representation - the reduced-dimension embeddings from the VAE and the extracted topics from the PAM model - are fused together to create a feature set. This representation is subsequently channeled into five different methods for fake news classification. Furthermore, we use eight distinct transformer-based architectures to test the embedding generation. To validate the feasibility of the proposed approach, we have conducted extensive experimentation on a proprietary dataset. The dataset is sourced from “Times of India” and other online media. Considering the size of the dataset, large-scale experiments are conducted on an NVIDIA supercomputer. Through this comprehensive numerical investigation, we have achieved an accuracy of 96.2% and an F1 score of 96% using the DistilBERT transformer architecture. By complementing the method via topic modeling, we record a performance improvement with the accuracy and F1 score both at 97%. These results indicate a promising direction toward leveraging the combination of advanced topic models into existing classification schemes to enhance research on fake news detection.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140057217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Opinion Mining on Social Media Text Using Optimized Deep Belief Networks 使用优化的深度信念网络对社交媒体文本进行观点挖掘

IF 2 4区计算机科学 Q2 Computer Science

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-03-02 DOI: 10.1145/3649502

S. Vinayaga Vadivu, P. Nagaraj, B. S. Murugan

In the digital world, most people spend their leisure and precious time on social media networks such as Facebook, Twitter. Instagram, and so on. Moreover, users post their views of products, services, political parties on their social sites. This information is viewed by many other users and brands. With the aid of these posts and tweets, the emotions, polarities of users are extracted to obtain the opinion about products or services. To analyze these posts sentiment analysis or opinion mining techniques are applied. Subsequently, this field rapidly attracts many researchers to conduct their research work due to the availability of an enormous number of data on social media networks. Further, this method can also be used to analyze the text to extract the sentiments which are classified as moderate, neutral, low extreme, and high extreme. However, the extraction of sentiment is an arduous one from the social media datasets, since it includes formal and informal texts, emojis, symbols. Hence to extract the feature vector from the accessed social media datasets and to perform accurate classification to group the texts based on the appropriate sentiments we proposed a novel method known as, Deep Belief Network-based Dynamic Grouping-based Cooperative optimization method DBN based DGCO. Exploiting this method the data are preprocessed to attain the required format of text and henceforth the feature vectors are extracted by the ICS algorithm. Furthermore, the extracted datasets are classified and grouped into moderate, neutral, low extreme, and high extreme with DBN based DGCO method. For experimental analysis, we have taken two social media datasets and analyzed the performance of the proposed method in terms of performance metrics such as accuracy/precision, recall, F1 Score, and ROC with HEMOS, WOA-SITO, PDCNN, and NB-LSVC state-of-art works. The acquired accuracy/precision, recall, and F1 Score, of our proposed ICS-DBN-DGCO method, are 89%, 80%, 98.2%, respectively.

在数字世界里，大多数人都把闲暇和宝贵的时间花在社交媒体网络上，如 Facebook、Twitter、Instagram 等。Instagram 等。此外，用户还会在社交网站上发表他们对产品、服务和政党的看法。许多其他用户和品牌都会浏览这些信息。借助这些帖子和推文，可以提取用户的情绪和两极分化，从而获得对产品或服务的看法。为了分析这些帖子，需要应用情感分析或意见挖掘技术。随后，由于社交媒体网络上存在大量数据，这一领域迅速吸引了许多研究人员开展研究工作。此外，这种方法还可用于分析文本以提取情感，情感可分为温和、中性、低极端和高极端。然而，从社交媒体数据集中提取情感是一项艰巨的工作，因为其中包括正式和非正式文本、表情符号和符号。因此，为了从访问的社交媒体数据集中提取特征向量，并根据适当的情感对文本进行准确分类，我们提出了一种新方法，即基于深度信网络的动态分组合作优化方法 DBN based DGCO。利用这种方法对数据进行预处理，以获得所需的文本格式，然后通过 ICS 算法提取特征向量。此外，还利用基于 DBN 的 DGCO 方法将提取的数据集分类并分为中等、中性、低极端和高极端。在实验分析中，我们选取了两个社交媒体数据集，并与 HEMOS、WOA-SITO、PDCNN 和 NB-LSVC 等最先进的作品一起，从准确率/精度、召回率、F1 分数和 ROC 等性能指标方面分析了所提方法的性能。我们提出的 ICS-DBN-DGCO 方法获得的准确率/精确率、召回率和 F1 分数分别为 89%、80% 和 98.2%。

{"title":"Opinion Mining on Social Media Text Using Optimized Deep Belief Networks","authors":"S. Vinayaga Vadivu, P. Nagaraj, B. S. Murugan","doi":"10.1145/3649502","DOIUrl":"https://doi.org/10.1145/3649502","url":null,"abstract":"In the digital world, most people spend their leisure and precious time on social media networks such as Facebook, Twitter. Instagram, and so on. Moreover, users post their views of products, services, political parties on their social sites. This information is viewed by many other users and brands. With the aid of these posts and tweets, the emotions, polarities of users are extracted to obtain the opinion about products or services. To analyze these posts sentiment analysis or opinion mining techniques are applied. Subsequently, this field rapidly attracts many researchers to conduct their research work due to the availability of an enormous number of data on social media networks. Further, this method can also be used to analyze the text to extract the sentiments which are classified as moderate, neutral, low extreme, and high extreme. However, the extraction of sentiment is an arduous one from the social media datasets, since it includes formal and informal texts, emojis, symbols. Hence to extract the feature vector from the accessed social media datasets and to perform accurate classification to group the texts based on the appropriate sentiments we proposed a novel method known as, Deep Belief Network-based Dynamic Grouping-based Cooperative optimization method DBN based DGCO. Exploiting this method the data are preprocessed to attain the required format of text and henceforth the feature vectors are extracted by the ICS algorithm. Furthermore, the extracted datasets are classified and grouped into moderate, neutral, low extreme, and high extreme with DBN based DGCO method. For experimental analysis, we have taken two social media datasets and analyzed the performance of the proposed method in terms of performance metrics such as accuracy/precision, recall, F1 Score, and ROC with HEMOS, WOA-SITO, PDCNN, and NB-LSVC state-of-art works. The acquired accuracy/precision, recall, and F1 Score, of our proposed ICS-DBN-DGCO method, are 89%, 80%, 98.2%, respectively.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140034763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0