ACM Transactions on Asian and Low-Resource Language Information Processing最新文献_第9页

PAMR: Persian Abstract Meaning Representation Corpus PAMR：波斯语抽象意义表征语料库

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-01-19 DOI: 10.1145/3638288

Nasim Tohidi, Chitra Dadkhah, Reza Nouralizadeh Ganji, Ehsan Ghaffari Sadr, Hoda Elmi

One of the most used and well-known semantic representation models is Abstract Meaning Representation (AMR). This representation has had numerous applications in natural language processing tasks in recent years. Currently, for English and Chinese languages, large annotated corpora are available. Besides, in some low-recourse languages, related corpora have been generated with less size. Although, till now to the best of our knowledge, there is not any AMR corpus for the Persian/Farsi language. Therefore, the aim of this paper is to create a Persian AMR (PAMR) corpus via translating English sentences and adjusting AMR guidelines and to solve the various challenges that are faced in this regard. The result of this research is a corpus, containing 1020 Persian sentences and their related AMR which can be used in various natural language processing tasks. In this paper, to investigate the feasibility of using the corpus, we have applied it to two natural language processing tasks: Sentiment Analysis and Text Summarization.

抽象意义表示（AMR）是最常用、最著名的语义表示模型之一。近年来，这种表示法在自然语言处理任务中得到了大量应用。目前，英语和汉语都有大量的注释语料库。此外，在一些低词汇量语言中，相关的语料库也已生成，但规模较小。据我们所知，迄今为止还没有任何针对波斯语/波斯语的 AMR 语料库。因此，本文的目的是通过翻译英语句子和调整 AMR 指南创建一个波斯语 AMR（PAMR）语料库，并解决在这方面面临的各种挑战。这项研究的成果是一个包含 1020 个波斯语句子及其相关 AMR 的语料库，该语料库可用于各种自然语言处理任务。在本文中，为了研究使用该语料库的可行性，我们将其应用于两项自然语言处理任务：情感分析和文本总结。

引用次数: 0

NLP-enabled Recommendation of Hashtags for Covid based Tweets using Hybrid BERT-LSTM Model 使用混合 BERT-LSTM 模型为基于 Covid 的推文推荐 NLP 标签

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-01-16 DOI: 10.1145/3640812

Kirti Jain, Rajni Jindal

Hashtags have become a new trend to summarize the feelings, sentiments, emotions, swinging moods, food tastes and much more. It also represents various entities like places, families and friends. It is a way to search and categorize various stuff on social media sites. With the increase in the hashtagging, there is a need to automate it, leading to the term “Hashtag Recommendation”. Also, there are plenty of posts on social media sites which remain untagged. These untagged posts get filtered out while searching and categorizing the data using a label. Such posts do not make any contribution to any helpful insight and remain abandoned. But, if the user of such posts is recommended by labels according to his post, then he may choose one or more of them, thus making the posts labelled. For such cases Hashtag recommendation comes into the picture. Although much research work has been done on Hashtag Recommendation using traditional Deep Learning approaches, not much work has been done using NLP based Bert Embedding. In this paper, we have proposed a model, BELHASH, Bert Embedding based LSTM for Hashtag Recommendation. This task is considered as a Multilabel Classification task as the hashtags are one-hot encoded into multiple binary vectors of zeros and ones using MultiLabelBinarizer. This model has been evaluated on Covid 19 tweets. We have achieved 0.72 accuracy, 0.7 Precision, 0.66 Recall and 0.67 F1-Score. This is the first paper of hashtag recommendation to the best of our knowledge combining Bert embedding with LSTM model and achieving the state of the arts results.

标签已成为一种新趋势，用来概括感受、情绪、情感、摇摆不定的心情和食物口味等等。它还代表各种实体，如地点、家庭和朋友。它是在社交媒体网站上搜索和分类各种内容的一种方式。随着标签的增加，有必要将其自动化，这就产生了 "标签推荐 "一词。此外，社交媒体网站上还有很多帖子没有标签。在使用标签搜索和分类数据时，这些无标签的帖子会被过滤掉。这些帖子对任何有帮助的洞察力都没有任何贡献，因此一直被遗弃。但是，如果这类帖子的用户根据自己的帖子获得了标签推荐，那么他可能会选择其中的一个或多个标签，从而使帖子贴上标签。在这种情况下，标签推荐就出现了。虽然使用传统的深度学习方法对 Hashtag 推荐进行了大量研究，但使用基于 NLP 的 Bert Embedding 方法进行的研究却不多。在本文中，我们提出了一种基于 Bert Embedding 的 LSTM 模型 BELHASH，用于 Hashtag 推荐。这项任务被视为多标签分类任务，因为标签是使用多标签二进制器（MultiLabelBinarizer）一次性编码成多个由 0 和 1 组成的二进制向量的。该模型已在 Covid 19 条推文中进行了评估。我们取得了 0.72 的准确率、0.7 的精确率、0.66 的召回率和 0.67 的 F1 分数。据我们所知，这是第一篇将 Bert embedding 与 LSTM 模型相结合并取得最新成果的标签推荐论文。

{"title":"NLP-enabled Recommendation of Hashtags for Covid based Tweets using Hybrid BERT-LSTM Model","authors":"Kirti Jain, Rajni Jindal","doi":"10.1145/3640812","DOIUrl":"https://doi.org/10.1145/3640812","url":null,"abstract":"Hashtags have become a new trend to summarize the feelings, sentiments, emotions, swinging moods, food tastes and much more. It also represents various entities like places, families and friends. It is a way to search and categorize various stuff on social media sites. With the increase in the hashtagging, there is a need to automate it, leading to the term “Hashtag Recommendation”. Also, there are plenty of posts on social media sites which remain untagged. These untagged posts get filtered out while searching and categorizing the data using a label. Such posts do not make any contribution to any helpful insight and remain abandoned. But, if the user of such posts is recommended by labels according to his post, then he may choose one or more of them, thus making the posts labelled. For such cases Hashtag recommendation comes into the picture. Although much research work has been done on Hashtag Recommendation using traditional Deep Learning approaches, not much work has been done using NLP based Bert Embedding. In this paper, we have proposed a model, BELHASH, Bert Embedding based LSTM for Hashtag Recommendation. This task is considered as a Multilabel Classification task as the hashtags are one-hot encoded into multiple binary vectors of zeros and ones using MultiLabelBinarizer. This model has been evaluated on Covid 19 tweets. We have achieved 0.72 accuracy, 0.7 Precision, 0.66 Recall and 0.67 F1-Score. This is the first paper of hashtag recommendation to the best of our knowledge combining Bert embedding with LSTM model and achieving the state of the arts results.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"8 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139476667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Criteria Decision-Making Framework with Fuzzy Queries for Multimedia Data Fusion 多媒体数据融合的模糊查询多标准决策框架

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-01-16 DOI: 10.1145/3640339

Khalid Haseeb, Irshad Ahmad, Mohammad Siraj, Naveed Abbas, Gwanggil Jeon

Multimedia Internet of Things (MIoT) is widely explored in many smart applications for connectivity with wireless communication. Such networks are not like ordinary networks because it has to collect a massive amount of data and are further forwarded to processing systems. As MIoT is very limited in terms of resources for healthcare, smart homes, etc., therefore, energy efficiency with reliable data transmission is a significant research challenge. As smart applications rely on bounded constraints, therefore duplicate and unnecessary data transmission should be minimized. In addition, the timely delivery of data in crucial circumstances has a significant impact on any proposed system. Consequently, this paper presents a fuzzy logic-based edge computing framework to provide cooperative decision-making while avoiding inefficient use of the sensing power of smart devices. The proposed framework can be applied to critical applications to improve response time and processing cost. It consists of the following two functional components: Firstly, it provides the automated routing process with a natural language interface at the sink node. Secondly, to ensure reasonable performance, it also transmits semantic data between sensors using fuzzy queries and security. According to the performance evaluation, the proposed framework significantly outperformed related studies in terms of energy consumption, packet overhead, network throughput, and end-to-end delay.

多媒体物联网（MIoT）在许多智能应用中被广泛应用于无线通信连接。这种网络与普通网络不同，因为它必须收集大量数据并进一步转发给处理系统。由于 MIoT 在医疗保健、智能家居等方面的资源非常有限，因此，提高能效并实现可靠的数据传输是一项重大的研究挑战。由于智能应用依赖于有限制的约束条件，因此应尽量减少重复和不必要的数据传输。此外，在关键情况下及时传输数据对任何提议的系统都有重大影响。因此，本文提出了一种基于模糊逻辑的边缘计算框架，以提供合作决策，同时避免低效利用智能设备的传感能力。建议的框架可应用于关键应用，以改善响应时间和处理成本。它由以下两个功能组件组成：首先，它在汇节点上通过自然语言界面提供自动路由过程。其次，为了确保合理的性能，它还利用模糊查询和安全功能在传感器之间传输语义数据。根据性能评估，所提出的框架在能源消耗、数据包开销、网络吞吐量和端到端延迟方面明显优于相关研究。

{"title":"Multi-Criteria Decision-Making Framework with Fuzzy Queries for Multimedia Data Fusion","authors":"Khalid Haseeb, Irshad Ahmad, Mohammad Siraj, Naveed Abbas, Gwanggil Jeon","doi":"10.1145/3640339","DOIUrl":"https://doi.org/10.1145/3640339","url":null,"abstract":"Multimedia Internet of Things (MIoT) is widely explored in many smart applications for connectivity with wireless communication. Such networks are not like ordinary networks because it has to collect a massive amount of data and are further forwarded to processing systems. As MIoT is very limited in terms of resources for healthcare, smart homes, etc., therefore, energy efficiency with reliable data transmission is a significant research challenge. As smart applications rely on bounded constraints, therefore duplicate and unnecessary data transmission should be minimized. In addition, the timely delivery of data in crucial circumstances has a significant impact on any proposed system. Consequently, this paper presents a fuzzy logic-based edge computing framework to provide cooperative decision-making while avoiding inefficient use of the sensing power of smart devices. The proposed framework can be applied to critical applications to improve response time and processing cost. It consists of the following two functional components: Firstly, it provides the automated routing process with a natural language interface at the sink node. Secondly, to ensure reasonable performance, it also transmits semantic data between sensors using fuzzy queries and security. According to the performance evaluation, the proposed framework significantly outperformed related studies in terms of energy consumption, packet overhead, network throughput, and end-to-end delay.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"255 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139476495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Arabic Sentiment Analysis for ChatGPT Using Machine Learning Classification Algorithms: A Hyperparameter Optimization Technique 使用机器学习分类算法对 ChatGPT 进行阿拉伯语情感分析：超参数优化技术

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-01-15 DOI: 10.1145/3638285

Ahmad Nasayreh, Rabia Emhamed Al Mamlook, Ghassan Samara, Hasan Gharaibeh, Mohammad Aljaidi, Dalia Alzu'Bi, Essam Al-Daoud, Laith Abualigah

In the realm of ChatGPT's language capabilities, exploring Arabic Sentiment Analysis emerges as a crucial research focus. This study centers on ChatGPT, a popular machine learning model engaging in dialogues with users, garnering attention for its exceptional performance and widespread impact, particularly in the Arab world. The objective is to assess people's opinions about ChatGPT, categorizing them as positive or negative. Despite abundant research in English, there is a notable gap in Arabic studies. We assembled a dataset from Twitter, comprising 2,247 tweets, classified by Arabic language specialists. Employing various machine learning algorithms, including Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF), and Naive Bayes (NB), we implemented hyperparameter optimization techniques such as Bayesian optimization, Grid Search, and random search to select the best hyperparameters which contribute to achieve the best performance. Through training and testing, performance enhancements were observed with optimization algorithms. SVM exhibited superior performance, achieving 90% accuracy, 88% precision, 95% recall, and 91% F1 score with Grid Search. These findings contribute valuable insights into ChatGPT's impact in the Arab world, offering a comprehensive understanding of sentiment analysis through machine learning methodologies.

在 ChatGPT 的语言能力领域，探索阿拉伯语情感分析成为一个关键的研究重点。本研究以 ChatGPT 为中心，ChatGPT 是一种与用户进行对话的流行机器学习模型，因其卓越的性能和广泛的影响而备受关注，尤其是在阿拉伯世界。研究的目的是评估人们对 ChatGPT 的看法，并将其分为积极和消极两类。尽管有大量的英语研究，但阿拉伯语研究明显不足。我们从 Twitter 上收集了一个数据集，其中包括由阿拉伯语专家分类的 2,247 条推文。我们采用了多种机器学习算法，包括支持向量机（SVM）、逻辑回归（LR）、随机森林（RF）和奈夫贝叶斯（NB），并实施了贝叶斯优化、网格搜索和随机搜索等超参数优化技术，以选择有助于实现最佳性能的最佳超参数。通过训练和测试，我们观察到优化算法提高了性能。SVM 表现出卓越的性能，其准确率达到 90%，精确率达到 88%，召回率达到 95%，网格搜索的 F1 分数达到 91%。这些发现为 ChatGPT 在阿拉伯世界的影响提供了宝贵的见解，通过机器学习方法提供了对情感分析的全面理解。

{"title":"Arabic Sentiment Analysis for ChatGPT Using Machine Learning Classification Algorithms: A Hyperparameter Optimization Technique","authors":"Ahmad Nasayreh, Rabia Emhamed Al Mamlook, Ghassan Samara, Hasan Gharaibeh, Mohammad Aljaidi, Dalia Alzu'Bi, Essam Al-Daoud, Laith Abualigah","doi":"10.1145/3638285","DOIUrl":"https://doi.org/10.1145/3638285","url":null,"abstract":"In the realm of ChatGPT's language capabilities, exploring Arabic Sentiment Analysis emerges as a crucial research focus. This study centers on ChatGPT, a popular machine learning model engaging in dialogues with users, garnering attention for its exceptional performance and widespread impact, particularly in the Arab world. The objective is to assess people's opinions about ChatGPT, categorizing them as positive or negative. Despite abundant research in English, there is a notable gap in Arabic studies. We assembled a dataset from Twitter, comprising 2,247 tweets, classified by Arabic language specialists. Employing various machine learning algorithms, including Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF), and Naive Bayes (NB), we implemented hyperparameter optimization techniques such as Bayesian optimization, Grid Search, and random search to select the best hyperparameters which contribute to achieve the best performance. Through training and testing, performance enhancements were observed with optimization algorithms. SVM exhibited superior performance, achieving 90% accuracy, 88% precision, 95% recall, and 91% F1 score with Grid Search. These findings contribute valuable insights into ChatGPT's impact in the Arab world, offering a comprehensive understanding of sentiment analysis through machine learning methodologies.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"255 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139469368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SEEUNRS: Semantically-Enriched-Entity-Based Urdu News Recommendation System SEEUNRS：基于语义丰富实体的乌尔都语新闻推荐系统

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-01-11 DOI: 10.1145/3639049

Safia Kanwal, Muhammad Kamran Malik, Zubair Nawaz, Khawar Mehmood

The advancement in the production, distribution, and consumption of news has fostered easy access to the news with fair challenges. The main challenge is to present the right news to the right audience. News recommendation system is one of the technological solutions to this problem. Much work has been done on news recommendation systems for the major languages of the world, but trivial work has been done for resource-poor languages like Urdu. Another significant hurdle in the development of an efficient news recommendation system is the Scarcity of an accessible and suitable Urdu dataset. To this end, an Urdu news mobile application was used to collect the news data and user feedback for one month. After refinement, the first-ever Urdu dataset of 100 users and 23250 news is curated for the Urdu news recommendation system. In addition, a Semantically-Enriched-Entity-Based Urdu News Recommendation System (SEEUNRS) is proposed. The proposed scheme exploits the hidden features of a news article and entities to suggest the right article to the right audience. Results have shown that the presented model has an improvement of 6.9% in the F-1 measure from traditional recommendation system techniques.

新闻生产、传播和消费的进步促进了新闻的便捷获取，同时也带来了公平的挑战。主要的挑战是如何将正确的新闻呈现给正确的受众。新闻推荐系统是解决这一问题的技术方案之一。针对世界主要语言的新闻推荐系统已经做了大量工作，但针对乌尔都语等资源贫乏语言的工作却微不足道。开发高效新闻推荐系统的另一个重大障碍是缺乏可访问的合适乌尔都语数据集。为此，我们使用乌尔都语新闻移动应用程序收集了一个月的新闻数据和用户反馈。经过改进后，为乌尔都语新闻推荐系统策划了首个包含 100 名用户和 23250 条新闻的乌尔都语数据集。此外，还提出了基于语义丰富实体的乌尔都语新闻推荐系统（SEEUNRS）。该方案利用新闻文章和实体的隐藏特征，向合适的受众推荐合适的文章。结果表明，与传统的推荐系统技术相比，所提出的模型在 F-1 指标上提高了 6.9%。

{"title":"SEEUNRS: Semantically-Enriched-Entity-Based Urdu News Recommendation System","authors":"Safia Kanwal, Muhammad Kamran Malik, Zubair Nawaz, Khawar Mehmood","doi":"10.1145/3639049","DOIUrl":"https://doi.org/10.1145/3639049","url":null,"abstract":"The advancement in the production, distribution, and consumption of news has fostered easy access to the news with fair challenges. The main challenge is to present the right news to the right audience. News recommendation system is one of the technological solutions to this problem. Much work has been done on news recommendation systems for the major languages of the world, but trivial work has been done for resource-poor languages like Urdu. Another significant hurdle in the development of an efficient news recommendation system is the Scarcity of an accessible and suitable Urdu dataset. To this end, an Urdu news mobile application was used to collect the news data and user feedback for one month. After refinement, the first-ever Urdu dataset of 100 users and 23250 news is curated for the Urdu news recommendation system. In addition, a Semantically-Enriched-Entity-Based Urdu News Recommendation System (SEEUNRS) is proposed. The proposed scheme exploits the hidden features of a news article and entities to suggest the right article to the right audience. Results have shown that the presented model has an improvement of 6.9% in the F-1 measure from traditional recommendation system techniques.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"124 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139422301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Granularity Knowledge Sharing in Low-Resource Neural Machine Translation 低资源神经机器翻译中的多粒度知识共享

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-01-09 DOI: 10.1145/3639930

Chenggang Mi, Shaoliang Xie, Yi Fan

As the rapid development of deep learning methods, neural machine translation (NMT) has attracted more and more attention in recent years. However, lack of bilingual resources decreases the performance of the low-resource NMT model seriously. To overcome this problem, several studies put their efforts on knowledge transfer from high-resource language pairs to low-resource language pairs. However, these methods usually focus on one single granularity of language and the parameter sharing among different granularities in NMT is not well studied. In this paper, we propose to improve the parameter sharing in low-resource NMT by introducing multi-granularity knowledge such as word, phrase and sentence. This knowledge can be monolingual and bilingual. We build the knowledge sharing model for low-resource NMT based on a multi-task learning (MTL) framework, three auxiliary tasks such as syntax parsing, cross-lingual named entity recognition and natural language generation are selected for the low-resource NMT. Experimental results show that the proposed method consistently outperforms six strong baseline systems on several low-resource language pairs.

随着深度学习方法的快速发展，神经机器翻译（NMT）近年来受到越来越多的关注。然而，双语资源的缺乏严重降低了低资源 NMT 模型的性能。为了克服这一问题，一些研究致力于将知识从高资源语言对转移到低资源语言对。然而，这些方法通常只关注单一粒度的语言，而对 NMT 中不同粒度语言之间的参数共享问题研究不多。在本文中，我们建议通过引入多粒度知识（如词、短语和句子）来改进低资源 NMT 中的参数共享。这些知识可以是单语的，也可以是双语的。我们基于多任务学习（MTL）框架为低资源 NMT 建立了知识共享模型，并为低资源 NMT 选择了语法分析、跨语言命名实体识别和自然语言生成等三个辅助任务。实验结果表明，在多个低资源语言对上，所提出的方法始终优于六个强大的基线系统。

{"title":"Multi-Granularity Knowledge Sharing in Low-Resource Neural Machine Translation","authors":"Chenggang Mi, Shaoliang Xie, Yi Fan","doi":"10.1145/3639930","DOIUrl":"https://doi.org/10.1145/3639930","url":null,"abstract":"As the rapid development of deep learning methods, neural machine translation (NMT) has attracted more and more attention in recent years. However, lack of bilingual resources decreases the performance of the low-resource NMT model seriously. To overcome this problem, several studies put their efforts on knowledge transfer from high-resource language pairs to low-resource language pairs. However, these methods usually focus on one single granularity of language and the parameter sharing among different granularities in NMT is not well studied. In this paper, we propose to improve the parameter sharing in low-resource NMT by introducing multi-granularity knowledge such as word, phrase and sentence. This knowledge can be monolingual and bilingual. We build the knowledge sharing model for low-resource NMT based on a multi-task learning (MTL) framework, three auxiliary tasks such as syntax parsing, cross-lingual named entity recognition and natural language generation are selected for the low-resource NMT. Experimental results show that the proposed method consistently outperforms six strong baseline systems on several low-resource language pairs.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"16 17","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139443142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Transliteration Characteristics in Romanized Assamese Language Social Media Text and Machine Transliteration 罗马化阿萨姆语社交媒体文本中的转写特征与机器转写

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-01-06 DOI: 10.1145/3639565

Hemanta Baruah, Sanasam Ranbir Singh, Priyankoo Sarmah

This article aims to understand different transliteration behaviors of Romanized Assamese text on social media. Assamese, a language that belongs to the Indo-Aryan language family, is also among the 22 scheduled languages in India. With the increasing popularity of social media in India and also the common use of the English Qwerty keyboard, Indian users on social media express themselves in their native languages, but using the Roman/Latin script. Unlike some other popular South Asian languages (say Pinyin for Chinese), Indian languages do not have a common standard romanization convention for writing on social media platforms. Assamese and English are two very different orthographical languages. Thus, considering both orthographic and phonemic characteristics of the language, this study tries to explain how Assamese vowels, vowel diacritics, and consonants are represented in Roman transliterated form. From a dataset of romanized Assamese social media texts collected from three popular social media sites: (Facebook, YouTube and Twitter), we have manually labeled them with their native Assamese script. A comparison analysis is also carried out between the transliterated Assamese social media texts with six different Assamese romanization schemes that reflect how Assamese users on social media do not adhere to any fixed romanization scheme. We have built three separate character-level transliteration models from our dataset. One using a traditional phrase-based statistical machine transliteration model, (1). PBSMT model and two separate neural transliteration models: (2). BiLSTM neural seq2seq model with attention, and (3). Neural transformer model. A thorough error analysis has been performed on the transliteration result obtained from the three state-of-the-art models mentioned above. This may help to build a more robust machine transliteration system for the Assamese social media domain in the future. Finally, an attention analysis experiment is also carried out with the help of attention weight scores taken from the character-level BiLSTM neural seq2seq transliteration model built from our dataset.

本文旨在了解社交媒体上罗马化阿萨姆语文本的不同音译行为。阿萨姆语属于印度-雅利安语系，也是印度 22 种在册语言之一。随着社交媒体在印度的日益普及，以及英语 Qwerty 键盘的普遍使用，印度用户在社交媒体上使用自己的母语表达自己，但使用的是罗马/拉丁字母。与其他一些流行的南亚语言（如汉语拼音）不同，印度语言在社交媒体平台上没有通用的罗马字母标准。阿萨姆语和英语是两种正字法截然不同的语言。因此，考虑到该语言的正字法和音位特征，本研究试图解释阿萨姆语元音、元音变音符和辅音如何以罗马音译形式表示。我们从三个流行的社交媒体网站（Facebook、YouTube 和 Twitter）上收集了罗马化的阿萨姆社交媒体文本数据集，并手动将其标注为阿萨姆本地文字。我们还将阿萨姆语社交媒体文本与六种不同的阿萨姆语罗马化方案进行了对比分析，这反映了社交媒体上的阿萨姆语用户并不拘泥于任何固定的罗马化方案。我们根据数据集建立了三个独立的字符级音译模型。一个使用传统的基于短语的统计机器音译模型 (1)。PBSMT 模型和两个独立的神经音译模型：(2).带有注意力的 BiLSTM 神经 seq2seq 模型，以及 (3).神经转换器模型。我们对上述三种最先进模型的音译结果进行了全面的误差分析。这可能有助于将来为阿萨姆语社交媒体领域建立更强大的机器音译系统。最后，我们还利用从我们的数据集建立的字符级 BiLSTM 神经 seq2seq 音译模型中提取的注意力权重分数进行了注意力分析实验。

{"title":"Transliteration Characteristics in Romanized Assamese Language Social Media Text and Machine Transliteration","authors":"Hemanta Baruah, Sanasam Ranbir Singh, Priyankoo Sarmah","doi":"10.1145/3639565","DOIUrl":"https://doi.org/10.1145/3639565","url":null,"abstract":"This article aims to understand different transliteration behaviors of Romanized Assamese text on social media. Assamese, a language that belongs to the Indo-Aryan language family, is also among the 22 scheduled languages in India. With the increasing popularity of social media in India and also the common use of the English Qwerty keyboard, Indian users on social media express themselves in their native languages, but using the Roman/Latin script. Unlike some other popular South Asian languages (say Pinyin for Chinese), Indian languages do not have a common standard romanization convention for writing on social media platforms. Assamese and English are two very different orthographical languages. Thus, considering both orthographic and phonemic characteristics of the language, this study tries to explain how Assamese vowels, vowel diacritics, and consonants are represented in Roman transliterated form. From a dataset of romanized Assamese social media texts collected from three popular social media sites: (Facebook, YouTube and Twitter), we have manually labeled them with their native Assamese script. A comparison analysis is also carried out between the transliterated Assamese social media texts with six different Assamese romanization schemes that reflect how Assamese users on social media do not adhere to any fixed romanization scheme. We have built three separate character-level transliteration models from our dataset. One using a traditional phrase-based statistical machine transliteration model, (1). PBSMT model and two separate neural transliteration models: (2). BiLSTM neural seq2seq model with attention, and (3). Neural transformer model. A thorough error analysis has been performed on the transliteration result obtained from the three state-of-the-art models mentioned above. This may help to build a more robust machine transliteration system for the Assamese social media domain in the future. Finally, an attention analysis experiment is also carried out with the help of attention weight scores taken from the character-level BiLSTM neural seq2seq transliteration model built from our dataset.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"54 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139373283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Revolutionizing Healthcare: NLP, Deep Learning, and WSN Solutions for Managing the COVID-19 Crisis 革新医疗保健：管理 COVID-19 危机的 NLP、深度学习和 WSN 解决方案

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-01-05 DOI: 10.1145/3639566

Ajay P., Nagaraj B., R. Arun Kumar

The COVID-19 outbreak in 2020 catalyzed a global socio-economic upheaval, compelling nations to embrace digital technologies as a means of countering economic downturns and ensuring efficient communication systems. This paper delves into the role of Natural Language Processing (NLP) in harnessing wireless connectivity during the pandemic. The examination assesses how wireless networks have affected various facets of crisis management, including virus tracking, optimizing healthcare, facilitating remote education, and enabling unified communications. Additionally, the article underscores the importance of digital inclusion in mitigating disease outbreaks and reconnecting marginalized communities. To address these challenges, a Dual CNN-based BERT model is proposed. BERT model is used to extract the text features, the internal layers of BERT excel at capturing intricate contextual details concerning words and phrases, rendering them highly valuable as features for a wide array of text analysis tasks. The significance of dual CNN is capturing the unique capability to seamlessly integrate both character-level and word-level information. This fusion of insights from different levels of textual analysis proves especially valuable in handling text data that is noisy, complex, or presents challenges related to misspellings and domain-specific terminology. The proposed model is evaluated using the simulated WSN-based text data for crisis management.

2020 年爆发的 COVID-19 引发了全球社会经济动荡，迫使各国纷纷采用数字技术作为应对经济衰退和确保高效通信系统的手段。本文深入探讨了自然语言处理（NLP）在大流行病期间利用无线连接的作用。文章评估了无线网络如何影响危机管理的各个方面，包括病毒追踪、优化医疗保健、促进远程教育和实现统一通信。此外，文章还强调了数字包容性在缓解疾病爆发和重新连接边缘化社区方面的重要性。为应对这些挑战，本文提出了一种基于双 CNN 的 BERT 模型。BERT 模型用于提取文本特征，BERT 的内部层善于捕捉有关单词和短语的复杂上下文细节，使其成为各种文本分析任务中极具价值的特征。双 CNN 的重要意义在于捕捉字符级和单词级信息无缝整合的独特能力。事实证明，这种融合不同层次文本分析的洞察力，在处理噪声大、复杂或存在拼写错误和特定领域术语等挑战的文本数据时尤其有价值。我们使用模拟的基于 WSN 的危机管理文本数据对所提出的模型进行了评估。

{"title":"Revolutionizing Healthcare: NLP, Deep Learning, and WSN Solutions for Managing the COVID-19 Crisis","authors":"Ajay P., Nagaraj B., R. Arun Kumar","doi":"10.1145/3639566","DOIUrl":"https://doi.org/10.1145/3639566","url":null,"abstract":"The COVID-19 outbreak in 2020 catalyzed a global socio-economic upheaval, compelling nations to embrace digital technologies as a means of countering economic downturns and ensuring efficient communication systems. This paper delves into the role of Natural Language Processing (NLP) in harnessing wireless connectivity during the pandemic. The examination assesses how wireless networks have affected various facets of crisis management, including virus tracking, optimizing healthcare, facilitating remote education, and enabling unified communications. Additionally, the article underscores the importance of digital inclusion in mitigating disease outbreaks and reconnecting marginalized communities. To address these challenges, a Dual CNN-based BERT model is proposed. BERT model is used to extract the text features, the internal layers of BERT excel at capturing intricate contextual details concerning words and phrases, rendering them highly valuable as features for a wide array of text analysis tasks. The significance of dual CNN is capturing the unique capability to seamlessly integrate both character-level and word-level information. This fusion of insights from different levels of textual analysis proves especially valuable in handling text data that is noisy, complex, or presents challenges related to misspellings and domain-specific terminology. The proposed model is evaluated using the simulated WSN-based text data for crisis management.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"54 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139373153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Mental Health Analysis in Social Media for Low-resourced Languages 在社交媒体中对低资源语言进行心理健康分析

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2023-12-30 DOI: 10.1145/3638761

Muskan Garg

The surge in internet use for expression of personal thoughts and beliefs has made it increasingly feasible for the social Natural Language Processing (NLP) research community to find and validate associations between social media posts and mental health status. Cross-sectional and longitudinal studies of low-resourced social media data bring to fore the importance of real-time responsible Artificial Intelligence (AI) models for mental health analysis in native languages. Aiming to classify research for social computing and tracking advances in the development of learning-based models, we propose a comprehensive survey on mental health analysis for social media and posit the need of analyzing low-resourced social media data for mental health. We first classify three components for computing on social media as: SM- data mining/ natural language processing on social media, IA- integrated applications with social media data and user-network modeling, and NM- user and network modeling on social networks. To this end, we posit the need of mental health analysis in different languages of East Asia (e.g. Chinese, Japanese, Korean), South Asia (Hindi, Bengali, Tamil), Southeast Asia (Malay, Thai, Vietnamese), European languages (Spanish, French) and the Middle East (Arabic). Our comprehensive study examines available resources and recent advances in low-resourced languages for different aspects of SM, IA and NM to discover new frontiers as potential field of research.

互联网在表达个人思想和信念方面的使用激增，使得社会自然语言处理（NLP）研究界越来越有可能发现和验证社交媒体帖子与心理健康状况之间的关联。对资源匮乏的社交媒体数据进行的横向和纵向研究凸显了实时负责任的人工智能（AI）模型对于用母语进行心理健康分析的重要性。为了对社交计算研究进行分类，并跟踪基于学习的模型开发进展，我们提出了一项关于社交媒体心理健康分析的综合调查，并提出了分析低资源社交媒体数据以促进心理健康的必要性。我们首先将社交媒体计算的三个组成部分分类为SM--社交媒体数据挖掘/自然语言处理；IA--社交媒体数据和用户网络建模的集成应用；NM--社交网络的用户和网络建模。为此，我们提出了用东亚（如汉语、日语、韩语）、南亚（印地语、孟加拉语、泰米尔语）、东南亚（马来语、泰语、越南语）、欧洲语言（西班牙语、法语）和中东（阿拉伯语）的不同语言进行心理健康分析的需求。我们的综合研究考察了低资源语言在 SM、IA 和 NM 不同方面的可用资源和最新进展，以发现潜在研究领域的新前沿。

{"title":"Towards Mental Health Analysis in Social Media for Low-resourced Languages","authors":"Muskan Garg","doi":"10.1145/3638761","DOIUrl":"https://doi.org/10.1145/3638761","url":null,"abstract":"The surge in internet use for expression of personal thoughts and beliefs has made it increasingly feasible for the social Natural Language Processing (NLP) research community to find and validate associations between social media posts and mental health status. Cross-sectional and longitudinal studies of low-resourced social media data bring to fore the importance of real-time responsible Artificial Intelligence (AI) models for mental health analysis in native languages. Aiming to classify research for social computing and tracking advances in the development of learning-based models, we propose a comprehensive survey on mental health analysis for social media and posit the need of analyzing low-resourced social media data for mental health. We first classify three components for computing on social media as: SM- data mining/ natural language processing on social media, IA- integrated applications with social media data and user-network modeling, and NM- user and network modeling on social networks. To this end, we posit the need of mental health analysis in different languages of East Asia (e.g. Chinese, Japanese, Korean), South Asia (Hindi, Bengali, Tamil), Southeast Asia (Malay, Thai, Vietnamese), European languages (Spanish, French) and the Middle East (Arabic). Our comprehensive study examines available resources and recent advances in low-resourced languages for different aspects of SM, IA and NM to discover new frontiers as potential field of research.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"20 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2023-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139064404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DSISA: A New Neural Machine Translation Combining Dependency Weight and Neighbors DSISA：结合依存权重和邻居的新型神经机器翻译

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2023-12-29 DOI: 10.1145/3638762

Lingfang Li, Aijun Zhang, Ming-Xing Luo

Most of the previous neural machine translations (NMT) rely on parallel corpus. Integrating explicitly prior syntactic structure information can improve the neural machine translation. In this paper, we propose a Syntax Induced Self-Attention (SISA) which explores the influence of dependence relation between words through the attention mechanism and fine-tunes the attention allocation of the sentence through the obtained dependency weight. We present a new model, Double Syntax Induced Self-Attention (DSISA), which fuses the features extracted by SISA and a compact convolution neural network (CNN). SISA can alleviate long dependency in sentence, while CNN captures the limited context based on neighbors. DSISA utilizes two different neural networks to extract different features for richer semantic representation and replaces the first layer of Transformer encoder. DSISA not only makes use of the global feature of tokens in sentences but also the local feature formed with adjacent tokens. Finally, we perform simulation experiments that verify the performance of the new model on standard corpora.

以往的神经机器翻译（NMT）大多依赖于平行语料库。明确整合先验句法结构信息可以改善神经机器翻译。在本文中，我们提出了一种语法诱导自注意（SISA），它通过注意机制探索词与词之间依赖关系的影响，并通过获得的依赖关系权重微调句子的注意分配。我们提出了一个新模型--双语法诱导自注意（DSISA），它融合了 SISA 和紧凑型卷积神经网络（CNN）所提取的特征。SISA 可减轻句子中的长依赖关系，而 CNN 则可根据邻近句捕捉有限的上下文。DSISA 利用两个不同的神经网络来提取不同的特征，以获得更丰富的语义表示，并取代了 Transformer 编码器的第一层。DSISA 不仅利用了句子中标记的全局特征，还利用了相邻标记形成的局部特征。最后，我们进行了模拟实验，验证了新模型在标准语料库中的性能。

引用次数: 0