ACM Transactions on Asian and Low-Resource Language Information Processing最新文献

FedREAS: A Robust Efficient Aggregation and Selection Framework for Federated Learning FedREAS：联盟学习的稳健高效聚合和选择框架

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-06-04 DOI: 10.1145/3670689

Shuming Fan, Chenpei Wang, Xinyu Ruan, Hongjian Shi, Ruhui Ma, Haibing Guan

In the field of Natural Language Processing (NLP), Deep Learning (DL) and Neural Network (NN) technologies have been widely applied to machine translation and sentiment analysis and have demonstrated outstanding performance. In recent years, NLP applications have also combined multimodal data, such as visual and audio, continuously improving language processing performance. At the same time, the size of Neural Network models is increasing, and many models cannot be deployed on devices with limited computing resources. Deploying models on cloud platforms has become a trend. However, deploying models in the cloud introduces new privacy risks for endpoint data, despite overcoming computational limitations. Federated Learning (FL) methods protect local data by keeping the data on the client side and only sending local updates to the central server. However, the FL architecture still has problems, such as vulnerability to adversarial attacks and non-IID data distribution. In this work, we propose a Federated Learning aggregation method called FedREAS. The server uses a benchmark dataset to train a global model and obtains benchmark updates in this method. Before aggregating local updates, the server adjusts the local updates using the benchmark updates and then returns the adjusted benchmark updates. Then, based on the similarity between the adjusted local updates and the adjusted benchmark updates, the server aggregates these local updates to obtain a more robust update. This method also improves the client selection process. FedREAS selects suitable clients for training at the beginning of each round based on specific strategies, the similarity of the previous round’s updates, and the submitted data. We conduct experiments on different datasets and compare FedREAS with other Federated Learning methods. The results show that FedREAS outperforms other methods regarding model performance and resistance to attacks.

在自然语言处理（NLP）领域，深度学习（DL）和神经网络（NN）技术已被广泛应用于机器翻译和情感分析，并表现出卓越的性能。近年来，NLP 应用还结合了视觉和音频等多模态数据，不断提高语言处理性能。与此同时，神经网络模型的规模也在不断扩大，许多模型无法部署在计算资源有限的设备上。在云平台上部署模型已成为一种趋势。然而，尽管克服了计算上的限制，但在云平台上部署模型会给终端数据带来新的隐私风险。联合学习（FL）方法通过将数据保存在客户端并只向中央服务器发送本地更新来保护本地数据。然而，FL 架构仍存在一些问题，如容易受到对抗性攻击和非 IID 数据分发。在这项工作中，我们提出了一种名为 FedREAS 的联邦学习聚合方法。服务器使用基准数据集训练全局模型，并通过这种方法获得基准更新。在聚合本地更新之前，服务器使用基准更新调整本地更新，然后返回调整后的基准更新。然后，根据调整后的本地更新和调整后的基准更新之间的相似性，服务器会聚合这些本地更新，以获得更稳健的更新。这种方法还改进了客户端选择过程。FedREAS 在每一轮开始时都会根据特定策略、上一轮更新的相似性和提交的数据选择合适的客户端进行训练。我们在不同的数据集上进行了实验，并将 FedREAS 与其他联合学习方法进行了比较。结果表明，在模型性能和抗攻击能力方面，FedREAS 优于其他方法。

{"title":"FedREAS: A Robust Efficient Aggregation and Selection Framework for Federated Learning","authors":"Shuming Fan, Chenpei Wang, Xinyu Ruan, Hongjian Shi, Ruhui Ma, Haibing Guan","doi":"10.1145/3670689","DOIUrl":"https://doi.org/10.1145/3670689","url":null,"abstract":"In the field of Natural Language Processing (NLP), Deep Learning (DL) and Neural Network (NN) technologies have been widely applied to machine translation and sentiment analysis and have demonstrated outstanding performance. In recent years, NLP applications have also combined multimodal data, such as visual and audio, continuously improving language processing performance. At the same time, the size of Neural Network models is increasing, and many models cannot be deployed on devices with limited computing resources. Deploying models on cloud platforms has become a trend. However, deploying models in the cloud introduces new privacy risks for endpoint data, despite overcoming computational limitations. Federated Learning (FL) methods protect local data by keeping the data on the client side and only sending local updates to the central server. However, the FL architecture still has problems, such as vulnerability to adversarial attacks and non-IID data distribution. In this work, we propose a Federated Learning aggregation method called FedREAS. The server uses a benchmark dataset to train a global model and obtains benchmark updates in this method. Before aggregating local updates, the server adjusts the local updates using the benchmark updates and then returns the adjusted benchmark updates. Then, based on the similarity between the adjusted local updates and the adjusted benchmark updates, the server aggregates these local updates to obtain a more robust update. This method also improves the client selection process. FedREAS selects suitable clients for training at the beginning of each round based on specific strategies, the similarity of the previous round’s updates, and the submitted data. We conduct experiments on different datasets and compare FedREAS with other Federated Learning methods. The results show that FedREAS outperforms other methods regarding model performance and resistance to attacks.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"70 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Study on Intelligent Scoring of English Composition Based on Machine Learning from the Perspective of Natural Language Processing 自然语言处理视角下基于机器学习的英语作文智能评分研究

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-06-04 DOI: 10.1145/3625545

Jing Tang

Knowledge management is crucial to the teaching and learning process in the current era of digitalization. The idea of "learning via working together" is making Natural Language Processing a popular tool to improve the learning process based on the intelligent system for evaluating the composition. English language learning is highly dependent on the composition written by the students under various topics. Teachers are facing huge difficulties in the evaluation of the composition as the level of writing by the students will vary for individual. In this research, Natural Language Processing concept is utilized for getting trained with the student's writing skills and Multiprocessor Learning Algorithm (MLA) combined with Convolutional Neural Network (CNN) (MLA-CNN) for evaluating the composition and declaring the scores for the students. The model's composition scoring rate is validated using a range of learning rate settings. Some theoretical notions for smart teaching are proposed, and it is hoped that this automatic composition scoring model would be used to grade student writing in English classes. When applied to the automatic scoring of students' English composition in schools, the suggested composition scoring system trained by the MLP-CNN has great performance and lays the groundwork for the educational applications of ML inside AI. The study results proved that the proposed model has provided an accuracy of 98%.

在当前的数字化时代，知识管理对教学过程至关重要。通过合作学习 "的理念正使自然语言处理成为一种流行的工具，用于改进基于智能系统的作文评价的学习过程。英语学习在很大程度上依赖于学生根据不同主题所写的作文。由于学生的写作水平因人而异，教师在作文评价方面面临巨大困难。本研究利用自然语言处理概念对学生的写作技巧进行训练，并结合多处理器学习算法（MLA）和卷积神经网络（CNN）（MLA-CNN）对学生的作文进行评估和打分。该模型的作文评分率通过一系列学习率设置进行了验证。提出了一些智能教学的理论概念，并希望该自动作文评分模型能用于英语课堂的学生作文评分。在应用于学校学生英语作文的自动评分时，建议的由 MLP-CNN 训练的作文评分系统表现出色，为人工智能中的 ML 在教育领域的应用奠定了基础。研究结果证明，所提出的模型准确率高达 98%。

{"title":"Study on Intelligent Scoring of English Composition Based on Machine Learning from the Perspective of Natural Language Processing","authors":"Jing Tang","doi":"10.1145/3625545","DOIUrl":"https://doi.org/10.1145/3625545","url":null,"abstract":"Knowledge management is crucial to the teaching and learning process in the current era of digitalization. The idea of \"learning via working together\" is making Natural Language Processing a popular tool to improve the learning process based on the intelligent system for evaluating the composition. English language learning is highly dependent on the composition written by the students under various topics. Teachers are facing huge difficulties in the evaluation of the composition as the level of writing by the students will vary for individual. In this research, Natural Language Processing concept is utilized for getting trained with the student's writing skills and Multiprocessor Learning Algorithm (MLA) combined with Convolutional Neural Network (CNN) (MLA-CNN) for evaluating the composition and declaring the scores for the students. The model's composition scoring rate is validated using a range of learning rate settings. Some theoretical notions for smart teaching are proposed, and it is hoped that this automatic composition scoring model would be used to grade student writing in English classes. When applied to the automatic scoring of students' English composition in schools, the suggested composition scoring system trained by the MLP-CNN has great performance and lays the groundwork for the educational applications of ML inside AI. The study results proved that the proposed model has provided an accuracy of 98%.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"2 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

X-Phishing-Writer: A Framework for Cross-Lingual Phishing Email Generation X-Phishing-Writer：跨语言网络钓鱼电子邮件生成框架

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-06-03 DOI: 10.1145/3670402

Shih-Wei Guo, Yao-Chung Fan

Cybercrime is projected to cause annual business losses of $10.5 trillion by 2025, a significant concern given that a majority of security breaches are due to human errors, especially through phishing attacks. The rapid increase in daily identified phishing sites over the past decade underscores the pressing need to enhance defenses against such attacks. Social Engineering Drills (SEDs) are essential in raising awareness about phishing, yet face challenges in creating effective and diverse phishing email content. These challenges are exacerbated by the limited availability of public datasets and concerns over using external language models like ChatGPT for phishing email generation. To address these issues, this paper introduces X-Phishing-Writer, a novel cross-lingual Few-Shot phishing email generation framework. X-Phishing-Writer allows for the generation of emails based on minimal user input, leverages single-language datasets for multilingual email generation, and is designed for internal deployment using a lightweight, open-source language model. Incorporating Adapters into an Encoder-Decoder architecture, X-Phishing-Writer marks a significant advancement in the field, demonstrating superior performance in generating phishing emails across 25 languages when compared to baseline models. Experimental results and real-world drills involving 1,682 users showcase a 17.67% email open rate and a 13.33% hyperlink click-through rate, affirming the framework’s effectiveness and practicality in enhancing phishing awareness and defense.

预计到 2025 年，网络犯罪每年将造成 10.5 万亿美元的商业损失，鉴于大多数安全漏洞都是人为失误造成的，特别是通过网络钓鱼攻击造成的，因此这是一个重大问题。在过去十年中，每天发现的网络钓鱼网站迅速增加，这突出表明迫切需要加强对此类攻击的防御。社会工程演习 (SED) 对于提高人们对网络钓鱼的认识至关重要，但在创建有效和多样化的网络钓鱼电子邮件内容方面却面临挑战。公共数据集的有限可用性以及对使用 ChatGPT 等外部语言模型生成网络钓鱼电子邮件的担忧加剧了这些挑战。为了解决这些问题，本文介绍了一种新颖的跨语言 Few-Shot 网络钓鱼电子邮件生成框架 X-Phishing-Writer。X-Phishing-Writer 允许基于最少的用户输入生成电子邮件，利用单语言数据集生成多语言电子邮件，并使用轻量级开源语言模型进行内部部署。X-Phishing-Writer 将适配器整合到编码器-解码器架构中，标志着该领域的重大进步，与基线模型相比，它在生成 25 种语言的网络钓鱼电子邮件方面表现出色。有 1682 名用户参与的实验结果和实际演练显示，电子邮件打开率为 17.67%，超链接点击率为 13.33%，这肯定了该框架在增强网络钓鱼意识和防御方面的有效性和实用性。

{"title":"X-Phishing-Writer: A Framework for Cross-Lingual Phishing Email Generation","authors":"Shih-Wei Guo, Yao-Chung Fan","doi":"10.1145/3670402","DOIUrl":"https://doi.org/10.1145/3670402","url":null,"abstract":"Cybercrime is projected to cause annual business losses of $10.5 trillion by 2025, a significant concern given that a majority of security breaches are due to human errors, especially through phishing attacks. The rapid increase in daily identified phishing sites over the past decade underscores the pressing need to enhance defenses against such attacks. Social Engineering Drills (SEDs) are essential in raising awareness about phishing, yet face challenges in creating effective and diverse phishing email content. These challenges are exacerbated by the limited availability of public datasets and concerns over using external language models like ChatGPT for phishing email generation. To address these issues, this paper introduces X-Phishing-Writer, a novel cross-lingual Few-Shot phishing email generation framework. X-Phishing-Writer allows for the generation of emails based on minimal user input, leverages single-language datasets for multilingual email generation, and is designed for internal deployment using a lightweight, open-source language model. Incorporating Adapters into an Encoder-Decoder architecture, X-Phishing-Writer marks a significant advancement in the field, demonstrating superior performance in generating phishing emails across 25 languages when compared to baseline models. Experimental results and real-world drills involving 1,682 users showcase a 17.67% email open rate and a 13.33% hyperlink click-through rate, affirming the framework’s effectiveness and practicality in enhancing phishing awareness and defense.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"53 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automatic Algerian Sarcasm Detection from Texts and Images 从文本和图像中自动检测阿尔及利亚讽刺语言

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-06-03 DOI: 10.1145/3670403

Kheira Zineb Bousmaha, Khaoula Hamadouche, Hadjer Djouabi, Lamia Hadrich-Belguith

In recent years, the number of Algerian Internet users has significantly increased, providing a valuable opportunity for collecting and utilizing opinions and sentiments expressed online. They now post not just texts but also images. However, to benefit from this wealth of information, it is crucial to address the challenge of sarcasm detection, which poses a limitation in sentiment analysis. Sarcasm often involves the use of non-literal and ambiguous language, making its detection complex. To enhance the quality and relevance of sentiment analysis, it is essential to develop effective methods for sarcasm detection. By overcoming this limitation, we can fully harness the expressed online opinions and benefit from their valuable insights for a better understanding of trends and sentiments among the Algerian public. In this work, our aim is to develop a comprehensive system that addresses sarcasm detection in Algerian dialect, encompassing both text and image analysis. We propose a hybrid approach that combines linguistic characteristics and machine learning techniques for text analysis. Additionally, for image analysis, we utilized the deep learning model VGG-19 for image classification, and employed the EasyOCR technique for Arabic text extraction. By integrating these approaches, we strive to create a robust system capable of detecting sarcasm in both textual and visual content in the Algerian dialect. Our system achieved an accuracy of 92.79% for the textual models and 89.28% for the visual model.

近年来，阿尔及利亚网民人数大幅增加，为收集和利用网上表达的意见和情感提供了宝贵的机会。他们现在不仅发布文字，还发布图片。然而，要从这些丰富的信息中获益，关键是要解决讽刺检测这一难题，因为它是情感分析中的一个局限。讽刺往往涉及使用非直白和模棱两可的语言，使其检测变得复杂。为了提高情感分析的质量和相关性，必须开发有效的讽刺检测方法。通过克服这一局限，我们可以充分利用网络表达的意见，并从其宝贵的见解中获益，从而更好地了解阿尔及利亚公众的趋势和情绪。在这项工作中，我们的目标是开发一个全面的系统，解决阿尔及利亚方言中的讽刺检测问题，包括文本和图像分析。我们提出了一种混合方法，将语言特点和机器学习技术结合起来进行文本分析。此外，在图像分析方面，我们利用深度学习模型 VGG-19 进行图像分类，并利用 EasyOCR 技术进行阿拉伯语文本提取。通过整合这些方法，我们努力创建一个强大的系统，能够检测阿尔及利亚方言文本和图像内容中的讽刺内容。我们的系统在文本模型和视觉模型中分别达到了 92.79% 和 89.28% 的准确率。

{"title":"Automatic Algerian Sarcasm Detection from Texts and Images","authors":"Kheira Zineb Bousmaha, Khaoula Hamadouche, Hadjer Djouabi, Lamia Hadrich-Belguith","doi":"10.1145/3670403","DOIUrl":"https://doi.org/10.1145/3670403","url":null,"abstract":"In recent years, the number of Algerian Internet users has significantly increased, providing a valuable opportunity for collecting and utilizing opinions and sentiments expressed online. They now post not just texts but also images. However, to benefit from this wealth of information, it is crucial to address the challenge of sarcasm detection, which poses a limitation in sentiment analysis. Sarcasm often involves the use of non-literal and ambiguous language, making its detection complex. To enhance the quality and relevance of sentiment analysis, it is essential to develop effective methods for sarcasm detection. By overcoming this limitation, we can fully harness the expressed online opinions and benefit from their valuable insights for a better understanding of trends and sentiments among the Algerian public. In this work, our aim is to develop a comprehensive system that addresses sarcasm detection in Algerian dialect, encompassing both text and image analysis. We propose a hybrid approach that combines linguistic characteristics and machine learning techniques for text analysis. Additionally, for image analysis, we utilized the deep learning model VGG-19 for image classification, and employed the EasyOCR technique for Arabic text extraction. By integrating these approaches, we strive to create a robust system capable of detecting sarcasm in both textual and visual content in the Algerian dialect. Our system achieved an accuracy of 92.79% for the textual models and 89.28% for the visual model.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"7 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

KannadaLex: A lexical database with psycholinguistic information KannadaLex：包含心理语言学信息的词汇数据库

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-06-03 DOI: 10.1145/3670688

Shreya R. Aithal, Muralikrishna Sn, Raghavendra Ganiga, Ashwath Rao, Govardhan Hegde

Databases containing lexical properties are of primary importance to psycholinguistic research and speech-language therapy. Several lexical databases for different languages have been developed in the recent past, but Kannada, a language spoken by 50.8 million people, has no comprehensive lexical database yet. To address this, KannadaLex, a Kannada lexical database is built as a language resource that contains orthographic, phonological, and syllabic information about words that are sourced from newspaper articles from the last decade. Along with these vital statistics like the phonological neighbourhood, syllable complexity summed syllable and bigram syllable frequencies, and lemma and inflectional family information are stored. The database is validated by correlating frequency, a well-established psycholinguistic feature, with other numerical features. The developed lexical database contains 170K words from varied disciplines, complete with psycholinguistic features. This KannadaLex is a comprehensive resource for psycholinguists, speech therapists, and linguistic researchers for analyzing Kannada and other similar languages. Psycholinguists require lexical data for choosing stimuli to conduct experiments that study the factors that enable humans to acquire, use, comprehend, and produce language. Speech and language therapists query these databases for developing the most efficient stimuli for evaluating, diagnosing, and treating communication disorders, and rehabilitation of speech after brain injuries.

包含词汇属性的数据库对心理语言学研究和言语治疗至关重要。近年来，不同语言的词汇数据库相继问世，但有 5080 万人口使用的卡纳达语却还没有一个全面的词汇数据库。为了解决这个问题，KannadaLex（卡纳达语词库）作为一种语言资源被建立起来，它包含了从过去十年的报纸文章中获取的单词的正字法、语音和音节信息。除了这些重要的统计信息外，还存储了音素邻域、音节复杂性、音节总和、大音节频率、外来词和转折系信息。通过将频率这一成熟的心理语言学特征与其他数字特征相关联，对数据库进行了验证。所开发的词库包含来自不同学科的 170K 个单词，具有完整的心理语言学特征。该 KannadaLex 是心理语言学家、语言治疗师和语言研究人员分析 Kannada 和其他类似语言的综合资源。心理语言学家需要词汇数据来选择刺激进行实验，研究人类获得、使用、理解和产生语言的因素。言语和语言治疗师查询这些数据库，以开发最有效的刺激，用于评估、诊断和治疗交流障碍，以及脑损伤后的言语康复。

{"title":"KannadaLex: A lexical database with psycholinguistic information","authors":"Shreya R. Aithal, Muralikrishna Sn, Raghavendra Ganiga, Ashwath Rao, Govardhan Hegde","doi":"10.1145/3670688","DOIUrl":"https://doi.org/10.1145/3670688","url":null,"abstract":"Databases containing lexical properties are of primary importance to psycholinguistic research and speech-language therapy. Several lexical databases for different languages have been developed in the recent past, but Kannada, a language spoken by 50.8 million people, has no comprehensive lexical database yet. To address this, KannadaLex, a Kannada lexical database is built as a language resource that contains orthographic, phonological, and syllabic information about words that are sourced from newspaper articles from the last decade. Along with these vital statistics like the phonological neighbourhood, syllable complexity summed syllable and bigram syllable frequencies, and lemma and inflectional family information are stored. The database is validated by correlating frequency, a well-established psycholinguistic feature, with other numerical features. The developed lexical database contains 170K words from varied disciplines, complete with psycholinguistic features. This KannadaLex is a comprehensive resource for psycholinguists, speech therapists, and linguistic researchers for analyzing Kannada and other similar languages. Psycholinguists require lexical data for choosing stimuli to conduct experiments that study the factors that enable humans to acquire, use, comprehend, and produce language. Speech and language therapists query these databases for developing the most efficient stimuli for evaluating, diagnosing, and treating communication disorders, and rehabilitation of speech after brain injuries.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"70 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Document-Level Relation Extraction Based on Machine Reading Comprehension and Hybrid Pointer-sequence Labeling 基于机器阅读理解和混合指针序列标记的文档级关联提取

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-06-01 DOI: 10.1145/3666042

xiaoyi wang, Jie Liu, Jiong Wang, Jianyong Duan, guixia guan, qing zhang, Jianshe Zhou

Document-level relational extraction requires reading, memorization and reasoning to discover relevant factual information in multiple sentences. It is difficult for the current hierarchical network and graph network methods to fully capture the structural information behind the document and make natural reasoning from the context. Different from the previous methods, this paper reconstructs the relation extraction task into a machine reading comprehension task. Each pair of entities and relationships is characterized by a question template, and the extraction of entities and relationships is translated into identifying answers from the context. To enhance the context comprehension ability of the extraction model and achieve more precise extraction, we introduce large language models (LLMs) during question construction, enabling the generation of exemplary answers. Besides, to solve the multi-label and multi-entity problems in documents, we propose a new answer extraction model based on hybrid pointer-sequence labeling, which improves the reasoning ability of the model and realizes the extraction of zero or multiple answers in documents. Extensive experiments on three public datasets show that the proposed method is effective.

文档级关系提取需要通过阅读、记忆和推理来发现多个句子中的相关事实信息。目前的分层网络和图网络方法很难完全捕捉到文档背后的结构信息，也很难根据上下文进行自然推理。与以往的方法不同，本文将关系提取任务重构为机器阅读理解任务。每一对实体和关系都由一个问题模板来表征，实体和关系的提取被转化为从上下文中识别答案。为了增强抽取模型的上下文理解能力，实现更精确的抽取，我们在构建问题时引入了大语言模型（LLM），从而能够生成模范答案。此外，为了解决文档中的多标签和多实体问题，我们提出了一种基于混合指针-序列标签的新答案提取模型，提高了模型的推理能力，实现了文档中零答案或多答案的提取。在三个公开数据集上的大量实验表明，所提出的方法是有效的。

{"title":"Document-Level Relation Extraction Based on Machine Reading Comprehension and Hybrid Pointer-sequence Labeling","authors":"xiaoyi wang, Jie Liu, Jiong Wang, Jianyong Duan, guixia guan, qing zhang, Jianshe Zhou","doi":"10.1145/3666042","DOIUrl":"https://doi.org/10.1145/3666042","url":null,"abstract":"Document-level relational extraction requires reading, memorization and reasoning to discover relevant factual information in multiple sentences. It is difficult for the current hierarchical network and graph network methods to fully capture the structural information behind the document and make natural reasoning from the context. Different from the previous methods, this paper reconstructs the relation extraction task into a machine reading comprehension task. Each pair of entities and relationships is characterized by a question template, and the extraction of entities and relationships is translated into identifying answers from the context. To enhance the context comprehension ability of the extraction model and achieve more precise extraction, we introduce large language models (LLMs) during question construction, enabling the generation of exemplary answers. Besides, to solve the multi-label and multi-entity problems in documents, we propose a new answer extraction model based on hybrid pointer-sequence labeling, which improves the reasoning ability of the model and realizes the extraction of zero or multiple answers in documents. Extensive experiments on three public datasets show that the proposed method is effective.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"8 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141197570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Quantitative Stylistic Analysis of Middle Chinese Texts Based on the Dissimilarity of Evolutive Core Word Usage 基于核心词用法演变差异的中古汉语文本定量文体分析

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-05-28 DOI: 10.1145/3665794

Jiahao Huo Bing Qiu

Stylistic analysis enables open-ended and exploratory observation of languages. To fill the gap in the quantitative analysis of the stylistic systems of Middle Chinese, we construct lexical features based on the evolutive core word usage and scheme a Bayesian method for feature parameters estimation. The lexical features are from the Swadesh list, each of which has different word forms along with the language evolution during the Middle Ages. We thus count the varied word of those entries along with the language evolution as the linguistic features. With the Bayesian formulation, the feature parameters are estimated to construct a high-dimensional random feature vector in order to obtain the pair-wise dissimilarity matrix of all the texts based on different distance measures. Finally, we perform the spectral embedding and clustering to visualize, categorize and analyze the linguistic styles of Middle Chinese texts. The quantitative result agrees with the existing qualitative conclusions and furthermore, betters our understanding of the linguistic styles of Middle Chinese from both the inter-category and intra-category aspects. It also helps unveil the special styles induced by the indirect language contact.

文体分析可以对语言进行开放性和探索性的观察。为了填补中古汉语文体系统定量分析的空白，我们根据核心词的演变用法构建词性特征，并采用贝叶斯方法进行特征参数估计。词性特征来自 Swadesh 词表，每个词性特征都随着中古语言的演变而有不同的词形。因此，我们将这些词条中的不同单词以及语言演变过程视为语言特征。通过贝叶斯公式，我们估算了特征参数，构建了一个高维随机特征向量，从而根据不同的距离度量获得了所有文本的成对异质性矩阵。最后，我们进行频谱嵌入和聚类，对中古汉语文本的语言风格进行可视化、分类和分析。定量结果与已有的定性结论相吻合，并进一步从类别间和类别内两个方面加深了我们对中古汉语语言风格的理解。它还有助于揭示间接语言接触所引发的特殊语体。

{"title":"Quantitative Stylistic Analysis of Middle Chinese Texts Based on the Dissimilarity of Evolutive Core Word Usage","authors":"Bing Qiu, Jiahao Huo","doi":"10.1145/3665794","DOIUrl":"https://doi.org/10.1145/3665794","url":null,"abstract":"Stylistic analysis enables open-ended and exploratory observation of languages. To fill the gap in the quantitative analysis of the stylistic systems of Middle Chinese, we construct lexical features based on the evolutive core word usage and scheme a Bayesian method for feature parameters estimation. The lexical features are from the Swadesh list, each of which has different word forms along with the language evolution during the Middle Ages. We thus count the varied word of those entries along with the language evolution as the linguistic features. With the Bayesian formulation, the feature parameters are estimated to construct a high-dimensional random feature vector in order to obtain the pair-wise dissimilarity matrix of all the texts based on different distance measures. Finally, we perform the spectral embedding and clustering to visualize, categorize and analyze the linguistic styles of Middle Chinese texts. The quantitative result agrees with the existing qualitative conclusions and furthermore, betters our understanding of the linguistic styles of Middle Chinese from both the inter-category and intra-category aspects. It also helps unveil the special styles induced by the indirect language contact.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"47 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141170478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SCBG: Semantic-Constrained Bidirectional Generation for Emotional Support Conversation SCBG：情感支持对话的语义约束双向生成

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-05-27 DOI: 10.1145/3666090

Xiao Sun Zhuoer Zhao Yangyang Xu

The Emotional Support Conversation (ESC) task aims to deliver consolation, encouragement, and advice to individuals undergoing emotional distress, thereby assisting them in overcoming difficulties. In the context of emotional support dialogue systems, it is of utmost importance to generate user-relevant and diverse responses. However, previous methods failed to take into account these crucial aspects, resulting in a tendency to produce universal and safe responses (e.g., “I do not know” and “I am sorry to hear that”). To tackle this challenge, a semantic-constrained bidirectional generation (SCBG) framework is utilized for generating more diverse and user-relevant responses. Specifically, we commence by selecting keywords that encapsulate the ongoing dialogue topics based on the context. Subsequently, a bidirectional generator generates responses incorporating these keywords. Two distinct methodologies, namely statistics-based and prompt-based methods, are employed for keyword extraction. Experimental results on the ESConv dataset demonstrate that the proposed SCBG framework improves response diversity and user relevance while ensuring response quality.

情感支持对话（ESC）任务旨在向受到情感困扰的个人提供安慰、鼓励和建议，从而帮助他们克服困难。在情感支持对话系统中，最重要的是生成与用户相关的多样化回复。然而，以往的方法没有考虑到这些关键方面，导致产生的回复往往是通用和安全的（如 "我不知道 "和 "很遗憾听到这个消息"）。为了应对这一挑战，我们采用了语义约束双向生成（SCBG）框架，以生成更加多样化和与用户相关的回复。具体来说，我们首先根据上下文选择能概括当前对话主题的关键词。随后，双向生成器生成包含这些关键词的回复。关键字提取采用了两种不同的方法，即基于统计的方法和基于提示的方法。在 ESConv 数据集上的实验结果表明，所提出的 SCBG 框架在确保回复质量的同时，还提高了回复的多样性和用户相关性。

{"title":"SCBG: Semantic-Constrained Bidirectional Generation for Emotional Support Conversation","authors":"Yangyang Xu, Zhuoer Zhao, Xiao Sun","doi":"10.1145/3666090","DOIUrl":"https://doi.org/10.1145/3666090","url":null,"abstract":"The Emotional Support Conversation (ESC) task aims to deliver consolation, encouragement, and advice to individuals undergoing emotional distress, thereby assisting them in overcoming difficulties. In the context of emotional support dialogue systems, it is of utmost importance to generate user-relevant and diverse responses. However, previous methods failed to take into account these crucial aspects, resulting in a tendency to produce universal and safe responses (e.g., “I do not know” and “I am sorry to hear that”). To tackle this challenge, a semantic-constrained bidirectional generation (SCBG) framework is utilized for generating more diverse and user-relevant responses. Specifically, we commence by selecting keywords that encapsulate the ongoing dialogue topics based on the context. Subsequently, a bidirectional generator generates responses incorporating these keywords. Two distinct methodologies, namely statistics-based and prompt-based methods, are employed for keyword extraction. Experimental results on the ESConv dataset demonstrate that the proposed SCBG framework improves response diversity and user relevance while ensuring response quality.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"90 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141170384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MizBERT: A Mizo BERT Model MizBERT：水族 BERT 模型

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-05-25 DOI: 10.1145/3666003

Robert Lalramhluna Dr.Partha Pakray Sandeep Dash

This research investigates the utilization of pre-trained BERT transformers within the context of the Mizo language. BERT, an abbreviation for Bidirectional Encoder Representations from Transformers, symbolizes Google’s forefront neural network approach to Natural Language Processing (NLP), renowned for its remarkable performance across various NLP tasks. However, its efficacy in handling low-resource languages such as Mizo remains largely unexplored. In this study, we introduce MizBERT, a specialized Mizo language model. Through extensive pre-training on a corpus collected from diverse online platforms, MizBERT has been tailored to accommodate the nuances of the Mizo language. Evaluation of MizBERT’s capabilities is conducted using two primary metrics: Masked Language Modeling (MLM) and Perplexity, yielding scores of 76.12% and 3.2565, respectively. Additionally, its performance in a text classification task is examined. Results indicate that MizBERT outperforms both the multilingual BERT (mBERT) model and the Support Vector Machine (SVM) algorithm, achieving an accuracy of 98.92%. This underscores MizBERT’s proficiency in understanding and processing the intricacies inherent in the Mizo language.

本研究调查了在米佐语中使用预训练 BERT 变换器的情况。BERT 是 Bidirectional Encoder Representations from Transformers 的缩写，象征着谷歌自然语言处理（NLP）的前沿神经网络方法，因其在各种 NLP 任务中的出色表现而闻名。然而，它在处理低资源语言（如米佐语）方面的功效在很大程度上仍未得到探索。在本研究中，我们介绍了 MizBERT，一种专门的水族语言模型。通过在从不同在线平台收集的语料库上进行广泛的预训练，MizBERT 已经适应了米佐语的细微差别。对 MizBERT 能力的评估主要采用两个指标：屏蔽语言建模（MLM）和复杂度（Perplexity）的得分分别为 76.12% 和 3.2565。此外，还考察了它在文本分类任务中的表现。结果表明，MizBERT 的表现优于多语言 BERT（mBERT）模型和支持向量机（SVM）算法，准确率达到 98.92%。这凸显了 MizBERT 在理解和处理米佐语言内在复杂性方面的能力。

{"title":"MizBERT: A Mizo BERT Model","authors":"Robert Lalramhluna, Sandeep Dash, Dr.Partha Pakray","doi":"10.1145/3666003","DOIUrl":"https://doi.org/10.1145/3666003","url":null,"abstract":"This research investigates the utilization of pre-trained BERT transformers within the context of the Mizo language. BERT, an abbreviation for Bidirectional Encoder Representations from Transformers, symbolizes Google’s forefront neural network approach to Natural Language Processing (NLP), renowned for its remarkable performance across various NLP tasks. However, its efficacy in handling low-resource languages such as Mizo remains largely unexplored. In this study, we introduce MizBERT, a specialized Mizo language model. Through extensive pre-training on a corpus collected from diverse online platforms, MizBERT has been tailored to accommodate the nuances of the Mizo language. Evaluation of MizBERT’s capabilities is conducted using two primary metrics: Masked Language Modeling (MLM) and Perplexity, yielding scores of 76.12% and 3.2565, respectively. Additionally, its performance in a text classification task is examined. Results indicate that MizBERT outperforms both the multilingual BERT (mBERT) model and the Support Vector Machine (SVM) algorithm, achieving an accuracy of 98.92%. This underscores MizBERT’s proficiency in understanding and processing the intricacies inherent in the Mizo language.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"32 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141151258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Abusive Comment Detection in Tamil Code-Mixed Data by Adjusting Class Weights and Refining Features 通过调整类权重和完善特征在泰米尔语代码混合数据中检测辱骂性评论

IF 2 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

ACM Transactions on Asian and Low-Resource Language Information Processing

Pub Date : 2024-05-18 DOI: 10.1145/3664619

Gayathri G L Krithika Swaminathan Divyasri Krishnakumar Thenmozhi D Bharathi B

In recent years, a significant portion of the content on various platforms on the internet has been found to be offensive or abusive. Abusive comment detection can go a long way in preventing internet users from facing the adverse effects of coming in contact with abusive language. This problem is particularly challenging when the comments are found in low-resource languages like Tamil or Tamil-English code-mixed text. So far, there has not been any substantial work on abusive comment detection using imbalanced datasets. Furthermore, significant work has not been performed, especially for Tamil code-mixed data, that involves analysing the dataset for classification and accordingly creating a custom vocabulary for preprocessing. This paper proposes a novel approach to classify abusive comments from an imbalanced dataset using a customised training vocabulary and a combination of statistical feature selection with language-agnostic feature selection while making use of explainable AI for feature refinement. Our model achieved an accuracy of 74% and a macro F1-score of 0.46.

近年来，互联网各种平台上的内容有很大一部分被发现具有攻击性或辱骂性。辱骂性评论检测可以有效防止互联网用户因接触辱骂性语言而受到不良影响。当评论使用泰米尔语或泰米尔语-英语混合代码文本等低资源语言时，这一问题尤其具有挑战性。迄今为止，还没有任何关于使用不平衡数据集检测辱骂性评论的实质性工作。此外，特别是针对泰米尔语混合代码数据，还没有开展过涉及数据集分类分析和相应创建自定义词汇进行预处理的重要工作。本文提出了一种从不平衡性数据集中对辱骂性评论进行分类的新方法，该方法使用定制的训练词汇，并将统计特征选择与语言无关特征选择相结合，同时利用可解释人工智能进行特征提纯。我们的模型达到了 74% 的准确率和 0.46 的宏观 F1 分数。

{"title":"Abusive Comment Detection in Tamil Code-Mixed Data by Adjusting Class Weights and Refining Features","authors":"Gayathri G L, Krithika Swaminathan, Divyasri Krishnakumar, Thenmozhi D, Bharathi B","doi":"10.1145/3664619","DOIUrl":"https://doi.org/10.1145/3664619","url":null,"abstract":"In recent years, a significant portion of the content on various platforms on the internet has been found to be offensive or abusive. Abusive comment detection can go a long way in preventing internet users from facing the adverse effects of coming in contact with abusive language. This problem is particularly challenging when the comments are found in low-resource languages like Tamil or Tamil-English code-mixed text. So far, there has not been any substantial work on abusive comment detection using imbalanced datasets. Furthermore, significant work has not been performed, especially for Tamil code-mixed data, that involves analysing the dataset for classification and accordingly creating a custom vocabulary for preprocessing. This paper proposes a novel approach to classify abusive comments from an imbalanced dataset using a customised training vocabulary and a combination of statistical feature selection with language-agnostic feature selection while making use of explainable AI for feature refinement. Our model achieved an accuracy of 74% and a macro F1-score of 0.46.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"66 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141060706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0