首页 > 最新文献

ACM Transactions on Asian and Low-Resource Language Information Processing最新文献

英文 中文
X-Phishing-Writer: A Framework for Cross-Lingual Phishing Email Generation X-Phishing-Writer:跨语言网络钓鱼电子邮件生成框架
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-06-03 DOI: 10.1145/3670402
Shih-Wei Guo, Yao-Chung Fan

Cybercrime is projected to cause annual business losses of $10.5 trillion by 2025, a significant concern given that a majority of security breaches are due to human errors, especially through phishing attacks. The rapid increase in daily identified phishing sites over the past decade underscores the pressing need to enhance defenses against such attacks. Social Engineering Drills (SEDs) are essential in raising awareness about phishing, yet face challenges in creating effective and diverse phishing email content. These challenges are exacerbated by the limited availability of public datasets and concerns over using external language models like ChatGPT for phishing email generation. To address these issues, this paper introduces X-Phishing-Writer, a novel cross-lingual Few-Shot phishing email generation framework. X-Phishing-Writer allows for the generation of emails based on minimal user input, leverages single-language datasets for multilingual email generation, and is designed for internal deployment using a lightweight, open-source language model. Incorporating Adapters into an Encoder-Decoder architecture, X-Phishing-Writer marks a significant advancement in the field, demonstrating superior performance in generating phishing emails across 25 languages when compared to baseline models. Experimental results and real-world drills involving 1,682 users showcase a 17.67% email open rate and a 13.33% hyperlink click-through rate, affirming the framework’s effectiveness and practicality in enhancing phishing awareness and defense.

预计到 2025 年,网络犯罪每年将造成 10.5 万亿美元的商业损失,鉴于大多数安全漏洞都是人为失误造成的,特别是通过网络钓鱼攻击造成的,因此这是一个重大问题。在过去十年中,每天发现的网络钓鱼网站迅速增加,这突出表明迫切需要加强对此类攻击的防御。社会工程演习 (SED) 对于提高人们对网络钓鱼的认识至关重要,但在创建有效和多样化的网络钓鱼电子邮件内容方面却面临挑战。公共数据集的有限可用性以及对使用 ChatGPT 等外部语言模型生成网络钓鱼电子邮件的担忧加剧了这些挑战。为了解决这些问题,本文介绍了一种新颖的跨语言 Few-Shot 网络钓鱼电子邮件生成框架 X-Phishing-Writer。X-Phishing-Writer 允许基于最少的用户输入生成电子邮件,利用单语言数据集生成多语言电子邮件,并使用轻量级开源语言模型进行内部部署。X-Phishing-Writer 将适配器整合到编码器-解码器架构中,标志着该领域的重大进步,与基线模型相比,它在生成 25 种语言的网络钓鱼电子邮件方面表现出色。有 1682 名用户参与的实验结果和实际演练显示,电子邮件打开率为 17.67%,超链接点击率为 13.33%,这肯定了该框架在增强网络钓鱼意识和防御方面的有效性和实用性。
{"title":"X-Phishing-Writer: A Framework for Cross-Lingual Phishing Email Generation","authors":"Shih-Wei Guo, Yao-Chung Fan","doi":"10.1145/3670402","DOIUrl":"https://doi.org/10.1145/3670402","url":null,"abstract":"<p>Cybercrime is projected to cause annual business losses of $10.5 trillion by 2025, a significant concern given that a majority of security breaches are due to human errors, especially through phishing attacks. The rapid increase in daily identified phishing sites over the past decade underscores the pressing need to enhance defenses against such attacks. Social Engineering Drills (SEDs) are essential in raising awareness about phishing, yet face challenges in creating effective and diverse phishing email content. These challenges are exacerbated by the limited availability of public datasets and concerns over using external language models like ChatGPT for phishing email generation. To address these issues, this paper introduces X-Phishing-Writer, a novel cross-lingual Few-Shot phishing email generation framework. X-Phishing-Writer allows for the generation of emails based on minimal user input, leverages single-language datasets for multilingual email generation, and is designed for internal deployment using a lightweight, open-source language model. Incorporating Adapters into an Encoder-Decoder architecture, X-Phishing-Writer marks a significant advancement in the field, demonstrating superior performance in generating phishing emails across 25 languages when compared to baseline models. Experimental results and real-world drills involving 1,682 users showcase a 17.67% email open rate and a 13.33% hyperlink click-through rate, affirming the framework’s effectiveness and practicality in enhancing phishing awareness and defense.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic Algerian Sarcasm Detection from Texts and Images 从文本和图像中自动检测阿尔及利亚讽刺语言
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-06-03 DOI: 10.1145/3670403
Kheira Zineb Bousmaha, Khaoula Hamadouche, Hadjer Djouabi, Lamia Hadrich-Belguith

In recent years, the number of Algerian Internet users has significantly increased, providing a valuable opportunity for collecting and utilizing opinions and sentiments expressed online. They now post not just texts but also images. However, to benefit from this wealth of information, it is crucial to address the challenge of sarcasm detection, which poses a limitation in sentiment analysis. Sarcasm often involves the use of non-literal and ambiguous language, making its detection complex. To enhance the quality and relevance of sentiment analysis, it is essential to develop effective methods for sarcasm detection. By overcoming this limitation, we can fully harness the expressed online opinions and benefit from their valuable insights for a better understanding of trends and sentiments among the Algerian public. In this work, our aim is to develop a comprehensive system that addresses sarcasm detection in Algerian dialect, encompassing both text and image analysis. We propose a hybrid approach that combines linguistic characteristics and machine learning techniques for text analysis. Additionally, for image analysis, we utilized the deep learning model VGG-19 for image classification, and employed the EasyOCR technique for Arabic text extraction. By integrating these approaches, we strive to create a robust system capable of detecting sarcasm in both textual and visual content in the Algerian dialect. Our system achieved an accuracy of 92.79% for the textual models and 89.28% for the visual model.

近年来,阿尔及利亚网民人数大幅增加,为收集和利用网上表达的意见和情感提供了 宝贵的机会。他们现在不仅发布文字,还发布图片。然而,要从这些丰富的信息中获益,关键是要解决讽刺检测这一难题,因为它是情感分析中的一个局限。讽刺往往涉及使用非直白和模棱两可的语言,使其检测变得复杂。为了提高情感分析的质量和相关性,必须开发有效的讽刺检测方法。通过克服这一局限,我们可以充分利用网络表达的意见,并从其宝贵的见解中获益,从而更好地了解阿尔及利亚公众的趋势和情绪。在这项工作中,我们的目标是开发一个全面的系统,解决阿尔及利亚方言中的讽刺检测问题,包括文本和图像分析。我们提出了一种混合方法,将语言特点和机器学习技术结合起来进行文本分析。此外,在图像分析方面,我们利用深度学习模型 VGG-19 进行图像分类,并利用 EasyOCR 技术进行阿拉伯语文本提取。通过整合这些方法,我们努力创建一个强大的系统,能够检测阿尔及利亚方言文本和图像内容中的讽刺内容。我们的系统在文本模型和视觉模型中分别达到了 92.79% 和 89.28% 的准确率。
{"title":"Automatic Algerian Sarcasm Detection from Texts and Images","authors":"Kheira Zineb Bousmaha, Khaoula Hamadouche, Hadjer Djouabi, Lamia Hadrich-Belguith","doi":"10.1145/3670403","DOIUrl":"https://doi.org/10.1145/3670403","url":null,"abstract":"<p>In recent years, the number of Algerian Internet users has significantly increased, providing a valuable opportunity for collecting and utilizing opinions and sentiments expressed online. They now post not just texts but also images. However, to benefit from this wealth of information, it is crucial to address the challenge of sarcasm detection, which poses a limitation in sentiment analysis. Sarcasm often involves the use of non-literal and ambiguous language, making its detection complex. To enhance the quality and relevance of sentiment analysis, it is essential to develop effective methods for sarcasm detection. By overcoming this limitation, we can fully harness the expressed online opinions and benefit from their valuable insights for a better understanding of trends and sentiments among the Algerian public. In this work, our aim is to develop a comprehensive system that addresses sarcasm detection in Algerian dialect, encompassing both text and image analysis. We propose a hybrid approach that combines linguistic characteristics and machine learning techniques for text analysis. Additionally, for image analysis, we utilized the deep learning model VGG-19 for image classification, and employed the EasyOCR technique for Arabic text extraction. By integrating these approaches, we strive to create a robust system capable of detecting sarcasm in both textual and visual content in the Algerian dialect. Our system achieved an accuracy of 92.79% for the textual models and 89.28% for the visual model.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
KannadaLex: A lexical database with psycholinguistic information KannadaLex:包含心理语言学信息的词汇数据库
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-06-03 DOI: 10.1145/3670688
Shreya R. Aithal, Muralikrishna Sn, Raghavendra Ganiga, Ashwath Rao, Govardhan Hegde

Databases containing lexical properties are of primary importance to psycholinguistic research and speech-language therapy. Several lexical databases for different languages have been developed in the recent past, but Kannada, a language spoken by 50.8 million people, has no comprehensive lexical database yet. To address this, KannadaLex, a Kannada lexical database is built as a language resource that contains orthographic, phonological, and syllabic information about words that are sourced from newspaper articles from the last decade. Along with these vital statistics like the phonological neighbourhood, syllable complexity summed syllable and bigram syllable frequencies, and lemma and inflectional family information are stored. The database is validated by correlating frequency, a well-established psycholinguistic feature, with other numerical features. The developed lexical database contains 170K words from varied disciplines, complete with psycholinguistic features. This KannadaLex is a comprehensive resource for psycholinguists, speech therapists, and linguistic researchers for analyzing Kannada and other similar languages. Psycholinguists require lexical data for choosing stimuli to conduct experiments that study the factors that enable humans to acquire, use, comprehend, and produce language. Speech and language therapists query these databases for developing the most efficient stimuli for evaluating, diagnosing, and treating communication disorders, and rehabilitation of speech after brain injuries.

包含词汇属性的数据库对心理语言学研究和言语治疗至关重要。近年来,不同语言的词汇数据库相继问世,但有 5080 万人口使用的卡纳达语却还没有一个全面的词汇数据库。为了解决这个问题,KannadaLex(卡纳达语词库)作为一种语言资源被建立起来,它包含了从过去十年的报纸文章中获取的单词的正字法、语音和音节信息。除了这些重要的统计信息外,还存储了音素邻域、音节复杂性、音节总和、大音节频率、外来词和转折系信息。通过将频率这一成熟的心理语言学特征与其他数字特征相关联,对数据库进行了验证。所开发的词库包含来自不同学科的 170K 个单词,具有完整的心理语言学特征。该 KannadaLex 是心理语言学家、语言治疗师和语言研究人员分析 Kannada 和其他类似语言的综合资源。心理语言学家需要词汇数据来选择刺激进行实验,研究人类获得、使用、理解和产生语言的因素。言语和语言治疗师查询这些数据库,以开发最有效的刺激,用于评估、诊断和治疗交流障碍,以及脑损伤后的言语康复。
{"title":"KannadaLex: A lexical database with psycholinguistic information","authors":"Shreya R. Aithal, Muralikrishna Sn, Raghavendra Ganiga, Ashwath Rao, Govardhan Hegde","doi":"10.1145/3670688","DOIUrl":"https://doi.org/10.1145/3670688","url":null,"abstract":"<p>Databases containing lexical properties are of primary importance to psycholinguistic research and speech-language therapy. Several lexical databases for different languages have been developed in the recent past, but Kannada, a language spoken by 50.8 million people, has no comprehensive lexical database yet. To address this, <i>KannadaLex</i>, a Kannada lexical database is built as a language resource that contains orthographic, phonological, and syllabic information about words that are sourced from newspaper articles from the last decade. Along with these vital statistics like the phonological neighbourhood, syllable complexity summed syllable and bigram syllable frequencies, and lemma and inflectional family information are stored. The database is validated by correlating frequency, a well-established psycholinguistic feature, with other numerical features. The developed lexical database contains 170K words from varied disciplines, complete with psycholinguistic features. This <i>KannadaLex</i> is a comprehensive resource for psycholinguists, speech therapists, and linguistic researchers for analyzing Kannada and other similar languages. Psycholinguists require lexical data for choosing stimuli to conduct experiments that study the factors that enable humans to acquire, use, comprehend, and produce language. Speech and language therapists query these databases for developing the most efficient stimuli for evaluating, diagnosing, and treating communication disorders, and rehabilitation of speech after brain injuries.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Document-Level Relation Extraction Based on Machine Reading Comprehension and Hybrid Pointer-sequence Labeling 基于机器阅读理解和混合指针序列标记的文档级关联提取
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-06-01 DOI: 10.1145/3666042
xiaoyi wang, Jie Liu, Jiong Wang, Jianyong Duan, guixia guan, qing zhang, Jianshe Zhou

Document-level relational extraction requires reading, memorization and reasoning to discover relevant factual information in multiple sentences. It is difficult for the current hierarchical network and graph network methods to fully capture the structural information behind the document and make natural reasoning from the context. Different from the previous methods, this paper reconstructs the relation extraction task into a machine reading comprehension task. Each pair of entities and relationships is characterized by a question template, and the extraction of entities and relationships is translated into identifying answers from the context. To enhance the context comprehension ability of the extraction model and achieve more precise extraction, we introduce large language models (LLMs) during question construction, enabling the generation of exemplary answers. Besides, to solve the multi-label and multi-entity problems in documents, we propose a new answer extraction model based on hybrid pointer-sequence labeling, which improves the reasoning ability of the model and realizes the extraction of zero or multiple answers in documents. Extensive experiments on three public datasets show that the proposed method is effective.

文档级关系提取需要通过阅读、记忆和推理来发现多个句子中的相关事实信息。目前的分层网络和图网络方法很难完全捕捉到文档背后的结构信息,也很难根据上下文进行自然推理。与以往的方法不同,本文将关系提取任务重构为机器阅读理解任务。每一对实体和关系都由一个问题模板来表征,实体和关系的提取被转化为从上下文中识别答案。为了增强抽取模型的上下文理解能力,实现更精确的抽取,我们在构建问题时引入了大语言模型(LLM),从而能够生成模范答案。此外,为了解决文档中的多标签和多实体问题,我们提出了一种基于混合指针-序列标签的新答案提取模型,提高了模型的推理能力,实现了文档中零答案或多答案的提取。在三个公开数据集上的大量实验表明,所提出的方法是有效的。
{"title":"Document-Level Relation Extraction Based on Machine Reading Comprehension and Hybrid Pointer-sequence Labeling","authors":"xiaoyi wang, Jie Liu, Jiong Wang, Jianyong Duan, guixia guan, qing zhang, Jianshe Zhou","doi":"10.1145/3666042","DOIUrl":"https://doi.org/10.1145/3666042","url":null,"abstract":"<p>Document-level relational extraction requires reading, memorization and reasoning to discover relevant factual information in multiple sentences. It is difficult for the current hierarchical network and graph network methods to fully capture the structural information behind the document and make natural reasoning from the context. Different from the previous methods, this paper reconstructs the relation extraction task into a machine reading comprehension task. Each pair of entities and relationships is characterized by a question template, and the extraction of entities and relationships is translated into identifying answers from the context. To enhance the context comprehension ability of the extraction model and achieve more precise extraction, we introduce large language models (LLMs) during question construction, enabling the generation of exemplary answers. Besides, to solve the multi-label and multi-entity problems in documents, we propose a new answer extraction model based on hybrid pointer-sequence labeling, which improves the reasoning ability of the model and realizes the extraction of zero or multiple answers in documents. Extensive experiments on three public datasets show that the proposed method is effective.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141197570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Quantitative Stylistic Analysis of Middle Chinese Texts Based on the Dissimilarity of Evolutive Core Word Usage 基于核心词用法演变差异的中古汉语文本定量文体分析
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-05-28 DOI: 10.1145/3665794
Bing Qiu, Jiahao Huo

Stylistic analysis enables open-ended and exploratory observation of languages. To fill the gap in the quantitative analysis of the stylistic systems of Middle Chinese, we construct lexical features based on the evolutive core word usage and scheme a Bayesian method for feature parameters estimation. The lexical features are from the Swadesh list, each of which has different word forms along with the language evolution during the Middle Ages. We thus count the varied word of those entries along with the language evolution as the linguistic features. With the Bayesian formulation, the feature parameters are estimated to construct a high-dimensional random feature vector in order to obtain the pair-wise dissimilarity matrix of all the texts based on different distance measures. Finally, we perform the spectral embedding and clustering to visualize, categorize and analyze the linguistic styles of Middle Chinese texts. The quantitative result agrees with the existing qualitative conclusions and furthermore, betters our understanding of the linguistic styles of Middle Chinese from both the inter-category and intra-category aspects. It also helps unveil the special styles induced by the indirect language contact.

文体分析可以对语言进行开放性和探索性的观察。为了填补中古汉语文体系统定量分析的空白,我们根据核心词的演变用法构建词性特征,并采用贝叶斯方法进行特征参数估计。词性特征来自 Swadesh 词表,每个词性特征都随着中古语言的演变而有不同的词形。因此,我们将这些词条中的不同单词以及语言演变过程视为语言特征。通过贝叶斯公式,我们估算了特征参数,构建了一个高维随机特征向量,从而根据不同的距离度量获得了所有文本的成对异质性矩阵。最后,我们进行频谱嵌入和聚类,对中古汉语文本的语言风格进行可视化、分类和分析。定量结果与已有的定性结论相吻合,并进一步从类别间和类别内两个方面加深了我们对中古汉语语言风格的理解。它还有助于揭示间接语言接触所引发的特殊语体。
{"title":"Quantitative Stylistic Analysis of Middle Chinese Texts Based on the Dissimilarity of Evolutive Core Word Usage","authors":"Bing Qiu, Jiahao Huo","doi":"10.1145/3665794","DOIUrl":"https://doi.org/10.1145/3665794","url":null,"abstract":"<p>Stylistic analysis enables open-ended and exploratory observation of languages. To fill the gap in the quantitative analysis of the stylistic systems of Middle Chinese, we construct lexical features based on the evolutive core word usage and scheme a Bayesian method for feature parameters estimation. The lexical features are from the Swadesh list, each of which has different word forms along with the language evolution during the Middle Ages. We thus count the varied word of those entries along with the language evolution as the linguistic features. With the Bayesian formulation, the feature parameters are estimated to construct a high-dimensional random feature vector in order to obtain the pair-wise dissimilarity matrix of all the texts based on different distance measures. Finally, we perform the spectral embedding and clustering to visualize, categorize and analyze the linguistic styles of Middle Chinese texts. The quantitative result agrees with the existing qualitative conclusions and furthermore, betters our understanding of the linguistic styles of Middle Chinese from both the inter-category and intra-category aspects. It also helps unveil the special styles induced by the indirect language contact.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141170478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SCBG: Semantic-Constrained Bidirectional Generation for Emotional Support Conversation SCBG:情感支持对话的语义约束双向生成
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-05-27 DOI: 10.1145/3666090
Yangyang Xu, Zhuoer Zhao, Xiao Sun

The Emotional Support Conversation (ESC) task aims to deliver consolation, encouragement, and advice to individuals undergoing emotional distress, thereby assisting them in overcoming difficulties. In the context of emotional support dialogue systems, it is of utmost importance to generate user-relevant and diverse responses. However, previous methods failed to take into account these crucial aspects, resulting in a tendency to produce universal and safe responses (e.g., “I do not know” and “I am sorry to hear that”). To tackle this challenge, a semantic-constrained bidirectional generation (SCBG) framework is utilized for generating more diverse and user-relevant responses. Specifically, we commence by selecting keywords that encapsulate the ongoing dialogue topics based on the context. Subsequently, a bidirectional generator generates responses incorporating these keywords. Two distinct methodologies, namely statistics-based and prompt-based methods, are employed for keyword extraction. Experimental results on the ESConv dataset demonstrate that the proposed SCBG framework improves response diversity and user relevance while ensuring response quality.

情感支持对话(ESC)任务旨在向受到情感困扰的个人提供安慰、鼓励和建议,从而帮助他们克服困难。在情感支持对话系统中,最重要的是生成与用户相关的多样化回复。然而,以往的方法没有考虑到这些关键方面,导致产生的回复往往是通用和安全的(如 "我不知道 "和 "很遗憾听到这个消息")。为了应对这一挑战,我们采用了语义约束双向生成(SCBG)框架,以生成更加多样化和与用户相关的回复。具体来说,我们首先根据上下文选择能概括当前对话主题的关键词。随后,双向生成器生成包含这些关键词的回复。关键字提取采用了两种不同的方法,即基于统计的方法和基于提示的方法。在 ESConv 数据集上的实验结果表明,所提出的 SCBG 框架在确保回复质量的同时,还提高了回复的多样性和用户相关性。
{"title":"SCBG: Semantic-Constrained Bidirectional Generation for Emotional Support Conversation","authors":"Yangyang Xu, Zhuoer Zhao, Xiao Sun","doi":"10.1145/3666090","DOIUrl":"https://doi.org/10.1145/3666090","url":null,"abstract":"<p>The Emotional Support Conversation (ESC) task aims to deliver consolation, encouragement, and advice to individuals undergoing emotional distress, thereby assisting them in overcoming difficulties. In the context of emotional support dialogue systems, it is of utmost importance to generate user-relevant and diverse responses. However, previous methods failed to take into account these crucial aspects, resulting in a tendency to produce universal and safe responses (e.g., “I do not know” and “I am sorry to hear that”). To tackle this challenge, a semantic-constrained bidirectional generation (SCBG) framework is utilized for generating more diverse and user-relevant responses. Specifically, we commence by selecting keywords that encapsulate the ongoing dialogue topics based on the context. Subsequently, a bidirectional generator generates responses incorporating these keywords. Two distinct methodologies, namely statistics-based and prompt-based methods, are employed for keyword extraction. Experimental results on the ESConv dataset demonstrate that the proposed SCBG framework improves response diversity and user relevance while ensuring response quality.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141170384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MizBERT: A Mizo BERT Model MizBERT:水族 BERT 模型
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-05-25 DOI: 10.1145/3666003
Robert Lalramhluna, Sandeep Dash, Dr.Partha Pakray

This research investigates the utilization of pre-trained BERT transformers within the context of the Mizo language. BERT, an abbreviation for Bidirectional Encoder Representations from Transformers, symbolizes Google’s forefront neural network approach to Natural Language Processing (NLP), renowned for its remarkable performance across various NLP tasks. However, its efficacy in handling low-resource languages such as Mizo remains largely unexplored. In this study, we introduce MizBERT, a specialized Mizo language model. Through extensive pre-training on a corpus collected from diverse online platforms, MizBERT has been tailored to accommodate the nuances of the Mizo language. Evaluation of MizBERT’s capabilities is conducted using two primary metrics: Masked Language Modeling (MLM) and Perplexity, yielding scores of 76.12% and 3.2565, respectively. Additionally, its performance in a text classification task is examined. Results indicate that MizBERT outperforms both the multilingual BERT (mBERT) model and the Support Vector Machine (SVM) algorithm, achieving an accuracy of 98.92%. This underscores MizBERT’s proficiency in understanding and processing the intricacies inherent in the Mizo language.

本研究调查了在米佐语中使用预训练 BERT 变换器的情况。BERT 是 Bidirectional Encoder Representations from Transformers 的缩写,象征着谷歌自然语言处理(NLP)的前沿神经网络方法,因其在各种 NLP 任务中的出色表现而闻名。然而,它在处理低资源语言(如米佐语)方面的功效在很大程度上仍未得到探索。在本研究中,我们介绍了 MizBERT,一种专门的水族语言模型。通过在从不同在线平台收集的语料库上进行广泛的预训练,MizBERT 已经适应了米佐语的细微差别。对 MizBERT 能力的评估主要采用两个指标:屏蔽语言建模(MLM)和复杂度(Perplexity)的得分分别为 76.12% 和 3.2565。此外,还考察了它在文本分类任务中的表现。结果表明,MizBERT 的表现优于多语言 BERT(mBERT)模型和支持向量机(SVM)算法,准确率达到 98.92%。这凸显了 MizBERT 在理解和处理米佐语言内在复杂性方面的能力。
{"title":"MizBERT: A Mizo BERT Model","authors":"Robert Lalramhluna, Sandeep Dash, Dr.Partha Pakray","doi":"10.1145/3666003","DOIUrl":"https://doi.org/10.1145/3666003","url":null,"abstract":"<p>This research investigates the utilization of pre-trained BERT transformers within the context of the Mizo language. BERT, an abbreviation for Bidirectional Encoder Representations from Transformers, symbolizes Google’s forefront neural network approach to Natural Language Processing (NLP), renowned for its remarkable performance across various NLP tasks. However, its efficacy in handling low-resource languages such as Mizo remains largely unexplored. In this study, we introduce <i>MizBERT</i>, a specialized Mizo language model. Through extensive pre-training on a corpus collected from diverse online platforms, <i>MizBERT</i> has been tailored to accommodate the nuances of the Mizo language. Evaluation of <i>MizBERT’s</i> capabilities is conducted using two primary metrics: Masked Language Modeling (MLM) and Perplexity, yielding scores of 76.12% and 3.2565, respectively. Additionally, its performance in a text classification task is examined. Results indicate that <i>MizBERT</i> outperforms both the multilingual BERT (mBERT) model and the Support Vector Machine (SVM) algorithm, achieving an accuracy of 98.92%. This underscores <i>MizBERT’s</i> proficiency in understanding and processing the intricacies inherent in the Mizo language.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141151258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Abusive Comment Detection in Tamil Code-Mixed Data by Adjusting Class Weights and Refining Features 通过调整类权重和完善特征在泰米尔语代码混合数据中检测辱骂性评论
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-05-18 DOI: 10.1145/3664619
Gayathri G L, Krithika Swaminathan, Divyasri Krishnakumar, Thenmozhi D, Bharathi B

In recent years, a significant portion of the content on various platforms on the internet has been found to be offensive or abusive. Abusive comment detection can go a long way in preventing internet users from facing the adverse effects of coming in contact with abusive language. This problem is particularly challenging when the comments are found in low-resource languages like Tamil or Tamil-English code-mixed text. So far, there has not been any substantial work on abusive comment detection using imbalanced datasets. Furthermore, significant work has not been performed, especially for Tamil code-mixed data, that involves analysing the dataset for classification and accordingly creating a custom vocabulary for preprocessing. This paper proposes a novel approach to classify abusive comments from an imbalanced dataset using a customised training vocabulary and a combination of statistical feature selection with language-agnostic feature selection while making use of explainable AI for feature refinement. Our model achieved an accuracy of 74% and a macro F1-score of 0.46.

近年来,互联网各种平台上的内容有很大一部分被发现具有攻击性或辱骂性。辱骂性评论检测可以有效防止互联网用户因接触辱骂性语言而受到不良影响。当评论使用泰米尔语或泰米尔语-英语混合代码文本等低资源语言时,这一问题尤其具有挑战性。迄今为止,还没有任何关于使用不平衡数据集检测辱骂性评论的实质性工作。此外,特别是针对泰米尔语混合代码数据,还没有开展过涉及数据集分类分析和相应创建自定义词汇进行预处理的重要工作。本文提出了一种从不平衡性数据集中对辱骂性评论进行分类的新方法,该方法使用定制的训练词汇,并将统计特征选择与语言无关特征选择相结合,同时利用可解释人工智能进行特征提纯。我们的模型达到了 74% 的准确率和 0.46 的宏观 F1 分数。
{"title":"Abusive Comment Detection in Tamil Code-Mixed Data by Adjusting Class Weights and Refining Features","authors":"Gayathri G L, Krithika Swaminathan, Divyasri Krishnakumar, Thenmozhi D, Bharathi B","doi":"10.1145/3664619","DOIUrl":"https://doi.org/10.1145/3664619","url":null,"abstract":"<p>In recent years, a significant portion of the content on various platforms on the internet has been found to be offensive or abusive. Abusive comment detection can go a long way in preventing internet users from facing the adverse effects of coming in contact with abusive language. This problem is particularly challenging when the comments are found in low-resource languages like Tamil or Tamil-English code-mixed text. So far, there has not been any substantial work on abusive comment detection using imbalanced datasets. Furthermore, significant work has not been performed, especially for Tamil code-mixed data, that involves analysing the dataset for classification and accordingly creating a custom vocabulary for preprocessing. This paper proposes a novel approach to classify abusive comments from an imbalanced dataset using a customised training vocabulary and a combination of statistical feature selection with language-agnostic feature selection while making use of explainable AI for feature refinement. Our model achieved an accuracy of 74% and a macro F1-score of 0.46.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141060706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Better Quantity Representations for Solving Math Word Problems 用更好的数量表示法解决数学字词问题
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-05-18 DOI: 10.1145/3665644
Runxin Sun, Shizhu He, Jun Zhao, Kang Liu

Solving a math word problem requires selecting quantities in it and performing appropriate arithmetic operations to obtain the answer. For deep learning-based methods, it is vital to obtain good quantity representations, i.e., to selectively and emphatically aggregate information in the context of quantities. However, existing works have not paid much attention to this aspect. Many works simply encode quantities as ordinary tokens, or use some implicit or rule-based methods to select information in their context. This leads to poor results when dealing with linguistic variations and confounding quantities. This paper proposes a novel method to identify question-related distinguishing features of quantities by contrasting their context with the question and the context of other quantities, thereby enhancing the representation of quantities. Our method not only considers the contrastive relationship between quantities, but also considers multiple relationships jointly. Besides, we propose two auxiliary tasks to further guide the representation learning of quantities: 1) predicting whether a quantity is used in the question; 2) predicting the relations (operators) between quantities given the question. Experimental results show that our method outperforms previous methods on SVAMP and ASDiv-A under similar settings, even some newly released strong baselines. Supplementary experiments further confirm that our method indeed improves the performance of quantity selection by improving the representation of both quantities and questions.

解决数学单词问题需要选择其中的数量,并进行适当的算术运算以获得答案。对于基于深度学习的方法来说,获得良好的数量表示至关重要,即有选择地、强调地聚合数量背景下的信息。然而,现有的研究并不重视这一方面。许多作品只是简单地将数量编码为普通标记,或使用一些隐式或基于规则的方法来选择其上下文中的信息。这导致在处理语言变化和混杂数量时效果不佳。本文提出了一种新颖的方法,通过将数量的上下文与问题和其他数量的上下文进行对比,来识别与问题相关的数量区分特征,从而增强数量的表征能力。我们的方法不仅考虑了数量之间的对比关系,还联合考虑了多种关系。此外,我们还提出了两个辅助任务来进一步指导量的表征学习:1) 预测问题中是否使用了某个量;2) 预测问题中量与量之间的关系(算子)。实验结果表明,在类似设置下,我们的方法在 SVAMP 和 ASDiv-A 上的表现优于之前的方法,甚至优于一些新发布的强基线方法。补充实验进一步证实,我们的方法通过改进数量和问题的表征,确实提高了数量选择的性能。
{"title":"Towards Better Quantity Representations for Solving Math Word Problems","authors":"Runxin Sun, Shizhu He, Jun Zhao, Kang Liu","doi":"10.1145/3665644","DOIUrl":"https://doi.org/10.1145/3665644","url":null,"abstract":"<p>Solving a math word problem requires selecting quantities in it and performing appropriate arithmetic operations to obtain the answer. For deep learning-based methods, it is vital to obtain good quantity representations, i.e., to selectively and emphatically aggregate information in the context of quantities. However, existing works have not paid much attention to this aspect. Many works simply encode quantities as ordinary tokens, or use some implicit or rule-based methods to select information in their context. This leads to poor results when dealing with linguistic variations and confounding quantities. This paper proposes a novel method to identify question-related distinguishing features of quantities by contrasting their context with the question and the context of other quantities, thereby enhancing the representation of quantities. Our method not only considers the contrastive relationship between quantities, but also considers multiple relationships jointly. Besides, we propose two auxiliary tasks to further guide the representation learning of quantities: 1) predicting whether a quantity is used in the question; 2) predicting the relations (operators) between quantities given the question. Experimental results show that our method outperforms previous methods on SVAMP and ASDiv-A under similar settings, even some newly released strong baselines. Supplementary experiments further confirm that our method indeed improves the performance of quantity selection by improving the representation of both quantities and questions.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141060662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CHUNAV: Analyzing Hindi Hate Speech and Targeted Groups in Indian Election Discourse CHUNAV:分析印度选举言论中的印地语仇恨言论和目标群体
IF 2 4区 计算机科学 Q2 Computer Science Pub Date : 2024-05-16 DOI: 10.1145/3665245
F. Jafri, Kritesh Rauniyar, Surendrabikram Thapa, Mohammad Aman Siddiqui, Matloob Khushi, Usman Naseem
In the ever-evolving landscape of online discourse and political dialogue, the rise of hate speech poses a significant challenge to maintaining a respectful and inclusive digital environment. The context becomes particularly complex when considering the Hindi language—a low-resource language with limited available data. To address this pressing concern, we introduce the CHUNAV dataset—a collection of 11,457 Hindi tweets gathered during assembly elections in various states. CHUNAV is purpose-built for hate speech categorization and the identification of target groups. The dataset is a valuable resource for exploring hate speech within the distinctive socio-political context of Indian elections. The tweets within CHUNAV have been meticulously categorized into “Hate” and “Non-Hate” labels, and further subdivided to pinpoint the specific targets of hate speech, including “Individual”, “Organization”, and “Community” labels (as shown in Figure 1). Furthermore, this paper presents multiple benchmark models for hate speech detection, along with an innovative ensemble and oversampling-based method. The paper also delves into the results of topic modeling, all aimed at effectively addressing hate speech and target identification in the Hindi language. This contribution seeks to advance the field of hate speech analysis and foster a safer and more inclusive online space within the distinctive realm of Indian Assembly Elections.
在不断变化的网络言论和政治对话环境中,仇恨言论的兴起对维护尊重和包容的数字环境构成了重大挑战。如果考虑到印地语--一种可用数据有限的低资源语言,情况就会变得尤为复杂。为了解决这一迫切问题,我们引入了 CHUNAV 数据集--一个在各邦议会选举期间收集到的 11,457 条印地语推文的集合。CHUNAV 专门用于仇恨言论分类和目标群体识别。该数据集是在印度选举的独特社会政治背景下探索仇恨言论的宝贵资源。CHUNAV 中的推文被细致地分为 "仇恨 "和 "非仇恨 "两个标签,并进一步细分以确定仇恨言论的具体目标,包括 "个人"、"组织 "和 "社区 "标签(如图 1 所示)。此外,本文还介绍了用于仇恨言论检测的多个基准模型,以及一种创新的基于集合和超采样的方法。本文还深入探讨了主题建模的结果,所有这些都旨在有效解决印地语中的仇恨言论和目标识别问题。本文旨在推动仇恨言论分析领域的发展,并在印度议会选举这一独特的领域内营造一个更安全、更具包容性的网络空间。
{"title":"CHUNAV: Analyzing Hindi Hate Speech and Targeted Groups in Indian Election Discourse","authors":"F. Jafri, Kritesh Rauniyar, Surendrabikram Thapa, Mohammad Aman Siddiqui, Matloob Khushi, Usman Naseem","doi":"10.1145/3665245","DOIUrl":"https://doi.org/10.1145/3665245","url":null,"abstract":"\u0000 In the ever-evolving landscape of online discourse and political dialogue, the rise of hate speech poses a significant challenge to maintaining a respectful and inclusive digital environment. The context becomes particularly complex when considering the Hindi language—a low-resource language with limited available data. To address this pressing concern, we introduce the\u0000 CHUNAV\u0000 dataset—a collection of 11,457 Hindi tweets gathered during assembly elections in various states.\u0000 CHUNAV\u0000 is purpose-built for hate speech categorization and the identification of target groups. The dataset is a valuable resource for exploring hate speech within the distinctive socio-political context of Indian elections. The tweets within\u0000 CHUNAV\u0000 have been meticulously categorized into “Hate” and “Non-Hate” labels, and further subdivided to pinpoint the specific targets of hate speech, including “Individual”, “Organization”, and “Community” labels (as shown in Figure 1). Furthermore, this paper presents multiple benchmark models for hate speech detection, along with an innovative ensemble and oversampling-based method. The paper also delves into the results of topic modeling, all aimed at effectively addressing hate speech and target identification in the Hindi language. This contribution seeks to advance the field of hate speech analysis and foster a safer and more inclusive online space within the distinctive realm of Indian Assembly Elections.\u0000","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140970249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ACM Transactions on Asian and Low-Resource Language Information Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1