Cybercrime is projected to cause annual business losses of $10.5 trillion by 2025, a significant concern given that a majority of security breaches are due to human errors, especially through phishing attacks. The rapid increase in daily identified phishing sites over the past decade underscores the pressing need to enhance defenses against such attacks. Social Engineering Drills (SEDs) are essential in raising awareness about phishing, yet face challenges in creating effective and diverse phishing email content. These challenges are exacerbated by the limited availability of public datasets and concerns over using external language models like ChatGPT for phishing email generation. To address these issues, this paper introduces X-Phishing-Writer, a novel cross-lingual Few-Shot phishing email generation framework. X-Phishing-Writer allows for the generation of emails based on minimal user input, leverages single-language datasets for multilingual email generation, and is designed for internal deployment using a lightweight, open-source language model. Incorporating Adapters into an Encoder-Decoder architecture, X-Phishing-Writer marks a significant advancement in the field, demonstrating superior performance in generating phishing emails across 25 languages when compared to baseline models. Experimental results and real-world drills involving 1,682 users showcase a 17.67% email open rate and a 13.33% hyperlink click-through rate, affirming the framework’s effectiveness and practicality in enhancing phishing awareness and defense.
{"title":"X-Phishing-Writer: A Framework for Cross-Lingual Phishing Email Generation","authors":"Shih-Wei Guo, Yao-Chung Fan","doi":"10.1145/3670402","DOIUrl":"https://doi.org/10.1145/3670402","url":null,"abstract":"<p>Cybercrime is projected to cause annual business losses of $10.5 trillion by 2025, a significant concern given that a majority of security breaches are due to human errors, especially through phishing attacks. The rapid increase in daily identified phishing sites over the past decade underscores the pressing need to enhance defenses against such attacks. Social Engineering Drills (SEDs) are essential in raising awareness about phishing, yet face challenges in creating effective and diverse phishing email content. These challenges are exacerbated by the limited availability of public datasets and concerns over using external language models like ChatGPT for phishing email generation. To address these issues, this paper introduces X-Phishing-Writer, a novel cross-lingual Few-Shot phishing email generation framework. X-Phishing-Writer allows for the generation of emails based on minimal user input, leverages single-language datasets for multilingual email generation, and is designed for internal deployment using a lightweight, open-source language model. Incorporating Adapters into an Encoder-Decoder architecture, X-Phishing-Writer marks a significant advancement in the field, demonstrating superior performance in generating phishing emails across 25 languages when compared to baseline models. Experimental results and real-world drills involving 1,682 users showcase a 17.67% email open rate and a 13.33% hyperlink click-through rate, affirming the framework’s effectiveness and practicality in enhancing phishing awareness and defense.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, the number of Algerian Internet users has significantly increased, providing a valuable opportunity for collecting and utilizing opinions and sentiments expressed online. They now post not just texts but also images. However, to benefit from this wealth of information, it is crucial to address the challenge of sarcasm detection, which poses a limitation in sentiment analysis. Sarcasm often involves the use of non-literal and ambiguous language, making its detection complex. To enhance the quality and relevance of sentiment analysis, it is essential to develop effective methods for sarcasm detection. By overcoming this limitation, we can fully harness the expressed online opinions and benefit from their valuable insights for a better understanding of trends and sentiments among the Algerian public. In this work, our aim is to develop a comprehensive system that addresses sarcasm detection in Algerian dialect, encompassing both text and image analysis. We propose a hybrid approach that combines linguistic characteristics and machine learning techniques for text analysis. Additionally, for image analysis, we utilized the deep learning model VGG-19 for image classification, and employed the EasyOCR technique for Arabic text extraction. By integrating these approaches, we strive to create a robust system capable of detecting sarcasm in both textual and visual content in the Algerian dialect. Our system achieved an accuracy of 92.79% for the textual models and 89.28% for the visual model.
{"title":"Automatic Algerian Sarcasm Detection from Texts and Images","authors":"Kheira Zineb Bousmaha, Khaoula Hamadouche, Hadjer Djouabi, Lamia Hadrich-Belguith","doi":"10.1145/3670403","DOIUrl":"https://doi.org/10.1145/3670403","url":null,"abstract":"<p>In recent years, the number of Algerian Internet users has significantly increased, providing a valuable opportunity for collecting and utilizing opinions and sentiments expressed online. They now post not just texts but also images. However, to benefit from this wealth of information, it is crucial to address the challenge of sarcasm detection, which poses a limitation in sentiment analysis. Sarcasm often involves the use of non-literal and ambiguous language, making its detection complex. To enhance the quality and relevance of sentiment analysis, it is essential to develop effective methods for sarcasm detection. By overcoming this limitation, we can fully harness the expressed online opinions and benefit from their valuable insights for a better understanding of trends and sentiments among the Algerian public. In this work, our aim is to develop a comprehensive system that addresses sarcasm detection in Algerian dialect, encompassing both text and image analysis. We propose a hybrid approach that combines linguistic characteristics and machine learning techniques for text analysis. Additionally, for image analysis, we utilized the deep learning model VGG-19 for image classification, and employed the EasyOCR technique for Arabic text extraction. By integrating these approaches, we strive to create a robust system capable of detecting sarcasm in both textual and visual content in the Algerian dialect. Our system achieved an accuracy of 92.79% for the textual models and 89.28% for the visual model.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Databases containing lexical properties are of primary importance to psycholinguistic research and speech-language therapy. Several lexical databases for different languages have been developed in the recent past, but Kannada, a language spoken by 50.8 million people, has no comprehensive lexical database yet. To address this, KannadaLex, a Kannada lexical database is built as a language resource that contains orthographic, phonological, and syllabic information about words that are sourced from newspaper articles from the last decade. Along with these vital statistics like the phonological neighbourhood, syllable complexity summed syllable and bigram syllable frequencies, and lemma and inflectional family information are stored. The database is validated by correlating frequency, a well-established psycholinguistic feature, with other numerical features. The developed lexical database contains 170K words from varied disciplines, complete with psycholinguistic features. This KannadaLex is a comprehensive resource for psycholinguists, speech therapists, and linguistic researchers for analyzing Kannada and other similar languages. Psycholinguists require lexical data for choosing stimuli to conduct experiments that study the factors that enable humans to acquire, use, comprehend, and produce language. Speech and language therapists query these databases for developing the most efficient stimuli for evaluating, diagnosing, and treating communication disorders, and rehabilitation of speech after brain injuries.
{"title":"KannadaLex: A lexical database with psycholinguistic information","authors":"Shreya R. Aithal, Muralikrishna Sn, Raghavendra Ganiga, Ashwath Rao, Govardhan Hegde","doi":"10.1145/3670688","DOIUrl":"https://doi.org/10.1145/3670688","url":null,"abstract":"<p>Databases containing lexical properties are of primary importance to psycholinguistic research and speech-language therapy. Several lexical databases for different languages have been developed in the recent past, but Kannada, a language spoken by 50.8 million people, has no comprehensive lexical database yet. To address this, <i>KannadaLex</i>, a Kannada lexical database is built as a language resource that contains orthographic, phonological, and syllabic information about words that are sourced from newspaper articles from the last decade. Along with these vital statistics like the phonological neighbourhood, syllable complexity summed syllable and bigram syllable frequencies, and lemma and inflectional family information are stored. The database is validated by correlating frequency, a well-established psycholinguistic feature, with other numerical features. The developed lexical database contains 170K words from varied disciplines, complete with psycholinguistic features. This <i>KannadaLex</i> is a comprehensive resource for psycholinguists, speech therapists, and linguistic researchers for analyzing Kannada and other similar languages. Psycholinguists require lexical data for choosing stimuli to conduct experiments that study the factors that enable humans to acquire, use, comprehend, and produce language. Speech and language therapists query these databases for developing the most efficient stimuli for evaluating, diagnosing, and treating communication disorders, and rehabilitation of speech after brain injuries.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Document-level relational extraction requires reading, memorization and reasoning to discover relevant factual information in multiple sentences. It is difficult for the current hierarchical network and graph network methods to fully capture the structural information behind the document and make natural reasoning from the context. Different from the previous methods, this paper reconstructs the relation extraction task into a machine reading comprehension task. Each pair of entities and relationships is characterized by a question template, and the extraction of entities and relationships is translated into identifying answers from the context. To enhance the context comprehension ability of the extraction model and achieve more precise extraction, we introduce large language models (LLMs) during question construction, enabling the generation of exemplary answers. Besides, to solve the multi-label and multi-entity problems in documents, we propose a new answer extraction model based on hybrid pointer-sequence labeling, which improves the reasoning ability of the model and realizes the extraction of zero or multiple answers in documents. Extensive experiments on three public datasets show that the proposed method is effective.
{"title":"Document-Level Relation Extraction Based on Machine Reading Comprehension and Hybrid Pointer-sequence Labeling","authors":"xiaoyi wang, Jie Liu, Jiong Wang, Jianyong Duan, guixia guan, qing zhang, Jianshe Zhou","doi":"10.1145/3666042","DOIUrl":"https://doi.org/10.1145/3666042","url":null,"abstract":"<p>Document-level relational extraction requires reading, memorization and reasoning to discover relevant factual information in multiple sentences. It is difficult for the current hierarchical network and graph network methods to fully capture the structural information behind the document and make natural reasoning from the context. Different from the previous methods, this paper reconstructs the relation extraction task into a machine reading comprehension task. Each pair of entities and relationships is characterized by a question template, and the extraction of entities and relationships is translated into identifying answers from the context. To enhance the context comprehension ability of the extraction model and achieve more precise extraction, we introduce large language models (LLMs) during question construction, enabling the generation of exemplary answers. Besides, to solve the multi-label and multi-entity problems in documents, we propose a new answer extraction model based on hybrid pointer-sequence labeling, which improves the reasoning ability of the model and realizes the extraction of zero or multiple answers in documents. Extensive experiments on three public datasets show that the proposed method is effective.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141197570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stylistic analysis enables open-ended and exploratory observation of languages. To fill the gap in the quantitative analysis of the stylistic systems of Middle Chinese, we construct lexical features based on the evolutive core word usage and scheme a Bayesian method for feature parameters estimation. The lexical features are from the Swadesh list, each of which has different word forms along with the language evolution during the Middle Ages. We thus count the varied word of those entries along with the language evolution as the linguistic features. With the Bayesian formulation, the feature parameters are estimated to construct a high-dimensional random feature vector in order to obtain the pair-wise dissimilarity matrix of all the texts based on different distance measures. Finally, we perform the spectral embedding and clustering to visualize, categorize and analyze the linguistic styles of Middle Chinese texts. The quantitative result agrees with the existing qualitative conclusions and furthermore, betters our understanding of the linguistic styles of Middle Chinese from both the inter-category and intra-category aspects. It also helps unveil the special styles induced by the indirect language contact.
{"title":"Quantitative Stylistic Analysis of Middle Chinese Texts Based on the Dissimilarity of Evolutive Core Word Usage","authors":"Bing Qiu, Jiahao Huo","doi":"10.1145/3665794","DOIUrl":"https://doi.org/10.1145/3665794","url":null,"abstract":"<p>Stylistic analysis enables open-ended and exploratory observation of languages. To fill the gap in the quantitative analysis of the stylistic systems of Middle Chinese, we construct lexical features based on the evolutive core word usage and scheme a Bayesian method for feature parameters estimation. The lexical features are from the Swadesh list, each of which has different word forms along with the language evolution during the Middle Ages. We thus count the varied word of those entries along with the language evolution as the linguistic features. With the Bayesian formulation, the feature parameters are estimated to construct a high-dimensional random feature vector in order to obtain the pair-wise dissimilarity matrix of all the texts based on different distance measures. Finally, we perform the spectral embedding and clustering to visualize, categorize and analyze the linguistic styles of Middle Chinese texts. The quantitative result agrees with the existing qualitative conclusions and furthermore, betters our understanding of the linguistic styles of Middle Chinese from both the inter-category and intra-category aspects. It also helps unveil the special styles induced by the indirect language contact.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141170478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Emotional Support Conversation (ESC) task aims to deliver consolation, encouragement, and advice to individuals undergoing emotional distress, thereby assisting them in overcoming difficulties. In the context of emotional support dialogue systems, it is of utmost importance to generate user-relevant and diverse responses. However, previous methods failed to take into account these crucial aspects, resulting in a tendency to produce universal and safe responses (e.g., “I do not know” and “I am sorry to hear that”). To tackle this challenge, a semantic-constrained bidirectional generation (SCBG) framework is utilized for generating more diverse and user-relevant responses. Specifically, we commence by selecting keywords that encapsulate the ongoing dialogue topics based on the context. Subsequently, a bidirectional generator generates responses incorporating these keywords. Two distinct methodologies, namely statistics-based and prompt-based methods, are employed for keyword extraction. Experimental results on the ESConv dataset demonstrate that the proposed SCBG framework improves response diversity and user relevance while ensuring response quality.
{"title":"SCBG: Semantic-Constrained Bidirectional Generation for Emotional Support Conversation","authors":"Yangyang Xu, Zhuoer Zhao, Xiao Sun","doi":"10.1145/3666090","DOIUrl":"https://doi.org/10.1145/3666090","url":null,"abstract":"<p>The Emotional Support Conversation (ESC) task aims to deliver consolation, encouragement, and advice to individuals undergoing emotional distress, thereby assisting them in overcoming difficulties. In the context of emotional support dialogue systems, it is of utmost importance to generate user-relevant and diverse responses. However, previous methods failed to take into account these crucial aspects, resulting in a tendency to produce universal and safe responses (e.g., “I do not know” and “I am sorry to hear that”). To tackle this challenge, a semantic-constrained bidirectional generation (SCBG) framework is utilized for generating more diverse and user-relevant responses. Specifically, we commence by selecting keywords that encapsulate the ongoing dialogue topics based on the context. Subsequently, a bidirectional generator generates responses incorporating these keywords. Two distinct methodologies, namely statistics-based and prompt-based methods, are employed for keyword extraction. Experimental results on the ESConv dataset demonstrate that the proposed SCBG framework improves response diversity and user relevance while ensuring response quality.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141170384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Robert Lalramhluna, Sandeep Dash, Dr.Partha Pakray
This research investigates the utilization of pre-trained BERT transformers within the context of the Mizo language. BERT, an abbreviation for Bidirectional Encoder Representations from Transformers, symbolizes Google’s forefront neural network approach to Natural Language Processing (NLP), renowned for its remarkable performance across various NLP tasks. However, its efficacy in handling low-resource languages such as Mizo remains largely unexplored. In this study, we introduce MizBERT, a specialized Mizo language model. Through extensive pre-training on a corpus collected from diverse online platforms, MizBERT has been tailored to accommodate the nuances of the Mizo language. Evaluation of MizBERT’s capabilities is conducted using two primary metrics: Masked Language Modeling (MLM) and Perplexity, yielding scores of 76.12% and 3.2565, respectively. Additionally, its performance in a text classification task is examined. Results indicate that MizBERT outperforms both the multilingual BERT (mBERT) model and the Support Vector Machine (SVM) algorithm, achieving an accuracy of 98.92%. This underscores MizBERT’s proficiency in understanding and processing the intricacies inherent in the Mizo language.
{"title":"MizBERT: A Mizo BERT Model","authors":"Robert Lalramhluna, Sandeep Dash, Dr.Partha Pakray","doi":"10.1145/3666003","DOIUrl":"https://doi.org/10.1145/3666003","url":null,"abstract":"<p>This research investigates the utilization of pre-trained BERT transformers within the context of the Mizo language. BERT, an abbreviation for Bidirectional Encoder Representations from Transformers, symbolizes Google’s forefront neural network approach to Natural Language Processing (NLP), renowned for its remarkable performance across various NLP tasks. However, its efficacy in handling low-resource languages such as Mizo remains largely unexplored. In this study, we introduce <i>MizBERT</i>, a specialized Mizo language model. Through extensive pre-training on a corpus collected from diverse online platforms, <i>MizBERT</i> has been tailored to accommodate the nuances of the Mizo language. Evaluation of <i>MizBERT’s</i> capabilities is conducted using two primary metrics: Masked Language Modeling (MLM) and Perplexity, yielding scores of 76.12% and 3.2565, respectively. Additionally, its performance in a text classification task is examined. Results indicate that <i>MizBERT</i> outperforms both the multilingual BERT (mBERT) model and the Support Vector Machine (SVM) algorithm, achieving an accuracy of 98.92%. This underscores <i>MizBERT’s</i> proficiency in understanding and processing the intricacies inherent in the Mizo language.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141151258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gayathri G L, Krithika Swaminathan, Divyasri Krishnakumar, Thenmozhi D, Bharathi B
In recent years, a significant portion of the content on various platforms on the internet has been found to be offensive or abusive. Abusive comment detection can go a long way in preventing internet users from facing the adverse effects of coming in contact with abusive language. This problem is particularly challenging when the comments are found in low-resource languages like Tamil or Tamil-English code-mixed text. So far, there has not been any substantial work on abusive comment detection using imbalanced datasets. Furthermore, significant work has not been performed, especially for Tamil code-mixed data, that involves analysing the dataset for classification and accordingly creating a custom vocabulary for preprocessing. This paper proposes a novel approach to classify abusive comments from an imbalanced dataset using a customised training vocabulary and a combination of statistical feature selection with language-agnostic feature selection while making use of explainable AI for feature refinement. Our model achieved an accuracy of 74% and a macro F1-score of 0.46.
近年来,互联网各种平台上的内容有很大一部分被发现具有攻击性或辱骂性。辱骂性评论检测可以有效防止互联网用户因接触辱骂性语言而受到不良影响。当评论使用泰米尔语或泰米尔语-英语混合代码文本等低资源语言时,这一问题尤其具有挑战性。迄今为止,还没有任何关于使用不平衡数据集检测辱骂性评论的实质性工作。此外,特别是针对泰米尔语混合代码数据,还没有开展过涉及数据集分类分析和相应创建自定义词汇进行预处理的重要工作。本文提出了一种从不平衡性数据集中对辱骂性评论进行分类的新方法,该方法使用定制的训练词汇,并将统计特征选择与语言无关特征选择相结合,同时利用可解释人工智能进行特征提纯。我们的模型达到了 74% 的准确率和 0.46 的宏观 F1 分数。
{"title":"Abusive Comment Detection in Tamil Code-Mixed Data by Adjusting Class Weights and Refining Features","authors":"Gayathri G L, Krithika Swaminathan, Divyasri Krishnakumar, Thenmozhi D, Bharathi B","doi":"10.1145/3664619","DOIUrl":"https://doi.org/10.1145/3664619","url":null,"abstract":"<p>In recent years, a significant portion of the content on various platforms on the internet has been found to be offensive or abusive. Abusive comment detection can go a long way in preventing internet users from facing the adverse effects of coming in contact with abusive language. This problem is particularly challenging when the comments are found in low-resource languages like Tamil or Tamil-English code-mixed text. So far, there has not been any substantial work on abusive comment detection using imbalanced datasets. Furthermore, significant work has not been performed, especially for Tamil code-mixed data, that involves analysing the dataset for classification and accordingly creating a custom vocabulary for preprocessing. This paper proposes a novel approach to classify abusive comments from an imbalanced dataset using a customised training vocabulary and a combination of statistical feature selection with language-agnostic feature selection while making use of explainable AI for feature refinement. Our model achieved an accuracy of 74% and a macro F1-score of 0.46.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141060706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Solving a math word problem requires selecting quantities in it and performing appropriate arithmetic operations to obtain the answer. For deep learning-based methods, it is vital to obtain good quantity representations, i.e., to selectively and emphatically aggregate information in the context of quantities. However, existing works have not paid much attention to this aspect. Many works simply encode quantities as ordinary tokens, or use some implicit or rule-based methods to select information in their context. This leads to poor results when dealing with linguistic variations and confounding quantities. This paper proposes a novel method to identify question-related distinguishing features of quantities by contrasting their context with the question and the context of other quantities, thereby enhancing the representation of quantities. Our method not only considers the contrastive relationship between quantities, but also considers multiple relationships jointly. Besides, we propose two auxiliary tasks to further guide the representation learning of quantities: 1) predicting whether a quantity is used in the question; 2) predicting the relations (operators) between quantities given the question. Experimental results show that our method outperforms previous methods on SVAMP and ASDiv-A under similar settings, even some newly released strong baselines. Supplementary experiments further confirm that our method indeed improves the performance of quantity selection by improving the representation of both quantities and questions.
{"title":"Towards Better Quantity Representations for Solving Math Word Problems","authors":"Runxin Sun, Shizhu He, Jun Zhao, Kang Liu","doi":"10.1145/3665644","DOIUrl":"https://doi.org/10.1145/3665644","url":null,"abstract":"<p>Solving a math word problem requires selecting quantities in it and performing appropriate arithmetic operations to obtain the answer. For deep learning-based methods, it is vital to obtain good quantity representations, i.e., to selectively and emphatically aggregate information in the context of quantities. However, existing works have not paid much attention to this aspect. Many works simply encode quantities as ordinary tokens, or use some implicit or rule-based methods to select information in their context. This leads to poor results when dealing with linguistic variations and confounding quantities. This paper proposes a novel method to identify question-related distinguishing features of quantities by contrasting their context with the question and the context of other quantities, thereby enhancing the representation of quantities. Our method not only considers the contrastive relationship between quantities, but also considers multiple relationships jointly. Besides, we propose two auxiliary tasks to further guide the representation learning of quantities: 1) predicting whether a quantity is used in the question; 2) predicting the relations (operators) between quantities given the question. Experimental results show that our method outperforms previous methods on SVAMP and ASDiv-A under similar settings, even some newly released strong baselines. Supplementary experiments further confirm that our method indeed improves the performance of quantity selection by improving the representation of both quantities and questions.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141060662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Jafri, Kritesh Rauniyar, Surendrabikram Thapa, Mohammad Aman Siddiqui, Matloob Khushi, Usman Naseem
In the ever-evolving landscape of online discourse and political dialogue, the rise of hate speech poses a significant challenge to maintaining a respectful and inclusive digital environment. The context becomes particularly complex when considering the Hindi language—a low-resource language with limited available data. To address this pressing concern, we introduce the CHUNAV dataset—a collection of 11,457 Hindi tweets gathered during assembly elections in various states. CHUNAV is purpose-built for hate speech categorization and the identification of target groups. The dataset is a valuable resource for exploring hate speech within the distinctive socio-political context of Indian elections. The tweets within CHUNAV have been meticulously categorized into “Hate” and “Non-Hate” labels, and further subdivided to pinpoint the specific targets of hate speech, including “Individual”, “Organization”, and “Community” labels (as shown in Figure 1). Furthermore, this paper presents multiple benchmark models for hate speech detection, along with an innovative ensemble and oversampling-based method. The paper also delves into the results of topic modeling, all aimed at effectively addressing hate speech and target identification in the Hindi language. This contribution seeks to advance the field of hate speech analysis and foster a safer and more inclusive online space within the distinctive realm of Indian Assembly Elections.
{"title":"CHUNAV: Analyzing Hindi Hate Speech and Targeted Groups in Indian Election Discourse","authors":"F. Jafri, Kritesh Rauniyar, Surendrabikram Thapa, Mohammad Aman Siddiqui, Matloob Khushi, Usman Naseem","doi":"10.1145/3665245","DOIUrl":"https://doi.org/10.1145/3665245","url":null,"abstract":"\u0000 In the ever-evolving landscape of online discourse and political dialogue, the rise of hate speech poses a significant challenge to maintaining a respectful and inclusive digital environment. The context becomes particularly complex when considering the Hindi language—a low-resource language with limited available data. To address this pressing concern, we introduce the\u0000 CHUNAV\u0000 dataset—a collection of 11,457 Hindi tweets gathered during assembly elections in various states.\u0000 CHUNAV\u0000 is purpose-built for hate speech categorization and the identification of target groups. The dataset is a valuable resource for exploring hate speech within the distinctive socio-political context of Indian elections. The tweets within\u0000 CHUNAV\u0000 have been meticulously categorized into “Hate” and “Non-Hate” labels, and further subdivided to pinpoint the specific targets of hate speech, including “Individual”, “Organization”, and “Community” labels (as shown in Figure 1). Furthermore, this paper presents multiple benchmark models for hate speech detection, along with an innovative ensemble and oversampling-based method. The paper also delves into the results of topic modeling, all aimed at effectively addressing hate speech and target identification in the Hindi language. This contribution seeks to advance the field of hate speech analysis and foster a safer and more inclusive online space within the distinctive realm of Indian Assembly Elections.\u0000","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":2.0,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140970249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}