通过持续预训练在低资源代码混合语言中进行语法感知的攻击性内容检测

IF 1.8 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE ACM Transactions on Asian and Low-Resource Language Information Processing Pub Date : 2024-03-26 DOI:10.1145/3653450

Necva Bölücü, Pelin Canbay

{"title":"通过持续预训练在低资源代码混合语言中进行语法感知的攻击性内容检测","authors":"Necva Bölücü, Pelin Canbay","doi":"10.1145/3653450","DOIUrl":null,"url":null,"abstract":"<p>Social media is a widely used platform that includes a vast amount of user-generated content, allowing the extraction of information about users’ thoughts from texts. Individuals freely express their thoughts on these platforms, often without constraints, even if the content is offensive or contains hate speech. The identification and removal of offensive content from social media are imperative to prevent individuals or groups from becoming targets of harmful language. Despite extensive research on offensive content detection, addressing this challenge in code-mixed languages remains unsolved, characterised by issues such as imbalanced datasets and limited data sources. Most previous studies on detecting offensive content in these languages focus on creating datasets and applying deep neural networks, such as Recurrent Neural Networks (RNNs), or pre-trained language models (PLMs) such as BERT and its variations. Given the low-resource nature and imbalanced dataset issues inherent in these languages, this study delves into the efficacy of the syntax-aware BERT model with continual pre-training for the accurate identification of offensive content and proposes a framework called Cont-Syntax-BERT by combining continual learning with continual pre-training. Comprehensive experimental results demonstrate that the proposed Cont-Syntax-BERT framework outperforms state-of-the-art approaches. Notably, this framework addresses the challenges posed by code-mixed languages, as evidenced by its proficiency on the DravidianCodeMix [10,19] and HASOC 2109 [37] datasets. These results demonstrate the adaptability of the proposed framework in effectively addressing the challenges of code-mixed languages.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"31 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Syntax-aware Offensive Content Detection in Low-resourced Code-mixed Languages with Continual Pre-training\",\"authors\":\"Necva Bölücü, Pelin Canbay\",\"doi\":\"10.1145/3653450\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Social media is a widely used platform that includes a vast amount of user-generated content, allowing the extraction of information about users’ thoughts from texts. Individuals freely express their thoughts on these platforms, often without constraints, even if the content is offensive or contains hate speech. The identification and removal of offensive content from social media are imperative to prevent individuals or groups from becoming targets of harmful language. Despite extensive research on offensive content detection, addressing this challenge in code-mixed languages remains unsolved, characterised by issues such as imbalanced datasets and limited data sources. Most previous studies on detecting offensive content in these languages focus on creating datasets and applying deep neural networks, such as Recurrent Neural Networks (RNNs), or pre-trained language models (PLMs) such as BERT and its variations. Given the low-resource nature and imbalanced dataset issues inherent in these languages, this study delves into the efficacy of the syntax-aware BERT model with continual pre-training for the accurate identification of offensive content and proposes a framework called Cont-Syntax-BERT by combining continual learning with continual pre-training. Comprehensive experimental results demonstrate that the proposed Cont-Syntax-BERT framework outperforms state-of-the-art approaches. Notably, this framework addresses the challenges posed by code-mixed languages, as evidenced by its proficiency on the DravidianCodeMix [10,19] and HASOC 2109 [37] datasets. These results demonstrate the adaptability of the proposed framework in effectively addressing the challenges of code-mixed languages.</p>\",\"PeriodicalId\":54312,\"journal\":{\"name\":\"ACM Transactions on Asian and Low-Resource Language Information Processing\",\"volume\":\"31 1\",\"pages\":\"\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2024-03-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Asian and Low-Resource Language Information Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3653450\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Asian and Low-Resource Language Information Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3653450","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

社交媒体是一个广泛使用的平台，包含大量用户生成的内容，可以从文本中提取有关用户思想的信息。个人在这些平台上自由表达自己的思想，通常不受任何限制，即使内容具有攻击性或包含仇恨言论。要防止个人或群体成为有害语言的攻击目标，就必须识别并删除社交媒体上的攻击性内容。尽管对攻击性内容检测进行了广泛的研究，但在代码混合语言中应对这一挑战的问题仍未得到解决，其特点是数据集不平衡和数据源有限。以往关于检测这些语言中攻击性内容的研究大多侧重于创建数据集和应用深度神经网络（如递归神经网络（RNN））或预训练语言模型（PLM）（如 BERT 及其变体）。鉴于这些语言固有的低资源性和不平衡数据集问题，本研究深入探讨了语法感知 BERT 模型与持续预训练在准确识别攻击性内容方面的功效，并通过将持续学习与持续预训练相结合，提出了一个名为 Cont-Syntax-BERT 的框架。综合实验结果表明，所提出的 Cont-Syntax-BERT 框架优于最先进的方法。值得注意的是，该框架能应对混合编码语言所带来的挑战，其在 DravidianCodeMix [10,19] 和 HASOC 2109 [37] 数据集上的出色表现就证明了这一点。这些结果表明，所提出的框架在有效应对代码混合语言挑战方面具有很强的适应性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Syntax-aware Offensive Content Detection in Low-resourced Code-mixed Languages with Continual Pre-training

Social media is a widely used platform that includes a vast amount of user-generated content, allowing the extraction of information about users’ thoughts from texts. Individuals freely express their thoughts on these platforms, often without constraints, even if the content is offensive or contains hate speech. The identification and removal of offensive content from social media are imperative to prevent individuals or groups from becoming targets of harmful language. Despite extensive research on offensive content detection, addressing this challenge in code-mixed languages remains unsolved, characterised by issues such as imbalanced datasets and limited data sources. Most previous studies on detecting offensive content in these languages focus on creating datasets and applying deep neural networks, such as Recurrent Neural Networks (RNNs), or pre-trained language models (PLMs) such as BERT and its variations. Given the low-resource nature and imbalanced dataset issues inherent in these languages, this study delves into the efficacy of the syntax-aware BERT model with continual pre-training for the accurate identification of offensive content and proposes a framework called Cont-Syntax-BERT by combining continual learning with continual pre-training. Comprehensive experimental results demonstrate that the proposed Cont-Syntax-BERT framework outperforms state-of-the-art approaches. Notably, this framework addresses the challenges posed by code-mixed languages, as evidenced by its proficiency on the DravidianCodeMix [10,19] and HASOC 2109 [37] datasets. These results demonstrate the adaptability of the proposed framework in effectively addressing the challenges of code-mixed languages.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Transactions on Asian and Low-Resource Language Information Processing Computer Science-General Computer Science

CiteScore

3.60

自引率

15.00%

发文量

241

期刊介绍： The ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) publishes high quality original archival papers and technical notes in the areas of computation and processing of information in Asian languages, low-resource languages of Africa, Australasia, Oceania and the Americas, as well as related disciplines. The subject areas covered by TALLIP include, but are not limited to: -Computational Linguistics: including computational phonology, computational morphology, computational syntax (e.g. parsing), computational semantics, computational pragmatics, etc. -Linguistic Resources: including computational lexicography, terminology, electronic dictionaries, cross-lingual dictionaries, electronic thesauri, etc. -Hardware and software algorithms and tools for Asian or low-resource language processing, e.g., handwritten character recognition. -Information Understanding: including text understanding, speech understanding, character recognition, discourse processing, dialogue systems, etc. -Machine Translation involving Asian or low-resource languages. -Information Retrieval: including natural language processing (NLP) for concept-based indexing, natural language query interfaces, semantic relevance judgments, etc. -Information Extraction and Filtering: including automatic abstraction, user profiling, etc. -Speech processing: including text-to-speech synthesis and automatic speech recognition. -Multimedia Asian Information Processing: including speech, image, video, image/text translation, etc. -Cross-lingual information processing involving Asian or low-resource languages. -Papers that deal in theory, systems design, evaluation and applications in the aforesaid subjects are appropriate for TALLIP. Emphasis will be placed on the originality and the practical significance of the reported research.