Faux Hate: unravelling the web of fake narratives in spreading hateful stories: a multi-label and multi-class dataset in cross-lingual Hindi-English code-mixed text

IF 1.8 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Language Resources and Evaluation Pub Date : 2024-04-16 DOI:10.1007/s10579-024-09732-0

Shankar Biradar, Sunil Saumya, Arun Chauhan

{"title":"Faux Hate: unravelling the web of fake narratives in spreading hateful stories: a multi-label and multi-class dataset in cross-lingual Hindi-English code-mixed text","authors":"Shankar Biradar, Sunil Saumya, Arun Chauhan","doi":"10.1007/s10579-024-09732-0","DOIUrl":null,"url":null,"abstract":"<p>Social media has undeniably transformed the way people communicate; however, it also comes with unquestionable drawbacks, notably the proliferation of fake and hateful comments. Recent observations have indicated that these two issues often coexist, with discussions on hate topics frequently being dominated by the fake. Therefore, it has become imperative to explore the role of fake narratives in the dissemination of hate in contemporary times. In this direction, the proposed article introduces a novel data set known as the Faux Hate Multi-Label Data set (FHMLD) comprising 8014 fake-instigated hateful comments in Hindi-English code-mixed text. To the best of our knowledge, this marks the first endeavour to bring together both fake and hateful content within a unified framework. Further, the proposed data set is collected from diverse platforms such as YouTube and Twitter to mitigate user-associated bias. To investigate a relation between the presence of fake narratives and its impact on the intensity of the hate, this study presents a statistical analysis using the Chi-square test. The statistical findings indicate that the calculated <span>\\(\\chi ^2\\)</span> value is greater than the value from the standard table, leading to the rejection of the null hypothesis. Additionally, the current study present baseline methods for categorizing multi-class and multi-label data set, utilizing syntactical and semantic features at both word and sentence levels. The experimental results demonstrate that the fastText and SVM based method outperforms others models with an accuracy of 71% and 58% for binary fake–hate and severity prediction respectively.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"38 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Resources and Evaluation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10579-024-09732-0","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Social media has undeniably transformed the way people communicate; however, it also comes with unquestionable drawbacks, notably the proliferation of fake and hateful comments. Recent observations have indicated that these two issues often coexist, with discussions on hate topics frequently being dominated by the fake. Therefore, it has become imperative to explore the role of fake narratives in the dissemination of hate in contemporary times. In this direction, the proposed article introduces a novel data set known as the Faux Hate Multi-Label Data set (FHMLD) comprising 8014 fake-instigated hateful comments in Hindi-English code-mixed text. To the best of our knowledge, this marks the first endeavour to bring together both fake and hateful content within a unified framework. Further, the proposed data set is collected from diverse platforms such as YouTube and Twitter to mitigate user-associated bias. To investigate a relation between the presence of fake narratives and its impact on the intensity of the hate, this study presents a statistical analysis using the Chi-square test. The statistical findings indicate that the calculated \(\chi ^2\) value is greater than the value from the standard table, leading to the rejection of the null hypothesis. Additionally, the current study present baseline methods for categorizing multi-class and multi-label data set, utilizing syntactical and semantic features at both word and sentence levels. The experimental results demonstrate that the fastText and SVM based method outperforms others models with an accuracy of 71% and 58% for binary fake–hate and severity prediction respectively.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

虚假仇恨：揭开传播仇恨故事的虚假叙事之网：跨语言印地语-英语代码混合文本中的多标签和多类别数据集

不可否认，社交媒体改变了人们的交流方式，但同时也带来了毋庸置疑的弊端，尤其是虚假和仇恨言论的泛滥。最近的观察表明，这两个问题经常并存，关于仇恨话题的讨论经常被虚假言论所主导。因此，探讨虚假叙事在当代仇恨传播中的作用已成为当务之急。在这一方向上，本文提出了一个名为 "虚假仇恨多标签数据集"（FHMLD）的新数据集，该数据集由 8014 条印地语-英语混合编码文本中的虚假仇恨评论组成。据我们所知，这是首次将虚假内容和仇恨内容整合到一个统一的框架中。此外，所提议的数据集收集自 YouTube 和 Twitter 等不同平台，以减少用户相关偏见。为了研究虚假叙事的存在及其对仇恨强度的影响之间的关系，本研究使用卡方检验法（Chi-square test）进行了统计分析。统计结果表明，计算出的（\chi ^2\）值大于标准表中的值，从而拒绝了零假设。此外，本研究还提出了利用单词和句子层面的句法和语义特征对多类别和多标签数据集进行分类的基线方法。实验结果表明，基于 fastText 和 SVM 的方法在二元假憎预测和严重性预测方面的准确率分别为 71% 和 58%，优于其他模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Language Resources and Evaluation 工程技术-计算机：跨学科应用

CiteScore

6.50

自引率

3.70%

发文量

审稿时长

>12 weeks

期刊介绍： Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications. Language resources include language data and descriptions in machine readable form used to assist and augment language processing applications, such as written or spoken corpora and lexica, multimodal resources, grammars, terminology or domain specific databases and dictionaries, ontologies, multimedia databases, etc., as well as basic software tools for their acquisition, preparation, annotation, management, customization, and use. Evaluation of language resources concerns assessing the state-of-the-art for a given technology, comparing different approaches to a given problem, assessing the availability of resources and technologies for a given application, benchmarking, and assessing system usability and user satisfaction.