Misogynistic attitude detection in YouTube comments and replies: A high-quality dataset and algorithmic models

IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Speech and Language Pub Date : 2024-06-22 DOI:10.1016/j.csl.2024.101682
Aakash Singh , Deepawali Sharma , Vivek Kumar Singh
{"title":"Misogynistic attitude detection in YouTube comments and replies: A high-quality dataset and algorithmic models","authors":"Aakash Singh ,&nbsp;Deepawali Sharma ,&nbsp;Vivek Kumar Singh","doi":"10.1016/j.csl.2024.101682","DOIUrl":null,"url":null,"abstract":"<div><p>Social media platforms are now not only a medium for expressing users views, feelings, emotions and sentiments but are also being abused by people to propagate unpleasant and hateful content. Consequently, research efforts have been made to develop techniques and models for automatically detecting and identifying hateful, abusive, vulgar, and offensive content on different platforms. Although significant progress has been made on the task, the research on design of methods to detect misogynistic attitude of people in non-English and code-mixed languages is not very well-developed. Non-availability of suitable datasets and resources is one main reason for this. Therefore, this paper attempts to bridge this research gap by presenting a high-quality curated dataset in the Hindi-English code-mixed language. The dataset includes 12,698 YouTube comments and replies, with each comment annotated under two-level categories, first as optimistic and pessimistic, and then into different types at second level based on the content. The inter-annotator agreement in the dataset is found to be 0.84 for the first subtask, and 0.79 for the second subtask, indicating the reasonably high quality of annotations. Different algorithmic models are explored for the task of automatic detection of the misogynistic attitude expressed in the comments, with the mBERT model giving best performance on both subtasks (reported macro average F1 scores of 0.59 and 0.52, and weighted average F1 scores of 0.66 and 0.65, respectively). The analysis and results suggest that the dataset can be used for further research on the topic and that the developed algorithmic models can be applied for automatic detection of misogynistic attitude in social media conversations and posts.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101682"},"PeriodicalIF":3.1000,"publicationDate":"2024-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000652/pdfft?md5=1fb50b1ad09f16299853e9624ad9718d&pid=1-s2.0-S0885230824000652-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824000652","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Social media platforms are now not only a medium for expressing users views, feelings, emotions and sentiments but are also being abused by people to propagate unpleasant and hateful content. Consequently, research efforts have been made to develop techniques and models for automatically detecting and identifying hateful, abusive, vulgar, and offensive content on different platforms. Although significant progress has been made on the task, the research on design of methods to detect misogynistic attitude of people in non-English and code-mixed languages is not very well-developed. Non-availability of suitable datasets and resources is one main reason for this. Therefore, this paper attempts to bridge this research gap by presenting a high-quality curated dataset in the Hindi-English code-mixed language. The dataset includes 12,698 YouTube comments and replies, with each comment annotated under two-level categories, first as optimistic and pessimistic, and then into different types at second level based on the content. The inter-annotator agreement in the dataset is found to be 0.84 for the first subtask, and 0.79 for the second subtask, indicating the reasonably high quality of annotations. Different algorithmic models are explored for the task of automatic detection of the misogynistic attitude expressed in the comments, with the mBERT model giving best performance on both subtasks (reported macro average F1 scores of 0.59 and 0.52, and weighted average F1 scores of 0.66 and 0.65, respectively). The analysis and results suggest that the dataset can be used for further research on the topic and that the developed algorithmic models can be applied for automatic detection of misogynistic attitude in social media conversations and posts.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
检测 YouTube 评论和回复中的厌女态度:高质量数据集和算法模型
现在,社交媒体平台不仅是表达用户观点、感受、情绪和情感的媒介,而且还被人们滥用来传播令人不快和仇恨的内容。因此,研究人员一直在努力开发自动检测和识别不同平台上的仇恨、辱骂、低俗和攻击性内容的技术和模型。虽然这项任务已经取得了重大进展,但在设计方法以检测非英语和代码混合语言中人们的厌恶态度方面的研究还不是很完善。缺乏合适的数据集和资源是造成这种情况的主要原因之一。因此,本文试图通过提供一个高质量的印地语-英语混合编码语言数据集来弥补这一研究空白。该数据集包括 12,698 条 YouTube 评论和回复,每条评论都有两个级别的注释类别,首先是乐观和悲观,然后在第二个级别根据内容分为不同类型。数据集中第一个子任务的注释者之间的一致性为 0.84,第二个子任务的一致性为 0.79,表明注释的质量相当高。在自动检测评论中表达的厌女态度这一任务中,探索了不同的算法模型,其中 mBERT 模型在两个子任务中的表现最佳(报告的宏观平均 F1 分数分别为 0.59 和 0.52,加权平均 F1 分数分别为 0.66 和 0.65)。分析和结果表明,该数据集可用于该主题的进一步研究,所开发的算法模型可用于自动检测社交媒体对话和帖子中的厌女态度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Computer Speech and Language
Computer Speech and Language 工程技术-计算机:人工智能
CiteScore
11.30
自引率
4.70%
发文量
80
审稿时长
22.9 weeks
期刊介绍: Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.
期刊最新文献
Modeling correlated causal-effect structure with a hypergraph for document-level event causality identification You Are What You Write: Author re-identification privacy attacks in the era of pre-trained language models End-to-End Speech-to-Text Translation: A Survey Corpus and unsupervised benchmark: Towards Tagalog grammatical error correction TR-Net: Token Relation Inspired Table Filling Network for Joint Entity and Relation Extraction
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1