Exploring Data Augmentation for Gender-Based Hate Speech Detection

Muhammad Amien Ibrahim, Samsul Arifin, Eko Setyo Purwanto
{"title":"Exploring Data Augmentation for Gender-Based Hate Speech Detection","authors":"Muhammad Amien Ibrahim, Samsul Arifin, Eko Setyo Purwanto","doi":"10.3844/jcssp.2023.1222.1230","DOIUrl":null,"url":null,"abstract":"Social media moderation is a crucial component to establish healthy online communities and ensuring online safety from hate speech and offensive language. In many cases, hate speech may be targeted at specific gender which could be expressed in many different languages on social media platforms such as Indonesian Twitter. However, difficulties such as data scarcity and the imbalanced gender-based hate speech dataset in Indonesian tweets have slowed the development and implementation of automatic social media moderation. Obtaining more data to increase the number of samples may be costly in terms of resources required to gather and annotate the data. This study looks at the usage of data augmentation methods to increase the amount of textual dataset while keeping the quality of the augmented data. Three augmentation strategies are explored in this study: Random insertion, back translation, and a sequential combination of back translation and random insertion. Additionally, the study examines the preservation of the increased data labels. The performance result demonstrates that classification models trained with augmented data generated from random insertion strategy outperform the other approaches. In terms of label preservation, the three augmentation approaches have been shown to offer enough label preservation without compromising the meaning of the augmented data. The findings imply that by increasing the amount of the dataset while preserving the original label, data augmentation could be utilized to solve issues such as data scarcity and dataset imbalance.","PeriodicalId":40005,"journal":{"name":"Journal of Computer Science","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3844/jcssp.2023.1222.1230","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Social media moderation is a crucial component to establish healthy online communities and ensuring online safety from hate speech and offensive language. In many cases, hate speech may be targeted at specific gender which could be expressed in many different languages on social media platforms such as Indonesian Twitter. However, difficulties such as data scarcity and the imbalanced gender-based hate speech dataset in Indonesian tweets have slowed the development and implementation of automatic social media moderation. Obtaining more data to increase the number of samples may be costly in terms of resources required to gather and annotate the data. This study looks at the usage of data augmentation methods to increase the amount of textual dataset while keeping the quality of the augmented data. Three augmentation strategies are explored in this study: Random insertion, back translation, and a sequential combination of back translation and random insertion. Additionally, the study examines the preservation of the increased data labels. The performance result demonstrates that classification models trained with augmented data generated from random insertion strategy outperform the other approaches. In terms of label preservation, the three augmentation approaches have been shown to offer enough label preservation without compromising the meaning of the augmented data. The findings imply that by increasing the amount of the dataset while preserving the original label, data augmentation could be utilized to solve issues such as data scarcity and dataset imbalance.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
探索基于性别的仇恨言论检测的数据增强
社交媒体节制是建立健康的网络社区和确保网络安全免受仇恨言论和攻击性语言侵害的关键组成部分。在许多情况下,仇恨言论可能针对特定性别,可以在社交媒体平台上以多种不同的语言表达,如印度尼西亚的Twitter。然而,数据短缺和印尼推文中基于性别的仇恨言论数据不平衡等困难阻碍了自动社交媒体审核的发展和实施。就收集和注释数据所需的资源而言,获取更多数据以增加样本数量可能代价高昂。本研究着眼于使用数据增强方法来增加文本数据集的数量,同时保持增强数据的质量。本研究探讨了三种增强策略:随机插入、反翻译、反翻译和随机插入的顺序组合。此外,该研究还检查了增加的数据标签的保存。性能结果表明,使用随机插入策略生成的增强数据训练的分类模型优于其他方法。在标签保存方面,这三种增强方法已被证明可以提供足够的标签保存,而不会损害增强数据的含义。研究结果表明,通过在保留原始标签的情况下增加数据集的数量,可以利用数据增强来解决数据稀缺和数据不平衡等问题。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Computer Science
Journal of Computer Science Computer Science-Computer Networks and Communications
CiteScore
1.70
自引率
0.00%
发文量
92
期刊介绍: Journal of Computer Science is aimed to publish research articles on theoretical foundations of information and computation, and of practical techniques for their implementation and application in computer systems. JCS updated twelve times a year and is a peer reviewed journal covers the latest and most compelling research of the time.
期刊最新文献
Features of the Security System Development of a Computer Telecommunication Network Performance Assessment of CPU Scheduling Algorithms: A Scenario-Based Approach with FCFS, RR, and SJF Website-Based Educational Application to Help MSMEs in Indonesia Develop A Multi-Split Cross-Strategy for Enhancing Machine Learning Algorithms Prediction Results with Data Generated by Conditional Generative Adversarial Network Improving the Detection of Mask-Wearing Mistakes by Deep Learning
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1