Data expansion using back translation and paraphrasing for hate speech detection

Q1 Social Sciences Online Social Networks and Media Pub Date : 2021-07-01 DOI:10.1016/j.osnem.2021.100153
Djamila Romaissa Beddiar, Md Saroar Jahan, Mourad Oussalah
{"title":"Data expansion using back translation and paraphrasing for hate speech detection","authors":"Djamila Romaissa Beddiar,&nbsp;Md Saroar Jahan,&nbsp;Mourad Oussalah","doi":"10.1016/j.osnem.2021.100153","DOIUrl":null,"url":null,"abstract":"<div><p>With proliferation of user generated contents in social media platforms, establishing mechanisms to automatically identify toxic and abusive content becomes a prime concern for regulators, researchers, and society. Keeping the balance between freedom of speech and respecting each other dignity is a major concern of social media platform regulators. Although, automatic detection of offensive content using deep learning approaches seems to provide encouraging results, training deep learning-based models requires large amounts of high-quality labeled data, which is often missing. In this regard, we present in this paper a new deep learning-based method that fuses a Back Translation method, and a Paraphrasing technique for data augmentation. Our pipeline investigates different word-embedding-based architectures for classification of hate speech. The back translation technique relies on an encoder–decoder architecture pre-trained on a large corpus and mostly used for machine translation. In addition, paraphrasing exploits the transformer model and the mixture of experts to generate diverse paraphrases. Finally, LSTM, and CNN are compared to seek enhanced classification results. We evaluate our proposal on five publicly available datasets; namely, AskFm corpus, Formspring dataset, Warner and Waseem dataset, Olid, and Wikipedia toxic comments dataset. The performance of the proposal together with comparison to some related state-of-art results demonstrate the effectiveness and soundness of our proposal.</p></div>","PeriodicalId":52228,"journal":{"name":"Online Social Networks and Media","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/j.osnem.2021.100153","citationCount":"43","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Online Social Networks and Media","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2468696421000355","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Social Sciences","Score":null,"Total":0}
引用次数: 43

Abstract

With proliferation of user generated contents in social media platforms, establishing mechanisms to automatically identify toxic and abusive content becomes a prime concern for regulators, researchers, and society. Keeping the balance between freedom of speech and respecting each other dignity is a major concern of social media platform regulators. Although, automatic detection of offensive content using deep learning approaches seems to provide encouraging results, training deep learning-based models requires large amounts of high-quality labeled data, which is often missing. In this regard, we present in this paper a new deep learning-based method that fuses a Back Translation method, and a Paraphrasing technique for data augmentation. Our pipeline investigates different word-embedding-based architectures for classification of hate speech. The back translation technique relies on an encoder–decoder architecture pre-trained on a large corpus and mostly used for machine translation. In addition, paraphrasing exploits the transformer model and the mixture of experts to generate diverse paraphrases. Finally, LSTM, and CNN are compared to seek enhanced classification results. We evaluate our proposal on five publicly available datasets; namely, AskFm corpus, Formspring dataset, Warner and Waseem dataset, Olid, and Wikipedia toxic comments dataset. The performance of the proposal together with comparison to some related state-of-art results demonstrate the effectiveness and soundness of our proposal.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用反向翻译和意译进行仇恨言论检测的数据扩展
随着社交媒体平台上用户生成内容的激增,建立自动识别有毒和滥用内容的机制成为监管机构、研究人员和社会关注的主要问题。保持言论自由和尊重彼此尊严之间的平衡是社交媒体平台监管机构关注的主要问题。尽管使用深度学习方法自动检测攻击性内容似乎提供了令人鼓舞的结果,但训练基于深度学习的模型需要大量高质量的标记数据,而这些数据通常是缺失的。在这方面,我们在本文中提出了一种新的基于深度学习的方法,该方法融合了反向翻译方法和用于数据增强的释义技术。我们的管道研究了不同的基于词嵌入的仇恨言论分类架构。反向翻译技术依赖于在大型语料库上预训练的编码器-解码器架构,主要用于机器翻译。此外,释义利用变压器模型和专家的混合来生成不同的释义。最后,比较LSTM和CNN,寻求增强的分类结果。我们在五个公开可用的数据集上评估我们的提案;即AskFm语料库、Formspring数据集、Warner和Waseem数据集、Olid和维基百科有毒评论数据集。该建议的执行情况以及与一些相关的最新结果的比较表明了我们的建议的有效性和合理性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Online Social Networks and Media
Online Social Networks and Media Social Sciences-Communication
CiteScore
10.60
自引率
0.00%
发文量
32
审稿时长
44 days
期刊最新文献
How does user-generated content on Social Media affect stock predictions? A case study on GameStop Measuring centralization of online platforms through size and interconnection of communities Crowdsourcing the Mitigation of disinformation and misinformation: The case of spontaneous community-based moderation on Reddit GASCOM: Graph-based Attentive Semantic Context Modeling for Online Conversation Understanding The influence of coordinated behavior on toxicity
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1