利用反向翻译和意译进行仇恨言论检测的数据扩展

IF 2.9 Q1 Social Sciences Online Social Networks and Media Pub Date : 2021-07-01 Epub Date: 2021-06-23 DOI:10.1016/j.osnem.2021.100153

Djamila Romaissa Beddiar, Md Saroar Jahan, Mourad Oussalah

{"title":"利用反向翻译和意译进行仇恨言论检测的数据扩展","authors":"Djamila Romaissa Beddiar, Md Saroar Jahan, Mourad Oussalah","doi":"10.1016/j.osnem.2021.100153","DOIUrl":null,"url":null,"abstract":"<div><p>With proliferation of user generated contents in social media platforms, establishing mechanisms to automatically identify toxic and abusive content becomes a prime concern for regulators, researchers, and society. Keeping the balance between freedom of speech and respecting each other dignity is a major concern of social media platform regulators. Although, automatic detection of offensive content using deep learning approaches seems to provide encouraging results, training deep learning-based models requires large amounts of high-quality labeled data, which is often missing. In this regard, we present in this paper a new deep learning-based method that fuses a Back Translation method, and a Paraphrasing technique for data augmentation. Our pipeline investigates different word-embedding-based architectures for classification of hate speech. The back translation technique relies on an encoder–decoder architecture pre-trained on a large corpus and mostly used for machine translation. In addition, paraphrasing exploits the transformer model and the mixture of experts to generate diverse paraphrases. Finally, LSTM, and CNN are compared to seek enhanced classification results. We evaluate our proposal on five publicly available datasets; namely, AskFm corpus, Formspring dataset, Warner and Waseem dataset, Olid, and Wikipedia toxic comments dataset. The performance of the proposal together with comparison to some related state-of-art results demonstrate the effectiveness and soundness of our proposal.</p></div>","PeriodicalId":52228,"journal":{"name":"Online Social Networks and Media","volume":"24 ","pages":"Article 100153"},"PeriodicalIF":2.9000,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/j.osnem.2021.100153","citationCount":"43","resultStr":"{\"title\":\"Data expansion using back translation and paraphrasing for hate speech detection\",\"authors\":\"Djamila Romaissa Beddiar, Md Saroar Jahan, Mourad Oussalah\",\"doi\":\"10.1016/j.osnem.2021.100153\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>With proliferation of user generated contents in social media platforms, establishing mechanisms to automatically identify toxic and abusive content becomes a prime concern for regulators, researchers, and society. Keeping the balance between freedom of speech and respecting each other dignity is a major concern of social media platform regulators. Although, automatic detection of offensive content using deep learning approaches seems to provide encouraging results, training deep learning-based models requires large amounts of high-quality labeled data, which is often missing. In this regard, we present in this paper a new deep learning-based method that fuses a Back Translation method, and a Paraphrasing technique for data augmentation. Our pipeline investigates different word-embedding-based architectures for classification of hate speech. The back translation technique relies on an encoder–decoder architecture pre-trained on a large corpus and mostly used for machine translation. In addition, paraphrasing exploits the transformer model and the mixture of experts to generate diverse paraphrases. Finally, LSTM, and CNN are compared to seek enhanced classification results. We evaluate our proposal on five publicly available datasets; namely, AskFm corpus, Formspring dataset, Warner and Waseem dataset, Olid, and Wikipedia toxic comments dataset. The performance of the proposal together with comparison to some related state-of-art results demonstrate the effectiveness and soundness of our proposal.</p></div>\",\"PeriodicalId\":52228,\"journal\":{\"name\":\"Online Social Networks and Media\",\"volume\":\"24 \",\"pages\":\"Article 100153\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2021-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1016/j.osnem.2021.100153\",\"citationCount\":\"43\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Online Social Networks and Media\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2468696421000355\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2021/6/23 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"Social Sciences\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Online Social Networks and Media","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2468696421000355","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/6/23 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"Social Sciences","Score":null,"Total":0}

引用次数: 43

摘要

随着社交媒体平台上用户生成内容的激增，建立自动识别有毒和滥用内容的机制成为监管机构、研究人员和社会关注的主要问题。保持言论自由和尊重彼此尊严之间的平衡是社交媒体平台监管机构关注的主要问题。尽管使用深度学习方法自动检测攻击性内容似乎提供了令人鼓舞的结果，但训练基于深度学习的模型需要大量高质量的标记数据，而这些数据通常是缺失的。在这方面，我们在本文中提出了一种新的基于深度学习的方法，该方法融合了反向翻译方法和用于数据增强的释义技术。我们的管道研究了不同的基于词嵌入的仇恨言论分类架构。反向翻译技术依赖于在大型语料库上预训练的编码器-解码器架构，主要用于机器翻译。此外，释义利用变压器模型和专家的混合来生成不同的释义。最后，比较LSTM和CNN，寻求增强的分类结果。我们在五个公开可用的数据集上评估我们的提案;即AskFm语料库、Formspring数据集、Warner和Waseem数据集、Olid和维基百科有毒评论数据集。该建议的执行情况以及与一些相关的最新结果的比较表明了我们的建议的有效性和合理性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Data expansion using back translation and paraphrasing for hate speech detection

With proliferation of user generated contents in social media platforms, establishing mechanisms to automatically identify toxic and abusive content becomes a prime concern for regulators, researchers, and society. Keeping the balance between freedom of speech and respecting each other dignity is a major concern of social media platform regulators. Although, automatic detection of offensive content using deep learning approaches seems to provide encouraging results, training deep learning-based models requires large amounts of high-quality labeled data, which is often missing. In this regard, we present in this paper a new deep learning-based method that fuses a Back Translation method, and a Paraphrasing technique for data augmentation. Our pipeline investigates different word-embedding-based architectures for classification of hate speech. The back translation technique relies on an encoder–decoder architecture pre-trained on a large corpus and mostly used for machine translation. In addition, paraphrasing exploits the transformer model and the mixture of experts to generate diverse paraphrases. Finally, LSTM, and CNN are compared to seek enhanced classification results. We evaluate our proposal on five publicly available datasets; namely, AskFm corpus, Formspring dataset, Warner and Waseem dataset, Olid, and Wikipedia toxic comments dataset. The performance of the proposal together with comparison to some related state-of-art results demonstrate the effectiveness and soundness of our proposal.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊