SSL-GAN-RoBERTa: A robust semi-supervised model for detecting Anti-Asian COVID-19 hate speech on social media

IF 2.3 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Natural Language Engineering Pub Date : 2023-08-03 DOI:10.1017/s1351324923000396
Xuanyu Su, Yansong Li, Paula Branco, D. Inkpen
{"title":"SSL-GAN-RoBERTa: A robust semi-supervised model for detecting Anti-Asian COVID-19 hate speech on social media","authors":"Xuanyu Su, Yansong Li, Paula Branco, D. Inkpen","doi":"10.1017/s1351324923000396","DOIUrl":null,"url":null,"abstract":"\n Anti-Asian speech during the COVID-19 pandemic has been a serious problem with severe consequences. A hate speech wave swept social media platforms. The timely detection of Anti-Asian COVID-19-related hate speech is of utmost importance, not only to allow the application of preventive mechanisms but also to anticipate and possibly prevent other similar discriminatory situations. In this paper, we address the problem of detecting Anti-Asian COVID-19-related hate speech from social media data. Previous approaches that tackled this problem used a transformer-based model, BERT/RoBERTa, trained on the homologous annotated dataset and achieved good performance on this task. However, this requires extensive and annotated datasets with a strong connection to the topic. Both goals are difficult to meet without employing reliable, vast, and costly resources. In this paper, we propose a robust semi-supervised model, SSL-GAN-RoBERTa, that learns from a limited heterogeneous dataset and whose performance is further enhanced by using vast amounts of unlabeled data from another related domain. Compared with the RoBERTa baseline model, the experimental results show that the model has substantial performance gains in terms of Accuracy and Macro-F1 score in different scenarios that use data from different domains. Our proposed model achieves state-of-the-art performance results while efficiently using unlabeled data, showing promising applicability to other complex classification tasks where large amounts of labeled examples are difficult to obtain.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2023-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1017/s1351324923000396","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Anti-Asian speech during the COVID-19 pandemic has been a serious problem with severe consequences. A hate speech wave swept social media platforms. The timely detection of Anti-Asian COVID-19-related hate speech is of utmost importance, not only to allow the application of preventive mechanisms but also to anticipate and possibly prevent other similar discriminatory situations. In this paper, we address the problem of detecting Anti-Asian COVID-19-related hate speech from social media data. Previous approaches that tackled this problem used a transformer-based model, BERT/RoBERTa, trained on the homologous annotated dataset and achieved good performance on this task. However, this requires extensive and annotated datasets with a strong connection to the topic. Both goals are difficult to meet without employing reliable, vast, and costly resources. In this paper, we propose a robust semi-supervised model, SSL-GAN-RoBERTa, that learns from a limited heterogeneous dataset and whose performance is further enhanced by using vast amounts of unlabeled data from another related domain. Compared with the RoBERTa baseline model, the experimental results show that the model has substantial performance gains in terms of Accuracy and Macro-F1 score in different scenarios that use data from different domains. Our proposed model achieves state-of-the-art performance results while efficiently using unlabeled data, showing promising applicability to other complex classification tasks where large amounts of labeled examples are difficult to obtain.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
SSL-GAN-RoBERTa:一个用于检测社交媒体上反阿新冠肺炎仇恨言论的稳健半监督模型
新冠肺炎大流行期间的反亚洲言论是一个严重问题,后果严重。一股仇恨言论浪潮席卷了社交媒体平台。及时发现与新冠肺炎相关的反亚洲仇恨言论至关重要,不仅可以使预防机制得以应用,还可以预测并可能防止其他类似的歧视情况。在本文中,我们解决了从社交媒体数据中检测反亚洲covid -19相关仇恨言论的问题。之前解决该问题的方法使用基于转换器的模型BERT/RoBERTa,该模型在同源注释数据集上进行训练,并在该任务上取得了良好的性能。然而,这需要与主题有紧密联系的广泛且带注释的数据集。如果不使用可靠、庞大和昂贵的资源,这两个目标都很难实现。在本文中,我们提出了一个鲁棒的半监督模型SSL-GAN-RoBERTa,该模型从有限的异构数据集中学习,并通过使用来自另一个相关领域的大量未标记数据进一步增强其性能。与RoBERTa基线模型相比,实验结果表明,该模型在使用不同领域数据的不同场景下,在准确性和宏观f1分数方面都有显著的性能提升。我们提出的模型在有效地使用未标记数据的同时获得了最先进的性能结果,在难以获得大量标记示例的其他复杂分类任务中显示出有希望的适用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Natural Language Engineering
Natural Language Engineering COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-
CiteScore
5.90
自引率
12.00%
发文量
60
审稿时长
>12 weeks
期刊介绍: Natural Language Engineering meets the needs of professionals and researchers working in all areas of computerised language processing, whether from the perspective of theoretical or descriptive linguistics, lexicology, computer science or engineering. Its aim is to bridge the gap between traditional computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing research articles on a broad range of topics - from text analysis, machine translation, information retrieval and speech analysis and generation to integrated systems and multi modal interfaces - it also publishes special issues on specific areas and technologies within these topics, an industry watch column and book reviews.
期刊最新文献
Start-up activity in the LLM ecosystem Anisotropic span embeddings and the negative impact of higher-order inference for coreference resolution: An empirical analysis Automated annotation of parallel bible corpora with cross-lingual semantic concordance How do control tokens affect natural language generation tasks like text simplification Emerging trends: When can users trust GPT, and when should they intervene?
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1