SWSR: A Chinese dataset and lexicon for online sexism detection

Q1 Social Sciences Online Social Networks and Media Pub Date : 2022-01-01 DOI:10.1016/j.osnem.2021.100182
Aiqi Jiang , Xiaohan Yang , Yang Liu , Arkaitz Zubiaga
{"title":"SWSR: A Chinese dataset and lexicon for online sexism detection","authors":"Aiqi Jiang ,&nbsp;Xiaohan Yang ,&nbsp;Yang Liu ,&nbsp;Arkaitz Zubiaga","doi":"10.1016/j.osnem.2021.100182","DOIUrl":null,"url":null,"abstract":"<div><p><span>Online sexism has become an increasing concern in social media platforms<span> as it has affected the healthy development of the Internet and can have negative effects in society. While research in the sexism detection domain is growing, most of this research focuses on English as the language and on Twitter as the platform. Our objective here is to broaden the scope of this research by considering the Chinese language on Sina Weibo. We propose the first Chinese sexism dataset – Sina Weibo Sexism Review (SWSR) dataset –, as well as a large Chinese lexicon SexHateLex made of abusive and gender-related terms. We introduce our data collection and annotation process, and provide an exploratory analysis of the dataset characteristics to validate its quality and to show how sexism is manifested in Chinese. The SWSR dataset provides labels at different levels of granularity<span><span> including (i) sexism or non-sexism, (ii) sexism category and (iii) target type, which can be exploited, among others, for building computational methods to identify and investigate finer-grained gender-related abusive language. We conduct experiments for the three sexism classification tasks making use of state-of-the-art </span>machine learning models. Our results show competitive performance, providing a benchmark for sexism detection in the Chinese language, as well as an error analysis highlighting open challenges needing more research in Chinese NLP. The SWSR dataset and SexHateLex lexicon are publicly available.</span></span></span><span><sup>1</sup></span></p></div>","PeriodicalId":52228,"journal":{"name":"Online Social Networks and Media","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Online Social Networks and Media","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2468696421000604","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Social Sciences","Score":null,"Total":0}
引用次数: 31

Abstract

Online sexism has become an increasing concern in social media platforms as it has affected the healthy development of the Internet and can have negative effects in society. While research in the sexism detection domain is growing, most of this research focuses on English as the language and on Twitter as the platform. Our objective here is to broaden the scope of this research by considering the Chinese language on Sina Weibo. We propose the first Chinese sexism dataset – Sina Weibo Sexism Review (SWSR) dataset –, as well as a large Chinese lexicon SexHateLex made of abusive and gender-related terms. We introduce our data collection and annotation process, and provide an exploratory analysis of the dataset characteristics to validate its quality and to show how sexism is manifested in Chinese. The SWSR dataset provides labels at different levels of granularity including (i) sexism or non-sexism, (ii) sexism category and (iii) target type, which can be exploited, among others, for building computational methods to identify and investigate finer-grained gender-related abusive language. We conduct experiments for the three sexism classification tasks making use of state-of-the-art machine learning models. Our results show competitive performance, providing a benchmark for sexism detection in the Chinese language, as well as an error analysis highlighting open challenges needing more research in Chinese NLP. The SWSR dataset and SexHateLex lexicon are publicly available.1

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
网络性别歧视检测的中文数据集和词典
网络性别歧视已经成为社交媒体平台日益关注的问题,因为它影响了互联网的健康发展,并可能对社会产生负面影响。虽然性别歧视检测领域的研究正在增长,但大多数研究都集中在英语作为语言和Twitter作为平台上。我们的目标是通过考虑新浪微博上的中文来扩大这项研究的范围。我们提出了第一个中文性别歧视数据集——新浪微博性别歧视评论(SWSR)数据集——以及一个由辱骂和性别相关术语组成的大型中文词汇SexHateLex。我们介绍了我们的数据收集和注释过程,并对数据集特征进行了探索性分析,以验证其质量,并展示性别歧视在中文中的表现。SWSR数据集提供了不同粒度级别的标签,包括(i)性别歧视或非性别歧视,(ii)性别歧视类别和(iii)目标类型,这些标签可以用于构建计算方法,以识别和调查更细粒度的与性别相关的辱骂语言。我们利用最先进的机器学习模型对三个性别歧视分类任务进行了实验。我们的研究结果显示了具有竞争力的表现,为汉语中的性别歧视检测提供了基准,同时也为汉语NLP中需要更多研究的开放挑战提供了错误分析。SWSR数据集和SexHateLex词典是公开可用的
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Online Social Networks and Media
Online Social Networks and Media Social Sciences-Communication
CiteScore
10.60
自引率
0.00%
发文量
32
审稿时长
44 days
期刊最新文献
How does user-generated content on Social Media affect stock predictions? A case study on GameStop Measuring centralization of online platforms through size and interconnection of communities Crowdsourcing the Mitigation of disinformation and misinformation: The case of spontaneous community-based moderation on Reddit GASCOM: Graph-based Attentive Semantic Context Modeling for Online Conversation Understanding The influence of coordinated behavior on toxicity
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1