策划网络欺凌数据集:人类-人工智能协作方法。

Christopher E Gomez, Marcelo O Sztainberg, Rachel E Trana
{"title":"策划网络欺凌数据集:人类-人工智能协作方法。","authors":"Christopher E Gomez,&nbsp;Marcelo O Sztainberg,&nbsp;Rachel E Trana","doi":"10.1007/s42380-021-00114-6","DOIUrl":null,"url":null,"abstract":"<p><p>Cyberbullying is the use of digital communication tools and spaces to inflict physical, mental, or emotional distress. This serious form of aggression is frequently targeted at, but not limited to, vulnerable populations. A common problem when creating machine learning models to identify cyberbullying is the availability of accurately annotated, reliable, relevant, and diverse datasets. Datasets intended to train models for cyberbullying detection are typically annotated by human participants, which can introduce the following issues: (1) annotator bias, (2) incorrect annotation due to language and cultural barriers, and (3) the inherent subjectivity of the task can naturally create multiple valid labels for a given comment. The result can be a potentially inadequate dataset with one or more of these overlapping issues. We propose two machine learning approaches to identify and filter unambiguous comments in a cyberbullying dataset of roughly 19,000 comments collected from YouTube that was initially annotated using Amazon Mechanical Turk (AMT). Using consensus filtering methods, comments were classified as unambiguous when an agreement occurred between the AMT workers' majority label and the unanimous algorithmic filtering label. Comments identified as unambiguous were extracted and used to curate new datasets. We then used an artificial neural network to test for performance on these datasets. Compared to the original dataset, the classifier exhibits a large improvement in performance on modified versions of the dataset and can yield insight into the type of data that is consistently classified as bullying or non-bullying. This annotation approach can be expanded from cyberbullying datasets onto any classification corpus that has a similar complexity in scope.</p>","PeriodicalId":73427,"journal":{"name":"International journal of bullying prevention : an official publication of the International Bullying Prevention Association","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8691962/pdf/","citationCount":"2","resultStr":"{\"title\":\"Curating Cyberbullying Datasets: a Human-AI Collaborative Approach.\",\"authors\":\"Christopher E Gomez,&nbsp;Marcelo O Sztainberg,&nbsp;Rachel E Trana\",\"doi\":\"10.1007/s42380-021-00114-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Cyberbullying is the use of digital communication tools and spaces to inflict physical, mental, or emotional distress. This serious form of aggression is frequently targeted at, but not limited to, vulnerable populations. A common problem when creating machine learning models to identify cyberbullying is the availability of accurately annotated, reliable, relevant, and diverse datasets. Datasets intended to train models for cyberbullying detection are typically annotated by human participants, which can introduce the following issues: (1) annotator bias, (2) incorrect annotation due to language and cultural barriers, and (3) the inherent subjectivity of the task can naturally create multiple valid labels for a given comment. The result can be a potentially inadequate dataset with one or more of these overlapping issues. We propose two machine learning approaches to identify and filter unambiguous comments in a cyberbullying dataset of roughly 19,000 comments collected from YouTube that was initially annotated using Amazon Mechanical Turk (AMT). Using consensus filtering methods, comments were classified as unambiguous when an agreement occurred between the AMT workers' majority label and the unanimous algorithmic filtering label. Comments identified as unambiguous were extracted and used to curate new datasets. We then used an artificial neural network to test for performance on these datasets. Compared to the original dataset, the classifier exhibits a large improvement in performance on modified versions of the dataset and can yield insight into the type of data that is consistently classified as bullying or non-bullying. This annotation approach can be expanded from cyberbullying datasets onto any classification corpus that has a similar complexity in scope.</p>\",\"PeriodicalId\":73427,\"journal\":{\"name\":\"International journal of bullying prevention : an official publication of the International Bullying Prevention Association\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8691962/pdf/\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International journal of bullying prevention : an official publication of the International Bullying Prevention Association\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s42380-021-00114-6\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2021/12/22 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International journal of bullying prevention : an official publication of the International Bullying Prevention Association","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s42380-021-00114-6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/12/22 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

网络欺凌是指利用数字通信工具和空间造成身体、精神或情感上的痛苦。这种严重形式的侵略经常以但不限于脆弱人口为目标。在创建机器学习模型来识别网络欺凌时,一个常见的问题是是否有准确注释、可靠、相关和多样化的数据集。用于训练网络欺凌检测模型的数据集通常由人类参与者进行注释,这可能会引入以下问题:(1)注释者偏见,(2)由于语言和文化障碍而导致的错误注释,以及(3)任务固有的主观性可以自然地为给定的评论创建多个有效标签。结果可能是一个有一个或多个重叠问题的潜在不充分的数据集。我们提出了两种机器学习方法来识别和过滤从YouTube收集的大约19,000条评论的网络欺凌数据集中的明确评论,这些评论最初使用亚马逊土耳其机械(AMT)进行注释。使用共识过滤方法,当AMT工人的多数标签和一致算法过滤标签之间发生协议时,评论被分类为明确的。识别为明确的评论被提取出来并用于管理新的数据集。然后,我们使用人工神经网络来测试这些数据集上的性能。与原始数据集相比,分类器在数据集的修改版本上表现出很大的性能改进,并且可以深入了解始终被分类为欺凌或非欺凌的数据类型。这种标注方法可以从网络欺凌数据集扩展到具有类似范围复杂性的任何分类语料库。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Curating Cyberbullying Datasets: a Human-AI Collaborative Approach.

Cyberbullying is the use of digital communication tools and spaces to inflict physical, mental, or emotional distress. This serious form of aggression is frequently targeted at, but not limited to, vulnerable populations. A common problem when creating machine learning models to identify cyberbullying is the availability of accurately annotated, reliable, relevant, and diverse datasets. Datasets intended to train models for cyberbullying detection are typically annotated by human participants, which can introduce the following issues: (1) annotator bias, (2) incorrect annotation due to language and cultural barriers, and (3) the inherent subjectivity of the task can naturally create multiple valid labels for a given comment. The result can be a potentially inadequate dataset with one or more of these overlapping issues. We propose two machine learning approaches to identify and filter unambiguous comments in a cyberbullying dataset of roughly 19,000 comments collected from YouTube that was initially annotated using Amazon Mechanical Turk (AMT). Using consensus filtering methods, comments were classified as unambiguous when an agreement occurred between the AMT workers' majority label and the unanimous algorithmic filtering label. Comments identified as unambiguous were extracted and used to curate new datasets. We then used an artificial neural network to test for performance on these datasets. Compared to the original dataset, the classifier exhibits a large improvement in performance on modified versions of the dataset and can yield insight into the type of data that is consistently classified as bullying or non-bullying. This annotation approach can be expanded from cyberbullying datasets onto any classification corpus that has a similar complexity in scope.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
5.40
自引率
0.00%
发文量
0
期刊最新文献
Psychosocial Well-being, Problematic Social Media Use, and Cyberbullying Involvement Among Mongolian Adolescents Systematic Review of Intervention and Prevention Programs to Tackle Homophobic Bullying at School: a Socio-emotional Learning Skills Perspective Teacher Identity and Bullying—Perspectives from Teachers During Bullying Prevention Professional Development Can Job Demands and Job Resources Predict Bystander Behaviour in Workplace Bullying? A Longitudinal Study Revisiting the Definition of Bullying in the Context of Higher Education
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1