{"title":"Context in abusive language detection: On the interdependence of context and annotation of user comments","authors":"Holly Lopez, Sandra Kübler","doi":"10.1016/j.dcm.2024.100848","DOIUrl":null,"url":null,"abstract":"<div><div>One of the challenges for automated abusive language detection is combating unintended bias, which can be easily introduced through the annotation process, especially when what is (not) considered abusive is subjective and heavily context dependent. Our study incorporates a fine-grained, socio-pragmatic perspective to data modeling by taking into consideration contextual elements that impact the quality of abusive language corpora. We use a fine-grained annotation scheme that distinguishes between different types of non-abuse along with explicit and implicit abuse. We include the following non-abusive categories: meta, casual profanity, argumentative language, irony, and non-abusive language. Experts and minimally trained annotators use this scheme to manually re-annotate instances originally considered abusive by crowdsourced annotators in a standard corpus. After re-annotation, we investigate discrepancies between experts and minimally trained annotators. Our investigation shows that minimally trained annotators have difficulty interpreting contextual aspects and distinguishing between content performing abuse and content about abuse or instances of casual profanity. It also demonstrates how missing information or contextualization cues are often a source of disagreement across all types of annotators and poses a significant challenge for developing robust, nuanced corpora and annotation guidelines for abusive language detection.</div></div>","PeriodicalId":46649,"journal":{"name":"Discourse Context & Media","volume":"63 ","pages":"Article 100848"},"PeriodicalIF":2.3000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Discourse Context & Media","FirstCategoryId":"98","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2211695824000941","RegionNum":2,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMMUNICATION","Score":null,"Total":0}
引用次数: 0
Abstract
One of the challenges for automated abusive language detection is combating unintended bias, which can be easily introduced through the annotation process, especially when what is (not) considered abusive is subjective and heavily context dependent. Our study incorporates a fine-grained, socio-pragmatic perspective to data modeling by taking into consideration contextual elements that impact the quality of abusive language corpora. We use a fine-grained annotation scheme that distinguishes between different types of non-abuse along with explicit and implicit abuse. We include the following non-abusive categories: meta, casual profanity, argumentative language, irony, and non-abusive language. Experts and minimally trained annotators use this scheme to manually re-annotate instances originally considered abusive by crowdsourced annotators in a standard corpus. After re-annotation, we investigate discrepancies between experts and minimally trained annotators. Our investigation shows that minimally trained annotators have difficulty interpreting contextual aspects and distinguishing between content performing abuse and content about abuse or instances of casual profanity. It also demonstrates how missing information or contextualization cues are often a source of disagreement across all types of annotators and poses a significant challenge for developing robust, nuanced corpora and annotation guidelines for abusive language detection.