{"title":"Should We Translate? Evaluating Toxicity in Online Comments when Translating from Portuguese to English","authors":"Jordan K. Kobellarz, Thiago H. Silva","doi":"10.1145/3539637.3556892","DOIUrl":null,"url":null,"abstract":"Social media and online discussion platforms suffer from the prevalence of uncivil behavior, such as harassment and abuse, seeking to curb toxic comments. There are several approaches to classifying toxic comments automatically. Some of them have more resources and are more advanced in English, thus, stimulating the task of translating the text from a specific language to English. While researchers have shown evidence that this practice is indicated for certain tasks, such as sentiment analysis, little is known in the context of toxicity identification. In this research, we assess the performance of a freely available model for toxic language detection in online comments called Perspective API, widely adopted by some famous news media sites to identify different toxicity classes in online comments. For that, we obtained comments in Portuguese from two Brazilian news media websites during a politically polarized situation as a use case. Then, this dataset was translated to English and compared to four baseline datasets, two composed of highly toxic comments, one in Portuguese and other in English, and two composed of neutral comments, also one in Portuguese and other in English – all of them in its original language, not translated. Finally, human-annotated comments from the news comments dataset were analyzed to assess the scores provided by the Perspective API for the original and the translated versions. Results indicate that keeping the texts in their original language is preferable, even in comparing different languages. Nevertheless, if the translated version is strictly necessary, ways of dealing with the situation were suggested to preserve as much information as possible from the original version.","PeriodicalId":350776,"journal":{"name":"Proceedings of the Brazilian Symposium on Multimedia and the Web","volume":"106 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Brazilian Symposium on Multimedia and the Web","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3539637.3556892","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Social media and online discussion platforms suffer from the prevalence of uncivil behavior, such as harassment and abuse, seeking to curb toxic comments. There are several approaches to classifying toxic comments automatically. Some of them have more resources and are more advanced in English, thus, stimulating the task of translating the text from a specific language to English. While researchers have shown evidence that this practice is indicated for certain tasks, such as sentiment analysis, little is known in the context of toxicity identification. In this research, we assess the performance of a freely available model for toxic language detection in online comments called Perspective API, widely adopted by some famous news media sites to identify different toxicity classes in online comments. For that, we obtained comments in Portuguese from two Brazilian news media websites during a politically polarized situation as a use case. Then, this dataset was translated to English and compared to four baseline datasets, two composed of highly toxic comments, one in Portuguese and other in English, and two composed of neutral comments, also one in Portuguese and other in English – all of them in its original language, not translated. Finally, human-annotated comments from the news comments dataset were analyzed to assess the scores provided by the Perspective API for the original and the translated versions. Results indicate that keeping the texts in their original language is preferable, even in comparing different languages. Nevertheless, if the translated version is strictly necessary, ways of dealing with the situation were suggested to preserve as much information as possible from the original version.