使用预训练语言模型进行有毒评论分类的比较研究

Companion Proceedings of the Web Conference 2021 Pub Date : 2021-04-19 DOI:10.1145/3442442.3452313

Zhixue Zhao, Ziqi Zhang, F. Hopfgartner

{"title":"使用预训练语言模型进行有毒评论分类的比较研究","authors":"Zhixue Zhao, Ziqi Zhang, F. Hopfgartner","doi":"10.1145/3442442.3452313","DOIUrl":null,"url":null,"abstract":"As user-generated contents thrive, so does the spread of toxic comment. Therefore, detecting toxic comment becomes an active research area, and it is often handled as a text classification task. As recent popular methods for text classification tasks, pre-trained language model-based methods are at the forefront of natural language processing, achieving state-of-the-art performance on various NLP tasks. However, there is a paucity in studies using such methods on toxic comment classification. In this work, we study how to best make use of pre-trained language model-based methods for toxic comment classification and the performances of different pre-trained language models on these tasks. Our results show that, Out of the three most popular language models, i.e. BERT, RoBERTa, and XLM, BERT and RoBERTa generally outperform XLM on toxic comment classification. We also prove that using a basic linear downstream structure outperforms complex ones such as CNN and BiLSTM. What is more, we find that further fine-tuning a pre-trained language model with light hyper-parameter settings brings improvements to the downstream toxic comment classification task, especially when the task has a relatively small dataset.","PeriodicalId":129420,"journal":{"name":"Companion Proceedings of the Web Conference 2021","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":"{\"title\":\"A Comparative Study of Using Pre-trained Language Models for Toxic Comment Classification\",\"authors\":\"Zhixue Zhao, Ziqi Zhang, F. Hopfgartner\",\"doi\":\"10.1145/3442442.3452313\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As user-generated contents thrive, so does the spread of toxic comment. Therefore, detecting toxic comment becomes an active research area, and it is often handled as a text classification task. As recent popular methods for text classification tasks, pre-trained language model-based methods are at the forefront of natural language processing, achieving state-of-the-art performance on various NLP tasks. However, there is a paucity in studies using such methods on toxic comment classification. In this work, we study how to best make use of pre-trained language model-based methods for toxic comment classification and the performances of different pre-trained language models on these tasks. Our results show that, Out of the three most popular language models, i.e. BERT, RoBERTa, and XLM, BERT and RoBERTa generally outperform XLM on toxic comment classification. We also prove that using a basic linear downstream structure outperforms complex ones such as CNN and BiLSTM. What is more, we find that further fine-tuning a pre-trained language model with light hyper-parameter settings brings improvements to the downstream toxic comment classification task, especially when the task has a relatively small dataset.\",\"PeriodicalId\":129420,\"journal\":{\"name\":\"Companion Proceedings of the Web Conference 2021\",\"volume\":\"70 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-04-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"20\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Companion Proceedings of the Web Conference 2021\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3442442.3452313\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Proceedings of the Web Conference 2021","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3442442.3452313","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

摘要

随着用户生成内容的蓬勃发展，有毒评论的传播也在迅速发展。因此，有毒评论的检测成为一个活跃的研究领域，通常作为文本分类任务来处理。作为最近流行的文本分类方法，基于预训练语言模型的方法处于自然语言处理的前沿，在各种自然语言处理任务中实现了最先进的性能。然而，将这种方法用于毒性评论分类的研究还很缺乏。在这项工作中，我们研究了如何最好地利用基于预训练语言模型的方法进行有毒评论分类，以及不同预训练语言模型在这些任务上的性能。我们的结果表明，在BERT、RoBERTa和XLM这三种最流行的语言模型中，BERT和RoBERTa在有毒评论分类上的表现普遍优于XLM。我们还证明了使用基本线性下游结构优于CNN和BiLSTM等复杂结构。更重要的是，我们发现进一步微调具有轻超参数设置的预训练语言模型可以改善下游有毒评论分类任务，特别是当任务具有相对较小的数据集时。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Comparative Study of Using Pre-trained Language Models for Toxic Comment Classification

As user-generated contents thrive, so does the spread of toxic comment. Therefore, detecting toxic comment becomes an active research area, and it is often handled as a text classification task. As recent popular methods for text classification tasks, pre-trained language model-based methods are at the forefront of natural language processing, achieving state-of-the-art performance on various NLP tasks. However, there is a paucity in studies using such methods on toxic comment classification. In this work, we study how to best make use of pre-trained language model-based methods for toxic comment classification and the performances of different pre-trained language models on these tasks. Our results show that, Out of the three most popular language models, i.e. BERT, RoBERTa, and XLM, BERT and RoBERTa generally outperform XLM on toxic comment classification. We also prove that using a basic linear downstream structure outperforms complex ones such as CNN and BiLSTM. What is more, we find that further fine-tuning a pre-trained language model with light hyper-parameter settings brings improvements to the downstream toxic comment classification task, especially when the task has a relatively small dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Companion Proceedings of the Web Conference 2021

自引率

0.00%

发文量