Detecting offensive tweets via topical feature discovery over a large scale twitter corpus

Proceedings of the 21st ACM international conference on Information and knowledge management Pub Date : 2012-10-29 DOI:10.1145/2396761.2398556

Guang Xiang, Bin Fan, Ling Wang, Jason I. Hong, C. Rosé

引用次数: 254

Abstract

In this paper, we propose a novel semi-supervised approach for detecting profanity-related offensive content in Twitter. Our approach exploits linguistic regularities in profane language via statistical topic modeling on a huge Twitter corpus, and detects offensive tweets using automatically these generated features. Our approach performs competitively with a variety of machine learning (ML) algorithms. For instance, our approach achieves a true positive rate (TP) of 75.1% over 4029 testing tweets using Logistic Regression, significantly outperforming the popular keyword matching baseline, which has a TP of 69.7%, while keeping the false positive rate (FP) at the same level as the baseline at about 3.77%. Our approach provides an alternative to large scale hand annotation efforts required by fully supervised learning approaches.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过大规模推特语料库上的主题特征发现来检测攻击性推文

在本文中，我们提出了一种新的半监督方法来检测Twitter中与亵渎相关的攻击性内容。我们的方法通过在一个巨大的Twitter语料库上进行统计主题建模，利用亵渎语言的语言规律，并使用这些生成的特征自动检测冒犯性推文。我们的方法与各种机器学习(ML)算法相比具有竞争力。例如，我们的方法在使用Logistic回归的4029条测试推文中实现了75.1%的真阳性率(TP)，显著优于流行关键字匹配基线(TP为69.7%)，同时将假阳性率(FP)保持在与基线相同的水平，约为3.77%。我们的方法为完全监督学习方法所需的大规模手工注释工作提供了一种替代方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊