An empirical evaluation of text representation schemes to filter the social media stream

IF 1.7 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Journal of Experimental & Theoretical Artificial Intelligence Pub Date : 2021-04-24 DOI:10.1080/0952813X.2021.1907792

Sandip J Modha, Prasenjit Majumder, Thomas Mandl

{"title":"An empirical evaluation of text representation schemes to filter the social media stream","authors":"Sandip J Modha, Prasenjit Majumder, Thomas Mandl","doi":"10.1080/0952813X.2021.1907792","DOIUrl":null,"url":null,"abstract":"ABSTRACT Modeling text in a numerical representation is a prime task for any Natural Language Processing downstream task such as text classification. This paper attempts to study the effectiveness of text representation schemes on the text classification task, such as aggressive text detection, a special case of Hate speech from social media. Aggression levels are categorized into three predefined classes, namely: ‘Non-aggressive’ (NAG), ‘Overtly Aggressive’ (OAG), and ‘Covertly Aggressive’ (CAG). Various text representation schemes based on BoW techniques, word embedding, contextual word embedding, sentence embedding on traditional classifiers, and deep neural models are compared on a text classification problem. The weighted score is used as a primary evaluation metric. The results show that text representation using Googles’ universal sentence encoder (USE) performs better than word embedding and BoW techniques on traditional classifiers, such as SVM, while pre-trained word embedding models perform better on classifiers based on the deep neural models on the English dataset. Recent pre-trained transfer learning models like Elmo, ULMFi, and BERT are fine-tuned for the aggression classification task. However, results are not at par with the pre-trained word embedding model. Overall, word embedding using pre-trained fastText vectors produces the best weighted -score than Word2Vec and Glove. On the Hindi dataset, BoW techniques perform better than word embeddings on traditional classifiers such as SVM. In contrast, pre-trained word embedding models perform better on classifiers based on the deep neural nets. Statistical significance tests are employed to ensure the significance of the classification results. Deep neural models are more robust against the bias induced by the training dataset. They perform substantially better than traditional classifiers, such as SVM, logistic regression, and Naive Bayes classifiers on the Twitter test dataset.","PeriodicalId":15677,"journal":{"name":"Journal of Experimental & Theoretical Artificial Intelligence","volume":"32 1","pages":"499 - 525"},"PeriodicalIF":1.7000,"publicationDate":"2021-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Experimental & Theoretical Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1080/0952813X.2021.1907792","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 8

Abstract

ABSTRACT Modeling text in a numerical representation is a prime task for any Natural Language Processing downstream task such as text classification. This paper attempts to study the effectiveness of text representation schemes on the text classification task, such as aggressive text detection, a special case of Hate speech from social media. Aggression levels are categorized into three predefined classes, namely: ‘Non-aggressive’ (NAG), ‘Overtly Aggressive’ (OAG), and ‘Covertly Aggressive’ (CAG). Various text representation schemes based on BoW techniques, word embedding, contextual word embedding, sentence embedding on traditional classifiers, and deep neural models are compared on a text classification problem. The weighted score is used as a primary evaluation metric. The results show that text representation using Googles’ universal sentence encoder (USE) performs better than word embedding and BoW techniques on traditional classifiers, such as SVM, while pre-trained word embedding models perform better on classifiers based on the deep neural models on the English dataset. Recent pre-trained transfer learning models like Elmo, ULMFi, and BERT are fine-tuned for the aggression classification task. However, results are not at par with the pre-trained word embedding model. Overall, word embedding using pre-trained fastText vectors produces the best weighted -score than Word2Vec and Glove. On the Hindi dataset, BoW techniques perform better than word embeddings on traditional classifiers such as SVM. In contrast, pre-trained word embedding models perform better on classifiers based on the deep neural nets. Statistical significance tests are employed to ensure the significance of the classification results. Deep neural models are more robust against the bias induced by the training dataset. They perform substantially better than traditional classifiers, such as SVM, logistic regression, and Naive Bayes classifiers on the Twitter test dataset.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

文本表示方案过滤社交媒体流的实证评估

用数字表示文本建模是任何自然语言处理下游任务(如文本分类)的主要任务。本文试图研究文本表示方案在文本分类任务中的有效性，例如攻击性文本检测，以社交媒体仇恨言论为例。攻击水平分为三个预定义的类别，即:“非攻击”(NAG)，“公开攻击”(OAG)和“隐蔽攻击”(CAG)。针对一个文本分类问题，比较了基于BoW技术、词嵌入、上下文词嵌入、传统分类器上的句子嵌入和深度神经模型的各种文本表示方案。加权分数被用作主要的评估指标。结果表明，使用google的通用句子编码器(USE)的文本表示在传统分类器(如SVM)上的表现优于词嵌入和BoW技术，而预训练的词嵌入模型在基于深度神经模型的英语数据集分类器上的表现更好。最近的预训练迁移学习模型，如Elmo、ULMFi和BERT，都是针对攻击分类任务进行微调的。然而，结果与预训练的词嵌入模型不一致。总的来说，使用预训练的fastText向量的词嵌入比Word2Vec和Glove产生了最好的加权分数。在印地语数据集上，BoW技术比传统分类器(如SVM)上的词嵌入表现更好。相比之下，预训练的词嵌入模型在基于深度神经网络的分类器上表现更好。采用统计显著性检验来保证分类结果的显著性。深度神经模型对训练数据集引起的偏差具有更强的鲁棒性。它们在Twitter测试数据集上的表现明显优于传统分类器，如SVM、逻辑回归和朴素贝叶斯分类器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Experimental & Theoretical Artificial Intelligence 工程技术-计算机：人工智能

CiteScore

6.10

自引率

4.50%

发文量

审稿时长

>12 weeks

期刊介绍： Journal of Experimental & Theoretical Artificial Intelligence (JETAI) is a world leading journal dedicated to publishing high quality, rigorously reviewed, original papers in artificial intelligence (AI) research. The journal features work in all subfields of AI research and accepts both theoretical and applied research. Topics covered include, but are not limited to, the following: • cognitive science • games • learning • knowledge representation • memory and neural system modelling • perception • problem-solving