An empirical evaluation of text representation schemes to filter the social media stream

IF 1.7 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Journal of Experimental & Theoretical Artificial Intelligence Pub Date : 2021-04-24 DOI:10.1080/0952813X.2021.1907792
Sandip J Modha, Prasenjit Majumder, Thomas Mandl
{"title":"An empirical evaluation of text representation schemes to filter the social media stream","authors":"Sandip J Modha, Prasenjit Majumder, Thomas Mandl","doi":"10.1080/0952813X.2021.1907792","DOIUrl":null,"url":null,"abstract":"ABSTRACT Modeling text in a numerical representation is a prime task for any Natural Language Processing downstream task such as text classification. This paper attempts to study the effectiveness of text representation schemes on the text classification task, such as aggressive text detection, a special case of Hate speech from social media. Aggression levels are categorized into three predefined classes, namely: ‘Non-aggressive’ (NAG), ‘Overtly Aggressive’ (OAG), and ‘Covertly Aggressive’ (CAG). Various text representation schemes based on BoW techniques, word embedding, contextual word embedding, sentence embedding on traditional classifiers, and deep neural models are compared on a text classification problem. The weighted score is used as a primary evaluation metric. The results show that text representation using Googles’ universal sentence encoder (USE) performs better than word embedding and BoW techniques on traditional classifiers, such as SVM, while pre-trained word embedding models perform better on classifiers based on the deep neural models on the English dataset. Recent pre-trained transfer learning models like Elmo, ULMFi, and BERT are fine-tuned for the aggression classification task. However, results are not at par with the pre-trained word embedding model. Overall, word embedding using pre-trained fastText vectors produces the best weighted -score than Word2Vec and Glove. On the Hindi dataset, BoW techniques perform better than word embeddings on traditional classifiers such as SVM. In contrast, pre-trained word embedding models perform better on classifiers based on the deep neural nets. Statistical significance tests are employed to ensure the significance of the classification results. Deep neural models are more robust against the bias induced by the training dataset. They perform substantially better than traditional classifiers, such as SVM, logistic regression, and Naive Bayes classifiers on the Twitter test dataset.","PeriodicalId":15677,"journal":{"name":"Journal of Experimental & Theoretical Artificial Intelligence","volume":"32 1","pages":"499 - 525"},"PeriodicalIF":1.7000,"publicationDate":"2021-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Experimental & Theoretical Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1080/0952813X.2021.1907792","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 8

Abstract

ABSTRACT Modeling text in a numerical representation is a prime task for any Natural Language Processing downstream task such as text classification. This paper attempts to study the effectiveness of text representation schemes on the text classification task, such as aggressive text detection, a special case of Hate speech from social media. Aggression levels are categorized into three predefined classes, namely: ‘Non-aggressive’ (NAG), ‘Overtly Aggressive’ (OAG), and ‘Covertly Aggressive’ (CAG). Various text representation schemes based on BoW techniques, word embedding, contextual word embedding, sentence embedding on traditional classifiers, and deep neural models are compared on a text classification problem. The weighted score is used as a primary evaluation metric. The results show that text representation using Googles’ universal sentence encoder (USE) performs better than word embedding and BoW techniques on traditional classifiers, such as SVM, while pre-trained word embedding models perform better on classifiers based on the deep neural models on the English dataset. Recent pre-trained transfer learning models like Elmo, ULMFi, and BERT are fine-tuned for the aggression classification task. However, results are not at par with the pre-trained word embedding model. Overall, word embedding using pre-trained fastText vectors produces the best weighted -score than Word2Vec and Glove. On the Hindi dataset, BoW techniques perform better than word embeddings on traditional classifiers such as SVM. In contrast, pre-trained word embedding models perform better on classifiers based on the deep neural nets. Statistical significance tests are employed to ensure the significance of the classification results. Deep neural models are more robust against the bias induced by the training dataset. They perform substantially better than traditional classifiers, such as SVM, logistic regression, and Naive Bayes classifiers on the Twitter test dataset.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
文本表示方案过滤社交媒体流的实证评估
用数字表示文本建模是任何自然语言处理下游任务(如文本分类)的主要任务。本文试图研究文本表示方案在文本分类任务中的有效性,例如攻击性文本检测,以社交媒体仇恨言论为例。攻击水平分为三个预定义的类别,即:“非攻击”(NAG),“公开攻击”(OAG)和“隐蔽攻击”(CAG)。针对一个文本分类问题,比较了基于BoW技术、词嵌入、上下文词嵌入、传统分类器上的句子嵌入和深度神经模型的各种文本表示方案。加权分数被用作主要的评估指标。结果表明,使用google的通用句子编码器(USE)的文本表示在传统分类器(如SVM)上的表现优于词嵌入和BoW技术,而预训练的词嵌入模型在基于深度神经模型的英语数据集分类器上的表现更好。最近的预训练迁移学习模型,如Elmo、ULMFi和BERT,都是针对攻击分类任务进行微调的。然而,结果与预训练的词嵌入模型不一致。总的来说,使用预训练的fastText向量的词嵌入比Word2Vec和Glove产生了最好的加权分数。在印地语数据集上,BoW技术比传统分类器(如SVM)上的词嵌入表现更好。相比之下,预训练的词嵌入模型在基于深度神经网络的分类器上表现更好。采用统计显著性检验来保证分类结果的显著性。深度神经模型对训练数据集引起的偏差具有更强的鲁棒性。它们在Twitter测试数据集上的表现明显优于传统分类器,如SVM、逻辑回归和朴素贝叶斯分类器。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
6.10
自引率
4.50%
发文量
89
审稿时长
>12 weeks
期刊介绍: Journal of Experimental & Theoretical Artificial Intelligence (JETAI) is a world leading journal dedicated to publishing high quality, rigorously reviewed, original papers in artificial intelligence (AI) research. The journal features work in all subfields of AI research and accepts both theoretical and applied research. Topics covered include, but are not limited to, the following: • cognitive science • games • learning • knowledge representation • memory and neural system modelling • perception • problem-solving
期刊最新文献
Occlusive target recognition method of sorting robot based on anchor-free detection network An effectual underwater image enhancement framework using adaptive trans-resunet ++ with attention mechanism An experimental study of sentiment classification using deep-based models with various word embedding techniques Sign language video to text conversion via optimised LSTM with improved motion estimation An efficient safest route prediction-based route discovery mechanism for drivers using improved golden tortoise beetle optimizer
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1