Feature Extraction in Hierarchical Multi-Label Classification for Dangerous Speech Identification on Twitter Texts

D. Purwitasari, D. A. Navastara, Y. Findawati, Kresna Adhi Pramana, Agus Budi Raharjo
{"title":"Feature Extraction in Hierarchical Multi-Label Classification for Dangerous Speech Identification on Twitter Texts","authors":"D. Purwitasari, D. A. Navastara, Y. Findawati, Kresna Adhi Pramana, Agus Budi Raharjo","doi":"10.1109/ICCoSITE57641.2023.10127774","DOIUrl":null,"url":null,"abstract":"Dangerous speech is a strong hate speech that causes negative impacts, such as violence, crime, social pressure, trauma, and despair, and can lead to conflicts between groups. Raw data of Twitter texts need the necessary preprocess to obtain the proper training datasets for hate speech or even dangerous one. One reason is how to express hate speech related to mentions or hashtags. Because of the variants of context messages in raw Twitter posts which could be hate speech or not, the problem here is hierarchical and multi-label classification with three label types of hate speech status, issues, and dangerous levels. The issues in this work are about religion, ethnicity, and others. After handling preprocess, the word embedding includes data under-sampling because the dataset is not balanced. Additional stop-word dictionaries to overcome language-related vocabularies are also incorporated. To observe the preprocess effects in the classification problem, some methods of machine learning and deep learning, such as SVM, Bi-LSTM, and BERT are explored. Then we examined after hyper-parameter settings with performance indicators of subset accuracy and Hamming lost for imbalanced, in addition to F1 scores of micro and macro averages.","PeriodicalId":256184,"journal":{"name":"2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCoSITE57641.2023.10127774","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Dangerous speech is a strong hate speech that causes negative impacts, such as violence, crime, social pressure, trauma, and despair, and can lead to conflicts between groups. Raw data of Twitter texts need the necessary preprocess to obtain the proper training datasets for hate speech or even dangerous one. One reason is how to express hate speech related to mentions or hashtags. Because of the variants of context messages in raw Twitter posts which could be hate speech or not, the problem here is hierarchical and multi-label classification with three label types of hate speech status, issues, and dangerous levels. The issues in this work are about religion, ethnicity, and others. After handling preprocess, the word embedding includes data under-sampling because the dataset is not balanced. Additional stop-word dictionaries to overcome language-related vocabularies are also incorporated. To observe the preprocess effects in the classification problem, some methods of machine learning and deep learning, such as SVM, Bi-LSTM, and BERT are explored. Then we examined after hyper-parameter settings with performance indicators of subset accuracy and Hamming lost for imbalanced, in addition to F1 scores of micro and macro averages.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于层次多标签分类的Twitter文本危险语音识别特征提取
危险言论是一种强烈的仇恨言论,会造成暴力、犯罪、社会压力、创伤和绝望等负面影响,并可能导致群体之间的冲突。Twitter文本的原始数据需要进行必要的预处理,以获得针对仇恨言论甚至危险言论的适当训练数据集。其中一个原因是如何表达与提及或标签相关的仇恨言论。由于原始Twitter帖子中的上下文信息的变体可能是仇恨言论,也可能不是,这里的问题是分层和多标签分类,有三种标签类型的仇恨言论状态,问题和危险级别。这部作品中的问题是关于宗教、种族和其他的。经过预处理后,由于数据集不平衡,词嵌入中包含了欠采样数据。额外的停顿词字典,以克服语言相关的词汇也纳入。为了观察预处理在分类问题中的效果,探索了一些机器学习和深度学习的方法,如SVM、Bi-LSTM和BERT。然后,我们在超参数设置后,除了微观和宏观平均的F1分数之外,还使用子集精度和汉明损失的不平衡性能指标进行了检验。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Customer Relationship Management, Customer Retention, and the Mediating Role of Customer Satisfaction on a Healthcare Mobile Applications Revalidating the Encoder-Decoder Depths and Activation Function to Find Optimum Vanilla Transformer Model Goertzel Algorithm Design on Field Programmable Gate Arrays For Implementing Electric Power Measurement Instagram vs TikTok: Which Engage Best for Consumer Brand Engagement for Social Commerce and Purchase Intention? Air Pollution Prediction using Random Forest Classifier: A Case Study of DKI Jakarta
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1