Feature Extraction in Hierarchical Multi-Label Classification for Dangerous Speech Identification on Twitter Texts

2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE) Pub Date : 2023-02-16 DOI:10.1109/ICCoSITE57641.2023.10127774

D. Purwitasari, D. A. Navastara, Y. Findawati, Kresna Adhi Pramana, Agus Budi Raharjo

{"title":"Feature Extraction in Hierarchical Multi-Label Classification for Dangerous Speech Identification on Twitter Texts","authors":"D. Purwitasari, D. A. Navastara, Y. Findawati, Kresna Adhi Pramana, Agus Budi Raharjo","doi":"10.1109/ICCoSITE57641.2023.10127774","DOIUrl":null,"url":null,"abstract":"Dangerous speech is a strong hate speech that causes negative impacts, such as violence, crime, social pressure, trauma, and despair, and can lead to conflicts between groups. Raw data of Twitter texts need the necessary preprocess to obtain the proper training datasets for hate speech or even dangerous one. One reason is how to express hate speech related to mentions or hashtags. Because of the variants of context messages in raw Twitter posts which could be hate speech or not, the problem here is hierarchical and multi-label classification with three label types of hate speech status, issues, and dangerous levels. The issues in this work are about religion, ethnicity, and others. After handling preprocess, the word embedding includes data under-sampling because the dataset is not balanced. Additional stop-word dictionaries to overcome language-related vocabularies are also incorporated. To observe the preprocess effects in the classification problem, some methods of machine learning and deep learning, such as SVM, Bi-LSTM, and BERT are explored. Then we examined after hyper-parameter settings with performance indicators of subset accuracy and Hamming lost for imbalanced, in addition to F1 scores of micro and macro averages.","PeriodicalId":256184,"journal":{"name":"2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCoSITE57641.2023.10127774","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Dangerous speech is a strong hate speech that causes negative impacts, such as violence, crime, social pressure, trauma, and despair, and can lead to conflicts between groups. Raw data of Twitter texts need the necessary preprocess to obtain the proper training datasets for hate speech or even dangerous one. One reason is how to express hate speech related to mentions or hashtags. Because of the variants of context messages in raw Twitter posts which could be hate speech or not, the problem here is hierarchical and multi-label classification with three label types of hate speech status, issues, and dangerous levels. The issues in this work are about religion, ethnicity, and others. After handling preprocess, the word embedding includes data under-sampling because the dataset is not balanced. Additional stop-word dictionaries to overcome language-related vocabularies are also incorporated. To observe the preprocess effects in the classification problem, some methods of machine learning and deep learning, such as SVM, Bi-LSTM, and BERT are explored. Then we examined after hyper-parameter settings with performance indicators of subset accuracy and Hamming lost for imbalanced, in addition to F1 scores of micro and macro averages.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于层次多标签分类的Twitter文本危险语音识别特征提取

危险言论是一种强烈的仇恨言论，会造成暴力、犯罪、社会压力、创伤和绝望等负面影响，并可能导致群体之间的冲突。Twitter文本的原始数据需要进行必要的预处理，以获得针对仇恨言论甚至危险言论的适当训练数据集。其中一个原因是如何表达与提及或标签相关的仇恨言论。由于原始Twitter帖子中的上下文信息的变体可能是仇恨言论，也可能不是，这里的问题是分层和多标签分类，有三种标签类型的仇恨言论状态，问题和危险级别。这部作品中的问题是关于宗教、种族和其他的。经过预处理后，由于数据集不平衡，词嵌入中包含了欠采样数据。额外的停顿词字典，以克服语言相关的词汇也纳入。为了观察预处理在分类问题中的效果，探索了一些机器学习和深度学习的方法，如SVM、Bi-LSTM和BERT。然后，我们在超参数设置后，除了微观和宏观平均的F1分数之外，还使用子集精度和汉明损失的不平衡性能指标进行了检验。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)

自引率

0.00%

发文量