D. Purwitasari, D. A. Navastara, Y. Findawati, Kresna Adhi Pramana, Agus Budi Raharjo
{"title":"基于层次多标签分类的Twitter文本危险语音识别特征提取","authors":"D. Purwitasari, D. A. Navastara, Y. Findawati, Kresna Adhi Pramana, Agus Budi Raharjo","doi":"10.1109/ICCoSITE57641.2023.10127774","DOIUrl":null,"url":null,"abstract":"Dangerous speech is a strong hate speech that causes negative impacts, such as violence, crime, social pressure, trauma, and despair, and can lead to conflicts between groups. Raw data of Twitter texts need the necessary preprocess to obtain the proper training datasets for hate speech or even dangerous one. One reason is how to express hate speech related to mentions or hashtags. Because of the variants of context messages in raw Twitter posts which could be hate speech or not, the problem here is hierarchical and multi-label classification with three label types of hate speech status, issues, and dangerous levels. The issues in this work are about religion, ethnicity, and others. After handling preprocess, the word embedding includes data under-sampling because the dataset is not balanced. Additional stop-word dictionaries to overcome language-related vocabularies are also incorporated. To observe the preprocess effects in the classification problem, some methods of machine learning and deep learning, such as SVM, Bi-LSTM, and BERT are explored. Then we examined after hyper-parameter settings with performance indicators of subset accuracy and Hamming lost for imbalanced, in addition to F1 scores of micro and macro averages.","PeriodicalId":256184,"journal":{"name":"2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Feature Extraction in Hierarchical Multi-Label Classification for Dangerous Speech Identification on Twitter Texts\",\"authors\":\"D. Purwitasari, D. A. Navastara, Y. Findawati, Kresna Adhi Pramana, Agus Budi Raharjo\",\"doi\":\"10.1109/ICCoSITE57641.2023.10127774\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Dangerous speech is a strong hate speech that causes negative impacts, such as violence, crime, social pressure, trauma, and despair, and can lead to conflicts between groups. Raw data of Twitter texts need the necessary preprocess to obtain the proper training datasets for hate speech or even dangerous one. One reason is how to express hate speech related to mentions or hashtags. Because of the variants of context messages in raw Twitter posts which could be hate speech or not, the problem here is hierarchical and multi-label classification with three label types of hate speech status, issues, and dangerous levels. The issues in this work are about religion, ethnicity, and others. After handling preprocess, the word embedding includes data under-sampling because the dataset is not balanced. Additional stop-word dictionaries to overcome language-related vocabularies are also incorporated. To observe the preprocess effects in the classification problem, some methods of machine learning and deep learning, such as SVM, Bi-LSTM, and BERT are explored. Then we examined after hyper-parameter settings with performance indicators of subset accuracy and Hamming lost for imbalanced, in addition to F1 scores of micro and macro averages.\",\"PeriodicalId\":256184,\"journal\":{\"name\":\"2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)\",\"volume\":\"31 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCoSITE57641.2023.10127774\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCoSITE57641.2023.10127774","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Feature Extraction in Hierarchical Multi-Label Classification for Dangerous Speech Identification on Twitter Texts
Dangerous speech is a strong hate speech that causes negative impacts, such as violence, crime, social pressure, trauma, and despair, and can lead to conflicts between groups. Raw data of Twitter texts need the necessary preprocess to obtain the proper training datasets for hate speech or even dangerous one. One reason is how to express hate speech related to mentions or hashtags. Because of the variants of context messages in raw Twitter posts which could be hate speech or not, the problem here is hierarchical and multi-label classification with three label types of hate speech status, issues, and dangerous levels. The issues in this work are about religion, ethnicity, and others. After handling preprocess, the word embedding includes data under-sampling because the dataset is not balanced. Additional stop-word dictionaries to overcome language-related vocabularies are also incorporated. To observe the preprocess effects in the classification problem, some methods of machine learning and deep learning, such as SVM, Bi-LSTM, and BERT are explored. Then we examined after hyper-parameter settings with performance indicators of subset accuracy and Hamming lost for imbalanced, in addition to F1 scores of micro and macro averages.