{"title":"基于Word2Vec和单对全支持向量机的网络欺凌情绪分析","authors":"Lionel Reinhart Halim, A. Suryadibrata","doi":"10.31937/ijnmt.v8i1.2047","DOIUrl":null,"url":null,"abstract":"Depression and social anxiety are the two main negative impacts of cyberbullying. Unfortunately, a survey conducted by UNICEF on 3rd September 2019 showed that 1 in 3 young people in 30 countries had been victims of cyberbullying. Sentiment analysis research will be conducted to detect a comment that contains cyberbullying. Dataset of cyberbullying is obtained from the Kaggle website, named, Toxic Comment Classification Challenge. The pre-processing process consists of 4 stages, namely comment generalization (convert text into lowercase and remove punctuation), tokenization, stop words removal, and lemmatization. Word Embedding will be used to conduct sentiment analysis by implementing Word2Vec. After that, One-Against-All (OAA) method with the Support Vector Machine (SVM) model will be used to make predictions in the form of multi labelling. The SVM model will go through a hyperparameter tuning process using Randomized Search CV. Then, evaluation will be carried out using Micro Averaged F1 Score to assess the prediction accuracy and Hamming Loss to assess the numbers of pairs of sample and label that are incorrectly classified. Implementation result of Word2Vec and OAA SVM model provide the best result for the data undergoing the process of pre-processing using comment generalization, tokenization, stop words removal, and lemmatization which is stored into 100 features in Word2Vec model. Micro Averaged F1 and Hamming Loss percentage that is produced by the tuned model is 83.40% and 15.13% respectively. \n \nIndex Terms— Sentiment Analysis; Word Embedding; Word2Vec; One-Against-All; Support Vector Machine; Toxic Comment Classification Challenge; Multi Labelling","PeriodicalId":110831,"journal":{"name":"IJNMT (International Journal of New Media Technology)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Cyberbullying Sentiment Analysis with Word2Vec and One-Against-All Support Vector Machine\",\"authors\":\"Lionel Reinhart Halim, A. Suryadibrata\",\"doi\":\"10.31937/ijnmt.v8i1.2047\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Depression and social anxiety are the two main negative impacts of cyberbullying. Unfortunately, a survey conducted by UNICEF on 3rd September 2019 showed that 1 in 3 young people in 30 countries had been victims of cyberbullying. Sentiment analysis research will be conducted to detect a comment that contains cyberbullying. Dataset of cyberbullying is obtained from the Kaggle website, named, Toxic Comment Classification Challenge. The pre-processing process consists of 4 stages, namely comment generalization (convert text into lowercase and remove punctuation), tokenization, stop words removal, and lemmatization. Word Embedding will be used to conduct sentiment analysis by implementing Word2Vec. After that, One-Against-All (OAA) method with the Support Vector Machine (SVM) model will be used to make predictions in the form of multi labelling. The SVM model will go through a hyperparameter tuning process using Randomized Search CV. Then, evaluation will be carried out using Micro Averaged F1 Score to assess the prediction accuracy and Hamming Loss to assess the numbers of pairs of sample and label that are incorrectly classified. Implementation result of Word2Vec and OAA SVM model provide the best result for the data undergoing the process of pre-processing using comment generalization, tokenization, stop words removal, and lemmatization which is stored into 100 features in Word2Vec model. Micro Averaged F1 and Hamming Loss percentage that is produced by the tuned model is 83.40% and 15.13% respectively. \\n \\nIndex Terms— Sentiment Analysis; Word Embedding; Word2Vec; One-Against-All; Support Vector Machine; Toxic Comment Classification Challenge; Multi Labelling\",\"PeriodicalId\":110831,\"journal\":{\"name\":\"IJNMT (International Journal of New Media Technology)\",\"volume\":\"39 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-06-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IJNMT (International Journal of New Media Technology)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.31937/ijnmt.v8i1.2047\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IJNMT (International Journal of New Media Technology)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31937/ijnmt.v8i1.2047","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
摘要
抑郁和社交焦虑是网络欺凌的两个主要负面影响。不幸的是,联合国儿童基金会于2019年9月3日进行的一项调查显示,30个国家中有三分之一的年轻人是网络欺凌的受害者。将进行情感分析研究,以发现含有网络欺凌的评论。网络欺凌的数据集来自Kaggle网站,命名为“有毒评论分类挑战”。预处理过程包括4个阶段,即注释泛化(将文本转换为小写并去除标点符号)、标记化、停止词去除和词序化。Word Embedding将通过实现Word2Vec来进行情感分析。然后,使用支持向量机(SVM)模型的One-Against-All (OAA)方法以多标签的形式进行预测。SVM模型将通过随机搜索CV的超参数调整过程。然后,用Micro average F1 Score评估预测的准确性,用Hamming Loss评估样本和标签错误分类的对数。Word2Vec和OAA支持向量机模型的实现结果为经过评论泛化、标记化、停用词去除和词序化预处理的数据提供了最好的结果,这些数据存储在Word2Vec模型的100个特征中。调整后模型产生的微平均F1和Hamming Loss百分比分别为83.40%和15.13%。指数术语-情绪分析;字嵌入;Word2Vec;One-Against-All;支持向量机;有毒评论分类挑战;多标签
Cyberbullying Sentiment Analysis with Word2Vec and One-Against-All Support Vector Machine
Depression and social anxiety are the two main negative impacts of cyberbullying. Unfortunately, a survey conducted by UNICEF on 3rd September 2019 showed that 1 in 3 young people in 30 countries had been victims of cyberbullying. Sentiment analysis research will be conducted to detect a comment that contains cyberbullying. Dataset of cyberbullying is obtained from the Kaggle website, named, Toxic Comment Classification Challenge. The pre-processing process consists of 4 stages, namely comment generalization (convert text into lowercase and remove punctuation), tokenization, stop words removal, and lemmatization. Word Embedding will be used to conduct sentiment analysis by implementing Word2Vec. After that, One-Against-All (OAA) method with the Support Vector Machine (SVM) model will be used to make predictions in the form of multi labelling. The SVM model will go through a hyperparameter tuning process using Randomized Search CV. Then, evaluation will be carried out using Micro Averaged F1 Score to assess the prediction accuracy and Hamming Loss to assess the numbers of pairs of sample and label that are incorrectly classified. Implementation result of Word2Vec and OAA SVM model provide the best result for the data undergoing the process of pre-processing using comment generalization, tokenization, stop words removal, and lemmatization which is stored into 100 features in Word2Vec model. Micro Averaged F1 and Hamming Loss percentage that is produced by the tuned model is 83.40% and 15.13% respectively.
Index Terms— Sentiment Analysis; Word Embedding; Word2Vec; One-Against-All; Support Vector Machine; Toxic Comment Classification Challenge; Multi Labelling