A Sentiment Analysis of COVID-19 Tweets Data Using Different Word Embedding Techniques

2023 International Research Conference on Smart Computing and Systems Engineering (SCSE) Pub Date : 2023-06-29 DOI:10.1109/SCSE59836.2023.10215046

U.M.M.P.K. Nawarathne, H. Kumari

{"title":"A Sentiment Analysis of COVID-19 Tweets Data Using Different Word Embedding Techniques","authors":"U.M.M.P.K. Nawarathne, H. Kumari","doi":"10.1109/SCSE59836.2023.10215046","DOIUrl":null,"url":null,"abstract":"The COVID-19 virus that invaded the world in 2019 caused many casualties while creating enormous mental turmoil among humans. During this pandemic period, humans were confined to prevent the virus from spreading. Due to the isolation, people used social media platforms like Twitter to express their ideas. Therefore, this study analyzed tweets related to COVID-19. Initially, text data processing techniques were employed, and sentiment labels were assigned. Then the data were trained using different machine learning (ML) models such as Multinomial Naïve Bayes (MNB), Random Forest (RF), Support Vector Machine (SVM), Decision Tree (DT), k-Nearest Neighbours (KNN), Logistic Regression (LR), Extreme Gradient Boosting (XGB), and CatBoost (CB). During the training phase, word embedding techniques such as Bag-of-Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), Word2Vec, Global Vectors for Word Representation (Glove), Bidirectional Encoder Representations from Transformers (BERT), and Robustly Optimized BERT-Pretraining Approach (RoBERTa) were used, and evaluation metrics such as accuracy, macro average precision, macro average recall, and macro average f1-score were calculated to evaluate these models. According to the results, the CB model, which used the RoBERTa technique, achieved an accuracy of 97%. Therefore, it can be concluded that CB with RoBERTa provides better results when classifying tweet data.","PeriodicalId":429228,"journal":{"name":"2023 International Research Conference on Smart Computing and Systems Engineering (SCSE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Research Conference on Smart Computing and Systems Engineering (SCSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SCSE59836.2023.10215046","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The COVID-19 virus that invaded the world in 2019 caused many casualties while creating enormous mental turmoil among humans. During this pandemic period, humans were confined to prevent the virus from spreading. Due to the isolation, people used social media platforms like Twitter to express their ideas. Therefore, this study analyzed tweets related to COVID-19. Initially, text data processing techniques were employed, and sentiment labels were assigned. Then the data were trained using different machine learning (ML) models such as Multinomial Naïve Bayes (MNB), Random Forest (RF), Support Vector Machine (SVM), Decision Tree (DT), k-Nearest Neighbours (KNN), Logistic Regression (LR), Extreme Gradient Boosting (XGB), and CatBoost (CB). During the training phase, word embedding techniques such as Bag-of-Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), Word2Vec, Global Vectors for Word Representation (Glove), Bidirectional Encoder Representations from Transformers (BERT), and Robustly Optimized BERT-Pretraining Approach (RoBERTa) were used, and evaluation metrics such as accuracy, macro average precision, macro average recall, and macro average f1-score were calculated to evaluate these models. According to the results, the CB model, which used the RoBERTa technique, achieved an accuracy of 97%. Therefore, it can be concluded that CB with RoBERTa provides better results when classifying tweet data.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于不同词嵌入技术的COVID-19 tweet数据情感分析

2019年，新型冠状病毒(COVID-19)入侵世界，造成了许多人员伤亡，并在人类中造成了巨大的精神动荡。在这次大流行期间，为防止病毒传播，对人类进行了限制。由于与世隔绝，人们使用Twitter等社交媒体平台来表达自己的想法。因此，本研究分析了与COVID-19相关的推文。最初，采用文本数据处理技术，并分配情感标签。然后使用多项式Naïve贝叶斯(MNB)、随机森林(RF)、支持向量机(SVM)、决策树(DT)、k近邻(KNN)、逻辑回归(LR)、极端梯度增强(XGB)和CatBoost (CB)等不同的机器学习(ML)模型对数据进行训练。在训练阶段，使用词袋(BoW)、词频-逆文档频率(TF-IDF)、Word2Vec、全局词向量表示(Glove)、变形器双向编码器表示(BERT)和鲁棒优化BERT-预训练方法(RoBERTa)等词嵌入技术，并计算准确率、宏观平均精度、宏观平均召回率和宏观平均f1-score等评价指标对这些模型进行评价。结果表明，采用RoBERTa技术的CB模型准确率达到97%。因此，可以得出结论，在对tweet数据进行分类时，使用RoBERTa的CB提供了更好的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2023 International Research Conference on Smart Computing and Systems Engineering (SCSE)

自引率

0.00%

发文量