A Sentiment Analysis of COVID-19 Tweets Data Using Different Word Embedding Techniques

U.M.M.P.K. Nawarathne, H. Kumari
{"title":"A Sentiment Analysis of COVID-19 Tweets Data Using Different Word Embedding Techniques","authors":"U.M.M.P.K. Nawarathne, H. Kumari","doi":"10.1109/SCSE59836.2023.10215046","DOIUrl":null,"url":null,"abstract":"The COVID-19 virus that invaded the world in 2019 caused many casualties while creating enormous mental turmoil among humans. During this pandemic period, humans were confined to prevent the virus from spreading. Due to the isolation, people used social media platforms like Twitter to express their ideas. Therefore, this study analyzed tweets related to COVID-19. Initially, text data processing techniques were employed, and sentiment labels were assigned. Then the data were trained using different machine learning (ML) models such as Multinomial Naïve Bayes (MNB), Random Forest (RF), Support Vector Machine (SVM), Decision Tree (DT), k-Nearest Neighbours (KNN), Logistic Regression (LR), Extreme Gradient Boosting (XGB), and CatBoost (CB). During the training phase, word embedding techniques such as Bag-of-Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), Word2Vec, Global Vectors for Word Representation (Glove), Bidirectional Encoder Representations from Transformers (BERT), and Robustly Optimized BERT-Pretraining Approach (RoBERTa) were used, and evaluation metrics such as accuracy, macro average precision, macro average recall, and macro average f1-score were calculated to evaluate these models. According to the results, the CB model, which used the RoBERTa technique, achieved an accuracy of 97%. Therefore, it can be concluded that CB with RoBERTa provides better results when classifying tweet data.","PeriodicalId":429228,"journal":{"name":"2023 International Research Conference on Smart Computing and Systems Engineering (SCSE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Research Conference on Smart Computing and Systems Engineering (SCSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SCSE59836.2023.10215046","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The COVID-19 virus that invaded the world in 2019 caused many casualties while creating enormous mental turmoil among humans. During this pandemic period, humans were confined to prevent the virus from spreading. Due to the isolation, people used social media platforms like Twitter to express their ideas. Therefore, this study analyzed tweets related to COVID-19. Initially, text data processing techniques were employed, and sentiment labels were assigned. Then the data were trained using different machine learning (ML) models such as Multinomial Naïve Bayes (MNB), Random Forest (RF), Support Vector Machine (SVM), Decision Tree (DT), k-Nearest Neighbours (KNN), Logistic Regression (LR), Extreme Gradient Boosting (XGB), and CatBoost (CB). During the training phase, word embedding techniques such as Bag-of-Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), Word2Vec, Global Vectors for Word Representation (Glove), Bidirectional Encoder Representations from Transformers (BERT), and Robustly Optimized BERT-Pretraining Approach (RoBERTa) were used, and evaluation metrics such as accuracy, macro average precision, macro average recall, and macro average f1-score were calculated to evaluate these models. According to the results, the CB model, which used the RoBERTa technique, achieved an accuracy of 97%. Therefore, it can be concluded that CB with RoBERTa provides better results when classifying tweet data.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于不同词嵌入技术的COVID-19 tweet数据情感分析
2019年,新型冠状病毒(COVID-19)入侵世界,造成了许多人员伤亡,并在人类中造成了巨大的精神动荡。在这次大流行期间,为防止病毒传播,对人类进行了限制。由于与世隔绝,人们使用Twitter等社交媒体平台来表达自己的想法。因此,本研究分析了与COVID-19相关的推文。最初,采用文本数据处理技术,并分配情感标签。然后使用多项式Naïve贝叶斯(MNB)、随机森林(RF)、支持向量机(SVM)、决策树(DT)、k近邻(KNN)、逻辑回归(LR)、极端梯度增强(XGB)和CatBoost (CB)等不同的机器学习(ML)模型对数据进行训练。在训练阶段,使用词袋(BoW)、词频-逆文档频率(TF-IDF)、Word2Vec、全局词向量表示(Glove)、变形器双向编码器表示(BERT)和鲁棒优化BERT-预训练方法(RoBERTa)等词嵌入技术,并计算准确率、宏观平均精度、宏观平均召回率和宏观平均f1-score等评价指标对这些模型进行评价。结果表明,采用RoBERTa技术的CB模型准确率达到97%。因此,可以得出结论,在对tweet数据进行分类时,使用RoBERTa的CB提供了更好的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Exploring Music Similarity through Siamese CNNs using Triplet Loss on Music Samples Impacts of Integrated Railway-Based Containerized Cargo Transport Network to Connect the Port of Colombo and Free Trade Zones in Sri Lanka Investigating Factors Influencing Behavioral Intention Toward Green Computing Practices Among Undergraduates In Sri Lankan Universities Preserving India’s Rich Dance Heritage: A Classification of Indian Dance Forms and Innovative Digital Management Solutions for Cultural Heritage Conservation An Automatic Density Cluster Generation Method to Identify the Amount of Tool Flank Wear via Tool Vibration
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1