基于随机过采样的稳健优化BERT分析不平衡股票新闻情绪数据

2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE) Pub Date : 2023-02-16 DOI:10.1109/ICCoSITE57641.2023.10127725

Salsabila Mazya Permataning Tyas, R. Sarno, Agus Tri Haryono, Kelly Rossa Sungkono

{"title":"基于随机过采样的稳健优化BERT分析不平衡股票新闻情绪数据","authors":"Salsabila Mazya Permataning Tyas, R. Sarno, Agus Tri Haryono, Kelly Rossa Sungkono","doi":"10.1109/ICCoSITE57641.2023.10127725","DOIUrl":null,"url":null,"abstract":"Stock news is one of the information sources that can used to monitor stock prices. The information from stock news usually contains positive and negative sentiments that can affect stock prices. Therefore, sentiment analysis is needed to process the sentiment of stock news. The stock news dataset is taken from Kaggle. From these data, there is an imbalanced class between positive and negative sentiment. This research proposed a method to solve the imbalance dataset with random oversampling which worked by randomly replicating several minority classes. This research presents several scenarios of pre-processing text with different stages, intending to get high accuracy. The classification method used in this paper is a robustly optimized Bidirectional Transformer Encoder Representation (RoBERTa). Besides that, this paper also compared with baseline of Machine Learning (ML) such as Multinomial Naïve Bayes, Bernoulli Naïve Bayes, Support Vector Machine, Random Forest Classifier, Logistic Regression and used two different text representation such as TF-IDF and Word2Vec. The best result in this research is obtained using RoBERTa method with the fourth scenario of pre-processing text, in which the stage of pre-processing in this scenario only removing hashtag, without removing punctuation, removing the number, converting number, stop word removal, and lemmatization. The performance result is 0.85 precision, 0,84 recall, 0,84 F1-score, and 86% for accuracy result.","PeriodicalId":256184,"journal":{"name":"2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)","volume":"159 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A Robustly Optimized BERT using Random Oversampling for Analyzing Imbalanced Stock News Sentiment Data\",\"authors\":\"Salsabila Mazya Permataning Tyas, R. Sarno, Agus Tri Haryono, Kelly Rossa Sungkono\",\"doi\":\"10.1109/ICCoSITE57641.2023.10127725\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Stock news is one of the information sources that can used to monitor stock prices. The information from stock news usually contains positive and negative sentiments that can affect stock prices. Therefore, sentiment analysis is needed to process the sentiment of stock news. The stock news dataset is taken from Kaggle. From these data, there is an imbalanced class between positive and negative sentiment. This research proposed a method to solve the imbalance dataset with random oversampling which worked by randomly replicating several minority classes. This research presents several scenarios of pre-processing text with different stages, intending to get high accuracy. The classification method used in this paper is a robustly optimized Bidirectional Transformer Encoder Representation (RoBERTa). Besides that, this paper also compared with baseline of Machine Learning (ML) such as Multinomial Naïve Bayes, Bernoulli Naïve Bayes, Support Vector Machine, Random Forest Classifier, Logistic Regression and used two different text representation such as TF-IDF and Word2Vec. The best result in this research is obtained using RoBERTa method with the fourth scenario of pre-processing text, in which the stage of pre-processing in this scenario only removing hashtag, without removing punctuation, removing the number, converting number, stop word removal, and lemmatization. The performance result is 0.85 precision, 0,84 recall, 0,84 F1-score, and 86% for accuracy result.\",\"PeriodicalId\":256184,\"journal\":{\"name\":\"2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)\",\"volume\":\"159 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCoSITE57641.2023.10127725\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCoSITE57641.2023.10127725","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

股票新闻是可以用来监控股票价格的信息来源之一。来自股票新闻的信息通常包含积极和消极的情绪，这些情绪会影响股价。因此，需要情绪分析来处理股票新闻的情绪。股票新闻数据集取自Kaggle。从这些数据来看，积极情绪和消极情绪之间存在着不平衡的阶层。本研究提出了一种通过随机复制几个少数类来解决随机过采样不平衡数据集的方法。本研究提出了几种不同阶段的文本预处理方案，以期获得较高的准确率。本文使用的分类方法是一种鲁棒优化的双向变压器编码器表示(RoBERTa)。除此之外，本文还比较了多项Naïve贝叶斯、伯努利Naïve贝叶斯、支持向量机、随机森林分类器、逻辑回归等机器学习(ML)的基线，并使用了TF-IDF和Word2Vec两种不同的文本表示。在本研究中，使用RoBERTa方法预处理文本的第四个场景得到了最好的结果，该场景的预处理阶段只有去掉标签，没有去掉标点、去掉数字、转换数字、去掉停用词和按序排列。性能结果为精度0.85，召回率0.84,f1得分0.84，正确率86%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Robustly Optimized BERT using Random Oversampling for Analyzing Imbalanced Stock News Sentiment Data

Stock news is one of the information sources that can used to monitor stock prices. The information from stock news usually contains positive and negative sentiments that can affect stock prices. Therefore, sentiment analysis is needed to process the sentiment of stock news. The stock news dataset is taken from Kaggle. From these data, there is an imbalanced class between positive and negative sentiment. This research proposed a method to solve the imbalance dataset with random oversampling which worked by randomly replicating several minority classes. This research presents several scenarios of pre-processing text with different stages, intending to get high accuracy. The classification method used in this paper is a robustly optimized Bidirectional Transformer Encoder Representation (RoBERTa). Besides that, this paper also compared with baseline of Machine Learning (ML) such as Multinomial Naïve Bayes, Bernoulli Naïve Bayes, Support Vector Machine, Random Forest Classifier, Logistic Regression and used two different text representation such as TF-IDF and Word2Vec. The best result in this research is obtained using RoBERTa method with the fourth scenario of pre-processing text, in which the stage of pre-processing in this scenario only removing hashtag, without removing punctuation, removing the number, converting number, stop word removal, and lemmatization. The performance result is 0.85 precision, 0,84 recall, 0,84 F1-score, and 86% for accuracy result.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 International Conference on Computer Science, Information Technology and Engineering (ICCoSITE)

自引率

0.00%

发文量