{"title":"The Effect of Data Types' on the Performance of Machine Learning Algorithms for Financial Prediction","authors":"Hulusi Mehmet Tanrikulu, Hakan Pabuccu","doi":"arxiv-2404.19324","DOIUrl":null,"url":null,"abstract":"Forecasting cryptocurrencies as a financial issue is crucial as it provides\ninvestors with possible financial benefits. A small improvement in forecasting\nperformance can lead to increased profitability; therefore, obtaining a\nrealistic forecast is very important for investors. Successful forecasting\nprovides traders with effective buy-or-hold strategies, allowing them to make\nmore profits. The most important thing in this process is to produce accurate\nforecasts suitable for real-life applications. Bitcoin, frequently mentioned\nrecently due to its volatility and chaotic behavior, has begun to pay great\nattention and has become an investment tool, especially during and after the\nCOVID-19 pandemic. This study provided a comprehensive methodology, including\nconstructing continuous and trend data using one and seven years periods of\ndata as inputs and applying machine learning (ML) algorithms to forecast\nBitcoin price movement. A binarization procedure was applied using continuous\ndata to construct the trend data representing each input feature trend.\nFollowing the related literature, the input features are determined as\ntechnical indicators, google trends, and the number of tweets. Random forest\n(RF), K-Nearest neighbor (KNN), Extreme Gradient Boosting (XGBoost-XGB),\nSupport vector machine (SVM) Naive Bayes (NB), Artificial Neural Networks\n(ANN), and Long-Short-Term Memory (LSTM) networks were applied on the selected\nfeatures for prediction purposes. This work investigates two main research\nquestions: i. How does the sample size affect the prediction performance of ML\nalgorithms? ii. How does the data type affect the prediction performance of ML\nalgorithms? Accuracy and area under the ROC curve (AUC) values were used to\ncompare the model performance. A t-test was performed to test the statistical\nsignificance of the prediction results.","PeriodicalId":501294,"journal":{"name":"arXiv - QuantFin - Computational Finance","volume":"10 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuantFin - Computational Finance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2404.19324","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Forecasting cryptocurrencies as a financial issue is crucial as it provides
investors with possible financial benefits. A small improvement in forecasting
performance can lead to increased profitability; therefore, obtaining a
realistic forecast is very important for investors. Successful forecasting
provides traders with effective buy-or-hold strategies, allowing them to make
more profits. The most important thing in this process is to produce accurate
forecasts suitable for real-life applications. Bitcoin, frequently mentioned
recently due to its volatility and chaotic behavior, has begun to pay great
attention and has become an investment tool, especially during and after the
COVID-19 pandemic. This study provided a comprehensive methodology, including
constructing continuous and trend data using one and seven years periods of
data as inputs and applying machine learning (ML) algorithms to forecast
Bitcoin price movement. A binarization procedure was applied using continuous
data to construct the trend data representing each input feature trend.
Following the related literature, the input features are determined as
technical indicators, google trends, and the number of tweets. Random forest
(RF), K-Nearest neighbor (KNN), Extreme Gradient Boosting (XGBoost-XGB),
Support vector machine (SVM) Naive Bayes (NB), Artificial Neural Networks
(ANN), and Long-Short-Term Memory (LSTM) networks were applied on the selected
features for prediction purposes. This work investigates two main research
questions: i. How does the sample size affect the prediction performance of ML
algorithms? ii. How does the data type affect the prediction performance of ML
algorithms? Accuracy and area under the ROC curve (AUC) values were used to
compare the model performance. A t-test was performed to test the statistical
significance of the prediction results.
将加密货币作为一个金融问题进行预测至关重要,因为它能为投资者带来可能的经济利益。预测性能的微小改进都可能导致盈利能力的提高;因此,获得准确的预测对投资者来说非常重要。成功的预测为交易者提供了有效的买入或持有策略,使他们能够获得更多利润。在这一过程中,最重要的是做出适合实际应用的准确预测。比特币因其波动性和混沌行为最近经常被提及,已开始受到高度关注,并已成为一种投资工具,尤其是在 COVID-19 大流行期间和之后。本研究提供了一种全面的方法,包括使用一年和七年的数据作为输入,构建连续数据和趋势数据,并应用机器学习(ML)算法预测比特币的价格走势。根据相关文献,输入特征被确定为技术指标、谷歌趋势和推文数量。随机森林(RF)、K-近邻(KNN)、极梯度提升(XGBoost-XGB)、支持向量机(SVM)、奈夫贝叶斯(NB)、人工神经网络(ANN)和长短期记忆(LSTM)网络被应用于所选特征的预测。这项工作主要研究两个问题:i. 样本大小如何影响 ML 算法的预测性能?使用准确率和 ROC 曲线下面积(AUC)值来比较模型性能。采用 t 检验来检验预测结果的统计显著性。