Intelligent Data Processing Methods for the Atypical Values Correction of Stock Quotes

IF 7.6 1区经济学 Q1 ECONOMICS Review of Economics and Statistics Pub Date : 2022-04-06 DOI:10.21686/2500-3925-2022-2-

T. Zolotova, D. A. Volkova

{"title":"Intelligent Data Processing Methods for the Atypical Values Correction of Stock Quotes","authors":"T. Zolotova, D. A. Volkova","doi":"10.21686/2500-3925-2022-2-","DOIUrl":null,"url":null,"abstract":"Purpose of the study. The purpose of the study is to carry out a comparative analysis of various methods for correcting atypical values of statistical data on the stock market and to develop recommendations for their use.Materials and methods. The article analyzes Russian and foreign bibliography on the research problem. Consideration of machine learning methods for detecting and correcting outliers in time series is proposed. The mathematical basis of machine learning methods is the Z-score method, the isolation forest method, support vector method for outlier detection, and winsorization and multiple imputation methods for outlier correction. To create the models, the Jupyter Notebook software tool, which supports the Python programming language, was used. To implement machine-learning methods, data from stock quotes of the Moscow Exchange are used.Results. The results of machine learning algorithms are demonstrated for sets of real statistical data representing the closing prices of shares of three Russian companies “Sberbank”, “Aeroflot”, “Gazprom” in the period from 01.12.2019 to 30.11.2020, obtained from the website of the Investment Company “FINAM”. A comparative analysis of methods for detecting and correcting outliers by standard deviation has been carried out. The Z-score statistical method allows you to accurately determine the distance from the suspicious observation to the distribution center, which is an advantage. The disadvantage of this method is the influence of outliers on the mean and standard deviation, which can contribute to the masking of outliers or their incorrect detection. The isolation forest method recognizes outliers of various types, and when implementing the method, there are no parameters that require selection; but the disadvantage is the slower detection rate of outliers compared to other methods. The support vector machine is a very fast method and is reduced to solving a quadratic programming problem, which always has a unique solution. The winsorization method for correcting outliers reduces the effect of outliers on the mean and variance, which is an advantage, but may introduce bias due to the selection of thresholds to separate observations in the sample. The multiple imputation method creates for each missing value not one, but many imputations, which avoids a systematic error, but at the expense of high computational costs. For the initial data used in the work, the best result was shown by the implementation of the multiple imputation algorithm based on the detected outliers by the support vector method.Conclusion. There is no universal method for detecting and/or eliminating outliers in data analysis theory. In general, the determination of outliers is subjective, and the decision is made individually for each specific dataset, considering its characteristics or existing experience in this area. The practical implementation of the methods for detecting and eliminating outliers used in this work can be a tool for calculating more accurate indicators in any area, for example, to improve forecasting the stock price. As part of further work, it is possible to consider the optimization of the parameters used in the methods of detecting and correcting outliers to study their effect on the results of the models.","PeriodicalId":48456,"journal":{"name":"Review of Economics and Statistics","volume":"56 1","pages":""},"PeriodicalIF":7.6000,"publicationDate":"2022-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Review of Economics and Statistics","FirstCategoryId":"96","ListUrlMain":"https://doi.org/10.21686/2500-3925-2022-2-","RegionNum":1,"RegionCategory":"经济学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECONOMICS","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose of the study. The purpose of the study is to carry out a comparative analysis of various methods for correcting atypical values of statistical data on the stock market and to develop recommendations for their use.Materials and methods. The article analyzes Russian and foreign bibliography on the research problem. Consideration of machine learning methods for detecting and correcting outliers in time series is proposed. The mathematical basis of machine learning methods is the Z-score method, the isolation forest method, support vector method for outlier detection, and winsorization and multiple imputation methods for outlier correction. To create the models, the Jupyter Notebook software tool, which supports the Python programming language, was used. To implement machine-learning methods, data from stock quotes of the Moscow Exchange are used.Results. The results of machine learning algorithms are demonstrated for sets of real statistical data representing the closing prices of shares of three Russian companies “Sberbank”, “Aeroflot”, “Gazprom” in the period from 01.12.2019 to 30.11.2020, obtained from the website of the Investment Company “FINAM”. A comparative analysis of methods for detecting and correcting outliers by standard deviation has been carried out. The Z-score statistical method allows you to accurately determine the distance from the suspicious observation to the distribution center, which is an advantage. The disadvantage of this method is the influence of outliers on the mean and standard deviation, which can contribute to the masking of outliers or their incorrect detection. The isolation forest method recognizes outliers of various types, and when implementing the method, there are no parameters that require selection; but the disadvantage is the slower detection rate of outliers compared to other methods. The support vector machine is a very fast method and is reduced to solving a quadratic programming problem, which always has a unique solution. The winsorization method for correcting outliers reduces the effect of outliers on the mean and variance, which is an advantage, but may introduce bias due to the selection of thresholds to separate observations in the sample. The multiple imputation method creates for each missing value not one, but many imputations, which avoids a systematic error, but at the expense of high computational costs. For the initial data used in the work, the best result was shown by the implementation of the multiple imputation algorithm based on the detected outliers by the support vector method.Conclusion. There is no universal method for detecting and/or eliminating outliers in data analysis theory. In general, the determination of outliers is subjective, and the decision is made individually for each specific dataset, considering its characteristics or existing experience in this area. The practical implementation of the methods for detecting and eliminating outliers used in this work can be a tool for calculating more accurate indicators in any area, for example, to improve forecasting the stock price. As part of further work, it is possible to consider the optimization of the parameters used in the methods of detecting and correcting outliers to study their effect on the results of the models.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

股票报价非典型值修正的智能数据处理方法

研究目的:本研究的目的是对纠正股票市场统计数据非典型值的各种方法进行比较分析，并提出使用这些方法的建议。材料和方法。本文对国内外文献目录学研究问题进行了分析。提出了一种基于机器学习的时间序列异常值检测与校正方法。机器学习方法的数学基础是z分数法、隔离森林法、支持向量法进行离群值检测，以及winsorization和multiple imputation方法进行离群值校正。为了创建模型，使用了支持Python编程语言的Jupyter Notebook软件工具。为了实现机器学习方法，使用了来自莫斯科交易所股票报价的数据。机器学习算法的结果对代表三家俄罗斯公司“Sberbank”，“Aeroflot”，“Gazprom”在2019年12月1日至2020年11月30日期间股票收盘价的真实统计数据集进行了演示，这些数据来自投资公司“FINAM”的网站。对用标准差法检测和校正异常值的方法进行了比较分析。Z-score统计方法允许您准确地确定从可疑观察到分布中心的距离，这是一个优势。该方法的缺点是异常值对均值和标准差的影响，这可能导致异常值的掩盖或异常值的错误检测。隔离林方法可以识别各种类型的异常值，并且在实施该方法时，不需要选择参数;但缺点是与其他方法相比，异常值的检测速度较慢。支持向量机是一种非常快速的方法，它被简化为求解一个二次规划问题，该问题总是有一个唯一解。校正异常值的winsorization方法减少了异常值对均值和方差的影响，这是一个优势，但由于选择阈值来分离样本中的观测值，可能会引入偏差。该方法对每一个缺失值不是一次而是多次进行补全，避免了系统误差，但代价是计算量大。对于工作中使用的初始数据，采用支持向量法实现基于检测到的离群值的多次插补算法，结果最佳。在数据分析理论中，没有通用的方法来检测和/或消除异常值。一般来说，异常值的确定是主观的，并且根据每个特定数据集的特征或该领域的现有经验单独做出决定。本工作中使用的检测和消除异常值的方法的实际实施可以成为计算任何领域更准确指标的工具，例如，改进预测股票价格。作为进一步工作的一部分，可以考虑在检测和校正异常值的方法中使用参数的优化，以研究它们对模型结果的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Review of Economics and Statistics Multiple-

CiteScore

8.50

自引率

0.00%

发文量

175

期刊介绍： The Review of Economics and Statistics is a 100-year-old general journal of applied (especially quantitative) economics. Edited at the Harvard Kennedy School, the Review has published some of the most important articles in empirical economics.

期刊最新文献

Human Capital and the Managerial Revolution in the United States: Evidence from General Electric Productivity Gains from Trade: Bunching Estimates from Trading Rights in China 'Til Dowry Do Us Part: Bargaining and Violence in Indian Families Are Negative Weights in Combining Forecasts So Bad? Assessing the Level of Consumption by Children in Households