Handling missing values in mixed panel data: a comparison of different techniques

Pressacademia Pub Date : 2024-02-01 DOI:10.17261/pressacademia.2023.1869

Cumhur Ekinci, Mustafa Abdullah Hakkoz, Unsal Kiran, Sirma Seker

{"title":"Handling missing values in mixed panel data: a comparison of different techniques","authors":"Cumhur Ekinci, Mustafa Abdullah Hakkoz, Unsal Kiran, Sirma Seker","doi":"10.17261/pressacademia.2023.1869","DOIUrl":null,"url":null,"abstract":"Purpose- The purpose of this study is to compare the success of alternative data imputation techniques with missing data. The study distinguishes itself from the rest of the literature by proposing an appropriate technique for mixed data on financial performance and environmental, social and governance (ESG) metrics of companies. In addition to simple imputation techniques, we also use machine learning techniques that allow working with more complex data. \nMethodology- We first employ ad-hoc methods such as mean, median, mode, constant, most frequent and regression imputation. In what follows, we handle multivariate imputation techniques such as multiple imputation by chained equations (MICE). Finally, we run imputation methods with machine learning (ML) classification such as K-nearest Neighbor (KNN), Ridge and Random Forest. To consider the assumptions of missing data, we first check the normality of the variables with Kolmogorov-Smirnov test and employ Rubin’s classification technique that defines the relationship among variables with the probability of missing data. The success of imputation techniques applied to missing data changes when the missing data are classified with Rubin’s technique according to randomness. Consequently, we apply listwise deletion at various levels and alternative data imputation techniques. We then compare their performances. The raw data contain parametric as well as categorical variables (binary and others). Among these are time-series (yearly) financial series such as sales and total assets obtained from financial statements, ESG scores as well as float ratios for firms from several countries and industries. Imputation is done randomly on a sample varying from 5% to 30% of the dataset and results are compared to true data based on accuracy or other measures such as root mean square errors (RMSE) or mean absolute percentage error (MAPE). Several robustness checks have been performed to supplement the analysis.\nFindings- Results show that ML methods such as KNN have a superior performance than others. Moreover, when multidimensional nature of the data is taken into account, the prediction performance improves. Hence, an optimality can be reached based on parameters.\nConclusion- Based upon the analysis, we conclude that the selected imputation technique and how it is employed matter to attain a higher accuracy and a better prediction of the missing values on selected mixed panel data in finance.\n\nKeywords: Imputation techniques, Panel data, Machine learning, Financial performance, ESG\nJEL Codes: C55, C81, M14, Q51\n","PeriodicalId":517141,"journal":{"name":"Pressacademia","volume":"42 3","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pressacademia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17261/pressacademia.2023.1869","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose- The purpose of this study is to compare the success of alternative data imputation techniques with missing data. The study distinguishes itself from the rest of the literature by proposing an appropriate technique for mixed data on financial performance and environmental, social and governance (ESG) metrics of companies. In addition to simple imputation techniques, we also use machine learning techniques that allow working with more complex data. Methodology- We first employ ad-hoc methods such as mean, median, mode, constant, most frequent and regression imputation. In what follows, we handle multivariate imputation techniques such as multiple imputation by chained equations (MICE). Finally, we run imputation methods with machine learning (ML) classification such as K-nearest Neighbor (KNN), Ridge and Random Forest. To consider the assumptions of missing data, we first check the normality of the variables with Kolmogorov-Smirnov test and employ Rubin’s classification technique that defines the relationship among variables with the probability of missing data. The success of imputation techniques applied to missing data changes when the missing data are classified with Rubin’s technique according to randomness. Consequently, we apply listwise deletion at various levels and alternative data imputation techniques. We then compare their performances. The raw data contain parametric as well as categorical variables (binary and others). Among these are time-series (yearly) financial series such as sales and total assets obtained from financial statements, ESG scores as well as float ratios for firms from several countries and industries. Imputation is done randomly on a sample varying from 5% to 30% of the dataset and results are compared to true data based on accuracy or other measures such as root mean square errors (RMSE) or mean absolute percentage error (MAPE). Several robustness checks have been performed to supplement the analysis. Findings- Results show that ML methods such as KNN have a superior performance than others. Moreover, when multidimensional nature of the data is taken into account, the prediction performance improves. Hence, an optimality can be reached based on parameters. Conclusion- Based upon the analysis, we conclude that the selected imputation technique and how it is employed matter to attain a higher accuracy and a better prediction of the missing values on selected mixed panel data in finance. Keywords: Imputation techniques, Panel data, Machine learning, Financial performance, ESG JEL Codes: C55, C81, M14, Q51

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

处理混合面板数据中的缺失值：不同技术的比较

目的-- 本研究的目的是比较其他数据估算技术在处理缺失数据时的成功率。与其他文献不同的是，本研究针对公司财务业绩与环境、社会和治理（ESG）指标的混合数据提出了一种适当的技术。除了简单的估算技术外，我们还使用了机器学习技术，可以处理更复杂的数据。方法论--我们首先采用了一些临时方法，如均值、中位数、模式、常量、最常量和回归估算。接下来，我们将处理多变量估算技术，如连锁方程多重估算（MICE）。最后，我们使用 K-nearest Neighbor (KNN)、Ridge 和 Random Forest 等机器学习（ML）分类法来运行估算方法。为了考虑缺失数据的假设，我们首先用 Kolmogorov-Smirnov 检验法检查变量的正态性，并采用 Rubin 分类技术，该技术定义了变量与缺失数据概率之间的关系。当使用鲁宾分类法根据随机性对缺失数据进行分类时，适用于缺失数据的估算技术的成功率就会发生变化。因此，我们应用了不同层次的列表删除和其他数据估算技术。然后比较它们的性能。原始数据包含参数变量和分类变量（二元变量和其他变量）。其中包括时间序列（年度）财务序列，如从财务报表中获得的销售额和总资产、ESG 分数以及多个国家和行业公司的浮动比率。估算是在数据集 5%到 30%的样本中随机进行的，并根据准确性或均方根误差 (RMSE) 或平均绝对百分比误差 (MAPE) 等其他指标将估算结果与真实数据进行比较。结果表明，KNN 等 ML 方法的性能优于其他方法。此外，当考虑到数据的多维性时，预测性能也有所提高。结论--基于分析，我们得出结论，所选的估算技术以及如何使用该技术对获得更高的准确性以及更好地预测金融领域所选混合面板数据的缺失值至关重要：估算技术面板数据机器学习金融表现 ESGJEL Codes：C55, C81, M14, Q51

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Pressacademia

自引率

0.00%

发文量