Handling missing values in mixed panel data: a comparison of different techniques

Cumhur Ekinci, Mustafa Abdullah Hakkoz, Unsal Kiran, Sirma Seker
{"title":"Handling missing values in mixed panel data: a comparison of different techniques","authors":"Cumhur Ekinci, Mustafa Abdullah Hakkoz, Unsal Kiran, Sirma Seker","doi":"10.17261/pressacademia.2023.1869","DOIUrl":null,"url":null,"abstract":"Purpose- The purpose of this study is to compare the success of alternative data imputation techniques with missing data. The study distinguishes itself from the rest of the literature by proposing an appropriate technique for mixed data on financial performance and environmental, social and governance (ESG) metrics of companies. In addition to simple imputation techniques, we also use machine learning techniques that allow working with more complex data. \nMethodology- We first employ ad-hoc methods such as mean, median, mode, constant, most frequent and regression imputation. In what follows, we handle multivariate imputation techniques such as multiple imputation by chained equations (MICE). Finally, we run imputation methods with machine learning (ML) classification such as K-nearest Neighbor (KNN), Ridge and Random Forest. To consider the assumptions of missing data, we first check the normality of the variables with Kolmogorov-Smirnov test and employ Rubin’s classification technique that defines the relationship among variables with the probability of missing data. The success of imputation techniques applied to missing data changes when the missing data are classified with Rubin’s technique according to randomness. Consequently, we apply listwise deletion at various levels and alternative data imputation techniques. We then compare their performances. The raw data contain parametric as well as categorical variables (binary and others). Among these are time-series (yearly) financial series such as sales and total assets obtained from financial statements, ESG scores as well as float ratios for firms from several countries and industries. Imputation is done randomly on a sample varying from 5% to 30% of the dataset and results are compared to true data based on accuracy or other measures such as root mean square errors (RMSE) or mean absolute percentage error (MAPE). Several robustness checks have been performed to supplement the analysis.\nFindings- Results show that ML methods such as KNN have a superior performance than others. Moreover, when multidimensional nature of the data is taken into account, the prediction performance improves. Hence, an optimality can be reached based on parameters.\nConclusion- Based upon the analysis, we conclude that the selected imputation technique and how it is employed matter to attain a higher accuracy and a better prediction of the missing values on selected mixed panel data in finance.\n\nKeywords: Imputation techniques, Panel data, Machine learning, Financial performance, ESG\nJEL Codes: C55, C81, M14, Q51\n","PeriodicalId":517141,"journal":{"name":"Pressacademia","volume":"42 3","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pressacademia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17261/pressacademia.2023.1869","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose- The purpose of this study is to compare the success of alternative data imputation techniques with missing data. The study distinguishes itself from the rest of the literature by proposing an appropriate technique for mixed data on financial performance and environmental, social and governance (ESG) metrics of companies. In addition to simple imputation techniques, we also use machine learning techniques that allow working with more complex data. Methodology- We first employ ad-hoc methods such as mean, median, mode, constant, most frequent and regression imputation. In what follows, we handle multivariate imputation techniques such as multiple imputation by chained equations (MICE). Finally, we run imputation methods with machine learning (ML) classification such as K-nearest Neighbor (KNN), Ridge and Random Forest. To consider the assumptions of missing data, we first check the normality of the variables with Kolmogorov-Smirnov test and employ Rubin’s classification technique that defines the relationship among variables with the probability of missing data. The success of imputation techniques applied to missing data changes when the missing data are classified with Rubin’s technique according to randomness. Consequently, we apply listwise deletion at various levels and alternative data imputation techniques. We then compare their performances. The raw data contain parametric as well as categorical variables (binary and others). Among these are time-series (yearly) financial series such as sales and total assets obtained from financial statements, ESG scores as well as float ratios for firms from several countries and industries. Imputation is done randomly on a sample varying from 5% to 30% of the dataset and results are compared to true data based on accuracy or other measures such as root mean square errors (RMSE) or mean absolute percentage error (MAPE). Several robustness checks have been performed to supplement the analysis. Findings- Results show that ML methods such as KNN have a superior performance than others. Moreover, when multidimensional nature of the data is taken into account, the prediction performance improves. Hence, an optimality can be reached based on parameters. Conclusion- Based upon the analysis, we conclude that the selected imputation technique and how it is employed matter to attain a higher accuracy and a better prediction of the missing values on selected mixed panel data in finance. Keywords: Imputation techniques, Panel data, Machine learning, Financial performance, ESG JEL Codes: C55, C81, M14, Q51
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
处理混合面板数据中的缺失值:不同技术的比较
目的-- 本研究的目的是比较其他数据估算技术在处理缺失数据时的成功率。与其他文献不同的是,本研究针对公司财务业绩与环境、社会和治理(ESG)指标的混合数据提出了一种适当的技术。除了简单的估算技术外,我们还使用了机器学习技术,可以处理更复杂的数据。方法论--我们首先采用了一些临时方法,如均值、中位数、模式、常量、最常量和回归估算。接下来,我们将处理多变量估算技术,如连锁方程多重估算(MICE)。最后,我们使用 K-nearest Neighbor (KNN)、Ridge 和 Random Forest 等机器学习(ML)分类法来运行估算方法。为了考虑缺失数据的假设,我们首先用 Kolmogorov-Smirnov 检验法检查变量的正态性,并采用 Rubin 分类技术,该技术定义了变量与缺失数据概率之间的关系。当使用鲁宾分类法根据随机性对缺失数据进行分类时,适用于缺失数据的估算技术的成功率就会发生变化。因此,我们应用了不同层次的列表删除和其他数据估算技术。然后比较它们的性能。原始数据包含参数变量和分类变量(二元变量和其他变量)。其中包括时间序列(年度)财务序列,如从财务报表中获得的销售额和总资产、ESG 分数以及多个国家和行业公司的浮动比率。估算是在数据集 5%到 30%的样本中随机进行的,并根据准确性或均方根误差 (RMSE) 或平均绝对百分比误差 (MAPE) 等其他指标将估算结果与真实数据进行比较。结果表明,KNN 等 ML 方法的性能优于其他方法。此外,当考虑到数据的多维性时,预测性能也有所提高。结论--基于分析,我们得出结论,所选的估算技术以及如何使用该技术对获得更高的准确性以及更好地预测金融领域所选混合面板数据的缺失值至关重要:估算技术 面板数据 机器学习 金融表现 ESGJEL Codes:C55, C81, M14, Q51
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
COMPETITION LEVEL ANALYSIS FOR THE FINTECH SECTOR IN TURKIYE COMPARED TO GERMANY THE IMPACT OF ARTIFICIAL INTELLIGENCE RECOMMENDATIONS ON INDIVIDUAL INVESTOR DECISIONS GREEN SUPPLY CHAIN IMPLICATIONS FOR FOOD INDUSTRY MAIN FACTORS AFFECTING THE FINANCIAL STRUCTURE OF ENTERPRISES EXPLORING THE SUSTAINABLE FUTURE OF E-COMMERCE COMPANIES THROUGH A DIGITAL MARKETING AND LOGISTICS CONTEXT
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1