Evaluating Imputation Methods for rainfall data under high variability in Johor River Basin, Malaysia

IF 2.6 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Applied Computing and Geosciences Pub Date : 2023-12-01 DOI:10.1016/j.acags.2023.100145

Zulfaqar Sa’adi , Zulkifli Yusop , Nor Eliza Alias , Ming Fai Chow , Mohd Khairul Idlan Muhammad , Muhammad Wafiy Adli Ramli , Zafar Iqbal , Mohammed Sanusi Shiru , Faizal Immaddudin Wira Rohmat , Nur Athirah Mohamad , Mohamad Faizal Ahmad

{"title":"Evaluating Imputation Methods for rainfall data under high variability in Johor River Basin, Malaysia","authors":"Zulfaqar Sa’adi , Zulkifli Yusop , Nor Eliza Alias , Ming Fai Chow , Mohd Khairul Idlan Muhammad , Muhammad Wafiy Adli Ramli , Zafar Iqbal , Mohammed Sanusi Shiru , Faizal Immaddudin Wira Rohmat , Nur Athirah Mohamad , Mohamad Faizal Ahmad","doi":"10.1016/j.acags.2023.100145","DOIUrl":null,"url":null,"abstract":"<div>Missing values in rainfall records might result in erroneous predictions and inefficient management practices with significant economic, environmental, and social consequences. This is particularly important for rainfall datasets in Peninsular Malaysia (PM) due to the high level of missingness that can affect the inherent pattern in the highly variable time series. In this work, 21 target rainfall stations in the Johor River Basin (JRB) with daily data between 1970 and 2015 were used to examine 19 different multiple imputation methods that were carried out using the Multivariate Imputation by Chained Equations (MICE) package in R. For each station, artificial missing data were added at rates of up to 5%, 10%, 20%, and 30% for different types of missingness, namely, Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR), leaving the original missing data intact. The imputation quality was evaluated based on several statistical performance metrics, namely mean absolute error (MAE), root mean square error (RMSE), normalized root mean square error (NRMSE), Nash-Sutcliffe efficiency (NSE), modified degree of agreement (MD), coefficient of determination (R2), Kling-Gupta efficiency (KGE), and volumetric efficiency (VE), which were later ranked and aggregated by using the compromise programming index (CPI) to select the best method. The results showed that linear regression predicted values (norm.predict) consistently ranked the highest under all types and levels of missingness. For example, under MAR, MNAR, and MCAR, this method showed the lowest MAE values, ranging between 0.78 and 2.25, 0.93–2.57, and 0.87–2.43, respectively. It also consistently shows higher NSE and R2 values of 0.71–0.92, 0.6–0.92, and 0.66–0.91, and 0.77–0.92, 0.71–0.93, and 0.75–0.92 under MAR, MCAR, and MNAR, respectively. The methods of mean, rf, and cart also appear to be efficient. The incorporation of the compromise programming index (CPI) as a decision-support tool has enabled an objective assessment of the output from the multiple performance metrics for the ranking and selection of the top-performing method. During validation, the Probability Density Function (PDF) demonstrated that even with up to 30% missingness, the shape of the distribution was retained after imputation compared to the actual data. The methodology proposed in this study can help in choosing suitable imputation methods for other tropical rainfall datasets, leading to improved accuracy in rainfall estimation and prediction.</div>","PeriodicalId":33804,"journal":{"name":"Applied Computing and Geosciences","volume":"20 ","pages":"Article 100145"},"PeriodicalIF":2.6000,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2590197423000344/pdfft?md5=807ccb11378bbc7aafaff142104149e9&pid=1-s2.0-S2590197423000344-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Computing and Geosciences","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590197423000344","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Missing values in rainfall records might result in erroneous predictions and inefficient management practices with significant economic, environmental, and social consequences. This is particularly important for rainfall datasets in Peninsular Malaysia (PM) due to the high level of missingness that can affect the inherent pattern in the highly variable time series. In this work, 21 target rainfall stations in the Johor River Basin (JRB) with daily data between 1970 and 2015 were used to examine 19 different multiple imputation methods that were carried out using the Multivariate Imputation by Chained Equations (MICE) package in R. For each station, artificial missing data were added at rates of up to 5%, 10%, 20%, and 30% for different types of missingness, namely, Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR), leaving the original missing data intact. The imputation quality was evaluated based on several statistical performance metrics, namely mean absolute error (MAE), root mean square error (RMSE), normalized root mean square error (NRMSE), Nash-Sutcliffe efficiency (NSE), modified degree of agreement (MD), coefficient of determination (R2), Kling-Gupta efficiency (KGE), and volumetric efficiency (VE), which were later ranked and aggregated by using the compromise programming index (CPI) to select the best method. The results showed that linear regression predicted values (norm.predict) consistently ranked the highest under all types and levels of missingness. For example, under MAR, MNAR, and MCAR, this method showed the lowest MAE values, ranging between 0.78 and 2.25, 0.93–2.57, and 0.87–2.43, respectively. It also consistently shows higher NSE and R2 values of 0.71–0.92, 0.6–0.92, and 0.66–0.91, and 0.77–0.92, 0.71–0.93, and 0.75–0.92 under MAR, MCAR, and MNAR, respectively. The methods of mean, rf, and cart also appear to be efficient. The incorporation of the compromise programming index (CPI) as a decision-support tool has enabled an objective assessment of the output from the multiple performance metrics for the ranking and selection of the top-performing method. During validation, the Probability Density Function (PDF) demonstrated that even with up to 30% missingness, the shape of the distribution was retained after imputation compared to the actual data. The methodology proposed in this study can help in choosing suitable imputation methods for other tropical rainfall datasets, leading to improved accuracy in rainfall estimation and prediction.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

评估马来西亚柔佛河流域高变化情况下降雨量数据的估算方法

降雨记录中的缺失值可能会导致错误的预测和低效的管理方法，从而造成严重的经济、环境和社会后果。这一点对于马来西亚半岛（PM）的降雨数据集尤为重要，因为高水平的缺失会影响高度多变的时间序列中的固有模式。在这项研究中，使用 R 软件包中的 "链式方程多变量估算（MICE）"，对柔佛河流域（JRB）21 个目标雨量站 1970 年至 2015 年的每日数据进行了研究，并检验了 19 种不同的多重估算方法。针对不同类型的缺失（即完全随机缺失（MCAR）、随机缺失（MAR）和非随机缺失（MNAR）），对每个测站分别按高达 5%、10%、20% 和 30% 的比例添加人工缺失数据，并保留原始缺失数据。根据几个统计性能指标，即平均绝对误差（MAE）、均方根误差（RMSE）、归一化均方根误差（NRMSE）、纳什-苏特克利夫效率（NSE）、修正一致度（MD）、判定系数（R2）、克林-古普塔效率（KGE）和容积效率（VE），对估算质量进行了评估，随后使用折中方案指数（CPI）对这些指标进行排序和汇总，以选出最佳方法。结果表明，线性回归预测值（norm.predict）在所有类型和级别的缺失率中始终排名最高。例如，在 MAR、MNAR 和 MCAR 下，该方法的 MAE 值最低，分别为 0.78 至 2.25、0.93 至 2.57 和 0.87 至 2.43。在 MAR、MCAR 和 MNAR 下，它的 NSE 和 R2 值也一直较高，分别为 0.71-0.92、0.6-0.92 和 0.66-0.91，以及 0.77-0.92、0.71-0.93 和 0.75-0.92。均值法、rf 法和推车法似乎也很有效。将折中方案设计指数（CPI）作为决策支持工具，可以对多种性能指标的输出进行客观评估，从而排序和选择性能最佳的方法。在验证过程中，概率密度函数（PDF）表明，即使缺失率高达 30%，与实际数据相比，估算后的分布形状仍得以保留。本研究提出的方法有助于为其他热带降雨数据集选择合适的估算方法，从而提高降雨估算和预测的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊