Revitalizing temperature records: A novel framework towards continuous data reconstruction using univariate and multivariate imputation techniques

IF 4.4 2区地球科学 Q1 METEOROLOGY & ATMOSPHERIC SCIENCES Atmospheric Research Pub Date : 2024-12-15 Epub Date: 2024-11-02 DOI:10.1016/j.atmosres.2024.107754

Hanumapura Kumaraswamy Yashas Kumar, Kumble Varija

{"title":"Revitalizing temperature records: A novel framework towards continuous data reconstruction using univariate and multivariate imputation techniques","authors":"Hanumapura Kumaraswamy Yashas Kumar, Kumble Varija","doi":"10.1016/j.atmosres.2024.107754","DOIUrl":null,"url":null,"abstract":"<div><div>Data gaps are a recurring challenge in climate research, hindering effective time series analysis and modeling. This study proposes a novel two-step data imputation framework to address temperature time series with a long continuous gap surrounded by predictor stations with sporadic missingness. The method leverages iterative gap-filling Singular Spectrum Analysis (SSA) for the small sporadic gaps, followed by multivariate techniques like Inverse Distance Weightage (IDW), Kriging, Spatial Regression Test (SRT), Point Estimation method of Biased Sentinel Hospital-based Area Disease Estimation (P-BSHADE), Random Forest (RF), Support Vector Machines (SVM), and MissForest (MF) for the longer gap. Once the sporadic gaps are effectively addressed with SSA, the method carefully applies multivariate techniques to impute the long continuous gap. Prioritizing accuracy, comprehensive cross-validation with class-based statistical indicators are employed to minimize any potential biases introduced by the imputation process. The study shows the effectiveness of SSA in filling small sporadic gaps using an optimal window length (M ≈ 365 days) and eigentriple grouping (ET = 30). Notably, for maximum temperature, P-BSHADE and SVM achieve an impressive accuracy (e.g., Legates's Coefficient of Efficiency (LCE), 0.75∼0.44, Combined Performance Index (CPI), 6.3%∼19.1%) attributed to their ability to capture spatial and/or temporal heterogeneity. While SRT and P-BSHADE offers acceptable performance for minimum temperature (e.g., LCE, 0.51∼0.27, CPI, 0.7%∼23.7%), the study also uncovers a complex interplay between missing data, predictor stations, and autocorrelation affecting imputation accuracy. This suggests that the reduced performance of certain techniques likely stems from the decline in spatial and spatiotemporal autocorrelation between the target station and its predictors. Overall, this study presents a promising framework for handling complex missing data scenarios often encountered in climate time series analysis, paving the way for more robust and reliable analysis and modeling.</div></div>","PeriodicalId":8600,"journal":{"name":"Atmospheric Research","volume":"312 ","pages":"Article 107754"},"PeriodicalIF":4.4000,"publicationDate":"2024-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Atmospheric Research","FirstCategoryId":"89","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169809524005362","RegionNum":2,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/11/2 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"METEOROLOGY & ATMOSPHERIC SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Data gaps are a recurring challenge in climate research, hindering effective time series analysis and modeling. This study proposes a novel two-step data imputation framework to address temperature time series with a long continuous gap surrounded by predictor stations with sporadic missingness. The method leverages iterative gap-filling Singular Spectrum Analysis (SSA) for the small sporadic gaps, followed by multivariate techniques like Inverse Distance Weightage (IDW), Kriging, Spatial Regression Test (SRT), Point Estimation method of Biased Sentinel Hospital-based Area Disease Estimation (P-BSHADE), Random Forest (RF), Support Vector Machines (SVM), and MissForest (MF) for the longer gap. Once the sporadic gaps are effectively addressed with SSA, the method carefully applies multivariate techniques to impute the long continuous gap. Prioritizing accuracy, comprehensive cross-validation with class-based statistical indicators are employed to minimize any potential biases introduced by the imputation process. The study shows the effectiveness of SSA in filling small sporadic gaps using an optimal window length (M ≈ 365 days) and eigentriple grouping (ET = 30). Notably, for maximum temperature, P-BSHADE and SVM achieve an impressive accuracy (e.g., Legates's Coefficient of Efficiency (LCE), 0.75∼0.44, Combined Performance Index (CPI), 6.3%∼19.1%) attributed to their ability to capture spatial and/or temporal heterogeneity. While SRT and P-BSHADE offers acceptable performance for minimum temperature (e.g., LCE, 0.51∼0.27, CPI, 0.7%∼23.7%), the study also uncovers a complex interplay between missing data, predictor stations, and autocorrelation affecting imputation accuracy. This suggests that the reduced performance of certain techniques likely stems from the decline in spatial and spatiotemporal autocorrelation between the target station and its predictors. Overall, this study presents a promising framework for handling complex missing data scenarios often encountered in climate time series analysis, paving the way for more robust and reliable analysis and modeling.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

振兴温度记录：利用单变量和多变量估算技术重建连续数据的新框架

数据缺口是气候研究中经常遇到的挑战，阻碍了有效的时间序列分析和建模。本研究提出了一种新颖的两步数据估算框架，以解决存在较长连续缺口的温度时间序列问题，该缺口被零星缺失的预测站所包围。该方法利用迭代间隙填充奇异谱分析（SSA）来填补较小的零星间隙，然后利用反距离加权（IDW）、克里金法、空间回归检验（SRT）、基于偏倚哨点医院的地区疾病估计点估计法（P-BSHADE）、随机森林（RF）、支持向量机（SVM）和 MissForest（MF）等多元技术来填补较长的间隙。一旦通过 SSA 有效地解决了零星缺口，该方法就会谨慎地应用多元技术来估算较长的连续缺口。该方法将准确性放在首位，采用基于类别的统计指标进行综合交叉验证，以最大限度地减少估算过程中可能引入的偏差。研究表明，使用最佳窗口长度（M ≈ 365 天）和等效分组（ET = 30），SSA 在填补小的零星缺口方面非常有效。值得注意的是，在最高气温方面，P-BSHADE 和 SVM 的准确性令人印象深刻（例如，Legates 效率系数 (LCE)，0.75∼0.44；综合绩效指数 (CPI)，6.3%∼19.1%），这归功于它们捕捉空间和/或时间异质性的能力。虽然 SRT 和 P-BSHADE 在最低气温方面提供了可接受的性能（如 LCE，0.51∼0.27；CPI，0.7%∼23.7%），但研究也发现了缺失数据、预测站和自相关性之间影响估算精度的复杂相互作用。这表明，某些技术的性能下降可能源于目标站与其预测站之间空间和时空自相关性的下降。总之，这项研究为处理气候时间序列分析中经常遇到的复杂缺失数据情况提出了一个很有前景的框架，为更稳健可靠的分析和建模铺平了道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Atmospheric Research 地学-气象与大气科学

CiteScore

9.40

自引率

10.90%

发文量

460

审稿时长

47 days

期刊介绍： The journal publishes scientific papers (research papers, review articles, letters and notes) dealing with the part of the atmosphere where meteorological events occur. Attention is given to all processes extending from the earth surface to the tropopause, but special emphasis continues to be devoted to the physics of clouds, mesoscale meteorology and air pollution, i.e. atmospheric aerosols; microphysical processes; cloud dynamics and thermodynamics; numerical simulation, climatology, climate change and weather modification.