指导如何构建和选择从相对简单到复杂的数据驱动模型，以进行多任务流量预报

IF 3.9 3区环境科学与生态学 Q1 ENGINEERING, CIVIL Stochastic Environmental Research and Risk Assessment Pub Date : 2024-07-17 DOI:10.1007/s00477-024-02776-2

Trung Duc Tran, Jongho Kim

{"title":"指导如何构建和选择从相对简单到复杂的数据驱动模型，以进行多任务流量预报","authors":"Trung Duc Tran, Jongho Kim","doi":"10.1007/s00477-024-02776-2","DOIUrl":null,"url":null,"abstract":"<p>With the goal of forecasting streamflow time series with sufficient lead time, we evaluate the efficiency and accuracy of data-based models ranging from relatively simple to complex. Based on this, we systematically explain the model construction and selection process according to lead time, type and amount of data, and optimization method. This analysis involved optimizing the inputs and hyperparameters of four unique data-driven models: Autoregressive Integrated Moving Average (ARIMA), Artificial Neural Network (ANN), Long Short-Term Memory (LSTM), and Transformer (TRANS), which were applied to the Soyang watershed, South Korea. The type and amount of model inputs are determined through a fine-tuning process that samples based on a correlation threshold, correlation to predictand, and autocorrelation to historical data and evaluates the simulated objective function. Hyperparameters are simultaneously optimized using three conventional optimization methods: Bayesian optimization (BO), particle swarm optimization (PSO), and gray wolf optimization (GWO). The experimental results provide insight into the role of input predictors, data preparations (e.g., wavelet transform), hyperparameter optimization, and model structures. From this, we can provide guidelines for model selection. Relatively simple models can be used when the dataset is small or there are few input variables, when only the near future is predicted, or when the selection of optimization methods is limited. However, a more complex model should be selected if the type and amount of data are sufficient, various optimization methods can be applied, or it is necessary to secure more lead time. More parameters, more complex model structures, and more training materials make this possible.</p>","PeriodicalId":21987,"journal":{"name":"Stochastic Environmental Research and Risk Assessment","volume":"307 1","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Guidance on the construction and selection of relatively simple to complex data-driven models for multi-task streamflow forecasting\",\"authors\":\"Trung Duc Tran, Jongho Kim\",\"doi\":\"10.1007/s00477-024-02776-2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>With the goal of forecasting streamflow time series with sufficient lead time, we evaluate the efficiency and accuracy of data-based models ranging from relatively simple to complex. Based on this, we systematically explain the model construction and selection process according to lead time, type and amount of data, and optimization method. This analysis involved optimizing the inputs and hyperparameters of four unique data-driven models: Autoregressive Integrated Moving Average (ARIMA), Artificial Neural Network (ANN), Long Short-Term Memory (LSTM), and Transformer (TRANS), which were applied to the Soyang watershed, South Korea. The type and amount of model inputs are determined through a fine-tuning process that samples based on a correlation threshold, correlation to predictand, and autocorrelation to historical data and evaluates the simulated objective function. Hyperparameters are simultaneously optimized using three conventional optimization methods: Bayesian optimization (BO), particle swarm optimization (PSO), and gray wolf optimization (GWO). The experimental results provide insight into the role of input predictors, data preparations (e.g., wavelet transform), hyperparameter optimization, and model structures. From this, we can provide guidelines for model selection. Relatively simple models can be used when the dataset is small or there are few input variables, when only the near future is predicted, or when the selection of optimization methods is limited. However, a more complex model should be selected if the type and amount of data are sufficient, various optimization methods can be applied, or it is necessary to secure more lead time. More parameters, more complex model structures, and more training materials make this possible.</p>\",\"PeriodicalId\":21987,\"journal\":{\"name\":\"Stochastic Environmental Research and Risk Assessment\",\"volume\":\"307 1\",\"pages\":\"\"},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2024-07-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Stochastic Environmental Research and Risk Assessment\",\"FirstCategoryId\":\"93\",\"ListUrlMain\":\"https://doi.org/10.1007/s00477-024-02776-2\",\"RegionNum\":3,\"RegionCategory\":\"环境科学与生态学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, CIVIL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Stochastic Environmental Research and Risk Assessment","FirstCategoryId":"93","ListUrlMain":"https://doi.org/10.1007/s00477-024-02776-2","RegionNum":3,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, CIVIL","Score":null,"Total":0}

引用次数: 0

摘要

以预报有足够提前期的河水流量时间序列为目标，我们评估了从相对简单到复杂的基于数据的模型的效率和准确性。在此基础上，我们根据准备时间、数据类型和数量以及优化方法，系统地解释了模型的构建和选择过程。这项分析涉及优化四个独特的数据驱动模型的输入和超参数：自回归综合移动平均模型（ARIMA）、人工神经网络模型（ANN）、长短期记忆模型（LSTM）和变压器模型（TRANS）。模型输入的类型和数量是通过微调过程确定的，微调过程根据相关性阈值、与预测值的相关性以及与历史数据的自相关性进行采样，并对模拟目标函数进行评估。超参数同时采用三种传统优化方法进行优化：贝叶斯优化 (BO)、粒子群优化 (PSO) 和灰狼优化 (GWO)。实验结果让我们深入了解了输入预测因子、数据准备（如小波变换）、超参数优化和模型结构的作用。由此，我们可以为模型选择提供指导。当数据集较小或输入变量较少时，当只预测近期或优化方法选择有限时，可以使用相对简单的模型。但是，如果数据的类型和数量足够多，可以应用各种优化方法，或者有必要确保更多的准备时间，则应选择更复杂的模型。有了更多的参数、更复杂的模型结构和更多的培训材料，就可以做到这一点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Guidance on the construction and selection of relatively simple to complex data-driven models for multi-task streamflow forecasting

With the goal of forecasting streamflow time series with sufficient lead time, we evaluate the efficiency and accuracy of data-based models ranging from relatively simple to complex. Based on this, we systematically explain the model construction and selection process according to lead time, type and amount of data, and optimization method. This analysis involved optimizing the inputs and hyperparameters of four unique data-driven models: Autoregressive Integrated Moving Average (ARIMA), Artificial Neural Network (ANN), Long Short-Term Memory (LSTM), and Transformer (TRANS), which were applied to the Soyang watershed, South Korea. The type and amount of model inputs are determined through a fine-tuning process that samples based on a correlation threshold, correlation to predictand, and autocorrelation to historical data and evaluates the simulated objective function. Hyperparameters are simultaneously optimized using three conventional optimization methods: Bayesian optimization (BO), particle swarm optimization (PSO), and gray wolf optimization (GWO). The experimental results provide insight into the role of input predictors, data preparations (e.g., wavelet transform), hyperparameter optimization, and model structures. From this, we can provide guidelines for model selection. Relatively simple models can be used when the dataset is small or there are few input variables, when only the near future is predicted, or when the selection of optimization methods is limited. However, a more complex model should be selected if the type and amount of data are sufficient, various optimization methods can be applied, or it is necessary to secure more lead time. More parameters, more complex model structures, and more training materials make this possible.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Stochastic Environmental Research and Risk Assessment 环境科学-工程：环境

CiteScore

7.10

自引率

9.50%

发文量

189

审稿时长

3.8 months

期刊介绍： Stochastic Environmental Research and Risk Assessment (SERRA) will publish research papers, reviews and technical notes on stochastic and probabilistic approaches to environmental sciences and engineering, including interactions of earth and atmospheric environments with people and ecosystems. The basic idea is to bring together research papers on stochastic modelling in various fields of environmental sciences and to provide an interdisciplinary forum for the exchange of ideas, for communicating on issues that cut across disciplinary barriers, and for the dissemination of stochastic techniques used in different fields to the community of interested researchers. Original contributions will be considered dealing with modelling (theoretical and computational), measurements and instrumentation in one or more of the following topical areas: - Spatiotemporal analysis and mapping of natural processes. - Enviroinformatics. - Environmental risk assessment, reliability analysis and decision making. - Surface and subsurface hydrology and hydraulics. - Multiphase porous media domains and contaminant transport modelling. - Hazardous waste site characterization. - Stochastic turbulence and random hydrodynamic fields. - Chaotic and fractal systems. - Random waves and seafloor morphology. - Stochastic atmospheric and climate processes. - Air pollution and quality assessment research. - Modern geostatistics. - Mechanisms of pollutant formation, emission, exposure and absorption. - Physical, chemical and biological analysis of human exposure from single and multiple media and routes; control and protection. - Bioinformatics. - Probabilistic methods in ecology and population biology. - Epidemiological investigations. - Models using stochastic differential equations stochastic or partial differential equations. - Hazardous waste site characterization.

指导如何构建和选择从相对简单到复杂的数据驱动模型，以进行多任务流 量预报

摘要

指导如何构建和选择从相对简单到复杂的数据驱动模型，以进行多任务流量预报