Xiaoying Pan , Hao Wang , Mingzhu Lei , Tong Ju , Lin Bai
{"title":"基于特征相关性的多元序列双向递归神经网络缺失值填充方法","authors":"Xiaoying Pan , Hao Wang , Mingzhu Lei , Tong Ju , Lin Bai","doi":"10.1016/j.jocs.2024.102472","DOIUrl":null,"url":null,"abstract":"<div><div>Multivariate real-life time series data often contain missing values. These missing values often affect subsequent prediction tasks. Traditional imputation methods generally consider only some of the characteristics of multivariate time series data. This can easily lead to inaccurate filling results. In this paper, a feature correlation-based bidirectional recurrent network (BRNN-FR) is proposed to solve the problem of missing values in multivariate sequence data. First, this method involves the design of a bidirectional prediction network based on time intervals and the use of forward and reverse time series information between data points to obtain the characteristics of data changes with time to the greatest extent. Second, considering the correlation between features, a combined feature selection strategy based on the Pearson correlation coefficient and mutual information was proposed. A multiple regression model was established to predict between features. Finally, a bidirectional network ensemble filling algorithm based on the relationships between features is established to predict missing values. Comprehensive experiments on four public datasets show that the mean absolute error (MAE), root mean square error (RMSE) and maximum R2 value (R2_score) of the BRNN-FR algorithm in the direct imputation test are better than those of the other comparison methods in most cases. BRNN-FR also achieved a better area under the curve (AUC) in the indirect comparison experiment of two classifications of in-hospital death after filling the medical dataset. Using the AIR air quality dataset and the power transformer temperature dataset from the ETTH1 interpolation regression to predict the next 3 hours and 6 hours of average numerical results, most of the optimal regression results are obtained.</div></div>","PeriodicalId":48907,"journal":{"name":"Journal of Computational Science","volume":"83 ","pages":"Article 102472"},"PeriodicalIF":3.1000,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A method for filling missing values in multivariate sequence bidirectional recurrent neural networks based on feature correlations\",\"authors\":\"Xiaoying Pan , Hao Wang , Mingzhu Lei , Tong Ju , Lin Bai\",\"doi\":\"10.1016/j.jocs.2024.102472\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multivariate real-life time series data often contain missing values. These missing values often affect subsequent prediction tasks. Traditional imputation methods generally consider only some of the characteristics of multivariate time series data. This can easily lead to inaccurate filling results. In this paper, a feature correlation-based bidirectional recurrent network (BRNN-FR) is proposed to solve the problem of missing values in multivariate sequence data. First, this method involves the design of a bidirectional prediction network based on time intervals and the use of forward and reverse time series information between data points to obtain the characteristics of data changes with time to the greatest extent. Second, considering the correlation between features, a combined feature selection strategy based on the Pearson correlation coefficient and mutual information was proposed. A multiple regression model was established to predict between features. Finally, a bidirectional network ensemble filling algorithm based on the relationships between features is established to predict missing values. Comprehensive experiments on four public datasets show that the mean absolute error (MAE), root mean square error (RMSE) and maximum R2 value (R2_score) of the BRNN-FR algorithm in the direct imputation test are better than those of the other comparison methods in most cases. BRNN-FR also achieved a better area under the curve (AUC) in the indirect comparison experiment of two classifications of in-hospital death after filling the medical dataset. Using the AIR air quality dataset and the power transformer temperature dataset from the ETTH1 interpolation regression to predict the next 3 hours and 6 hours of average numerical results, most of the optimal regression results are obtained.</div></div>\",\"PeriodicalId\":48907,\"journal\":{\"name\":\"Journal of Computational Science\",\"volume\":\"83 \",\"pages\":\"Article 102472\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2024-11-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Computational Science\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1877750324002655\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Science","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1877750324002655","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
A method for filling missing values in multivariate sequence bidirectional recurrent neural networks based on feature correlations
Multivariate real-life time series data often contain missing values. These missing values often affect subsequent prediction tasks. Traditional imputation methods generally consider only some of the characteristics of multivariate time series data. This can easily lead to inaccurate filling results. In this paper, a feature correlation-based bidirectional recurrent network (BRNN-FR) is proposed to solve the problem of missing values in multivariate sequence data. First, this method involves the design of a bidirectional prediction network based on time intervals and the use of forward and reverse time series information between data points to obtain the characteristics of data changes with time to the greatest extent. Second, considering the correlation between features, a combined feature selection strategy based on the Pearson correlation coefficient and mutual information was proposed. A multiple regression model was established to predict between features. Finally, a bidirectional network ensemble filling algorithm based on the relationships between features is established to predict missing values. Comprehensive experiments on four public datasets show that the mean absolute error (MAE), root mean square error (RMSE) and maximum R2 value (R2_score) of the BRNN-FR algorithm in the direct imputation test are better than those of the other comparison methods in most cases. BRNN-FR also achieved a better area under the curve (AUC) in the indirect comparison experiment of two classifications of in-hospital death after filling the medical dataset. Using the AIR air quality dataset and the power transformer temperature dataset from the ETTH1 interpolation regression to predict the next 3 hours and 6 hours of average numerical results, most of the optimal regression results are obtained.
期刊介绍:
Computational Science is a rapidly growing multi- and interdisciplinary field that uses advanced computing and data analysis to understand and solve complex problems. It has reached a level of predictive capability that now firmly complements the traditional pillars of experimentation and theory.
The recent advances in experimental techniques such as detectors, on-line sensor networks and high-resolution imaging techniques, have opened up new windows into physical and biological processes at many levels of detail. The resulting data explosion allows for detailed data driven modeling and simulation.
This new discipline in science combines computational thinking, modern computational methods, devices and collateral technologies to address problems far beyond the scope of traditional numerical methods.
Computational science typically unifies three distinct elements:
• Modeling, Algorithms and Simulations (e.g. numerical and non-numerical, discrete and continuous);
• Software developed to solve science (e.g., biological, physical, and social), engineering, medicine, and humanities problems;
• Computer and information science that develops and optimizes the advanced system hardware, software, networking, and data management components (e.g. problem solving environments).