The prediction of new Covid-19 cases in Poland with machine learning models

Q4 Mathematics Statistics in Transition Pub Date : 2023-03-15 DOI:10.59170/stattrans-2023-020

Adam Chwila

{"title":"The prediction of new Covid-19 cases in Poland with machine learning models","authors":"Adam Chwila","doi":"10.59170/stattrans-2023-020","DOIUrl":null,"url":null,"abstract":"The COVID-19 pandemic has had a huge impact both on the global economy and on\n everyday life in all countries all over the world. In this paper, we propose several\n possible machine learning approaches to forecasting new confirmed COVID-19 cases,\n including the LASSO regression, Gradient Boosted (GB) regression trees, Support Vector\n Regression (SVR), and Long-Short Term Memory (LSTM) neural network. The above methods\n are applied in two variants: to the data prepared for the whole Poland and to the data\n prepared separately for each of the 16 voivodeships (NUTS 2 regions). The learning of\n all the models has been performed in two variants: with the 5-fold time-series\n cross-validation as well as with the split into the single train and test subsets. The\n computations in the study used official statistics from government reports from the\n period of April 2020 to March 2022. We propose a setup of 16 scenarios of the model\n selection to detect the model characterized by the best ex-post prediction accuracy. The\n scenarios differ from each other by the following features: the machine learning model,\n the method for the hyperparameters selection and the data setup. The most accurate\n scenario for the LASSO and SVR machine learning approaches is the single train/test\n dataset split with data for the whole Poland, while in case of the LSTM and GB trees it\n is the cross validation with data for whole Poland. Among the best scenarios for each\n model, the most accurate ex-post RMSE is obtained for the SVR. For the model performing\n best in terms of the ex-post RMSE, the interpretation of the outcome is conducted with\n the Shapley values. The Shapley values make it possible to present the impact of\n auxiliary variables in the machine learning model on the actual predicted value. The\n knowledge regarding factors that have the strongest impact on the number of new\n infections can help companies to plan their economic activity during turbulent times of\n pandemics. We propose to identify and compare the most important variables that affect\n both the train and test datasets of the model.","PeriodicalId":37985,"journal":{"name":"Statistics in Transition","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistics in Transition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.59170/stattrans-2023-020","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Mathematics","Score":null,"Total":0}

引用次数: 0

Abstract

The COVID-19 pandemic has had a huge impact both on the global economy and on everyday life in all countries all over the world. In this paper, we propose several possible machine learning approaches to forecasting new confirmed COVID-19 cases, including the LASSO regression, Gradient Boosted (GB) regression trees, Support Vector Regression (SVR), and Long-Short Term Memory (LSTM) neural network. The above methods are applied in two variants: to the data prepared for the whole Poland and to the data prepared separately for each of the 16 voivodeships (NUTS 2 regions). The learning of all the models has been performed in two variants: with the 5-fold time-series cross-validation as well as with the split into the single train and test subsets. The computations in the study used official statistics from government reports from the period of April 2020 to March 2022. We propose a setup of 16 scenarios of the model selection to detect the model characterized by the best ex-post prediction accuracy. The scenarios differ from each other by the following features: the machine learning model, the method for the hyperparameters selection and the data setup. The most accurate scenario for the LASSO and SVR machine learning approaches is the single train/test dataset split with data for the whole Poland, while in case of the LSTM and GB trees it is the cross validation with data for whole Poland. Among the best scenarios for each model, the most accurate ex-post RMSE is obtained for the SVR. For the model performing best in terms of the ex-post RMSE, the interpretation of the outcome is conducted with the Shapley values. The Shapley values make it possible to present the impact of auxiliary variables in the machine learning model on the actual predicted value. The knowledge regarding factors that have the strongest impact on the number of new infections can help companies to plan their economic activity during turbulent times of pandemics. We propose to identify and compare the most important variables that affect both the train and test datasets of the model.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用机器学习模型预测波兰新冠肺炎病例

新冠肺炎疫情对全球经济和世界各国人民的日常生活都产生了巨大影响。在本文中，我们提出了几种可能的机器学习方法来预测新确诊的COVID-19病例，包括LASSO回归、梯度增强(GB)回归树、支持向量回归(SVR)和长短期记忆(LSTM)神经网络。上述方法适用于两种变体:一种是为整个波兰准备的数据，另一种是为16个省(NUTS 2地区)分别准备的数据。所有模型的学习都以两种变体进行:5倍时间序列交叉验证以及拆分为单个训练和测试子集。研究中的计算使用了2020年4月至2022年3月期间政府报告中的官方统计数据。我们提出了16个场景的模型选择设置，以检测具有最佳事后预测精度的模型。这些场景的不同之处在于以下特征:机器学习模型、超参数选择方法和数据设置。LASSO和SVR机器学习方法最准确的场景是单个训练/测试数据集与整个波兰的数据分割，而在LSTM和GB树的情况下，它是与整个波兰的数据交叉验证。在每个模型的最佳场景中，SVR得到最准确的事后RMSE。对于在事后RMSE方面表现最好的模型，结果的解释是用Shapley值进行的。Shapley值使机器学习模型中辅助变量对实际预测值的影响成为可能。了解对新感染人数影响最大的因素，可以帮助企业在大流行的动荡时期规划其经济活动。我们建议识别和比较影响模型训练和测试数据集的最重要变量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Statistics in Transition Decision Sciences-Statistics, Probability and Uncertainty

CiteScore

1.00

自引率

0.00%

发文量

审稿时长

9 weeks

期刊介绍： Statistics in Transition (SiT) is an international journal published jointly by the Polish Statistical Association (PTS) and the Central Statistical Office of Poland (CSO/GUS), which sponsors this publication. Launched in 1993, it was issued twice a year until 2006; since then it appears - under a slightly changed title, Statistics in Transition new series - three times a year; and after 2013 as a regular quarterly journal." The journal provides a forum for exchange of ideas and experience amongst members of international community of statisticians, data producers and users, including researchers, teachers, policy makers and the general public. Its initially dominating focus on statistical issues pertinent to transition from centrally planned to a market-oriented economy has gradually been extended to embracing statistical problems related to development and modernization of the system of public (official) statistics, in general.