The prediction of new Covid-19 cases in Poland with machine learning models

Q4 Mathematics Statistics in Transition Pub Date : 2023-03-15 DOI:10.59170/stattrans-2023-020
Adam Chwila
{"title":"The prediction of new Covid-19 cases in Poland with machine learning models","authors":"Adam Chwila","doi":"10.59170/stattrans-2023-020","DOIUrl":null,"url":null,"abstract":"The COVID-19 pandemic has had a huge impact both on the global economy and on\n everyday life in all countries all over the world. In this paper, we propose several\n possible machine learning approaches to forecasting new confirmed COVID-19 cases,\n including the LASSO regression, Gradient Boosted (GB) regression trees, Support Vector\n Regression (SVR), and Long-Short Term Memory (LSTM) neural network. The above methods\n are applied in two variants: to the data prepared for the whole Poland and to the data\n prepared separately for each of the 16 voivodeships (NUTS 2 regions). The learning of\n all the models has been performed in two variants: with the 5-fold time-series\n cross-validation as well as with the split into the single train and test subsets. The\n computations in the study used official statistics from government reports from the\n period of April 2020 to March 2022. We propose a setup of 16 scenarios of the model\n selection to detect the model characterized by the best ex-post prediction accuracy. The\n scenarios differ from each other by the following features: the machine learning model,\n the method for the hyperparameters selection and the data setup. The most accurate\n scenario for the LASSO and SVR machine learning approaches is the single train/test\n dataset split with data for the whole Poland, while in case of the LSTM and GB trees it\n is the cross validation with data for whole Poland. Among the best scenarios for each\n model, the most accurate ex-post RMSE is obtained for the SVR. For the model performing\n best in terms of the ex-post RMSE, the interpretation of the outcome is conducted with\n the Shapley values. The Shapley values make it possible to present the impact of\n auxiliary variables in the machine learning model on the actual predicted value. The\n knowledge regarding factors that have the strongest impact on the number of new\n infections can help companies to plan their economic activity during turbulent times of\n pandemics. We propose to identify and compare the most important variables that affect\n both the train and test datasets of the model.","PeriodicalId":37985,"journal":{"name":"Statistics in Transition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistics in Transition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.59170/stattrans-2023-020","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Mathematics","Score":null,"Total":0}
引用次数: 0

Abstract

The COVID-19 pandemic has had a huge impact both on the global economy and on everyday life in all countries all over the world. In this paper, we propose several possible machine learning approaches to forecasting new confirmed COVID-19 cases, including the LASSO regression, Gradient Boosted (GB) regression trees, Support Vector Regression (SVR), and Long-Short Term Memory (LSTM) neural network. The above methods are applied in two variants: to the data prepared for the whole Poland and to the data prepared separately for each of the 16 voivodeships (NUTS 2 regions). The learning of all the models has been performed in two variants: with the 5-fold time-series cross-validation as well as with the split into the single train and test subsets. The computations in the study used official statistics from government reports from the period of April 2020 to March 2022. We propose a setup of 16 scenarios of the model selection to detect the model characterized by the best ex-post prediction accuracy. The scenarios differ from each other by the following features: the machine learning model, the method for the hyperparameters selection and the data setup. The most accurate scenario for the LASSO and SVR machine learning approaches is the single train/test dataset split with data for the whole Poland, while in case of the LSTM and GB trees it is the cross validation with data for whole Poland. Among the best scenarios for each model, the most accurate ex-post RMSE is obtained for the SVR. For the model performing best in terms of the ex-post RMSE, the interpretation of the outcome is conducted with the Shapley values. The Shapley values make it possible to present the impact of auxiliary variables in the machine learning model on the actual predicted value. The knowledge regarding factors that have the strongest impact on the number of new infections can help companies to plan their economic activity during turbulent times of pandemics. We propose to identify and compare the most important variables that affect both the train and test datasets of the model.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用机器学习模型预测波兰新冠肺炎病例
新冠肺炎疫情对全球经济和世界各国人民的日常生活都产生了巨大影响。在本文中,我们提出了几种可能的机器学习方法来预测新确诊的COVID-19病例,包括LASSO回归、梯度增强(GB)回归树、支持向量回归(SVR)和长短期记忆(LSTM)神经网络。上述方法适用于两种变体:一种是为整个波兰准备的数据,另一种是为16个省(NUTS 2地区)分别准备的数据。所有模型的学习都以两种变体进行:5倍时间序列交叉验证以及拆分为单个训练和测试子集。研究中的计算使用了2020年4月至2022年3月期间政府报告中的官方统计数据。我们提出了16个场景的模型选择设置,以检测具有最佳事后预测精度的模型。这些场景的不同之处在于以下特征:机器学习模型、超参数选择方法和数据设置。LASSO和SVR机器学习方法最准确的场景是单个训练/测试数据集与整个波兰的数据分割,而在LSTM和GB树的情况下,它是与整个波兰的数据交叉验证。在每个模型的最佳场景中,SVR得到最准确的事后RMSE。对于在事后RMSE方面表现最好的模型,结果的解释是用Shapley值进行的。Shapley值使机器学习模型中辅助变量对实际预测值的影响成为可能。了解对新感染人数影响最大的因素,可以帮助企业在大流行的动荡时期规划其经济活动。我们建议识别和比较影响模型训练和测试数据集的最重要变量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Statistics in Transition
Statistics in Transition Decision Sciences-Statistics, Probability and Uncertainty
CiteScore
1.00
自引率
0.00%
发文量
0
审稿时长
9 weeks
期刊介绍: Statistics in Transition (SiT) is an international journal published jointly by the Polish Statistical Association (PTS) and the Central Statistical Office of Poland (CSO/GUS), which sponsors this publication. Launched in 1993, it was issued twice a year until 2006; since then it appears - under a slightly changed title, Statistics in Transition new series - three times a year; and after 2013 as a regular quarterly journal." The journal provides a forum for exchange of ideas and experience amongst members of international community of statisticians, data producers and users, including researchers, teachers, policy makers and the general public. Its initially dominating focus on statistical issues pertinent to transition from centrally planned to a market-oriented economy has gradually been extended to embracing statistical problems related to development and modernization of the system of public (official) statistics, in general.
期刊最新文献
Estimating the probability of leaving unemployment for older people in Poland using survival models with censored data Does economic freedom promote financial development? Evidence from EU countries Rotation schemes and Chebyshev polynomials A nonparametric analysis of discrete time competing risks data: a comparison of the cause-specific-hazards approach and the vertical approach Comments on „Probability vs. Nonprobability Sampling: From the Birth of Survey Sampling to the Present Day” by Graham Kalton
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1