Addressing missing values in routine health information system data: an evaluation of imputation methods using data from the Democratic Republic of the Congo during the COVID-19 pandemic.

IF 3.2 2区 医学 Q2 PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH Population Health Metrics Pub Date : 2021-11-04 DOI:10.1186/s12963-021-00274-z
Shuo Feng, Celestin Hategeka, Karen Ann Grépin
{"title":"Addressing missing values in routine health information system data: an evaluation of imputation methods using data from the Democratic Republic of the Congo during the COVID-19 pandemic.","authors":"Shuo Feng,&nbsp;Celestin Hategeka,&nbsp;Karen Ann Grépin","doi":"10.1186/s12963-021-00274-z","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Poor data quality is limiting the use of data sourced from routine health information systems (RHIS), especially in low- and middle-income countries. An important component of this data quality issue comes from missing values, where health facilities, for a variety of reasons, fail to report to the central system.</p><p><strong>Methods: </strong>Using data from the health management information system in the Democratic Republic of the Congo and the advent of COVID-19 pandemic as an illustrative case study, we implemented seven commonly used imputation methods and evaluated their performance in terms of minimizing bias in imputed values and parameter estimates generated through subsequent analytical techniques, namely segmented regression, which is widely used in interrupted time series studies, and pre-post-comparisons through paired Wilcoxon rank-sum tests. We also examined the performance of these imputation methods under different missing mechanisms and tested their stability to changes in the data.</p><p><strong>Results: </strong>For regression analyses, there were no substantial differences found in the coefficient estimates generated from all methods except mean imputation and exclusion and interpolation when the data contained less than 20% missing values. However, as the missing proportion grew, k-NN started to produce biased estimates. Machine learning algorithms, i.e. missForest and k-NN, were also found to lack robustness to small changes in the data or consecutive missingness. On the other hand, multiple imputation methods generated the overall most unbiased estimates and were the most robust to all changes in data. They also produced smaller standard errors than single imputations. For pre-post-comparisons, all methods produced p values less than 0.01, regardless of the amount of missingness introduced, suggesting low sensitivity of Wilcoxon rank-sum tests to the imputation method used.</p><p><strong>Conclusions: </strong>We recommend the use of multiple imputation in addressing missing values in RHIS datasets and appropriate handling of data structure to minimize imputation standard errors. In cases where necessary computing resources are unavailable for multiple imputation, one may consider seasonal decomposition as the next best method. Mean imputation and exclusion and interpolation, however, always produced biased and misleading results in the subsequent analyses, and thus, their use in the handling of missing values should be discouraged.</p>","PeriodicalId":51476,"journal":{"name":"Population Health Metrics","volume":null,"pages":null},"PeriodicalIF":3.2000,"publicationDate":"2021-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8567342/pdf/","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Population Health Metrics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12963-021-00274-z","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
引用次数: 8

Abstract

Background: Poor data quality is limiting the use of data sourced from routine health information systems (RHIS), especially in low- and middle-income countries. An important component of this data quality issue comes from missing values, where health facilities, for a variety of reasons, fail to report to the central system.

Methods: Using data from the health management information system in the Democratic Republic of the Congo and the advent of COVID-19 pandemic as an illustrative case study, we implemented seven commonly used imputation methods and evaluated their performance in terms of minimizing bias in imputed values and parameter estimates generated through subsequent analytical techniques, namely segmented regression, which is widely used in interrupted time series studies, and pre-post-comparisons through paired Wilcoxon rank-sum tests. We also examined the performance of these imputation methods under different missing mechanisms and tested their stability to changes in the data.

Results: For regression analyses, there were no substantial differences found in the coefficient estimates generated from all methods except mean imputation and exclusion and interpolation when the data contained less than 20% missing values. However, as the missing proportion grew, k-NN started to produce biased estimates. Machine learning algorithms, i.e. missForest and k-NN, were also found to lack robustness to small changes in the data or consecutive missingness. On the other hand, multiple imputation methods generated the overall most unbiased estimates and were the most robust to all changes in data. They also produced smaller standard errors than single imputations. For pre-post-comparisons, all methods produced p values less than 0.01, regardless of the amount of missingness introduced, suggesting low sensitivity of Wilcoxon rank-sum tests to the imputation method used.

Conclusions: We recommend the use of multiple imputation in addressing missing values in RHIS datasets and appropriate handling of data structure to minimize imputation standard errors. In cases where necessary computing resources are unavailable for multiple imputation, one may consider seasonal decomposition as the next best method. Mean imputation and exclusion and interpolation, however, always produced biased and misleading results in the subsequent analyses, and thus, their use in the handling of missing values should be discouraged.

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
解决常规卫生信息系统数据中的缺失值:2019冠状病毒病大流行期间使用刚果民主共和国数据的归算方法评估
背景:数据质量差限制了常规卫生信息系统(RHIS)数据的使用,特别是在低收入和中等收入国家。这一数据质量问题的一个重要组成部分来自价值缺失,即卫生设施由于各种原因未能向中央系统报告。方法:利用刚果民主共和国卫生管理信息系统的数据和COVID-19大流行的到来作为说明性案例研究,我们实施了七种常用的imputation方法,并评估了它们的性能,最小化了通过后续分析技术(即在中断时间序列研究中广泛使用的分段回归)产生的imputation值和参数估计的偏差。并通过配对Wilcoxon秩和检验进行前后比较。我们还研究了这些方法在不同缺失机制下的性能,并测试了它们对数据变化的稳定性。结果:在回归分析中,当数据缺失值小于20%时,除均值归算、排除和插值外,所有方法产生的系数估计值均无显著差异。然而,随着缺失比例的增加,k-NN开始产生有偏差的估计。机器学习算法,如missForest和k-NN,也被发现对数据的微小变化或连续缺失缺乏鲁棒性。另一方面,多重imputation方法产生了总体上最无偏的估计,并且对数据的所有变化都是最稳健的。它们产生的标准误差也比单次估算要小。对于前后比较,无论引入多少缺失,所有方法产生的p值都小于0.01,这表明Wilcoxon秩和检验对所使用的imputation方法的敏感性较低。结论:我们建议在RHIS数据集中使用多重插值来解决缺失值,并适当处理数据结构以最小化插值标准误差。在没有必要的计算资源来进行多次插值的情况下,可以考虑将季节分解作为次优方法。然而,在随后的分析中,平均归算、排除和内插总是产生有偏差和误导性的结果,因此,在处理缺失值时应不鼓励使用它们。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Population Health Metrics
Population Health Metrics PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH-
CiteScore
6.50
自引率
0.00%
发文量
21
审稿时长
29 weeks
期刊介绍: Population Health Metrics aims to advance the science of population health assessment, and welcomes papers relating to concepts, methods, ethics, applications, and summary measures of population health. The journal provides a unique platform for population health researchers to share their findings with the global community. We seek research that addresses the communication of population health measures and policy implications to stakeholders; this includes papers related to burden estimation and risk assessment, and research addressing population health across the full range of development. Population Health Metrics covers a broad range of topics encompassing health state measurement and valuation, summary measures of population health, descriptive epidemiology at the population level, burden of disease and injury analysis, disease and risk factor modeling for populations, and comparative assessment of risks to health at the population level. The journal is also interested in how to use and communicate indicators of population health to reduce disease burden, and the approaches for translating from indicators of population health to health-advancing actions. As a cross-cutting topic of importance, we are particularly interested in inequalities in population health and their measurement.
期刊最新文献
Deriving disability weights for the Netherlands: findings from the Dutch disability weights measurement study. Quantifying the magnitude of the general contextual effect in a multilevel study of SARS-CoV-2 infection in Ontario, Canada: application of the median rate ratio in population health research. Standardised reporting of burden of disease studies: the STROBOD statement. Population age structure dependency of the excess mortality P-score. Automated mortality coding for improved health policy in the Philippines.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1