A generative model for evaluating missing data methods in large epidemiological cohorts.

IF 3.4 3区 医学 Q1 HEALTH CARE SCIENCES & SERVICES BMC Medical Research Methodology Pub Date : 2025-02-08 DOI:10.1186/s12874-025-02487-4
Lav Radosavljević, Stephen M Smith, Thomas E Nichols
{"title":"A generative model for evaluating missing data methods in large epidemiological cohorts.","authors":"Lav Radosavljević, Stephen M Smith, Thomas E Nichols","doi":"10.1186/s12874-025-02487-4","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The potential value of large scale datasets is constrained by the ubiquitous problem of missing data, arising in either a structured or unstructured fashion. When imputation methods are proposed for large scale data, one limitation is the simplicity of existing evaluation methods. Specifically, most evaluations create synthetic data with only a simple, unstructured missing data mechanism which does not resemble the missing data patterns found in real data. For example, in the UK Biobank missing data tends to appear in blocks, because non-participation in one of the sub-studies leads to missingness for all sub-study variables.</p><p><strong>Methods: </strong>We propose a tool for generating mixed type missing data mimicking key properties of a given real large scale epidemiological data set with both structured and unstructured missingness while accounting for informative missingness. The process involves identifying sub-studies using hierarchical clustering of missingness patterns and modelling the dependence of inter-variable correlation and co-missingness patterns.</p><p><strong>Results: </strong>On the UK Biobank brain imaging cohort, we identify several large blocks of missing data. We demonstrate the use of our tool for evaluating several imputation methods, showing modest accuracy of imputation overall, with iterative imputation having the best performance. We compare our evaluations based on synthetic data to an exemplar study which includes variable selection on a single real imputed dataset, finding only small differences between the imputation methods though with iterative imputation leading to the most informative selection of variables.</p><p><strong>Conclusions: </strong>We have created a framework for simulating large scale data with that captures the complexities of the inter-variable dependence as well as structured and unstructured informative missingness. Evaluations using this framework highlight the immense challenge of data imputation in this setting and the need for improved missing data methods.</p>","PeriodicalId":9114,"journal":{"name":"BMC Medical Research Methodology","volume":"25 1","pages":"34"},"PeriodicalIF":3.4000,"publicationDate":"2025-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11806830/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Research Methodology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12874-025-02487-4","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: The potential value of large scale datasets is constrained by the ubiquitous problem of missing data, arising in either a structured or unstructured fashion. When imputation methods are proposed for large scale data, one limitation is the simplicity of existing evaluation methods. Specifically, most evaluations create synthetic data with only a simple, unstructured missing data mechanism which does not resemble the missing data patterns found in real data. For example, in the UK Biobank missing data tends to appear in blocks, because non-participation in one of the sub-studies leads to missingness for all sub-study variables.

Methods: We propose a tool for generating mixed type missing data mimicking key properties of a given real large scale epidemiological data set with both structured and unstructured missingness while accounting for informative missingness. The process involves identifying sub-studies using hierarchical clustering of missingness patterns and modelling the dependence of inter-variable correlation and co-missingness patterns.

Results: On the UK Biobank brain imaging cohort, we identify several large blocks of missing data. We demonstrate the use of our tool for evaluating several imputation methods, showing modest accuracy of imputation overall, with iterative imputation having the best performance. We compare our evaluations based on synthetic data to an exemplar study which includes variable selection on a single real imputed dataset, finding only small differences between the imputation methods though with iterative imputation leading to the most informative selection of variables.

Conclusions: We have created a framework for simulating large scale data with that captures the complexities of the inter-variable dependence as well as structured and unstructured informative missingness. Evaluations using this framework highlight the immense challenge of data imputation in this setting and the need for improved missing data methods.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在大型流行病学队列中评估缺失数据方法的生成模型。
背景:大规模数据集的潜在价值受到无处不在的数据缺失问题的限制,这些问题以结构化或非结构化的方式出现。当提出针对大规模数据的估算方法时,现有评估方法的一个局限性是过于简单。具体来说,大多数评估只使用简单的非结构化缺失数据机制创建合成数据,这种机制与真实数据中的缺失数据模式不同。例如,在UK Biobank中,缺失的数据往往以块的形式出现,因为不参与其中一个子研究导致所有子研究变量的缺失。方法:我们提出了一种工具,用于生成混合型缺失数据,模拟给定的具有结构化和非结构化缺失的真实大规模流行病学数据集的关键属性,同时考虑信息缺失。该过程包括使用缺失模式的分层聚类识别子研究,并对变量间相关性和共同缺失模式的依赖性进行建模。结果:在英国生物银行脑成像队列中,我们确定了几个大的缺失数据块。我们演示了使用我们的工具来评估几种代入方法,总体上显示了适度的代入精度,其中迭代代入具有最佳性能。我们将基于合成数据的评估与包括单个真实输入数据集上的变量选择的范例研究进行了比较,发现输入方法之间只有很小的差异,尽管迭代输入导致了最翔实的变量选择。结论:我们创建了一个用于模拟大规模数据的框架,该框架捕获了变量间依赖的复杂性以及结构化和非结构化信息缺失。使用这一框架的评价突出了在这种情况下数据输入的巨大挑战以及改进缺失数据方法的必要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
BMC Medical Research Methodology
BMC Medical Research Methodology 医学-卫生保健
CiteScore
6.50
自引率
2.50%
发文量
298
审稿时长
3-8 weeks
期刊介绍: BMC Medical Research Methodology is an open access journal publishing original peer-reviewed research articles in methodological approaches to healthcare research. Articles on the methodology of epidemiological research, clinical trials and meta-analysis/systematic review are particularly encouraged, as are empirical studies of the associations between choice of methodology and study outcomes. BMC Medical Research Methodology does not aim to publish articles describing scientific methods or techniques: these should be directed to the BMC journal covering the relevant biomedical subject area.
期刊最新文献
Participant recruitment and retention in a longitudinal study: experience from SARS-CoV-2 cohort in Ethiopia. From raw clinical data to robust prediction: an AI framework for early lymphedema detection. A feature selection-based oblique hyperplane for oblique random survival forests. Validation of algorithms for identifying people living with HIV in French medico-administrative databases: implications for HIV surveillance. A scoping review of statistical methods for the analysis of method comparison studies with repeated measurements of clinical data.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1