High-dimensional, outcome-dependent missing data problems: Models for the human KIR loci.

IF 1.9 3区 医学 Q3 HEALTH CARE SCIENCES & SERVICES Statistical Methods in Medical Research Pub Date : 2025-03-01 Epub Date: 2025-01-31 DOI:10.1177/09622802241304112
Lars Leonardus Joannes van der Burg, Hein Putter, Henning Baldauf, Jürgen Sauter, Johannes Schetelig, Liesbeth C de Wreede, Stefan Böhringer
{"title":"<ArticleTitle xmlns:ns0=\"http://www.w3.org/1998/Math/MathML\">High-dimensional, outcome-dependent missing data problems: Models for the human <ns0:math><ns0:mi>K</ns0:mi><ns0:mi>I</ns0:mi><ns0:mi>R</ns0:mi></ns0:math> loci.","authors":"Lars Leonardus Joannes van der Burg, Hein Putter, Henning Baldauf, Jürgen Sauter, Johannes Schetelig, Liesbeth C de Wreede, Stefan Böhringer","doi":"10.1177/09622802241304112","DOIUrl":null,"url":null,"abstract":"<p><p>Missing data problems are common in biological, high-dimensional data, where data can be partially or completely missing. Algorithms have been developed to reconstruct the missing values by means of imputation or expectation-maximization algorithms. For missing data problems, it has been suggested that the regression model of interest should be incorporated into the imputation procedure to reduce bias of the regression coefficients. We here consider a challenging missing data problem, where diplotypes of the <i>KIR</i> loci are to be reconstructed. These loci are difficult to genotype, resulting in ambiguous genotype calls. We extend a previously proposed expectation-maximization algorithm by incorporating a potentially high-dimensional regression model to model the outcome. Three strategies are evaluated: (1) only allelic predictors, (2) allelic predictors and forward-backward selection on haplotype predictors, and (3) penalized regression on a saturated model. In a simulation study, we compared these strategies with a baseline expectation-maximization algorithm without outcome model. For extreme choices of effect sizes and missingness levels, the outcome-based expectation-maximization algorithms outperformed the no-outcome expectation-maximization algorithm. However, in all other cases, the no-outcome expectation-maximization algorithm performed either superior or comparable to the three strategies, suggesting the outcome model can have a harmful effect. In a data analysis concerning death after allogeneic hematopoietic stem cell transplantation as a function of donor <i>KIR</i> genes, expectation-maximization algorithms with and without outcome showed very similar results. In conclusion, outcome based missing data models in the high-dimensional setting have to be used with care and are likely to lead to biased results.</p>","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"440-456"},"PeriodicalIF":1.9000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11951372/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Methods in Medical Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/09622802241304112","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/31 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Missing data problems are common in biological, high-dimensional data, where data can be partially or completely missing. Algorithms have been developed to reconstruct the missing values by means of imputation or expectation-maximization algorithms. For missing data problems, it has been suggested that the regression model of interest should be incorporated into the imputation procedure to reduce bias of the regression coefficients. We here consider a challenging missing data problem, where diplotypes of the KIR loci are to be reconstructed. These loci are difficult to genotype, resulting in ambiguous genotype calls. We extend a previously proposed expectation-maximization algorithm by incorporating a potentially high-dimensional regression model to model the outcome. Three strategies are evaluated: (1) only allelic predictors, (2) allelic predictors and forward-backward selection on haplotype predictors, and (3) penalized regression on a saturated model. In a simulation study, we compared these strategies with a baseline expectation-maximization algorithm without outcome model. For extreme choices of effect sizes and missingness levels, the outcome-based expectation-maximization algorithms outperformed the no-outcome expectation-maximization algorithm. However, in all other cases, the no-outcome expectation-maximization algorithm performed either superior or comparable to the three strategies, suggesting the outcome model can have a harmful effect. In a data analysis concerning death after allogeneic hematopoietic stem cell transplantation as a function of donor KIR genes, expectation-maximization algorithms with and without outcome showed very similar results. In conclusion, outcome based missing data models in the high-dimensional setting have to be used with care and are likely to lead to biased results.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
高维、结果依赖的缺失数据问题:人类KIR基因座的模型。
丢失数据问题在生物、高维数据中很常见,其中数据可能部分或完全丢失。已经开发了一些算法,通过插值或期望最大化算法来重建缺失值。对于缺失数据问题,有人建议将感兴趣的回归模型纳入归算过程,以减少回归系数的偏差。我们在这里考虑一个具有挑战性的缺失数据问题,其中KIR基因座的二倍型将被重建。这些基因座难以进行基因分型,导致基因分型不明确。我们扩展了先前提出的期望最大化算法,结合了一个潜在的高维回归模型来模拟结果。评估了三种策略:(1)仅使用等位基因预测因子,(2)单倍型预测因子上的等位基因预测因子和正向向后选择,以及(3)饱和模型上的惩罚回归。在模拟研究中,我们将这些策略与没有结果模型的基线期望最大化算法进行了比较。对于效应大小和缺失程度的极端选择,基于结果的期望最大化算法优于无结果期望最大化算法。然而,在所有其他情况下,无结果期望最大化算法的表现优于或与三种策略相当,这表明结果模型可能具有有害影响。在一项关于异体造血干细胞移植后死亡作为供体KIR基因功能的数据分析中,有结果和没有结果的期望最大化算法显示了非常相似的结果。总之,在高维环境中,基于结果的缺失数据模型必须谨慎使用,并且很可能导致有偏差的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Statistical Methods in Medical Research
Statistical Methods in Medical Research 医学-数学与计算生物学
CiteScore
4.10
自引率
4.30%
发文量
127
审稿时长
>12 weeks
期刊介绍: Statistical Methods in Medical Research is a peer reviewed scholarly journal and is the leading vehicle for articles in all the main areas of medical statistics and an essential reference for all medical statisticians. This unique journal is devoted solely to statistics and medicine and aims to keep professionals abreast of the many powerful statistical techniques now available to the medical profession. This journal is a member of the Committee on Publication Ethics (COPE)
期刊最新文献
A Bayesian transformation model for informative partly interval-censored data with covariates subject to measurement error. Randomization and allocation procedures for master protocol trials of single-arm studies. Asymptotic validity of Schoenfeld's sample size formula for the Cox proportional hazards model via the Wald test approach. Mediation analysis in longitudinal intervention studies with an ordinal treatment-dependent confounder. Flexible Bayesian modeling of non-equidispersed counts with penalized complexity priors in disease incidence studies.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1