High-dimensional, outcome-dependent missing data problems: Models for the human KIR loci.

IF 1.9 3区医学 Q3 HEALTH CARE SCIENCES & SERVICES Statistical Methods in Medical Research Pub Date : 2025-03-01 Epub Date: 2025-01-31 DOI:10.1177/09622802241304112

Lars Leonardus Joannes van der Burg, Hein Putter, Henning Baldauf, Jürgen Sauter, Johannes Schetelig, Liesbeth C de Wreede, Stefan Böhringer

{"title":"<ArticleTitle xmlns:ns0=\"http://www.w3.org/1998/Math/MathML\">High-dimensional, outcome-dependent missing data problems: Models for the human <ns0:math><ns0:mi>K</ns0:mi><ns0:mi>I</ns0:mi><ns0:mi>R</ns0:mi></ns0:math> loci.","authors":"Lars Leonardus Joannes van der Burg, Hein Putter, Henning Baldauf, Jürgen Sauter, Johannes Schetelig, Liesbeth C de Wreede, Stefan Böhringer","doi":"10.1177/09622802241304112","DOIUrl":null,"url":null,"abstract":"Missing data problems are common in biological, high-dimensional data, where data can be partially or completely missing. Algorithms have been developed to reconstruct the missing values by means of imputation or expectation-maximization algorithms. For missing data problems, it has been suggested that the regression model of interest should be incorporated into the imputation procedure to reduce bias of the regression coefficients. We here consider a challenging missing data problem, where diplotypes of the KIR loci are to be reconstructed. These loci are difficult to genotype, resulting in ambiguous genotype calls. We extend a previously proposed expectation-maximization algorithm by incorporating a potentially high-dimensional regression model to model the outcome. Three strategies are evaluated: (1) only allelic predictors, (2) allelic predictors and forward-backward selection on haplotype predictors, and (3) penalized regression on a saturated model. In a simulation study, we compared these strategies with a baseline expectation-maximization algorithm without outcome model. For extreme choices of effect sizes and missingness levels, the outcome-based expectation-maximization algorithms outperformed the no-outcome expectation-maximization algorithm. However, in all other cases, the no-outcome expectation-maximization algorithm performed either superior or comparable to the three strategies, suggesting the outcome model can have a harmful effect. In a data analysis concerning death after allogeneic hematopoietic stem cell transplantation as a function of donor KIR genes, expectation-maximization algorithms with and without outcome showed very similar results. In conclusion, outcome based missing data models in the high-dimensional setting have to be used with care and are likely to lead to biased results.","PeriodicalId":22038,"journal":{"name":"Statistical Methods in Medical Research","volume":" ","pages":"440-456"},"PeriodicalIF":1.9000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11951372/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Methods in Medical Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/09622802241304112","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/31 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Missing data problems are common in biological, high-dimensional data, where data can be partially or completely missing. Algorithms have been developed to reconstruct the missing values by means of imputation or expectation-maximization algorithms. For missing data problems, it has been suggested that the regression model of interest should be incorporated into the imputation procedure to reduce bias of the regression coefficients. We here consider a challenging missing data problem, where diplotypes of the KIR loci are to be reconstructed. These loci are difficult to genotype, resulting in ambiguous genotype calls. We extend a previously proposed expectation-maximization algorithm by incorporating a potentially high-dimensional regression model to model the outcome. Three strategies are evaluated: (1) only allelic predictors, (2) allelic predictors and forward-backward selection on haplotype predictors, and (3) penalized regression on a saturated model. In a simulation study, we compared these strategies with a baseline expectation-maximization algorithm without outcome model. For extreme choices of effect sizes and missingness levels, the outcome-based expectation-maximization algorithms outperformed the no-outcome expectation-maximization algorithm. However, in all other cases, the no-outcome expectation-maximization algorithm performed either superior or comparable to the three strategies, suggesting the outcome model can have a harmful effect. In a data analysis concerning death after allogeneic hematopoietic stem cell transplantation as a function of donor KIR genes, expectation-maximization algorithms with and without outcome showed very similar results. In conclusion, outcome based missing data models in the high-dimensional setting have to be used with care and are likely to lead to biased results.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

高维、结果依赖的缺失数据问题：人类KIR基因座的模型。

丢失数据问题在生物、高维数据中很常见，其中数据可能部分或完全丢失。已经开发了一些算法，通过插值或期望最大化算法来重建缺失值。对于缺失数据问题，有人建议将感兴趣的回归模型纳入归算过程，以减少回归系数的偏差。我们在这里考虑一个具有挑战性的缺失数据问题，其中KIR基因座的二倍型将被重建。这些基因座难以进行基因分型，导致基因分型不明确。我们扩展了先前提出的期望最大化算法，结合了一个潜在的高维回归模型来模拟结果。评估了三种策略：(1)仅使用等位基因预测因子，(2)单倍型预测因子上的等位基因预测因子和正向向后选择，以及(3)饱和模型上的惩罚回归。在模拟研究中，我们将这些策略与没有结果模型的基线期望最大化算法进行了比较。对于效应大小和缺失程度的极端选择，基于结果的期望最大化算法优于无结果期望最大化算法。然而，在所有其他情况下，无结果期望最大化算法的表现优于或与三种策略相当，这表明结果模型可能具有有害影响。在一项关于异体造血干细胞移植后死亡作为供体KIR基因功能的数据分析中，有结果和没有结果的期望最大化算法显示了非常相似的结果。总之，在高维环境中，基于结果的缺失数据模型必须谨慎使用，并且很可能导致有偏差的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Statistical Methods in Medical Research 医学-数学与计算生物学

CiteScore

4.10

自引率

4.30%

发文量

127

审稿时长

>12 weeks

期刊介绍： Statistical Methods in Medical Research is a peer reviewed scholarly journal and is the leading vehicle for articles in all the main areas of medical statistics and an essential reference for all medical statisticians. This unique journal is devoted solely to statistics and medicine and aims to keep professionals abreast of the many powerful statistical techniques now available to the medical profession. This journal is a member of the Committee on Publication Ethics (COPE)