Haplotype reconstruction for genetically complex regions with ambiguous genotype calls: Illustration by the KIR gene region

IF 1.7 4区 医学 Q3 GENETICS & HEREDITY Genetic Epidemiology Pub Date : 2023-10-13 DOI:10.1002/gepi.22538
Lars L. J. van der Burg, Liesbeth C. de Wreede, Henning Baldauf, Jürgen Sauter, Johannes Schetelig, Hein Putter, Stefan Böhringer
{"title":"Haplotype reconstruction for genetically complex regions with ambiguous genotype calls: Illustration by the KIR gene region","authors":"Lars L. J. van der Burg,&nbsp;Liesbeth C. de Wreede,&nbsp;Henning Baldauf,&nbsp;Jürgen Sauter,&nbsp;Johannes Schetelig,&nbsp;Hein Putter,&nbsp;Stefan Böhringer","doi":"10.1002/gepi.22538","DOIUrl":null,"url":null,"abstract":"<p>Advances in DNA sequencing technologies have enabled genotyping of complex genetic regions exhibiting copy number variation and high allelic diversity, yet it is impossible to derive exact genotypes in all cases, often resulting in ambiguous genotype calls, that is, partially missing data. An example of such a gene region is the killer-cell immunoglobulin-like receptor (<i>KIR</i>) genes. These genes are of special interest in the context of allogeneic hematopoietic stem cell transplantation. For such complex gene regions, current haplotype reconstruction methods are not feasible as they cannot cope with the complexity of the data. We present an expectation–maximization (EM)-algorithm to estimate haplotype frequencies (HTFs) which deals with the missing data components, and takes into account linkage disequilibrium (LD) between genes. To cope with the exponential increase in the number of haplotypes as genes are added, we add three components to a standard EM-algorithm implementation. First, reconstruction is performed iteratively, adding one gene at a time. Second, after each step, haplotypes with frequencies below a threshold are collapsed in a rare haplotype group. Third, the HTF of the rare haplotype group is profiled in subsequent iterations to improve estimates. A simulation study evaluates the effect of combining information of multiple genes on the estimates of these frequencies. We show that estimated HTFs are approximately unbiased. Our simulation study shows that the EM-algorithm is able to combine information from multiple genes when LD is high, whereas increased ambiguity levels increase bias. Linear regression models based on this EM, show that a large number of haplotypes can be problematic for unbiased effect size estimation and that models need to be sparse. In a real data analysis of <i>KIR</i> genotypes, we compare HTFs to those obtained in an independent study. Our new EM-algorithm-based method is the first to account for the full genetic architecture of complex gene regions, such as the <i>KIR</i> gene region. This algorithm can handle the numerous observed ambiguities, and allows for the collapsing of haplotypes to perform implicit dimension reduction. Combining information from multiple genes improves haplotype reconstruction.</p>","PeriodicalId":12710,"journal":{"name":"Genetic Epidemiology","volume":"48 1","pages":"3-26"},"PeriodicalIF":1.7000,"publicationDate":"2023-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/gepi.22538","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genetic Epidemiology","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/gepi.22538","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0

Abstract

Advances in DNA sequencing technologies have enabled genotyping of complex genetic regions exhibiting copy number variation and high allelic diversity, yet it is impossible to derive exact genotypes in all cases, often resulting in ambiguous genotype calls, that is, partially missing data. An example of such a gene region is the killer-cell immunoglobulin-like receptor (KIR) genes. These genes are of special interest in the context of allogeneic hematopoietic stem cell transplantation. For such complex gene regions, current haplotype reconstruction methods are not feasible as they cannot cope with the complexity of the data. We present an expectation–maximization (EM)-algorithm to estimate haplotype frequencies (HTFs) which deals with the missing data components, and takes into account linkage disequilibrium (LD) between genes. To cope with the exponential increase in the number of haplotypes as genes are added, we add three components to a standard EM-algorithm implementation. First, reconstruction is performed iteratively, adding one gene at a time. Second, after each step, haplotypes with frequencies below a threshold are collapsed in a rare haplotype group. Third, the HTF of the rare haplotype group is profiled in subsequent iterations to improve estimates. A simulation study evaluates the effect of combining information of multiple genes on the estimates of these frequencies. We show that estimated HTFs are approximately unbiased. Our simulation study shows that the EM-algorithm is able to combine information from multiple genes when LD is high, whereas increased ambiguity levels increase bias. Linear regression models based on this EM, show that a large number of haplotypes can be problematic for unbiased effect size estimation and that models need to be sparse. In a real data analysis of KIR genotypes, we compare HTFs to those obtained in an independent study. Our new EM-algorithm-based method is the first to account for the full genetic architecture of complex gene regions, such as the KIR gene region. This algorithm can handle the numerous observed ambiguities, and allows for the collapsing of haplotypes to perform implicit dimension reduction. Combining information from multiple genes improves haplotype reconstruction.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
具有模糊基因型调用的遗传复杂区域的单倍型重建:KIR基因区域的说明。
DNA测序技术的进步使得能够对表现出拷贝数变异和高等位基因多样性的复杂遗传区域进行基因分型,但不可能在所有情况下都得出确切的基因型,这往往导致基因型调用不明确,即部分缺失数据。这种基因区域的一个例子是杀伤细胞免疫球蛋白样受体(KIR)基因。这些基因在异基因造血干细胞移植中具有特殊的意义。对于这样复杂的基因区域,目前的单倍型重建方法是不可行的,因为它们无法应对数据的复杂性。我们提出了一种期望最大化(EM)算法来估计单倍型频率(HTF),该算法处理缺失的数据成分,并考虑基因之间的连锁不平衡(LD)。为了应对基因添加后单倍型数量的指数增长,我们在标准EM算法实现中添加了三个组件。首先,重复进行重建,一次添加一个基因。其次,在每一步之后,频率低于阈值的单倍型在一个罕见的单倍型组中崩溃。第三,在随后的迭代中对罕见单倍型组的HTF进行了分析,以改进估计。一项模拟研究评估了组合多个基因的信息对这些频率估计的影响。我们证明估计的传热函数是近似无偏的。我们的模拟研究表明,当LD高时,EM算法能够组合来自多个基因的信息,而模糊度的增加会增加偏差。基于该EM的线性回归模型表明,大量单倍型对于无偏效应大小估计可能存在问题,并且模型需要稀疏。在KIR基因型的真实数据分析中,我们将HTFs与独立研究中获得的HTFs进行了比较。我们新的基于EM算法的方法是第一个考虑复杂基因区域(如KIR基因区域)的完整遗传结构的方法。该算法可以处理大量观察到的模糊性,并允许单倍型的折叠来执行隐式降维。结合来自多个基因的信息可以改善单倍型的重建。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Genetic Epidemiology
Genetic Epidemiology 医学-公共卫生、环境卫生与职业卫生
CiteScore
4.40
自引率
9.50%
发文量
49
审稿时长
6-12 weeks
期刊介绍: Genetic Epidemiology is a peer-reviewed journal for discussion of research on the genetic causes of the distribution of human traits in families and populations. Emphasis is placed on the relative contribution of genetic and environmental factors to human disease as revealed by genetic, epidemiological, and biologic investigations. Genetic Epidemiology primarily publishes papers in statistical genetics, a research field that is primarily concerned with development of statistical, bioinformatical, and computational models for analyzing genetic data. Incorporation of underlying biology and population genetics into conceptual models is favored. The Journal seeks original articles comprising either applied research or innovative statistical, mathematical, computational, or genomic methodologies that advance studies in genetic epidemiology. Other types of reports are encouraged, such as letters to the editor, topic reviews, and perspectives from other fields of research that will likely enrich the field of genetic epidemiology.
期刊最新文献
Genetic Associations of Persistent Opioid Use After Surgery Point to OPRM1 but Not Other Opioid-Related Loci as the Main Driver of Opioid Use Disorder. Bayesian Effect Size Ranking to Prioritise Genetic Risk Variants in Common Diseases for Follow-Up Studies. Using Family History Data to Improve the Power of Association Studies: Application to Cancer in UK Biobank. Issue Information Issue Information
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1