Two-stage strategy using denoising autoencoders for robust reference-free genotype imputation with missing input genotypes

IF 2.6 3区 生物学 Q2 GENETICS & HEREDITY Journal of Human Genetics Pub Date : 2024-06-25 DOI:10.1038/s10038-024-01261-6
Kaname Kojima, Shu Tadaka, Yasunobu Okamura, Kengo Kinoshita
{"title":"Two-stage strategy using denoising autoencoders for robust reference-free genotype imputation with missing input genotypes","authors":"Kaname Kojima, Shu Tadaka, Yasunobu Okamura, Kengo Kinoshita","doi":"10.1038/s10038-024-01261-6","DOIUrl":null,"url":null,"abstract":"Widely used genotype imputation methods are based on the Li and Stephens model, which assumes that new haplotypes can be represented by modifying existing haplotypes in a reference panel through mutations and recombinations. These methods use genotypes from SNP arrays as inputs to estimate haplotypes that align with the input genotypes by analyzing recombination patterns within a reference panel, and then infer unobserved variants. While these methods require reference panels in an identifiable form, their public use is limited due to privacy and consent concerns. One strategy to overcome these limitations is to use de-identified haplotype information, such as summary statistics or model parameters. Advances in deep learning (DL) offer the potential to develop imputation methods that use haplotype information in a reference-free manner by handling it as model parameters, while maintaining comparable imputation accuracy to methods based on the Li and Stephens model. Here, we provide a brief introduction to DL-based reference-free genotype imputation methods, including RNN-IMP, developed by our research group. We then evaluate the performance of RNN-IMP against widely-used Li and Stephens model-based imputation methods in terms of accuracy (R2), using the 1000 Genomes Project Phase 3 dataset and corresponding simulated Omni2.5 SNP genotype data. Although RNN-IMP is sensitive to missing values in input genotypes, we propose a two-stage imputation strategy: missing genotypes are first imputed using denoising autoencoders; RNN-IMP then processes these imputed genotypes. This approach restores the imputation accuracy that is degraded by missing values, enhancing the practical use of RNN-IMP.","PeriodicalId":16077,"journal":{"name":"Journal of Human Genetics","volume":null,"pages":null},"PeriodicalIF":2.6000,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.nature.com/articles/s10038-024-01261-6.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Human Genetics","FirstCategoryId":"99","ListUrlMain":"https://www.nature.com/articles/s10038-024-01261-6","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0

Abstract

Widely used genotype imputation methods are based on the Li and Stephens model, which assumes that new haplotypes can be represented by modifying existing haplotypes in a reference panel through mutations and recombinations. These methods use genotypes from SNP arrays as inputs to estimate haplotypes that align with the input genotypes by analyzing recombination patterns within a reference panel, and then infer unobserved variants. While these methods require reference panels in an identifiable form, their public use is limited due to privacy and consent concerns. One strategy to overcome these limitations is to use de-identified haplotype information, such as summary statistics or model parameters. Advances in deep learning (DL) offer the potential to develop imputation methods that use haplotype information in a reference-free manner by handling it as model parameters, while maintaining comparable imputation accuracy to methods based on the Li and Stephens model. Here, we provide a brief introduction to DL-based reference-free genotype imputation methods, including RNN-IMP, developed by our research group. We then evaluate the performance of RNN-IMP against widely-used Li and Stephens model-based imputation methods in terms of accuracy (R2), using the 1000 Genomes Project Phase 3 dataset and corresponding simulated Omni2.5 SNP genotype data. Although RNN-IMP is sensitive to missing values in input genotypes, we propose a two-stage imputation strategy: missing genotypes are first imputed using denoising autoencoders; RNN-IMP then processes these imputed genotypes. This approach restores the imputation accuracy that is degraded by missing values, enhancing the practical use of RNN-IMP.

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用去噪自编码器的两阶段策略,实现输入基因型缺失的稳健无参考基因型归因。
广泛使用的基因型估算方法基于李氏和斯蒂芬斯模型,该模型假定新的单倍型可以通过突变和重组修改参考面板中的现有单倍型来表示。这些方法使用 SNP 阵列中的基因型作为输入,通过分析参考面板中的重组模式来估计与输入基因型一致的单倍型,然后推断未观察到的变异。虽然这些方法需要可识别形式的参考面板,但由于隐私和同意问题,其公开使用受到限制。克服这些限制的一种策略是使用去标识化的单倍型信息,如摘要统计或模型参数。深度学习(DL)的进步为开发归因方法提供了可能,这种方法通过将单倍型信息作为模型参数处理,以无参照的方式使用单倍型信息,同时保持与基于李氏和斯蒂芬斯模型的方法相当的归因准确性。在此,我们将简要介绍基于 DL 的无参照基因型估算方法,包括我们研究小组开发的 RNN-IMP。然后,我们使用 1000 基因组计划第三阶段数据集和相应的模拟 Omni2.5 SNP 基因型数据,评估了 RNN-IMP 与广泛使用的基于 Li 和 Stephens 模型的估算方法在准确率(R2)方面的性能。虽然 RNN-IMP 对输入基因型中的缺失值很敏感,但我们提出了一种两阶段归约策略:首先使用去噪自编码器归约缺失的基因型;然后 RNN-IMP 处理这些归约的基因型。这种方法恢复了因缺失值而降低的估算精度,提高了 RNN-IMP 的实际应用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Human Genetics
Journal of Human Genetics 生物-遗传学
CiteScore
7.20
自引率
0.00%
发文量
101
审稿时长
4-8 weeks
期刊介绍: The Journal of Human Genetics is an international journal publishing articles on human genetics, including medical genetics and human genome analysis. It covers all aspects of human genetics, including molecular genetics, clinical genetics, behavioral genetics, immunogenetics, pharmacogenomics, population genetics, functional genomics, epigenetics, genetic counseling and gene therapy. Articles on the following areas are especially welcome: genetic factors of monogenic and complex disorders, genome-wide association studies, genetic epidemiology, cancer genetics, personal genomics, genotype-phenotype relationships and genome diversity.
期刊最新文献
Novel homozygous ESAM variants in two families with perinatal strokes showing variable neuroradiologic and clinical findings. Biallelic missense CEP55 variants cause prenatal MARCH syndrome. Two-hit mutation causes Wilms tumor in an individual with FBXW7-related neurodevelopmental syndrome. Genetic analysis of a Yayoi individual from the Doigahama site provides insights into the origins of immigrants to the Japanese Archipelago. Development of a method for the imputation of the multi-allelic serotonin-transporter-linked polymorphic region (5-HTTLPR) in the Japanese population.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1