FastImpute: A Baseline for Open-source, Reference-Free Genotype Imputation Methods -- A Case Study in PRS313

Aaron Ge, Jeya Balasubramanian, Xueyao Wu, Peter Kraft, Jonas S. Almeida
{"title":"FastImpute: A Baseline for Open-source, Reference-Free Genotype Imputation Methods -- A Case Study in PRS313","authors":"Aaron Ge, Jeya Balasubramanian, Xueyao Wu, Peter Kraft, Jonas S. Almeida","doi":"arxiv-2407.09355","DOIUrl":null,"url":null,"abstract":"Genotype imputation enhances genetic data by predicting missing SNPs using\nreference haplotype information. Traditional methods leverage linkage\ndisequilibrium (LD) to infer untyped SNP genotypes, relying on the similarity\nof LD structures between genotyped target sets and fully sequenced reference\npanels. Recently, reference-free deep learning-based methods have emerged,\noffering a promising alternative by predicting missing genotypes without\nexternal databases, thereby enhancing privacy and accessibility. However, these\nmethods often produce models with tens of millions of parameters, leading to\nchallenges such as the need for substantial computational resources to train\nand inefficiency for client-sided deployment. Our study addresses these\nlimitations by introducing a baseline for a novel genotype imputation pipeline\nthat supports client-sided imputation models generalizable across any\ngenotyping chip and genomic region. This approach enhances patient privacy by\nperforming imputation directly on edge devices. As a case study, we focus on\nPRS313, a polygenic risk score comprising 313 SNPs used for breast cancer risk\nprediction. Utilizing consumer genetic panels such as 23andMe, our model\ndemocratizes access to personalized genetic insights by allowing 23andMe users\nto obtain their PRS313 score. We demonstrate that simple linear regression can\nsignificantly improve the accuracy of PRS313 scores when calculated using SNPs\nimputed from consumer gene panels, such as 23andMe. Our linear regression model\nachieved an R^2 of 0.86, compared to 0.33 without imputation and 0.28 with\nsimple imputation (substituting missing SNPs with the minor allele frequency).\nThese findings suggest that popular SNP analysis libraries could benefit from\nintegrating linear regression models for genotype imputation, providing a\nviable and light-weight alternative to reference based imputation.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.09355","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Genotype imputation enhances genetic data by predicting missing SNPs using reference haplotype information. Traditional methods leverage linkage disequilibrium (LD) to infer untyped SNP genotypes, relying on the similarity of LD structures between genotyped target sets and fully sequenced reference panels. Recently, reference-free deep learning-based methods have emerged, offering a promising alternative by predicting missing genotypes without external databases, thereby enhancing privacy and accessibility. However, these methods often produce models with tens of millions of parameters, leading to challenges such as the need for substantial computational resources to train and inefficiency for client-sided deployment. Our study addresses these limitations by introducing a baseline for a novel genotype imputation pipeline that supports client-sided imputation models generalizable across any genotyping chip and genomic region. This approach enhances patient privacy by performing imputation directly on edge devices. As a case study, we focus on PRS313, a polygenic risk score comprising 313 SNPs used for breast cancer risk prediction. Utilizing consumer genetic panels such as 23andMe, our model democratizes access to personalized genetic insights by allowing 23andMe users to obtain their PRS313 score. We demonstrate that simple linear regression can significantly improve the accuracy of PRS313 scores when calculated using SNPs imputed from consumer gene panels, such as 23andMe. Our linear regression model achieved an R^2 of 0.86, compared to 0.33 without imputation and 0.28 with simple imputation (substituting missing SNPs with the minor allele frequency). These findings suggest that popular SNP analysis libraries could benefit from integrating linear regression models for genotype imputation, providing a viable and light-weight alternative to reference based imputation.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
FastImpute:开源、无参照基因型推算方法的基线 -- PRS313 案例研究
基因型推算是利用参考单倍型信息预测缺失的 SNP,从而增强遗传数据的能力。传统方法依赖基因分型目标集与完全测序参考集之间的 LD 结构相似性,利用连锁平衡(LD)来推断未分型的 SNP 基因型。最近,出现了基于无参考深度学习的方法,通过预测缺失的基因型而无需外部数据库,从而提高了私密性和可访问性,提供了一种有前途的替代方法。然而,这些方法通常会产生具有数千万个参数的模型,从而导致需要大量计算资源进行训练和客户端部署效率低下等挑战。我们的研究通过引入新型基因型估算管道的基线来解决上述限制,该管道支持可在任何基因分型芯片和基因组区域通用的客户端估算模型。这种方法通过直接在边缘设备上执行估算,提高了患者的隐私性。作为案例研究,我们将重点放在 PRS313 上,这是一个由 313 个 SNP 组成的多基因风险评分,用于乳腺癌风险预测。我们的模型利用 23andMe 等消费者基因面板,通过让 23andMe 用户获得他们的 PRS313 分数,使获取个性化基因见解的途径民主化。我们证明,在使用从 23andMe 等消费者基因面板中提取的 SNPs 计算 PRS313 分数时,简单的线性回归可以显著提高 PRS313 分数的准确性。我们的线性回归模型获得了 0.86 的 R^2,而不进行归因的 R^2 为 0.33,简单归因(用小等位基因频率替代缺失的 SNP)的 R^2 为 0.28。这些发现表明,流行的 SNP 分析库可以从整合线性回归模型的基因型归因中获益,为基于参考的归因提供了可行且轻量级的替代方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Allium Vegetables Intake and Digestive System Cancer Risk: A Study Based on Mendelian Randomization, Network Pharmacology and Molecular Docking wgatools: an ultrafast toolkit for manipulating whole genome alignments Selecting Differential Splicing Methods: Practical Considerations Advancements in colored k-mer sets: essentials for the curious Advancements in practical k-mer sets: essentials for the curious
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1