FastImpute: A Baseline for Open-source, Reference-Free Genotype Imputation Methods -- A Case Study in PRS313

arXiv - QuanBio - Genomics Pub Date : 2024-07-12 DOI:arxiv-2407.09355

Aaron Ge, Jeya Balasubramanian, Xueyao Wu, Peter Kraft, Jonas S. Almeida

{"title":"FastImpute: A Baseline for Open-source, Reference-Free Genotype Imputation Methods -- A Case Study in PRS313","authors":"Aaron Ge, Jeya Balasubramanian, Xueyao Wu, Peter Kraft, Jonas S. Almeida","doi":"arxiv-2407.09355","DOIUrl":null,"url":null,"abstract":"Genotype imputation enhances genetic data by predicting missing SNPs using\nreference haplotype information. Traditional methods leverage linkage\ndisequilibrium (LD) to infer untyped SNP genotypes, relying on the similarity\nof LD structures between genotyped target sets and fully sequenced reference\npanels. Recently, reference-free deep learning-based methods have emerged,\noffering a promising alternative by predicting missing genotypes without\nexternal databases, thereby enhancing privacy and accessibility. However, these\nmethods often produce models with tens of millions of parameters, leading to\nchallenges such as the need for substantial computational resources to train\nand inefficiency for client-sided deployment. Our study addresses these\nlimitations by introducing a baseline for a novel genotype imputation pipeline\nthat supports client-sided imputation models generalizable across any\ngenotyping chip and genomic region. This approach enhances patient privacy by\nperforming imputation directly on edge devices. As a case study, we focus on\nPRS313, a polygenic risk score comprising 313 SNPs used for breast cancer risk\nprediction. Utilizing consumer genetic panels such as 23andMe, our model\ndemocratizes access to personalized genetic insights by allowing 23andMe users\nto obtain their PRS313 score. We demonstrate that simple linear regression can\nsignificantly improve the accuracy of PRS313 scores when calculated using SNPs\nimputed from consumer gene panels, such as 23andMe. Our linear regression model\nachieved an R^2 of 0.86, compared to 0.33 without imputation and 0.28 with\nsimple imputation (substituting missing SNPs with the minor allele frequency).\nThese findings suggest that popular SNP analysis libraries could benefit from\nintegrating linear regression models for genotype imputation, providing a\nviable and light-weight alternative to reference based imputation.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"11 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.09355","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Genotype imputation enhances genetic data by predicting missing SNPs using reference haplotype information. Traditional methods leverage linkage disequilibrium (LD) to infer untyped SNP genotypes, relying on the similarity of LD structures between genotyped target sets and fully sequenced reference panels. Recently, reference-free deep learning-based methods have emerged, offering a promising alternative by predicting missing genotypes without external databases, thereby enhancing privacy and accessibility. However, these methods often produce models with tens of millions of parameters, leading to challenges such as the need for substantial computational resources to train and inefficiency for client-sided deployment. Our study addresses these limitations by introducing a baseline for a novel genotype imputation pipeline that supports client-sided imputation models generalizable across any genotyping chip and genomic region. This approach enhances patient privacy by performing imputation directly on edge devices. As a case study, we focus on PRS313, a polygenic risk score comprising 313 SNPs used for breast cancer risk prediction. Utilizing consumer genetic panels such as 23andMe, our model democratizes access to personalized genetic insights by allowing 23andMe users to obtain their PRS313 score. We demonstrate that simple linear regression can significantly improve the accuracy of PRS313 scores when calculated using SNPs imputed from consumer gene panels, such as 23andMe. Our linear regression model achieved an R^2 of 0.86, compared to 0.33 without imputation and 0.28 with simple imputation (substituting missing SNPs with the minor allele frequency). These findings suggest that popular SNP analysis libraries could benefit from integrating linear regression models for genotype imputation, providing a viable and light-weight alternative to reference based imputation.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

FastImpute：开源、无参照基因型推算方法的基线 -- PRS313 案例研究

基因型推算是利用参考单倍型信息预测缺失的 SNP，从而增强遗传数据的能力。传统方法依赖基因分型目标集与完全测序参考集之间的 LD 结构相似性，利用连锁平衡（LD）来推断未分型的 SNP 基因型。最近，出现了基于无参考深度学习的方法，通过预测缺失的基因型而无需外部数据库，从而提高了私密性和可访问性，提供了一种有前途的替代方法。然而，这些方法通常会产生具有数千万个参数的模型，从而导致需要大量计算资源进行训练和客户端部署效率低下等挑战。我们的研究通过引入新型基因型估算管道的基线来解决上述限制，该管道支持可在任何基因分型芯片和基因组区域通用的客户端估算模型。这种方法通过直接在边缘设备上执行估算，提高了患者的隐私性。作为案例研究，我们将重点放在 PRS313 上，这是一个由 313 个 SNP 组成的多基因风险评分，用于乳腺癌风险预测。我们的模型利用 23andMe 等消费者基因面板，通过让 23andMe 用户获得他们的 PRS313 分数，使获取个性化基因见解的途径民主化。我们证明，在使用从 23andMe 等消费者基因面板中提取的 SNPs 计算 PRS313 分数时，简单的线性回归可以显著提高 PRS313 分数的准确性。我们的线性回归模型获得了 0.86 的 R^2，而不进行归因的 R^2 为 0.33，简单归因（用小等位基因频率替代缺失的 SNP）的 R^2 为 0.28。这些发现表明，流行的 SNP 分析库可以从整合线性回归模型的基因型归因中获益，为基于参考的归因提供了可行且轻量级的替代方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - QuanBio - Genomics

自引率

0.00%

发文量