Aaron Ge, Jeya Balasubramanian, Xueyao Wu, Peter Kraft, Jonas S. Almeida
{"title":"FastImpute: A Baseline for Open-source, Reference-Free Genotype Imputation Methods -- A Case Study in PRS313","authors":"Aaron Ge, Jeya Balasubramanian, Xueyao Wu, Peter Kraft, Jonas S. Almeida","doi":"arxiv-2407.09355","DOIUrl":null,"url":null,"abstract":"Genotype imputation enhances genetic data by predicting missing SNPs using\nreference haplotype information. Traditional methods leverage linkage\ndisequilibrium (LD) to infer untyped SNP genotypes, relying on the similarity\nof LD structures between genotyped target sets and fully sequenced reference\npanels. Recently, reference-free deep learning-based methods have emerged,\noffering a promising alternative by predicting missing genotypes without\nexternal databases, thereby enhancing privacy and accessibility. However, these\nmethods often produce models with tens of millions of parameters, leading to\nchallenges such as the need for substantial computational resources to train\nand inefficiency for client-sided deployment. Our study addresses these\nlimitations by introducing a baseline for a novel genotype imputation pipeline\nthat supports client-sided imputation models generalizable across any\ngenotyping chip and genomic region. This approach enhances patient privacy by\nperforming imputation directly on edge devices. As a case study, we focus on\nPRS313, a polygenic risk score comprising 313 SNPs used for breast cancer risk\nprediction. Utilizing consumer genetic panels such as 23andMe, our model\ndemocratizes access to personalized genetic insights by allowing 23andMe users\nto obtain their PRS313 score. We demonstrate that simple linear regression can\nsignificantly improve the accuracy of PRS313 scores when calculated using SNPs\nimputed from consumer gene panels, such as 23andMe. Our linear regression model\nachieved an R^2 of 0.86, compared to 0.33 without imputation and 0.28 with\nsimple imputation (substituting missing SNPs with the minor allele frequency).\nThese findings suggest that popular SNP analysis libraries could benefit from\nintegrating linear regression models for genotype imputation, providing a\nviable and light-weight alternative to reference based imputation.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"11 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.09355","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Genotype imputation enhances genetic data by predicting missing SNPs using
reference haplotype information. Traditional methods leverage linkage
disequilibrium (LD) to infer untyped SNP genotypes, relying on the similarity
of LD structures between genotyped target sets and fully sequenced reference
panels. Recently, reference-free deep learning-based methods have emerged,
offering a promising alternative by predicting missing genotypes without
external databases, thereby enhancing privacy and accessibility. However, these
methods often produce models with tens of millions of parameters, leading to
challenges such as the need for substantial computational resources to train
and inefficiency for client-sided deployment. Our study addresses these
limitations by introducing a baseline for a novel genotype imputation pipeline
that supports client-sided imputation models generalizable across any
genotyping chip and genomic region. This approach enhances patient privacy by
performing imputation directly on edge devices. As a case study, we focus on
PRS313, a polygenic risk score comprising 313 SNPs used for breast cancer risk
prediction. Utilizing consumer genetic panels such as 23andMe, our model
democratizes access to personalized genetic insights by allowing 23andMe users
to obtain their PRS313 score. We demonstrate that simple linear regression can
significantly improve the accuracy of PRS313 scores when calculated using SNPs
imputed from consumer gene panels, such as 23andMe. Our linear regression model
achieved an R^2 of 0.86, compared to 0.33 without imputation and 0.28 with
simple imputation (substituting missing SNPs with the minor allele frequency).
These findings suggest that popular SNP analysis libraries could benefit from
integrating linear regression models for genotype imputation, providing a
viable and light-weight alternative to reference based imputation.