Ahmed Khattab, Shang-Fu Chen, Nathan Wineinger, Ali Torkamani
{"title":"AoUPRS: A Cost-Effective and Versatile PRS Calculator for the All of Us Program","authors":"Ahmed Khattab, Shang-Fu Chen, Nathan Wineinger, Ali Torkamani","doi":"10.1101/2024.07.11.603165","DOIUrl":null,"url":null,"abstract":"Background The All of Us (AoU) Research Program provides a comprehensive genomic dataset to accelerate health research and medical breakthroughs. Despite its potential, researchers face significant challenges, including high costs and inefficiencies associated with data extraction and analysis. AoUPRS addresses these challenges by offering a versatile and cost-effective tool for calculating polygenic risk scores (PRS), enabling both experienced and novice researchers to leverage the AoU dataset for significant genomic discoveries. Results AoUPRS is implemented in Python and utilizes the Hail framework for genomic data analysis. It offers two distinct approaches for PRS calculation: the Hail MatrixTable (MT) and the Hail Variant Dataset (VDS). The MT approach provides a dense representation of genotype data, while the VDS approach offers a sparse representation, significantly reducing computational costs. In performance evaluations, the VDS approach demonstrated a cost reduction of up to 99.51% for smaller scores and 85% for larger scores compared to the MT approach. Both approaches yielded similar predictive power, as shown by logistic regression analyses of PRS for coronary artery disease, atrial fibrillation, and type 2 diabetes. The empirical cumulative distribution functions (ECDFs) for PRS values further confirmed the consistency between the two methods. Conclusions AoUPRS is a versatile and cost-effective tool that addresses the high costs and inefficiencies associated with PRS calculations using the AoU dataset. By offering both dense and sparse data processing approaches, AoUPRS allows researchers to choose the approach best suited to their needs, facilitating genomic discoveries. The tool’s open-source availability on GitHub, coupled with detailed documentation and tutorials, ensures accessibility and ease of use for the scientific community.","PeriodicalId":9124,"journal":{"name":"bioRxiv","volume":"5 8","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.07.11.603165","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Background The All of Us (AoU) Research Program provides a comprehensive genomic dataset to accelerate health research and medical breakthroughs. Despite its potential, researchers face significant challenges, including high costs and inefficiencies associated with data extraction and analysis. AoUPRS addresses these challenges by offering a versatile and cost-effective tool for calculating polygenic risk scores (PRS), enabling both experienced and novice researchers to leverage the AoU dataset for significant genomic discoveries. Results AoUPRS is implemented in Python and utilizes the Hail framework for genomic data analysis. It offers two distinct approaches for PRS calculation: the Hail MatrixTable (MT) and the Hail Variant Dataset (VDS). The MT approach provides a dense representation of genotype data, while the VDS approach offers a sparse representation, significantly reducing computational costs. In performance evaluations, the VDS approach demonstrated a cost reduction of up to 99.51% for smaller scores and 85% for larger scores compared to the MT approach. Both approaches yielded similar predictive power, as shown by logistic regression analyses of PRS for coronary artery disease, atrial fibrillation, and type 2 diabetes. The empirical cumulative distribution functions (ECDFs) for PRS values further confirmed the consistency between the two methods. Conclusions AoUPRS is a versatile and cost-effective tool that addresses the high costs and inefficiencies associated with PRS calculations using the AoU dataset. By offering both dense and sparse data processing approaches, AoUPRS allows researchers to choose the approach best suited to their needs, facilitating genomic discoveries. The tool’s open-source availability on GitHub, coupled with detailed documentation and tutorials, ensures accessibility and ease of use for the scientific community.