SparkBeagle:云中的分布式全基因组参考面板的可扩展基因型插入

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics Pub Date : 2020-09-21 DOI:10.1145/3388440.3414860

Altti Ilari Maarala, K. Pärn, J. Nuñez-Fontarnau, Keijo Heljanko

{"title":"SparkBeagle:云中的分布式全基因组参考面板的可扩展基因型插入","authors":"Altti Ilari Maarala, K. Pärn, J. Nuñez-Fontarnau, Keijo Heljanko","doi":"10.1145/3388440.3414860","DOIUrl":null,"url":null,"abstract":"Massive whole-genome genotype reference panels now provide accurate and fast genotyping by imputation for high-resolution genome-wide association (GWA) studies. Imputation-assisted genotyping can increase the genomic coverage of genotypes and thus satisfy the resolution required in comprehensive GWA studies in a cost-effective manner. However, the imputation of missing genotypes from large reference panels is a compute-intensive process that requires high-performance computing (HPC). Although HPC uses extremely distributed and parallel computing, current imputation tools, and existing algorithms have not been developed to fully exploit the power of distributed computing. To this end, we have developed SparkBeagle, a scalable, fast, and accurate distributed genotype imputation tool based on popular Beagle software. SparkBeagle is designed for HPC and cloud computing environments and it is implemented on top of the Apache Spark distributed computing framework. We have carried out scalability experiments by imputing 64,976,316 variants of 2504 samples from the 1000 Genomes reference panel in the cloud. SparkBeagle shows near-linear scalability while increasing the number of computing nodes. A speedup of 30x was achieved with 40 nodes. The imputation time of the whole data set decreased from 565 minutes to 18 minutes compared to a single node parallel execution. Near identical imputation accuracy was measured in the concordance analysis between the original Beagle and the distributed SparkBeagle tool.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"SparkBeagle: Scalable Genotype Imputation from Distributed Whole-Genome Reference Panels in the Cloud\",\"authors\":\"Altti Ilari Maarala, K. Pärn, J. Nuñez-Fontarnau, Keijo Heljanko\",\"doi\":\"10.1145/3388440.3414860\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Massive whole-genome genotype reference panels now provide accurate and fast genotyping by imputation for high-resolution genome-wide association (GWA) studies. Imputation-assisted genotyping can increase the genomic coverage of genotypes and thus satisfy the resolution required in comprehensive GWA studies in a cost-effective manner. However, the imputation of missing genotypes from large reference panels is a compute-intensive process that requires high-performance computing (HPC). Although HPC uses extremely distributed and parallel computing, current imputation tools, and existing algorithms have not been developed to fully exploit the power of distributed computing. To this end, we have developed SparkBeagle, a scalable, fast, and accurate distributed genotype imputation tool based on popular Beagle software. SparkBeagle is designed for HPC and cloud computing environments and it is implemented on top of the Apache Spark distributed computing framework. We have carried out scalability experiments by imputing 64,976,316 variants of 2504 samples from the 1000 Genomes reference panel in the cloud. SparkBeagle shows near-linear scalability while increasing the number of computing nodes. A speedup of 30x was achieved with 40 nodes. The imputation time of the whole data set decreased from 565 minutes to 18 minutes compared to a single node parallel execution. Near identical imputation accuracy was measured in the concordance analysis between the original Beagle and the distributed SparkBeagle tool.\",\"PeriodicalId\":411338,\"journal\":{\"name\":\"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3388440.3414860\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3388440.3414860","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

大量的全基因组基因型参考面板现在为高分辨率全基因组关联(GWA)研究提供了准确和快速的基因分型。植入辅助基因分型可以增加基因型的基因组覆盖率，从而以经济有效的方式满足GWA综合研究所需的分辨率。然而，从大型参考面板中插入缺失基因型是一个需要高性能计算(HPC)的计算密集型过程。虽然HPC使用了非常分布式和并行的计算，但目前的计算工具和现有的算法还没有发展到充分利用分布式计算的能力。为此，我们开发了SparkBeagle，一个基于流行的Beagle软件的可扩展，快速，准确的分布式基因型插入工具。SparkBeagle是为HPC和云计算环境设计的，它是在Apache Spark分布式计算框架之上实现的。我们通过在云中输入来自1000个基因组参考面板的2504个样本的64,976,316个变体，进行了可扩展性实验。SparkBeagle在增加计算节点数量的同时显示出近似线性的可扩展性。使用40个节点实现了30倍的加速。与单节点并行执行相比，整个数据集的插入时间从565分钟减少到18分钟。在原始Beagle和分布式SparkBeagle工具之间的一致性分析中，测量了几乎相同的插入精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SparkBeagle: Scalable Genotype Imputation from Distributed Whole-Genome Reference Panels in the Cloud

Massive whole-genome genotype reference panels now provide accurate and fast genotyping by imputation for high-resolution genome-wide association (GWA) studies. Imputation-assisted genotyping can increase the genomic coverage of genotypes and thus satisfy the resolution required in comprehensive GWA studies in a cost-effective manner. However, the imputation of missing genotypes from large reference panels is a compute-intensive process that requires high-performance computing (HPC). Although HPC uses extremely distributed and parallel computing, current imputation tools, and existing algorithms have not been developed to fully exploit the power of distributed computing. To this end, we have developed SparkBeagle, a scalable, fast, and accurate distributed genotype imputation tool based on popular Beagle software. SparkBeagle is designed for HPC and cloud computing environments and it is implemented on top of the Apache Spark distributed computing framework. We have carried out scalability experiments by imputing 64,976,316 variants of 2504 samples from the 1000 Genomes reference panel in the cloud. SparkBeagle shows near-linear scalability while increasing the number of computing nodes. A speedup of 30x was achieved with 40 nodes. The imputation time of the whole data set decreased from 565 minutes to 18 minutes compared to a single node parallel execution. Near identical imputation accuracy was measured in the concordance analysis between the original Beagle and the distributed SparkBeagle tool.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

自引率

0.00%

发文量

期刊最新文献

RA2Vec CanMod From Interatomic Distances to Protein Tertiary Structures with a Deep Convolutional Neural Network Prediction of Large for Gestational Age Infants in Overweight and Obese Women at Approximately 20 Gestational Weeks Using Patient Information for the Prediction of Caregiver Burden in Amyotrophic Lateral Sclerosis