Bayesian Phylogenetic Analysis on Multi-Core Compute Architectures: Implementation and Evaluation of BEAGLE in RevBayes With MPI.

IF 5.7 1区生物学 Q1 EVOLUTIONARY BIOLOGY Systematic Biology Pub Date : 2024-07-27 DOI:10.1093/sysbio/syae005

Killian Smith, Daniel Ayres, René Neumaier, Gert Wörheide, Sebastian Höhna

{"title":"Bayesian Phylogenetic Analysis on Multi-Core Compute Architectures: Implementation and Evaluation of BEAGLE in RevBayes With MPI.","authors":"Killian Smith, Daniel Ayres, René Neumaier, Gert Wörheide, Sebastian Höhna","doi":"10.1093/sysbio/syae005","DOIUrl":null,"url":null,"abstract":"<p><p>Phylogenies are central to many research areas in biology and commonly estimated using likelihood-based methods. Unfortunately, any likelihood-based method, including Bayesian inference, can be restrictively slow for large datasets-with many taxa and/or many sites in the sequence alignment-or complex substitutions models. The primary limiting factor when using large datasets and/or complex models in probabilistic phylogenetic analyses is the likelihood calculation, which dominates the total computation time. To address this bottleneck, we incorporated the high-performance phylogenetic library BEAGLE into RevBayes, which enables multi-threading on multi-core CPUs and GPUs, as well as hardware specific vectorized instructions for faster likelihood calculations. Our new implementation of RevBayes+BEAGLE retains the flexibility and dynamic nature that users expect from vanilla RevBayes. In addition, we implemented native parallelization within RevBayes without an external library using the message passing interface (MPI); RevBayes+MPI. We evaluated our new implementation of RevBayes+BEAGLE using multi-threading on CPUs and 2 different powerful GPUs (NVidia Titan V and NVIDIA A100) against our native implementation of RevBayes+MPI. We found good improvements in speedup when multiple cores were used, with up to 20-fold speedup when using multiple CPU cores and over 90-fold speedup when using multiple GPU cores. The improvement depended on the data type used, DNA or amino acids, and the size of the alignment, but less on the size of the tree. We additionally investigated the cost of rescaling partial likelihoods to avoid numerical underflow and showed that unnecessarily frequent and inefficient rescaling can increase runtimes up to 4-fold. Finally, we presented and compared a new approach to store partial likelihoods on branches instead of nodes that can speed up computations up to 1.7 times but comes at twice the memory requirements.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":"455-469"},"PeriodicalIF":5.7000,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Systematic Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/sysbio/syae005","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EVOLUTIONARY BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Phylogenies are central to many research areas in biology and commonly estimated using likelihood-based methods. Unfortunately, any likelihood-based method, including Bayesian inference, can be restrictively slow for large datasets-with many taxa and/or many sites in the sequence alignment-or complex substitutions models. The primary limiting factor when using large datasets and/or complex models in probabilistic phylogenetic analyses is the likelihood calculation, which dominates the total computation time. To address this bottleneck, we incorporated the high-performance phylogenetic library BEAGLE into RevBayes, which enables multi-threading on multi-core CPUs and GPUs, as well as hardware specific vectorized instructions for faster likelihood calculations. Our new implementation of RevBayes+BEAGLE retains the flexibility and dynamic nature that users expect from vanilla RevBayes. In addition, we implemented native parallelization within RevBayes without an external library using the message passing interface (MPI); RevBayes+MPI. We evaluated our new implementation of RevBayes+BEAGLE using multi-threading on CPUs and 2 different powerful GPUs (NVidia Titan V and NVIDIA A100) against our native implementation of RevBayes+MPI. We found good improvements in speedup when multiple cores were used, with up to 20-fold speedup when using multiple CPU cores and over 90-fold speedup when using multiple GPU cores. The improvement depended on the data type used, DNA or amino acids, and the size of the alignment, but less on the size of the tree. We additionally investigated the cost of rescaling partial likelihoods to avoid numerical underflow and showed that unnecessarily frequent and inefficient rescaling can increase runtimes up to 4-fold. Finally, we presented and compared a new approach to store partial likelihoods on branches instead of nodes that can speed up computations up to 1.7 times but comes at twice the memory requirements.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

多核计算架构上的贝叶斯系统发育分析：使用 MPI 实现和评估 RevBayes 中的 BEAGLE。

系统发生是生物学许多研究领域的核心，通常使用基于似然法的方法进行估算。遗憾的是，任何基于似然法的方法，包括贝叶斯推断法，对于大型数据集--序列排列中有许多类群和/或许多位点--或复杂的替换模型来说，速度都会非常缓慢。在概率系统发育分析中使用大型数据集和/或复杂模型时，主要的限制因素是似然法计算，它在总计算时间中占主导地位。为了解决这个瓶颈问题，我们将高性能系统发育库 BEAGLE 纳入了 RevBayes，它可以在多核 CPU 和 GPU 上实现多线程，并提供硬件特定的矢量化指令，以加快似然计算速度。我们新的 RevBayes+BEAGLE 实现保留了用户期望从 vanilla RevBayes 中获得的灵活性和动态性。此外，我们还使用消息传递接口（MPI）在 RevBayes 中实现了本地并行化，而无需使用外部库；即 RevBayes+MPI。我们在 CPU 和两种不同的强大 GPU（NVidia Titan V 和 NVIDIA A100）上使用多线程对 RevBayes+BEAGLE 的新实现与 RevBayes+MPI 的本机实现进行了评估。我们发现，在使用多核的情况下，速度提高了很多，使用多 CPU 核时速度提高了 20 倍，使用多 GPU 核时速度提高了 90 多倍。速度的提高取决于所使用的数据类型（DNA 或氨基酸）和排列的大小，但与树的大小关系不大。此外，我们还研究了为避免数值下溢而重新调整部分似然的成本，结果表明，不必要的频繁、低效的重新调整会使运行时间增加多达 4 倍。最后，我们介绍并比较了一种将部分似然存储在分支而非节点上的新方法，这种方法可将计算速度提高 1.7 倍，但内存需求却是原来的两倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Systematic Biology 生物-进化生物学

CiteScore

13.00

自引率

7.70%

发文量

审稿时长

6-12 weeks

期刊介绍： Systematic Biology is the bimonthly journal of the Society of Systematic Biologists. Papers for the journal are original contributions to the theory, principles, and methods of systematics as well as phylogeny, evolution, morphology, biogeography, paleontology, genetics, and the classification of all living things. A Points of View section offers a forum for discussion, while book reviews and announcements of general interest are also featured.