一种2:1八叉树平衡最小同步算法的评估

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-11-01 DOI:10.1109/SC41405.2020.00027

Hansol Suh, T. Isaac

{"title":"一种2:1八叉树平衡最小同步算法的评估","authors":"Hansol Suh, T. Isaac","doi":"10.1109/SC41405.2020.00027","DOIUrl":null,"url":null,"abstract":"The p4est library implements octree-based adaptive mesh refinement (AMR) and has demonstrated parallel scalability beyond 100,000 MPI processes in previous weak scaling studies. This work focuses on the strong scalability of mesh adaptivity in p4est, where the communication pattern of the existing 2:1-balance is a latency bottleneck. The sorting-based algorithm of Malhotra and Biros has balanced communication, but synchronizes all processes. We propose an algorithm that combines sorting and neighbor-to-neighbor exchange to minimize the number of processes each process synchronizes with.We measure the performance of these algorithms on several test problems on Stampede2 at TACC. Both the parallel-sorting and minimally-synchronous algorithms significantly outperform the existing algorithm and have nearly identical performance out to 1,024 Xeon Phi KNL nodes, meaning the asymptotic advantage of the minimally-synchronous algorithm does not translate to improved performance at this scale. We conclude by showing that global metadata communication will limit future strong scaling.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"88 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluation of a Minimally Synchronous Algorithm for 2:1 Octree Balance\",\"authors\":\"Hansol Suh, T. Isaac\",\"doi\":\"10.1109/SC41405.2020.00027\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The p4est library implements octree-based adaptive mesh refinement (AMR) and has demonstrated parallel scalability beyond 100,000 MPI processes in previous weak scaling studies. This work focuses on the strong scalability of mesh adaptivity in p4est, where the communication pattern of the existing 2:1-balance is a latency bottleneck. The sorting-based algorithm of Malhotra and Biros has balanced communication, but synchronizes all processes. We propose an algorithm that combines sorting and neighbor-to-neighbor exchange to minimize the number of processes each process synchronizes with.We measure the performance of these algorithms on several test problems on Stampede2 at TACC. Both the parallel-sorting and minimally-synchronous algorithms significantly outperform the existing algorithm and have nearly identical performance out to 1,024 Xeon Phi KNL nodes, meaning the asymptotic advantage of the minimally-synchronous algorithm does not translate to improved performance at this scale. We conclude by showing that global metadata communication will limit future strong scaling.\",\"PeriodicalId\":424429,\"journal\":{\"name\":\"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis\",\"volume\":\"88 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SC41405.2020.00027\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC41405.2020.00027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

p4est库实现了基于八叉树的自适应网格细化(AMR)，并在之前的弱缩放研究中展示了超过100,000 MPI进程的并行可扩展性。本文重点研究了p4test中网格自适应的强大可扩展性，其中现有的2:1-balance通信模式是延迟瓶颈。Malhotra和Biros的基于排序的算法具有平衡的通信，但同步所有进程。我们提出了一种结合排序和邻居间交换的算法，以最小化每个进程同步的进程数量。我们在Stampede2在TACC上的几个测试问题上测量了这些算法的性能。并行排序和最小同步算法都明显优于现有算法，并且在1024 Xeon Phi KNL节点上具有几乎相同的性能，这意味着最小同步算法的渐近优势在这种规模下不会转化为改进的性能。我们的结论是，全球元数据通信将限制未来的强扩展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Evaluation of a Minimally Synchronous Algorithm for 2:1 Octree Balance

The p4est library implements octree-based adaptive mesh refinement (AMR) and has demonstrated parallel scalability beyond 100,000 MPI processes in previous weak scaling studies. This work focuses on the strong scalability of mesh adaptivity in p4est, where the communication pattern of the existing 2:1-balance is a latency bottleneck. The sorting-based algorithm of Malhotra and Biros has balanced communication, but synchronizes all processes. We propose an algorithm that combines sorting and neighbor-to-neighbor exchange to minimize the number of processes each process synchronizes with.We measure the performance of these algorithms on several test problems on Stampede2 at TACC. Both the parallel-sorting and minimally-synchronous algorithms significantly outperform the existing algorithm and have nearly identical performance out to 1,024 Xeon Phi KNL nodes, meaning the asymptotic advantage of the minimally-synchronous algorithm does not translate to improved performance at this scale. We conclude by showing that global metadata communication will limit future strong scaling.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

自引率

0.00%

发文量

期刊最新文献

CAB-MPI: Exploring Interprocess Work-Stealing towards Balanced MPI Communication Toward Realization of Numerical Towing-Tank Tests by Wall-Resolved Large Eddy Simulation based on 32 Billion Grid Finite-Element Computation Scalable yet Rigorous Floating-Point Error Analysis Scalable Knowledge Graph Analytics at 136 Petaflop/s BORA: A Bag Optimizer for Robotic Analysis