Resident Block-Structured Adaptive Mesh Refinement on Thousands of Graphics Processing Units

2015 44th International Conference on Parallel Processing Pub Date : 2015-09-01 DOI:10.1109/ICPP.2015.15

D. Beckingsale, W. Gaudin, Andy Herdman, S. Jarvis

{"title":"Resident Block-Structured Adaptive Mesh Refinement on Thousands of Graphics Processing Units","authors":"D. Beckingsale, W. Gaudin, Andy Herdman, S. Jarvis","doi":"10.1109/ICPP.2015.15","DOIUrl":null,"url":null,"abstract":"Block-structured adaptive mesh refinement (AMR) is a technique that can be used when solving partial differential equations to reduce the number of cells necessary to achieve the required accuracy in areas of interest. These areas (shock fronts, material interfaces, etc.) are recursively covered with finer mesh patches that are grouped into a hierarchy of refinement levels. Despite the potential for large savings in computational requirements and memory usage without a corresponding reduction in accuracy, AMR adds overhead in managing the mesh hierarchy, adding complex communication and data movement requirements to a simulation. In this paper, we describe the design and implementation of a resident GPU-based AMR library, including: the classes used to manage data on a mesh patch, the routines used for transferring data between GPUs on different nodes, and the data-parallel operators developed to coarsen and refine mesh data. We validate the performance and accuracy of our implementation using three test problems and two architectures: an 8 node cluster, and 4,196 nodes of Oak Ridge National Laboratory's Titan supercomputer. Our GPU-based AMR hydrodynamics code performs up to 4.87× faster than the CPU-based implementation, and is scalable on 4,196 K20x GPUs using a combination of MPI and CUDA.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"115 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 44th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2015.15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

Abstract

Block-structured adaptive mesh refinement (AMR) is a technique that can be used when solving partial differential equations to reduce the number of cells necessary to achieve the required accuracy in areas of interest. These areas (shock fronts, material interfaces, etc.) are recursively covered with finer mesh patches that are grouped into a hierarchy of refinement levels. Despite the potential for large savings in computational requirements and memory usage without a corresponding reduction in accuracy, AMR adds overhead in managing the mesh hierarchy, adding complex communication and data movement requirements to a simulation. In this paper, we describe the design and implementation of a resident GPU-based AMR library, including: the classes used to manage data on a mesh patch, the routines used for transferring data between GPUs on different nodes, and the data-parallel operators developed to coarsen and refine mesh data. We validate the performance and accuracy of our implementation using three test problems and two architectures: an 8 node cluster, and 4,196 nodes of Oak Ridge National Laboratory's Titan supercomputer. Our GPU-based AMR hydrodynamics code performs up to 4.87× faster than the CPU-based implementation, and is scalable on 4,196 K20x GPUs using a combination of MPI and CUDA.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于数千个图形处理单元的驻留块结构自适应网格细化

块结构自适应网格细化(AMR)是一种可以在求解偏微分方程时使用的技术，可以减少在感兴趣的区域达到所需精度所需的单元数。这些区域(冲击前沿，材料界面等)被递归地覆盖上更精细的网格补丁，这些网格补丁被分组成细化级别的层次结构。尽管AMR在计算需求和内存使用方面有很大的节省潜力，但没有相应的准确性降低，但AMR增加了管理网格层次结构的开销，为模拟增加了复杂的通信和数据移动需求。在本文中，我们描述了一个基于常驻gpu的AMR库的设计和实现，包括:用于管理网格补丁上的数据的类，用于在不同节点上的gpu之间传输数据的例程，以及用于粗化和细化网格数据的数据并行算子。我们使用三个测试问题和两个架构来验证我们实现的性能和准确性:一个8节点集群和橡树岭国家实验室的泰坦超级计算机的4196个节点。我们基于gpu的AMR流体力学代码的执行速度比基于cpu的实现快4.87倍，并且可以使用MPI和CUDA的组合在4,196 K20x gpu上进行扩展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2015 44th International Conference on Parallel Processing

自引率

0.00%

发文量