Resident Block-Structured Adaptive Mesh Refinement on Thousands of Graphics Processing Units

D. Beckingsale, W. Gaudin, Andy Herdman, S. Jarvis
{"title":"Resident Block-Structured Adaptive Mesh Refinement on Thousands of Graphics Processing Units","authors":"D. Beckingsale, W. Gaudin, Andy Herdman, S. Jarvis","doi":"10.1109/ICPP.2015.15","DOIUrl":null,"url":null,"abstract":"Block-structured adaptive mesh refinement (AMR) is a technique that can be used when solving partial differential equations to reduce the number of cells necessary to achieve the required accuracy in areas of interest. These areas (shock fronts, material interfaces, etc.) are recursively covered with finer mesh patches that are grouped into a hierarchy of refinement levels. Despite the potential for large savings in computational requirements and memory usage without a corresponding reduction in accuracy, AMR adds overhead in managing the mesh hierarchy, adding complex communication and data movement requirements to a simulation. In this paper, we describe the design and implementation of a resident GPU-based AMR library, including: the classes used to manage data on a mesh patch, the routines used for transferring data between GPUs on different nodes, and the data-parallel operators developed to coarsen and refine mesh data. We validate the performance and accuracy of our implementation using three test problems and two architectures: an 8 node cluster, and 4,196 nodes of Oak Ridge National Laboratory's Titan supercomputer. Our GPU-based AMR hydrodynamics code performs up to 4.87× faster than the CPU-based implementation, and is scalable on 4,196 K20x GPUs using a combination of MPI and CUDA.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"115 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 44th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2015.15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 19

Abstract

Block-structured adaptive mesh refinement (AMR) is a technique that can be used when solving partial differential equations to reduce the number of cells necessary to achieve the required accuracy in areas of interest. These areas (shock fronts, material interfaces, etc.) are recursively covered with finer mesh patches that are grouped into a hierarchy of refinement levels. Despite the potential for large savings in computational requirements and memory usage without a corresponding reduction in accuracy, AMR adds overhead in managing the mesh hierarchy, adding complex communication and data movement requirements to a simulation. In this paper, we describe the design and implementation of a resident GPU-based AMR library, including: the classes used to manage data on a mesh patch, the routines used for transferring data between GPUs on different nodes, and the data-parallel operators developed to coarsen and refine mesh data. We validate the performance and accuracy of our implementation using three test problems and two architectures: an 8 node cluster, and 4,196 nodes of Oak Ridge National Laboratory's Titan supercomputer. Our GPU-based AMR hydrodynamics code performs up to 4.87× faster than the CPU-based implementation, and is scalable on 4,196 K20x GPUs using a combination of MPI and CUDA.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于数千个图形处理单元的驻留块结构自适应网格细化
块结构自适应网格细化(AMR)是一种可以在求解偏微分方程时使用的技术,可以减少在感兴趣的区域达到所需精度所需的单元数。这些区域(冲击前沿,材料界面等)被递归地覆盖上更精细的网格补丁,这些网格补丁被分组成细化级别的层次结构。尽管AMR在计算需求和内存使用方面有很大的节省潜力,但没有相应的准确性降低,但AMR增加了管理网格层次结构的开销,为模拟增加了复杂的通信和数据移动需求。在本文中,我们描述了一个基于常驻gpu的AMR库的设计和实现,包括:用于管理网格补丁上的数据的类,用于在不同节点上的gpu之间传输数据的例程,以及用于粗化和细化网格数据的数据并行算子。我们使用三个测试问题和两个架构来验证我们实现的性能和准确性:一个8节点集群和橡树岭国家实验室的泰坦超级计算机的4196个节点。我们基于gpu的AMR流体力学代码的执行速度比基于cpu的实现快4.87倍,并且可以使用MPI和CUDA的组合在4,196 K20x gpu上进行扩展。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Elastic and Efficient Virtual Network Provisioning for Cloud-Based Multi-tier Applications Design and Implementation of a Highly Efficient DGEMM for 64-Bit ARMv8 Multi-core Processors Leveraging Error Compensation to Minimize Time Deviation in Parallel Multi-core Simulations Crowdsourcing Sensing Workloads of Heterogeneous Tasks: A Distributed Fairness-Aware Approach TAPS: Software Defined Task-Level Deadline-Aware Preemptive Flow Scheduling in Data Centers
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1