千万核神威太湖之光的极端尺度逼真模板计算

Ying Cai, Chao Yang, Wenjing Ma, Yulong Ao
{"title":"千万核神威太湖之光的极端尺度逼真模板计算","authors":"Ying Cai, Chao Yang, Wenjing Ma, Yulong Ao","doi":"10.1109/CCGRID.2018.00086","DOIUrl":null,"url":null,"abstract":"Stencil computation arises from a large variety of scientific and engineering applications and often plays a critical role in the performance of extreme-scale simulations. Due to the memory bound nature, it is a challenging task to optimize stencil computation kernels on many leadership supercomputers, such as Sunway TaihuLight, which has relatively high computing throughput whilst relatively low data-moving capability. In this white paper, we show the efforts we have been making during the past two years in developing end-to-end implementation and optimization techniques for extreme-scale stencil computations on Sunway TaihuLight. We started with a work on optimizing the 3-D 2nd-order 13-point stencil for nonhydrostatic atmospheric dynamics simulation, which is an important part of the 2016 ACM Gordon Bell Prize winning work, and extended it in ways that can handle a broader range of realistic and challenging problems, such as the HPGMG benchmark that consists of memory-hungry stencils and the gaseous wave detonation simulation that relies on complex high-order stencils. The presented stencil computation paradigm on Sunway TaihuLight includes not only multilevel parallelization to exploit the parallelism on different hardware levels, but also systematic performance optimization techniques for communication, memory access, and computation. We show by extreme-scale tests that the proposed systematic stencil computation paradigm can successfully deliver remarkable performance on Sunway TaihuLight with ten million heterogeneous cores. In particular, we achieve an aggregate performance of 23.12 Pflops for the 3-D 5th order WENO stencil computation in gaseous wave detonation simulation, which is the highest performance result for high-order stencil computations as far as we know, and an aggregate performance of solving over one trillion unknowns per second in the HPGMG benchmark, which ranks the first place in the HPGMG List of Nov 2017.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"3 4","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Extreme-Scale Realistic Stencil Computations on Sunway TaihuLight with Ten Million Cores\",\"authors\":\"Ying Cai, Chao Yang, Wenjing Ma, Yulong Ao\",\"doi\":\"10.1109/CCGRID.2018.00086\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Stencil computation arises from a large variety of scientific and engineering applications and often plays a critical role in the performance of extreme-scale simulations. Due to the memory bound nature, it is a challenging task to optimize stencil computation kernels on many leadership supercomputers, such as Sunway TaihuLight, which has relatively high computing throughput whilst relatively low data-moving capability. In this white paper, we show the efforts we have been making during the past two years in developing end-to-end implementation and optimization techniques for extreme-scale stencil computations on Sunway TaihuLight. We started with a work on optimizing the 3-D 2nd-order 13-point stencil for nonhydrostatic atmospheric dynamics simulation, which is an important part of the 2016 ACM Gordon Bell Prize winning work, and extended it in ways that can handle a broader range of realistic and challenging problems, such as the HPGMG benchmark that consists of memory-hungry stencils and the gaseous wave detonation simulation that relies on complex high-order stencils. The presented stencil computation paradigm on Sunway TaihuLight includes not only multilevel parallelization to exploit the parallelism on different hardware levels, but also systematic performance optimization techniques for communication, memory access, and computation. We show by extreme-scale tests that the proposed systematic stencil computation paradigm can successfully deliver remarkable performance on Sunway TaihuLight with ten million heterogeneous cores. In particular, we achieve an aggregate performance of 23.12 Pflops for the 3-D 5th order WENO stencil computation in gaseous wave detonation simulation, which is the highest performance result for high-order stencil computations as far as we know, and an aggregate performance of solving over one trillion unknowns per second in the HPGMG benchmark, which ranks the first place in the HPGMG List of Nov 2017.\",\"PeriodicalId\":321027,\"journal\":{\"name\":\"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)\",\"volume\":\"3 4\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCGRID.2018.00086\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2018.00086","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

模板计算出现在各种各样的科学和工程应用中,通常在极端尺度模拟的性能中起着关键作用。由于内存的有限性,在许多领先的超级计算机(如神威太湖之光)上优化模板计算内核是一项具有挑战性的任务,这些超级计算机的计算吞吐量相对较高,而数据移动能力相对较低。在本白皮书中,我们展示了我们在过去两年中为神威太湖之光的极端规模模板计算开发端到端实现和优化技术所做的努力。我们从优化用于非流体静力大气动力学模拟的3-D二阶13点模板开始,这是2016年ACM戈登贝尔奖获奖作品的重要组成部分,并将其扩展到可以处理更广泛的现实和具有挑战性的问题,例如由内存消耗模板组成的HPGMG基准测试和依赖于复杂高阶模板的气体波爆炸模拟。提出的“神威太湖之光”模板计算范式不仅包括利用不同硬件级别并行性的多级并行化,而且还包括通信、内存访问和计算方面的系统性能优化技术。我们通过极端规模的测试表明,所提出的系统模板计算范式可以在具有1000万个异构核的神威太湖之光上成功地提供显着的性能。特别是,我们在气体波爆震模拟中实现了三维五阶WENO模板计算的总性能为23.12 Pflops,这是迄今为止我们所知道的高阶模板计算的最高性能结果,并且在HPGMG基准测试中实现了每秒求解超过一万亿未知数的总性能,在2017年11月HPGMG列表中排名第一。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Extreme-Scale Realistic Stencil Computations on Sunway TaihuLight with Ten Million Cores
Stencil computation arises from a large variety of scientific and engineering applications and often plays a critical role in the performance of extreme-scale simulations. Due to the memory bound nature, it is a challenging task to optimize stencil computation kernels on many leadership supercomputers, such as Sunway TaihuLight, which has relatively high computing throughput whilst relatively low data-moving capability. In this white paper, we show the efforts we have been making during the past two years in developing end-to-end implementation and optimization techniques for extreme-scale stencil computations on Sunway TaihuLight. We started with a work on optimizing the 3-D 2nd-order 13-point stencil for nonhydrostatic atmospheric dynamics simulation, which is an important part of the 2016 ACM Gordon Bell Prize winning work, and extended it in ways that can handle a broader range of realistic and challenging problems, such as the HPGMG benchmark that consists of memory-hungry stencils and the gaseous wave detonation simulation that relies on complex high-order stencils. The presented stencil computation paradigm on Sunway TaihuLight includes not only multilevel parallelization to exploit the parallelism on different hardware levels, but also systematic performance optimization techniques for communication, memory access, and computation. We show by extreme-scale tests that the proposed systematic stencil computation paradigm can successfully deliver remarkable performance on Sunway TaihuLight with ten million heterogeneous cores. In particular, we achieve an aggregate performance of 23.12 Pflops for the 3-D 5th order WENO stencil computation in gaseous wave detonation simulation, which is the highest performance result for high-order stencil computations as far as we know, and an aggregate performance of solving over one trillion unknowns per second in the HPGMG benchmark, which ranks the first place in the HPGMG List of Nov 2017.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Extreme-Scale Realistic Stencil Computations on Sunway TaihuLight with Ten Million Cores RideMatcher: Peer-to-Peer Matching of Passengers for Efficient Ridesharing Nitro: Network-Aware Virtual Machine Image Management in Geo-Distributed Clouds Improving Energy Efficiency of Database Clusters Through Prefetching and Caching Main-Memory Requirements of Big Data Applications on Commodity Server Platform
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1