Gysela代码在多核处理器集群上的扩展和优化

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2018-09-01 DOI:10.1109/CAHPC.2018.8645933

G. Latu, Y. Asahi, Julien Bigot, Tamas B. Fehér, V. Grandgirard

{"title":"Gysela代码在多核处理器集群上的扩展和优化","authors":"G. Latu, Y. Asahi, Julien Bigot, Tamas B. Fehér, V. Grandgirard","doi":"10.1109/CAHPC.2018.8645933","DOIUrl":null,"url":null,"abstract":"The current generation of the Xeon Phi Knights Landing (KNL) processor provides a highly multi-threaded environment on which regular programming models such as MPIjopenMP can be used. Many factors impact the performance achieved by applications on these devices: one of the key points is the efficient exploitation of SIMD vector units, and one another is the memory access pattern. Works have been conducted to adapt a plasma turbulence application, namely Gysela, for this architecture. A set of different techniques have been used: standard vectorization techniques, auto-tuning of one computation kernel, switching to high-order scheme. As a result, KNL execution times have been reduced by up to a factor 3. This effort has also permitted to gain a speedup of 2x on Broadwell architecture and 3x on Skylake. Nice scalability curves up to a few thousands cores have been obtained on a strong scaling experiment. Incremental work meant a large payoff without resorting to using low-level intrinsics.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Scaling and Optimizing the Gysela Code on a Cluster of Many-Core Processors\",\"authors\":\"G. Latu, Y. Asahi, Julien Bigot, Tamas B. Fehér, V. Grandgirard\",\"doi\":\"10.1109/CAHPC.2018.8645933\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The current generation of the Xeon Phi Knights Landing (KNL) processor provides a highly multi-threaded environment on which regular programming models such as MPIjopenMP can be used. Many factors impact the performance achieved by applications on these devices: one of the key points is the efficient exploitation of SIMD vector units, and one another is the memory access pattern. Works have been conducted to adapt a plasma turbulence application, namely Gysela, for this architecture. A set of different techniques have been used: standard vectorization techniques, auto-tuning of one computation kernel, switching to high-order scheme. As a result, KNL execution times have been reduced by up to a factor 3. This effort has also permitted to gain a speedup of 2x on Broadwell architecture and 3x on Skylake. Nice scalability curves up to a few thousands cores have been obtained on a strong scaling experiment. Incremental work meant a large payoff without resorting to using low-level intrinsics.\",\"PeriodicalId\":307747,\"journal\":{\"name\":\"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)\",\"volume\":\"37 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CAHPC.2018.8645933\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CAHPC.2018.8645933","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

当前一代的Xeon Phi Knights Landing (KNL)处理器提供了一个高度多线程的环境，可以在该环境上使用常规编程模型，如MPIjopenMP。许多因素会影响应用程序在这些设备上实现的性能:其中一个关键点是SIMD向量单元的有效利用，另一个是内存访问模式。工作已经进行，以适应等离子体湍流应用程序，即Gysela，为这种架构。使用了一系列不同的技术:标准矢量化技术，一个计算内核的自动调优，切换到高阶方案。因此，KNL的执行时间最多减少了1 / 3。这一努力也使Broadwell架构的速度提高了2倍，Skylake的速度提高了3倍。在一个强大的扩展实验中获得了数千核的良好可扩展性曲线。增量工作意味着无需使用低级内在机制就能获得巨大回报。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Scaling and Optimizing the Gysela Code on a Cluster of Many-Core Processors

The current generation of the Xeon Phi Knights Landing (KNL) processor provides a highly multi-threaded environment on which regular programming models such as MPIjopenMP can be used. Many factors impact the performance achieved by applications on these devices: one of the key points is the efficient exploitation of SIMD vector units, and one another is the memory access pattern. Works have been conducted to adapt a plasma turbulence application, namely Gysela, for this architecture. A set of different techniques have been used: standard vectorization techniques, auto-tuning of one computation kernel, switching to high-order scheme. As a result, KNL execution times have been reduced by up to a factor 3. This effort has also permitted to gain a speedup of 2x on Broadwell architecture and 3x on Skylake. Nice scalability curves up to a few thousands cores have been obtained on a strong scaling experiment. Incremental work meant a large payoff without resorting to using low-level intrinsics.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

自引率

0.00%

发文量

期刊最新文献

Assessing Time Predictability Features of ARM Big. LITTLE Multicores Impacts of Three Soft-Fault Models on Hybrid Parallel Asynchronous Iterative Methods Predicting the Performance Impact of Increasing Memory Bandwidth for Scientific Workflows From Java to FPGA: An Experience with the Intel HARP System Copyright