在多核和多核架构下扩展和分析模板性能

2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS) Pub Date : 2014-12-01 DOI:10.1109/PADSW.2014.7097797

L. Gan, H. Fu, Wei Xue, Yangtong Xu, Chao Yang, Xinliang Wang, Zihong Lv, Yang You, Guangwen Yang, Kaijian Ou

{"title":"在多核和多核架构下扩展和分析模板性能","authors":"L. Gan, H. Fu, Wei Xue, Yangtong Xu, Chao Yang, Xinliang Wang, Zihong Lv, Yang You, Guangwen Yang, Kaijian Ou","doi":"10.1109/PADSW.2014.7097797","DOIUrl":null,"url":null,"abstract":"Stencils are among the most important and time-consuming kernels in many applications. While stencil optimization has been a well-studied topic on CPU platforms, achieving higher performance and efficiency for the evolving numerical stencils on the more recent multi-core and many-core architectures is still an important issue. In this paper, we explore a number of different stencils, ranging from a basic 7-point Jacobi stencil to more complex high-order stencils used in finer numerical simulations. By optimizing and analyzing those stencils on the latest multi-core and many-core architectures (the Intel Sandy Bridge processor, the Intel Xeon Phi coprocessor, and the NVIDIA Fermi C2070 and Kepler K20x GPUs), we investigate the algorithmic and architectural factors that determine the performance and efficiency of the resulting designs. While multi-threading, vectorization, and optimization on cache and other fast buffers are still the most important techniques that provide performance, we observe that the different memory hierarchy and the different mechanism for issuing and executing parallel instructions lead to the different performance behaviors on CPU, MIC and GPU. With vector-like processing units becoming the major provider of computing power on almost all architectures, the compiler's inability to align all the computing and memory operations would become the major bottleneck from getting a high efficiency on current and future platforms. Our specific optimization of the complex WNAD stencil on GPU provides a good example of what the compiler could do to help.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"596 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Scaling and analyzing the stencil performance on multi-core and many-core architectures\",\"authors\":\"L. Gan, H. Fu, Wei Xue, Yangtong Xu, Chao Yang, Xinliang Wang, Zihong Lv, Yang You, Guangwen Yang, Kaijian Ou\",\"doi\":\"10.1109/PADSW.2014.7097797\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Stencils are among the most important and time-consuming kernels in many applications. While stencil optimization has been a well-studied topic on CPU platforms, achieving higher performance and efficiency for the evolving numerical stencils on the more recent multi-core and many-core architectures is still an important issue. In this paper, we explore a number of different stencils, ranging from a basic 7-point Jacobi stencil to more complex high-order stencils used in finer numerical simulations. By optimizing and analyzing those stencils on the latest multi-core and many-core architectures (the Intel Sandy Bridge processor, the Intel Xeon Phi coprocessor, and the NVIDIA Fermi C2070 and Kepler K20x GPUs), we investigate the algorithmic and architectural factors that determine the performance and efficiency of the resulting designs. While multi-threading, vectorization, and optimization on cache and other fast buffers are still the most important techniques that provide performance, we observe that the different memory hierarchy and the different mechanism for issuing and executing parallel instructions lead to the different performance behaviors on CPU, MIC and GPU. With vector-like processing units becoming the major provider of computing power on almost all architectures, the compiler's inability to align all the computing and memory operations would become the major bottleneck from getting a high efficiency on current and future platforms. Our specific optimization of the complex WNAD stencil on GPU provides a good example of what the compiler could do to help.\",\"PeriodicalId\":421740,\"journal\":{\"name\":\"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)\",\"volume\":\"596 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PADSW.2014.7097797\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PADSW.2014.7097797","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

模板是许多应用程序中最重要和最耗时的内核之一。虽然模板优化已经成为CPU平台上一个被广泛研究的话题，但在最近的多核和多核架构上，为不断发展的数值模板实现更高的性能和效率仍然是一个重要的问题。在本文中，我们探索了许多不同的模板，从基本的7点雅可比模板到更复杂的高阶模板，用于更精细的数值模拟。通过在最新的多核和多核架构(英特尔Sandy Bridge处理器、英特尔Xeon Phi协处理器、NVIDIA Fermi C2070和Kepler K20x gpu)上优化和分析这些模板，我们研究了决定最终设计性能和效率的算法和架构因素。虽然多线程、向量化以及缓存和其他快速缓冲区上的优化仍然是提供性能的最重要技术，但我们观察到，不同的内存层次结构以及发出和执行并行指令的不同机制导致CPU、MIC和GPU上的不同性能行为。随着类矢量处理单元成为几乎所有体系结构上计算能力的主要提供者，编译器无法将所有计算和内存操作对齐将成为当前和未来平台上获得高效率的主要瓶颈。我们对GPU上复杂的WNAD模板的具体优化提供了一个很好的例子，说明编译器可以做些什么来提供帮助。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Scaling and analyzing the stencil performance on multi-core and many-core architectures

Stencils are among the most important and time-consuming kernels in many applications. While stencil optimization has been a well-studied topic on CPU platforms, achieving higher performance and efficiency for the evolving numerical stencils on the more recent multi-core and many-core architectures is still an important issue. In this paper, we explore a number of different stencils, ranging from a basic 7-point Jacobi stencil to more complex high-order stencils used in finer numerical simulations. By optimizing and analyzing those stencils on the latest multi-core and many-core architectures (the Intel Sandy Bridge processor, the Intel Xeon Phi coprocessor, and the NVIDIA Fermi C2070 and Kepler K20x GPUs), we investigate the algorithmic and architectural factors that determine the performance and efficiency of the resulting designs. While multi-threading, vectorization, and optimization on cache and other fast buffers are still the most important techniques that provide performance, we observe that the different memory hierarchy and the different mechanism for issuing and executing parallel instructions lead to the different performance behaviors on CPU, MIC and GPU. With vector-like processing units becoming the major provider of computing power on almost all architectures, the compiler's inability to align all the computing and memory operations would become the major bottleneck from getting a high efficiency on current and future platforms. Our specific optimization of the complex WNAD stencil on GPU provides a good example of what the compiler could do to help.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)

自引率

0.00%

发文量

期刊最新文献

Optimal bandwidth allocation with dynamic multi-path routing for non-critical traffic in AFDX networks Sensor-free corner shape detection by wireless networks Accelerated variance reduction methods on GPU Fault-Tolerant bi-directional communications in web-based applications Performance analysis of HPC applications with irregular tree data structures