L. Gan, H. Fu, Wei Xue, Yangtong Xu, Chao Yang, Xinliang Wang, Zihong Lv, Yang You, Guangwen Yang, Kaijian Ou
{"title":"在多核和多核架构下扩展和分析模板性能","authors":"L. Gan, H. Fu, Wei Xue, Yangtong Xu, Chao Yang, Xinliang Wang, Zihong Lv, Yang You, Guangwen Yang, Kaijian Ou","doi":"10.1109/PADSW.2014.7097797","DOIUrl":null,"url":null,"abstract":"Stencils are among the most important and time-consuming kernels in many applications. While stencil optimization has been a well-studied topic on CPU platforms, achieving higher performance and efficiency for the evolving numerical stencils on the more recent multi-core and many-core architectures is still an important issue. In this paper, we explore a number of different stencils, ranging from a basic 7-point Jacobi stencil to more complex high-order stencils used in finer numerical simulations. By optimizing and analyzing those stencils on the latest multi-core and many-core architectures (the Intel Sandy Bridge processor, the Intel Xeon Phi coprocessor, and the NVIDIA Fermi C2070 and Kepler K20x GPUs), we investigate the algorithmic and architectural factors that determine the performance and efficiency of the resulting designs. While multi-threading, vectorization, and optimization on cache and other fast buffers are still the most important techniques that provide performance, we observe that the different memory hierarchy and the different mechanism for issuing and executing parallel instructions lead to the different performance behaviors on CPU, MIC and GPU. With vector-like processing units becoming the major provider of computing power on almost all architectures, the compiler's inability to align all the computing and memory operations would become the major bottleneck from getting a high efficiency on current and future platforms. Our specific optimization of the complex WNAD stencil on GPU provides a good example of what the compiler could do to help.","PeriodicalId":421740,"journal":{"name":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","volume":"596 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Scaling and analyzing the stencil performance on multi-core and many-core architectures\",\"authors\":\"L. Gan, H. Fu, Wei Xue, Yangtong Xu, Chao Yang, Xinliang Wang, Zihong Lv, Yang You, Guangwen Yang, Kaijian Ou\",\"doi\":\"10.1109/PADSW.2014.7097797\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Stencils are among the most important and time-consuming kernels in many applications. While stencil optimization has been a well-studied topic on CPU platforms, achieving higher performance and efficiency for the evolving numerical stencils on the more recent multi-core and many-core architectures is still an important issue. In this paper, we explore a number of different stencils, ranging from a basic 7-point Jacobi stencil to more complex high-order stencils used in finer numerical simulations. By optimizing and analyzing those stencils on the latest multi-core and many-core architectures (the Intel Sandy Bridge processor, the Intel Xeon Phi coprocessor, and the NVIDIA Fermi C2070 and Kepler K20x GPUs), we investigate the algorithmic and architectural factors that determine the performance and efficiency of the resulting designs. While multi-threading, vectorization, and optimization on cache and other fast buffers are still the most important techniques that provide performance, we observe that the different memory hierarchy and the different mechanism for issuing and executing parallel instructions lead to the different performance behaviors on CPU, MIC and GPU. With vector-like processing units becoming the major provider of computing power on almost all architectures, the compiler's inability to align all the computing and memory operations would become the major bottleneck from getting a high efficiency on current and future platforms. Our specific optimization of the complex WNAD stencil on GPU provides a good example of what the compiler could do to help.\",\"PeriodicalId\":421740,\"journal\":{\"name\":\"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)\",\"volume\":\"596 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PADSW.2014.7097797\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PADSW.2014.7097797","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Scaling and analyzing the stencil performance on multi-core and many-core architectures
Stencils are among the most important and time-consuming kernels in many applications. While stencil optimization has been a well-studied topic on CPU platforms, achieving higher performance and efficiency for the evolving numerical stencils on the more recent multi-core and many-core architectures is still an important issue. In this paper, we explore a number of different stencils, ranging from a basic 7-point Jacobi stencil to more complex high-order stencils used in finer numerical simulations. By optimizing and analyzing those stencils on the latest multi-core and many-core architectures (the Intel Sandy Bridge processor, the Intel Xeon Phi coprocessor, and the NVIDIA Fermi C2070 and Kepler K20x GPUs), we investigate the algorithmic and architectural factors that determine the performance and efficiency of the resulting designs. While multi-threading, vectorization, and optimization on cache and other fast buffers are still the most important techniques that provide performance, we observe that the different memory hierarchy and the different mechanism for issuing and executing parallel instructions lead to the different performance behaviors on CPU, MIC and GPU. With vector-like processing units becoming the major provider of computing power on almost all architectures, the compiler's inability to align all the computing and memory operations would become the major bottleneck from getting a high efficiency on current and future platforms. Our specific optimization of the complex WNAD stencil on GPU provides a good example of what the compiler could do to help.