{"title":"On Generating Out-Of-Core GPU Code for Multi-Dimensional Array Operations","authors":"P. van Beurden, S. Scholz","doi":"10.1145/3587216.3587223","DOIUrl":null,"url":null,"abstract":"This paper presents the first results of our experiments for generating CUDA code that streams array operations over the elements of its array arguments from high-level specifications. We look at two classes of memory-bound array operations: map-like operations and iterative stencil computations. We investigate code patterns that stream the arguments of these operations from the host through the GPU and back taking the iterative nature of our experiments into account. We show that this setup does not only enable computations on arrays that are so big that they do not fit into the device memory of a single GPU (hence “out-of-core“), but we also demonstrate that the proposed streaming code outperforms non-streaming code versions even for smaller array sizes. For both application patterns, we observe memory throughputs that are beyond 80% of the hardware capability, irrespective of the problem sizes.","PeriodicalId":318613,"journal":{"name":"Proceedings of the 34th Symposium on Implementation and Application of Functional Languages","volume":"824 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 34th Symposium on Implementation and Application of Functional Languages","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3587216.3587223","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
This paper presents the first results of our experiments for generating CUDA code that streams array operations over the elements of its array arguments from high-level specifications. We look at two classes of memory-bound array operations: map-like operations and iterative stencil computations. We investigate code patterns that stream the arguments of these operations from the host through the GPU and back taking the iterative nature of our experiments into account. We show that this setup does not only enable computations on arrays that are so big that they do not fit into the device memory of a single GPU (hence “out-of-core“), but we also demonstrate that the proposed streaming code outperforms non-streaming code versions even for smaller array sizes. For both application patterns, we observe memory throughputs that are beyond 80% of the hardware capability, irrespective of the problem sizes.