F. Madsen, Robert Clifton-Everest, M. Chakravarty, G. Keller
Regular array languages for high performance computing based on aggregate operations provide a convenient parallel programming model, which enables the generation of efficient code for SIMD architectures, such as GPUs. However, the data sets that can be processed with current implementations are severely constrained by the limited amount of main memory available in these architectures. In this paper, we propose an extension of the embedded array language Accelerate with a notion of sequences, resulting in a two level hierarchy which allows the programmer to specify a partitioning strategy which facilitates automatic resource allocation. Depending on the available memory, the runtime system processes the overall data set in streams of chunks appropriate to the hardware parameters. In this paper, we present the language design for the sequence operations, as well as the compilation and runtime support, and demonstrate with a set of benchmarks the feasibility of this approach.
{"title":"Functional array streams","authors":"F. Madsen, Robert Clifton-Everest, M. Chakravarty, G. Keller","doi":"10.1145/2808091.2808094","DOIUrl":"https://doi.org/10.1145/2808091.2808094","url":null,"abstract":"Regular array languages for high performance computing based on aggregate operations provide a convenient parallel programming model, which enables the generation of efficient code for SIMD architectures, such as GPUs. However, the data sets that can be processed with current implementations are severely constrained by the limited amount of main memory available in these architectures. In this paper, we propose an extension of the embedded array language Accelerate with a notion of sequences, resulting in a two level hierarchy which allows the programmer to specify a partitioning strategy which facilitates automatic resource allocation. Depending on the available memory, the runtime system processes the overall data set in streams of chunks appropriate to the hardware parameters. In this paper, we present the language design for the sequence operations, as well as the compilation and runtime support, and demonstrate with a set of benchmarks the feasibility of this approach.","PeriodicalId":440468,"journal":{"name":"Proceedings of the 4th ACM SIGPLAN Workshop on Functional High-Performance Computing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117132607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While high-level abstractions greatly simplify program development, they ultimately need to be eliminated to produce high-performance code. This can be done using generative programming; one particularly usable approach is Lightweight Modular Staging. We present Scalan, a framework which enables compilation of high-level object-oriented-functional code into high-performance low-level code. It extends the basic LMS approach by making rewrite rules and compilation stages first-class and extending the graph IR with object-oriented features. Rewrite rules are represented as graph IR nodes with edges pointing to a pattern graph and a replacement graph; whenever new nodes are constructed, they are compared with the pattern graphs of all active rules and in case a match is found, the corresponding replacement graph is generated instead. Compilation stages are represented as graph transformers and together with the final output generation stage assembled into a compilation pipeline. This allows using multiple backends together, for example generating C/C++ code with JNI wrappers for the most performance-critical parts and Spark code which calls into it for the rest. We will show how object-oriented programming is supported by staging class constructors and method calls (including "factory" methods on companion objects) as part of the IR, thus exposing them to rewrite rules like all other operations. JVM mechanisms allow treating symbols as typed proxies for their corresponding nodes. Now it becomes necessary to eliminate such nodes at some compilation stage to avoid virtual dispatch in the output code (or at least minimize it for object-oriented target languages). In the simple case when the receiver node of a method is a class constructor, we can simply delegate the call to the subject at that stage. The more interesting case when the receiver node is the result of a calculation is handled by isomorphic specialization. This effectively enables virtual dispatch to be carried out at staging time, as described in our previous work. We will demonstrate how we use a Scala compiler plugin to further simplify development by avoiding the explicit use of the Rep type constructor and how our framework can handle effects using free monads. We will finish by discussing future plans for Scalan development.
{"title":"Scalan: a framework for domain-specific hotspot optimization (invited tutorial)","authors":"A. Slesarenko, Alexey Romanov","doi":"10.1145/2808091.2814203","DOIUrl":"https://doi.org/10.1145/2808091.2814203","url":null,"abstract":"While high-level abstractions greatly simplify program development, they ultimately need to be eliminated to produce high-performance code. This can be done using generative programming; one particularly usable approach is Lightweight Modular Staging. We present Scalan, a framework which enables compilation of high-level object-oriented-functional code into high-performance low-level code. It extends the basic LMS approach by making rewrite rules and compilation stages first-class and extending the graph IR with object-oriented features. Rewrite rules are represented as graph IR nodes with edges pointing to a pattern graph and a replacement graph; whenever new nodes are constructed, they are compared with the pattern graphs of all active rules and in case a match is found, the corresponding replacement graph is generated instead. Compilation stages are represented as graph transformers and together with the final output generation stage assembled into a compilation pipeline. This allows using multiple backends together, for example generating C/C++ code with JNI wrappers for the most performance-critical parts and Spark code which calls into it for the rest. We will show how object-oriented programming is supported by staging class constructors and method calls (including \"factory\" methods on companion objects) as part of the IR, thus exposing them to rewrite rules like all other operations. JVM mechanisms allow treating symbols as typed proxies for their corresponding nodes. Now it becomes necessary to eliminate such nodes at some compilation stage to avoid virtual dispatch in the output code (or at least minimize it for object-oriented target languages). In the simple case when the receiver node of a method is a class constructor, we can simply delegate the call to the subject at that stage. The more interesting case when the receiver node is the result of a calculation is handled by isomorphic specialization. This effectively enables virtual dispatch to be carried out at staging time, as described in our previous work. We will demonstrate how we use a Scala compiler plugin to further simplify development by avoiding the explicit use of the Rep type constructor and how our framework can handle effects using free monads. We will finish by discussing future plans for Scalan development.","PeriodicalId":440468,"journal":{"name":"Proceedings of the 4th ACM SIGPLAN Workshop on Functional High-Performance Computing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114425450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 4th ACM SIGPLAN Workshop on Functional High-Performance Computing","authors":"Tiark Rompf, G. Mainland","doi":"10.1145/2808091","DOIUrl":"https://doi.org/10.1145/2808091","url":null,"abstract":"","PeriodicalId":440468,"journal":{"name":"Proceedings of the 4th ACM SIGPLAN Workshop on Functional High-Performance Computing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122381719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Vollmer, Bo Joel Svensson, Eric Holk, Ryan Newton
Writing high performance GPGPU code is often difficult and time-consuming, potentially requiring laborious manual tuning of low-level details. Despite these challenges, the cost in ignoring GPUs in high performance computing is increasingly large. Auto-tuning is a potential solution to the problem of tedious manual tuning. We present a framework for auto-tuning GPU kernels which are expressed in an embedded DSL, and which expose compile-time parameters for tuning. Our framework allows for kernels to be polymorphic over what search strategy will tune them, and allows search strategies to be implemented in the same meta-language as the kernel-generation code (Haskell). Further, we show how to use functional programming abstractions to enforce regular (hyper-rectangular) search spaces. We also evaluate several common search strategies on a variety of kernels, and demonstrate that the framework can tune both EDSL and ordinary CUDA code.
{"title":"Meta-programming and auto-tuning in the search for high performance GPU code","authors":"Michael Vollmer, Bo Joel Svensson, Eric Holk, Ryan Newton","doi":"10.1145/2808091.2808092","DOIUrl":"https://doi.org/10.1145/2808091.2808092","url":null,"abstract":"Writing high performance GPGPU code is often difficult and time-consuming, potentially requiring laborious manual tuning of low-level details. Despite these challenges, the cost in ignoring GPUs in high performance computing is increasingly large. Auto-tuning is a potential solution to the problem of tedious manual tuning. We present a framework for auto-tuning GPU kernels which are expressed in an embedded DSL, and which expose compile-time parameters for tuning. Our framework allows for kernels to be polymorphic over what search strategy will tune them, and allows search strategies to be implemented in the same meta-language as the kernel-generation code (Haskell). Further, we show how to use functional programming abstractions to enforce regular (hyper-rectangular) search spaces. We also evaluate several common search strategies on a variety of kernels, and demonstrate that the framework can tune both EDSL and ordinary CUDA code.","PeriodicalId":440468,"journal":{"name":"Proceedings of the 4th ACM SIGPLAN Workshop on Functional High-Performance Computing","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131346517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present the Asuna system which supports implicitly heterogeneous multi-stage programming based on MetaOCaml, a multi-stage extension of OCaml. Our system allows programmers to write code generators in a high-level language, and generated code can be translated to a program in low-level languages such as C and LLVM. The high-level code generators can make use of all the features of MetaOCaml such as algebraic data types and higher-order functions while the generated code may include low-level CPU instructions such as vector (SIMD) operations. One can write programs in a modular and type-safe programming style and can directly represent low-level optimizations. Asuna is a multi-target system, that means a single code generator can generate code in C and LLVM, without changing the generator. The translation by Asuna preserves typing and all generated code is guaranteed to be well typed and well scoped. In this paper, we explain the practical aspect of Asuna, using examples taken from high-performance computing.
{"title":"Generate and offshore: type-safe and modular code generation for low-level optimization","authors":"Naoki Takashima, Hiroki Sakamoto, Yukiyoshi Kameyama","doi":"10.1145/2808091.2808096","DOIUrl":"https://doi.org/10.1145/2808091.2808096","url":null,"abstract":"We present the Asuna system which supports implicitly heterogeneous multi-stage programming based on MetaOCaml, a multi-stage extension of OCaml. Our system allows programmers to write code generators in a high-level language, and generated code can be translated to a program in low-level languages such as C and LLVM. The high-level code generators can make use of all the features of MetaOCaml such as algebraic data types and higher-order functions while the generated code may include low-level CPU instructions such as vector (SIMD) operations. One can write programs in a modular and type-safe programming style and can directly represent low-level optimizations. Asuna is a multi-target system, that means a single code generator can generate code in C and LLVM, without changing the generator. The translation by Asuna preserves typing and all generated code is guaranteed to be well typed and well scoped. In this paper, we explain the practical aspect of Asuna, using examples taken from high-performance computing.","PeriodicalId":440468,"journal":{"name":"Proceedings of the 4th ACM SIGPLAN Workshop on Functional High-Performance Computing","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114714519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parallel implementation of topological algorithms is highly desirable, but the challenges, from reconstructing algorithms around independent threads through to runtime load balancing, have proven to be formidable. This problem, made all the more acute by the diversity of hardware platforms, has led to new kinds of implementation platform for computational science, with sophisticated runtime systems managing and coordinating large threadcounts to keep processing elements heavily utilized. While simpler and more portable than direct management of threads, these approaches still entangle program logic with resource management. Similar kinds of highly parallel runtime system have also been developed for functional languages. Here, however, language support for higher-order functions allows a cleaner separation between the algorithm and `skeletons' that express generic patterns of parallel computation. We report results on using this technique to develop a distributed version of the Joint Contour Net, a generalization of the Contour Tree to multifields. We present performance comparisons against a recent Haskell implementation using shared-memory parallelism, and initial work on a skeleton for distributed memory implementation that utilizes an innovative strategy to reduce inter-process communication overheads.
{"title":"Skeletons for distributed topological computation","authors":"D. Duke, Fouzhan Hosseini","doi":"10.1145/2808091.2808095","DOIUrl":"https://doi.org/10.1145/2808091.2808095","url":null,"abstract":"Parallel implementation of topological algorithms is highly desirable, but the challenges, from reconstructing algorithms around independent threads through to runtime load balancing, have proven to be formidable. This problem, made all the more acute by the diversity of hardware platforms, has led to new kinds of implementation platform for computational science, with sophisticated runtime systems managing and coordinating large threadcounts to keep processing elements heavily utilized. While simpler and more portable than direct management of threads, these approaches still entangle program logic with resource management. Similar kinds of highly parallel runtime system have also been developed for functional languages. Here, however, language support for higher-order functions allows a cleaner separation between the algorithm and `skeletons' that express generic patterns of parallel computation. We report results on using this technique to develop a distributed version of the Joint Contour Net, a generalization of the Contour Tree to multifields. We present performance comparisons against a recent Haskell implementation using shared-memory parallelism, and initial work on a skeleton for distributed memory implementation that utilizes an innovative strategy to reduce inter-process communication overheads.","PeriodicalId":440468,"journal":{"name":"Proceedings of the 4th ACM SIGPLAN Workshop on Functional High-Performance Computing","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122452220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bo Joel Svensson, Michael Vollmer, Eric Holk, T. L. McDonell, Ryan Newton
High-level domain-specific languages for array processing on the GPU are increasingly common, but they typically only run on a single GPU. As computational power is distributed across more devices, languages must target multiple devices simultaneously. To this end, we present a compositional translation that fissions data-parallel programs in the Accelerate language, allowing subsequent compiler and runtime stages to map computations onto multiple devices for improved performance---even programs that begin as a single data-parallel kernel.
{"title":"Converting data-parallelism to task-parallelism by rewrites: purely functional programs across multiple GPUs","authors":"Bo Joel Svensson, Michael Vollmer, Eric Holk, T. L. McDonell, Ryan Newton","doi":"10.1145/2808091.2808093","DOIUrl":"https://doi.org/10.1145/2808091.2808093","url":null,"abstract":"High-level domain-specific languages for array processing on the GPU are increasingly common, but they typically only run on a single GPU. As computational power is distributed across more devices, languages must target multiple devices simultaneously. To this end, we present a compositional translation that fissions data-parallel programs in the Accelerate language, allowing subsequent compiler and runtime stages to map computations onto multiple devices for improved performance---even programs that begin as a single data-parallel kernel.","PeriodicalId":440468,"journal":{"name":"Proceedings of the 4th ACM SIGPLAN Workshop on Functional High-Performance Computing","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123925835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}