Proceedings of the 4th ACM SIGPLAN Workshop on Functional High-Performance Computing最新文献

英文中文

Functional array streams 函数数组流

Proceedings of the 4th ACM SIGPLAN Workshop on Functional High-Performance Computing

Pub Date : 2015-08-30 DOI: 10.1145/2808091.2808094

F. Madsen, Robert Clifton-Everest, M. Chakravarty, G. Keller

Regular array languages for high performance computing based on aggregate operations provide a convenient parallel programming model, which enables the generation of efficient code for SIMD architectures, such as GPUs. However, the data sets that can be processed with current implementations are severely constrained by the limited amount of main memory available in these architectures. In this paper, we propose an extension of the embedded array language Accelerate with a notion of sequences, resulting in a two level hierarchy which allows the programmer to specify a partitioning strategy which facilitates automatic resource allocation. Depending on the available memory, the runtime system processes the overall data set in streams of chunks appropriate to the hardware parameters. In this paper, we present the language design for the sequence operations, as well as the compilation and runtime support, and demonstrate with a set of benchmarks the feasibility of this approach.

用于基于聚合操作的高性能计算的常规数组语言提供了一种方便的并行编程模型，可以为SIMD体系结构(如gpu)生成高效的代码。然而，当前实现可以处理的数据集受到这些体系结构中可用的有限主内存的严重限制。在本文中，我们提出了嵌入式数组语言加速的扩展与序列的概念，导致一个两级层次结构，允许程序员指定一个分区策略，促进自动资源分配。根据可用内存，运行时系统以适合硬件参数的块流的形式处理整个数据集。在本文中，我们给出了序列操作的语言设计，以及编译和运行时支持，并通过一组基准测试证明了这种方法的可行性。

引用次数: 7

Scalan: a framework for domain-specific hotspot optimization (invited tutorial) Scalan:特定于领域的热点优化框架(特邀教程)

Proceedings of the 4th ACM SIGPLAN Workshop on Functional High-Performance Computing

Pub Date : 2015-08-30 DOI: 10.1145/2808091.2814203

A. Slesarenko, Alexey Romanov

While high-level abstractions greatly simplify program development, they ultimately need to be eliminated to produce high-performance code. This can be done using generative programming; one particularly usable approach is Lightweight Modular Staging. We present Scalan, a framework which enables compilation of high-level object-oriented-functional code into high-performance low-level code. It extends the basic LMS approach by making rewrite rules and compilation stages first-class and extending the graph IR with object-oriented features. Rewrite rules are represented as graph IR nodes with edges pointing to a pattern graph and a replacement graph; whenever new nodes are constructed, they are compared with the pattern graphs of all active rules and in case a match is found, the corresponding replacement graph is generated instead. Compilation stages are represented as graph transformers and together with the final output generation stage assembled into a compilation pipeline. This allows using multiple backends together, for example generating C/C++ code with JNI wrappers for the most performance-critical parts and Spark code which calls into it for the rest. We will show how object-oriented programming is supported by staging class constructors and method calls (including "factory" methods on companion objects) as part of the IR, thus exposing them to rewrite rules like all other operations. JVM mechanisms allow treating symbols as typed proxies for their corresponding nodes. Now it becomes necessary to eliminate such nodes at some compilation stage to avoid virtual dispatch in the output code (or at least minimize it for object-oriented target languages). In the simple case when the receiver node of a method is a class constructor, we can simply delegate the call to the subject at that stage. The more interesting case when the receiver node is the result of a calculation is handled by isomorphic specialization. This effectively enables virtual dispatch to be carried out at staging time, as described in our previous work. We will demonstrate how we use a Scala compiler plugin to further simplify development by avoiding the explicit use of the Rep type constructor and how our framework can handle effects using free monads. We will finish by discussing future plans for Scalan development.

虽然高级抽象极大地简化了程序开发，但为了生成高性能代码，最终需要消除它们。这可以通过生成式编程来实现;一种特别有用的方法是轻量级模块化Staging。我们提出了Scalan，这是一个框架，可以将高级面向对象的函数代码编译成高性能的低级代码。它扩展了基本的LMS方法，将重写规则和编译阶段设为一级，并使用面向对象的特性扩展了图IR。重写规则表示为图IR节点，其边指向模式图和替换图;每当构造新节点时，将它们与所有活动规则的模式图进行比较，如果找到匹配，则生成相应的替换图。编译阶段表示为图形转换器，并与最终输出生成阶段一起组装成编译管道。这允许同时使用多个后端，例如，为性能最关键的部分生成带有JNI包装的C/ c++代码，并为其余部分调用Spark代码。我们将展示如何通过将类构造函数和方法调用(包括伴随对象上的“工厂”方法)作为IR的一部分来支持面向对象编程，从而使它们像所有其他操作一样可以重写规则。JVM机制允许将符号视为其对应节点的类型化代理。现在有必要在某些编译阶段消除这样的节点，以避免输出代码中的虚拟分派(或者至少将面向对象目标语言中的虚拟分派最小化)。在方法的接收节点是类构造函数的简单情况下，我们可以在该阶段简单地将调用委托给主题。更有趣的情况是，接收节点是由同构专门化处理的计算结果。这有效地使虚拟分派能够在分期阶段执行，正如我们在前面的工作中描述的那样。我们将演示如何使用Scala编译器插件通过避免显式使用Rep类型构造函数来进一步简化开发，以及我们的框架如何使用自由单子来处理效果。最后，我们将讨论Scalan未来的发展计划。

{"title":"Scalan: a framework for domain-specific hotspot optimization (invited tutorial)","authors":"A. Slesarenko, Alexey Romanov","doi":"10.1145/2808091.2814203","DOIUrl":"https://doi.org/10.1145/2808091.2814203","url":null,"abstract":"While high-level abstractions greatly simplify program development, they ultimately need to be eliminated to produce high-performance code. This can be done using generative programming; one particularly usable approach is Lightweight Modular Staging. We present Scalan, a framework which enables compilation of high-level object-oriented-functional code into high-performance low-level code. It extends the basic LMS approach by making rewrite rules and compilation stages first-class and extending the graph IR with object-oriented features. Rewrite rules are represented as graph IR nodes with edges pointing to a pattern graph and a replacement graph; whenever new nodes are constructed, they are compared with the pattern graphs of all active rules and in case a match is found, the corresponding replacement graph is generated instead. Compilation stages are represented as graph transformers and together with the final output generation stage assembled into a compilation pipeline. This allows using multiple backends together, for example generating C/C++ code with JNI wrappers for the most performance-critical parts and Spark code which calls into it for the rest. We will show how object-oriented programming is supported by staging class constructors and method calls (including \"factory\" methods on companion objects) as part of the IR, thus exposing them to rewrite rules like all other operations. JVM mechanisms allow treating symbols as typed proxies for their corresponding nodes. Now it becomes necessary to eliminate such nodes at some compilation stage to avoid virtual dispatch in the output code (or at least minimize it for object-oriented target languages). In the simple case when the receiver node of a method is a class constructor, we can simply delegate the call to the subject at that stage. The more interesting case when the receiver node is the result of a calculation is handled by isomorphic specialization. This effectively enables virtual dispatch to be carried out at staging time, as described in our previous work. We will demonstrate how we use a Scala compiler plugin to further simplify development by avoiding the explicit use of the Rep type constructor and how our framework can handle effects using free monads. We will finish by discussing future plans for Scalan development.","PeriodicalId":440468,"journal":{"name":"Proceedings of the 4th ACM SIGPLAN Workshop on Functional High-Performance Computing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114425450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Proceedings of the 4th ACM SIGPLAN Workshop on Functional High-Performance Computing 第四届ACM SIGPLAN功能性高性能计算研讨会论文集

Proceedings of the 4th ACM SIGPLAN Workshop on Functional High-Performance Computing

Pub Date : 2015-08-30 DOI: 10.1145/2808091

Tiark Rompf, G. Mainland

引用次数: 0

Meta-programming and auto-tuning in the search for high performance GPU code 元编程和自动调优在搜索高性能GPU代码

Proceedings of the 4th ACM SIGPLAN Workshop on Functional High-Performance Computing

Pub Date : 2015-08-30 DOI: 10.1145/2808091.2808092

Michael Vollmer, Bo Joel Svensson, Eric Holk, Ryan Newton

Writing high performance GPGPU code is often difficult and time-consuming, potentially requiring laborious manual tuning of low-level details. Despite these challenges, the cost in ignoring GPUs in high performance computing is increasingly large. Auto-tuning is a potential solution to the problem of tedious manual tuning. We present a framework for auto-tuning GPU kernels which are expressed in an embedded DSL, and which expose compile-time parameters for tuning. Our framework allows for kernels to be polymorphic over what search strategy will tune them, and allows search strategies to be implemented in the same meta-language as the kernel-generation code (Haskell). Further, we show how to use functional programming abstractions to enforce regular (hyper-rectangular) search spaces. We also evaluate several common search strategies on a variety of kernels, and demonstrate that the framework can tune both EDSL and ordinary CUDA code.

编写高性能GPGPU代码通常既困难又耗时，可能需要费力地手动调优底层细节。尽管存在这些挑战，但在高性能计算中忽略gpu的成本越来越大。自动调优是繁琐的手动调优问题的潜在解决方案。我们提出了一个自动调优GPU内核的框架，该框架用嵌入式DSL表示，并公开了用于调优的编译时参数。我们的框架允许内核对搜索策略进行多态调整，并允许搜索策略使用与内核生成代码(Haskell)相同的元语言实现。此外，我们将展示如何使用函数式编程抽象来强制执行正则(超矩形)搜索空间。我们还在各种内核上评估了几种常见的搜索策略，并证明了该框架可以调优EDSL和普通CUDA代码。

引用次数: 9

Generate and offshore: type-safe and modular code generation for low-level optimization 生成和离岸:用于低级优化的类型安全和模块化代码生成

Proceedings of the 4th ACM SIGPLAN Workshop on Functional High-Performance Computing

Pub Date : 2015-08-30 DOI: 10.1145/2808091.2808096

Naoki Takashima, Hiroki Sakamoto, Yukiyoshi Kameyama

We present the Asuna system which supports implicitly heterogeneous multi-stage programming based on MetaOCaml, a multi-stage extension of OCaml. Our system allows programmers to write code generators in a high-level language, and generated code can be translated to a program in low-level languages such as C and LLVM. The high-level code generators can make use of all the features of MetaOCaml such as algebraic data types and higher-order functions while the generated code may include low-level CPU instructions such as vector (SIMD) operations. One can write programs in a modular and type-safe programming style and can directly represent low-level optimizations. Asuna is a multi-target system, that means a single code generator can generate code in C and LLVM, without changing the generator. The translation by Asuna preserves typing and all generated code is guaranteed to be well typed and well scoped. In this paper, we explain the practical aspect of Asuna, using examples taken from high-performance computing.

基于OCaml的多阶段扩展MetaOCaml，提出了支持隐式异构多阶段编程的Asuna系统。我们的系统允许程序员用高级语言编写代码生成器，生成的代码可以用C和LLVM等低级语言翻译成程序。高级代码生成器可以利用MetaOCaml的所有特性，如代数数据类型和高阶函数，而生成的代码可能包括低级CPU指令，如矢量(SIMD)操作。可以用模块化和类型安全的编程风格编写程序，并且可以直接表示低级优化。Asuna是一个多目标系统，这意味着一个代码生成器可以在不改变生成器的情况下用C和LLVM生成代码。Asuna的翻译保留了类型，并且所有生成的代码都保证具有良好的类型和良好的作用域。在本文中，我们使用来自高性能计算的示例来解释Asuna的实际方面。

引用次数: 4

Skeletons for distributed topological computation 分布式拓扑计算框架

Proceedings of the 4th ACM SIGPLAN Workshop on Functional High-Performance Computing

Pub Date : 2015-08-30 DOI: 10.1145/2808091.2808095

D. Duke, Fouzhan Hosseini

Parallel implementation of topological algorithms is highly desirable, but the challenges, from reconstructing algorithms around independent threads through to runtime load balancing, have proven to be formidable. This problem, made all the more acute by the diversity of hardware platforms, has led to new kinds of implementation platform for computational science, with sophisticated runtime systems managing and coordinating large threadcounts to keep processing elements heavily utilized. While simpler and more portable than direct management of threads, these approaches still entangle program logic with resource management. Similar kinds of highly parallel runtime system have also been developed for functional languages. Here, however, language support for higher-order functions allows a cleaner separation between the algorithm and `skeletons' that express generic patterns of parallel computation. We report results on using this technique to develop a distributed version of the Joint Contour Net, a generalization of the Contour Tree to multifields. We present performance comparisons against a recent Haskell implementation using shared-memory parallelism, and initial work on a skeleton for distributed memory implementation that utilizes an innovative strategy to reduce inter-process communication overheads.

拓扑算法的并行实现是非常可取的，但是从围绕独立线程重构算法到运行时负载平衡等挑战已被证明是艰巨的。这个问题由于硬件平台的多样性而变得更加尖锐，导致了计算科学的新型实现平台的出现，这些平台使用复杂的运行时系统来管理和协调大量的线程数，以保持处理元素得到充分利用。虽然这些方法比直接管理线程更简单、更易于移植，但它们仍然将程序逻辑与资源管理纠缠在一起。类似的高度并行运行时系统也已经为函数式语言开发出来了。然而，在这里，对高阶函数的语言支持使得算法和表达并行计算通用模式的“骨架”之间有了更清晰的分离。我们报告了使用该技术开发联合轮廓网的分布式版本的结果，这是轮廓树到多场的推广。我们展示了最近一个使用共享内存并行性的Haskell实现的性能比较，以及分布式内存实现框架的初步工作，该框架利用了一种创新的策略来减少进程间通信开销。

{"title":"Skeletons for distributed topological computation","authors":"D. Duke, Fouzhan Hosseini","doi":"10.1145/2808091.2808095","DOIUrl":"https://doi.org/10.1145/2808091.2808095","url":null,"abstract":"Parallel implementation of topological algorithms is highly desirable, but the challenges, from reconstructing algorithms around independent threads through to runtime load balancing, have proven to be formidable. This problem, made all the more acute by the diversity of hardware platforms, has led to new kinds of implementation platform for computational science, with sophisticated runtime systems managing and coordinating large threadcounts to keep processing elements heavily utilized. While simpler and more portable than direct management of threads, these approaches still entangle program logic with resource management. Similar kinds of highly parallel runtime system have also been developed for functional languages. Here, however, language support for higher-order functions allows a cleaner separation between the algorithm and `skeletons' that express generic patterns of parallel computation. We report results on using this technique to develop a distributed version of the Joint Contour Net, a generalization of the Contour Tree to multifields. We present performance comparisons against a recent Haskell implementation using shared-memory parallelism, and initial work on a skeleton for distributed memory implementation that utilizes an innovative strategy to reduce inter-process communication overheads.","PeriodicalId":440468,"journal":{"name":"Proceedings of the 4th ACM SIGPLAN Workshop on Functional High-Performance Computing","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122452220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Converting data-parallelism to task-parallelism by rewrites: purely functional programs across multiple GPUs 通过重写将数据并行性转换为任务并行性:跨多个gpu的纯功能程序

Proceedings of the 4th ACM SIGPLAN Workshop on Functional High-Performance Computing

Pub Date : 2015-08-30 DOI: 10.1145/2808091.2808093

Bo Joel Svensson, Michael Vollmer, Eric Holk, T. L. McDonell, Ryan Newton

High-level domain-specific languages for array processing on the GPU are increasingly common, but they typically only run on a single GPU. As computational power is distributed across more devices, languages must target multiple devices simultaneously. To this end, we present a compositional translation that fissions data-parallel programs in the Accelerate language, allowing subsequent compiler and runtime stages to map computations onto multiple devices for improved performance---even programs that begin as a single data-parallel kernel.

用于GPU上的数组处理的高级领域特定语言越来越普遍，但它们通常只在单个GPU上运行。随着计算能力分布在更多的设备上，语言必须同时针对多个设备。为此，我们提出了一种组合转换，它在Accelerate语言中分解数据并行程序，允许随后的编译器和运行时阶段将计算映射到多个设备上以提高性能——甚至是作为单个数据并行内核开始的程序。

引用次数: 3

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 4th ACM SIGPLAN Workshop on Functional High-Performance Computing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀