高性能计算技术最新文献

英文中文

YASK—Yet Another Stencil Kernel: A Framework for HPC Stencil Code-Generation and Tuning 另一个模板内核:用于HPC模板代码生成和调优的框架

高性能计算技术

Pub Date : 2016-11-13 DOI: 10.1109/WOLFHPC.2016.8

Charles R. Yount, Josh Tobin, Alexander Breuer, A. Duran

Stencil computation is an important class of algorithms used in a large variety of scientific-simulation applications. While the code for many problems can certainly be written in a straightforward manner in a high-level language, this often results in sub-optimal performance on modern computing platforms. On the other hand, adding advanced optimizations such as multi-level loop interchanges and vector-folding allows the code to perform better, but at the expense of reducing readability, maintainability, and portability. This paper describes the YASK (Yet Another Stencil Kernel) framework that simplifies the tasks of defining stencil functions, generating high-performance code targeted especially for Intel® Xeon® and Intel® Xeon Phi™ processors, and running tuning experiments. The features of the framework are described, including domain-specific-languages (DSLs), code generators for stencil-equation and loop code, and a genetic-algorithm-based automated tuning tool. Two practical use-cases are illustrated with real-world examples: the standalone YASK kernel is used to tune an isotropic 3D finitedifference stencil, and the generated YASK code is integrated into an external earthquake simulator.

模板计算是一类重要的算法，用于各种各样的科学模拟应用。虽然许多问题的代码当然可以用高级语言以直接的方式编写，但这通常会导致现代计算平台上的次优性能。另一方面，添加高级优化(如多级循环交换和矢量折叠)可以使代码性能更好，但代价是降低可读性、可维护性和可移植性。本文描述了YASK (Yet Another Stencil Kernel)框架，该框架简化了定义Stencil函数的任务，生成专门针对Intel®Xeon®和Intel®Xeon Phi™处理器的高性能代码，并运行调优实验。描述了该框架的特性，包括领域特定语言(dsl)、模板方程和循环代码的代码生成器以及基于遗传算法的自动调优工具。两个实际用例用现实世界的例子说明:独立的YASK内核用于调优各向同性3D有限差分模板，生成的YASK代码集成到外部地震模拟器中。

{"title":"YASK—Yet Another Stencil Kernel: A Framework for HPC Stencil Code-Generation and Tuning","authors":"Charles R. Yount, Josh Tobin, Alexander Breuer, A. Duran","doi":"10.1109/WOLFHPC.2016.8","DOIUrl":"https://doi.org/10.1109/WOLFHPC.2016.8","url":null,"abstract":"Stencil computation is an important class of algorithms used in a large variety of scientific-simulation applications. While the code for many problems can certainly be written in a straightforward manner in a high-level language, this often results in sub-optimal performance on modern computing platforms. On the other hand, adding advanced optimizations such as multi-level loop interchanges and vector-folding allows the code to perform better, but at the expense of reducing readability, maintainability, and portability. This paper describes the YASK (Yet Another Stencil Kernel) framework that simplifies the tasks of defining stencil functions, generating high-performance code targeted especially for Intel® Xeon® and Intel® Xeon Phi™ processors, and running tuning experiments. The features of the framework are described, including domain-specific-languages (DSLs), code generators for stencil-equation and loop code, and a genetic-algorithm-based automated tuning tool. Two practical use-cases are illustrated with real-world examples: the standalone YASK kernel is used to tune an isotropic 3D finitedifference stencil, and the generated YASK code is integrated into an external earthquake simulator.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"349 1","pages":"30-39"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75453895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

Optimizing Dirac Wilson Operator and linear solvers for Intel KNL 优化狄拉克威尔逊算子和线性求解Intel KNL

高性能计算技术

Pub Date : 2016-10-01 DOI: 10.2172/1988224

Bálint Jóo, Aaron C. Walden, Dhiraj D. Kalamkar, T. Kurth, K. Vaidyanathan

引用次数: 0

Devito: Automated Fast Finite Difference Computation Devito:自动快速有限差分计算

高性能计算技术

Pub Date : 2016-08-30 DOI: 10.1109/WOLFHPC.2016.6

Navjot Kukreja, M. Louboutin, Felippe Vieira, F. Luporini, Michael Lange, G. Gorman

Domain specific languages have successfully been used in a variety of fields to cleanly express scientific problems as well as to simplify implementation and performance optimization on different computer architectures. Although a large number of stencil languages are available, finite difference domain specific languages have proved challenging to design because most practical use cases require additional features that fall outside the finite difference abstraction. Inspired by the complexity of real-world seismic imaging problems, we introduce Devito, a domain specific language in which high level equations are expressed using symbolic expressions from the SymPy package. Complex equations are automatically manipulated, optimized, and translated into highly optimized C code that aims to perform comparably or better than hand-tuned code. All this is transparent to users, who only see concise symbolic mathematical expressions.

领域特定语言已经成功地应用于各种领域，以清晰地表达科学问题，并简化在不同计算机体系结构上的实现和性能优化。尽管有大量的模板语言可用，但有限差分领域特定语言的设计已被证明具有挑战性，因为大多数实际用例需要超出有限差分抽象的附加功能。受现实世界地震成像问题复杂性的启发，我们引入了Devito，这是一种特定于领域的语言，在这种语言中，高级方程使用SymPy包中的符号表达式表示。复杂的方程被自动处理、优化并转换为高度优化的C代码，其目的是与手动调整的代码相当或更好。所有这些对用户来说都是透明的，他们只看到简洁的符号数学表达式。

引用次数: 26

FDPS: a novel framework for developing high-performance particle simulation codes for distributed-memory systems FDPS:为分布式存储系统开发高性能粒子模拟代码的新框架

高性能计算技术

Pub Date : 2015-11-15 DOI: 10.1145/2830018.2830019

M. Iwasawa, A. Tanikawa, N. Hosono, Keigo Nitadori, T. Muranushi, J. Makino

We have developed FDPS (Framework for Developing Particle Simulator), which enables researchers and programmers to develop high-performance particle simulation codes easily. The basic idea of FDPS is to separate the program code for complex parallelization including domain decomposition, redistribution of particles, and exchange of particle information for interaction calculation between nodes, from actual interaction calculation and orbital integration. FDPS provides the former part and the users write the latter. Thus, a user can implement, for example, a high-performance N- body code, only in 120 lines. In this paper, we present the structure and implementation of FDPS, and describe its performance on two sample applications: gravitational N-body simulation and Smoothed Particle Hydrodynamics simulation. Both codes show very good parallel efficiency and scalability on the K computer. FDPS lets the researchers concentrate on the implementation of physics and mathematical schemes, without wasting their time on the development and performance tuning of their codes.

我们开发了FDPS(开发粒子模拟器框架)，使研究人员和程序员能够轻松开发高性能的粒子模拟代码。FDPS的基本思想是将复杂并行化的程序代码(包括域分解、粒子重分配、节点间相互作用计算的粒子信息交换)与实际相互作用计算和轨道积分分离。前者由FDPS提供，后者由用户编写。因此，用户可以实现，例如，一个高性能的N体代码，只有120行。在本文中，我们介绍了FDPS的结构和实现，并描述了它在两个示例应用中的性能:重力n体模拟和光滑粒子流体动力学模拟。这两个代码在K计算机上显示了非常好的并行效率和可扩展性。FDPS让研究人员专注于物理和数学方案的实现，而不用把时间浪费在代码的开发和性能调整上。

引用次数: 9

Enhancing domain specific language implementations through ontology 通过本体增强特定领域的语言实现

高性能计算技术

Pub Date : 2015-11-15 DOI: 10.1145/2830018.2830022

C. Liao, Pei-Hung Lin, D. Quinlan, Yue Zhao, Xipeng Shen

Domain specific languages (DSLs) offer an attractive path to program large-scale, heterogeneous parallel computers since application developers can leverage high-level annotations defined by DSLs to efficiently express algorithms without being distracted by low-level hardware details. However, performance of DSL programs heavily relies on how well a DSL implementation, including compilers and runtime systems, can exploit knowledge across multiple layers of software/hardware environments for optimizations. The knowledge ranges from domain assumptions, high-level DSL semantics, to low-level hardware features. Traditionally, such knowledge is either implicitly assumed or represented using ad-hoc approaches, including narrative text, source-level annotations, or customized software and hardware specifications in high performance computing (HPC). The lack of a formal, uniform, extensible, reusable and scalable knowledge management approach is becoming a major obstacle to efficient DSLs implementations targeting fast-changing parallel architectures. In this paper, we present a novel DSL implementation paradigm using an ontology-based knowledge base to formally and uniformly exploit the knowledge needed for optimizations. An ontology is a formal and explicit knowledge representation to describe concepts, properties, and individuals in a domain. During the past decades, a wide range of ontology standards and tools have been developed to help users capture, share, utilize and reason domain knowledge. Using modern ontology techniques, we design a knowledge base capturing concepts and properties of a problem domain, DSL programs, and hardware architectures. Compiler interfaces are also defined to allow interactions with the knowledge base to assist program analysis, optimization and code generation. Our preliminary evaluation using stencil computation shows the feasibility and benefits of our approach.

领域特定语言(dsl)为大规模异构并行计算机编程提供了一条有吸引力的途径，因为应用程序开发人员可以利用dsl定义的高级注释来有效地表达算法，而不会被低级硬件细节分散注意力。然而，DSL程序的性能在很大程度上依赖于DSL实现(包括编译器和运行时系统)如何利用跨多层软件/硬件环境的知识进行优化。知识范围从领域假设、高级DSL语义到低级硬件特性。传统上，这些知识要么被隐式地假设，要么使用特别的方法来表示，包括叙述性文本、源代码级注释，或者高性能计算(HPC)中的定制软件和硬件规范。缺乏一种正式的、统一的、可扩展的、可重用的和可伸缩的知识管理方法，正成为实现针对快速变化的并行体系结构的高效dsl的主要障碍。在本文中，我们提出了一种新的DSL实现范式，使用基于本体的知识库来形式化和统一地利用优化所需的知识。本体是一种正式和明确的知识表示，用于描述领域中的概念、属性和个体。在过去的几十年里，已经开发了广泛的本体标准和工具来帮助用户捕获、共享、利用和推理领域知识。使用现代本体技术，我们设计了一个知识库，捕获问题领域、DSL程序和硬件体系结构的概念和属性。还定义了编译器接口，允许与知识库进行交互，以协助程序分析、优化和代码生成。我们用模板计算进行了初步评估，结果表明了该方法的可行性和优越性。

{"title":"Enhancing domain specific language implementations through ontology","authors":"C. Liao, Pei-Hung Lin, D. Quinlan, Yue Zhao, Xipeng Shen","doi":"10.1145/2830018.2830022","DOIUrl":"https://doi.org/10.1145/2830018.2830022","url":null,"abstract":"Domain specific languages (DSLs) offer an attractive path to program large-scale, heterogeneous parallel computers since application developers can leverage high-level annotations defined by DSLs to efficiently express algorithms without being distracted by low-level hardware details. However, performance of DSL programs heavily relies on how well a DSL implementation, including compilers and runtime systems, can exploit knowledge across multiple layers of software/hardware environments for optimizations. The knowledge ranges from domain assumptions, high-level DSL semantics, to low-level hardware features. Traditionally, such knowledge is either implicitly assumed or represented using ad-hoc approaches, including narrative text, source-level annotations, or customized software and hardware specifications in high performance computing (HPC). The lack of a formal, uniform, extensible, reusable and scalable knowledge management approach is becoming a major obstacle to efficient DSLs implementations targeting fast-changing parallel architectures. In this paper, we present a novel DSL implementation paradigm using an ontology-based knowledge base to formally and uniformly exploit the knowledge needed for optimizations. An ontology is a formal and explicit knowledge representation to describe concepts, properties, and individuals in a domain. During the past decades, a wide range of ontology standards and tools have been developed to help users capture, share, utilize and reason domain knowledge. Using modern ontology techniques, we design a knowledge base capturing concepts and properties of a problem domain, DSL programs, and hardware architectures. Compiler interfaces are also defined to allow interactions with the knowledge base to assist program analysis, optimization and code generation. Our preliminary evaluation using stencil computation shows the feasibility and benefits of our approach.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90443953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Optimizing the LULESH stencil code using concurrent collections 使用并发集合优化LULESH模板代码

高性能计算技术

Pub Date : 2015-11-15 DOI: 10.1145/2830018.2830024

Chenyang Liu, Milind Kulkarni

Writing scientific applications for modern multicore machines is a challenging task. There are a myriad of hardware solutions available for many different target applications, each having their own advantages and trade-offs. An attractive approach is Concurrent Collections (CnC), which provides a programming model that separates the concerns of the application expert from the performance expert. CnC uses a data and control flow model paired with philosophies from previous data-flow programming models and tuple-space influences. By following the CnC programming paradigm, the runtime will seamlessly exploit available parallelism regardless of the platform; however, there are limitations to its effectiveness depending on the algorithm. In this paper, we explore ways to optimize the performance of the proxy application, Livermore Unstructured Lagrange Explicit Shock Hydrodynamics (LULESH), written using Concurrent Collections. The LULESH algorithm is expressed as a minimally-constrained set of partially-ordered operations with explicit dependencies. However, performance is plagued by scheduling overhead and synchronization costs caused by the fine granularity of computation steps. In LULESH and similar stencil-codes, we show that an algorithmic CnC program can be tuned by coalescing CnC elements through step fusion and tiling to become a well-tuned and scalable application running on multi-core systems. With these optimizations, we achieve up to 38x speedup over the original implementation with good scalability for up to 48 processor machines.

为现代多核机器编写科学应用程序是一项具有挑战性的任务。有无数的硬件解决方案可用于许多不同的目标应用程序，每个都有自己的优点和权衡。并发集合(CnC)是一种很有吸引力的方法，它提供了一种编程模型，将应用程序专家的关注点与性能专家的关注点分开。CnC使用数据和控制流模型与以前的数据流编程模型和元空间影响的哲学配对。通过遵循CnC编程范式，运行时将无缝地利用可用的并行性，而不考虑平台;然而，根据算法的不同，其有效性也存在局限性。在本文中，我们探索了优化代理应用程序的性能的方法，Livermore非结构化拉格朗日显式冲击流体动力学(LULESH)，使用并发集合编写。LULESH算法被表示为具有显式依赖关系的部分有序操作的最小约束集。然而，由于计算步骤的细粒度导致的调度开销和同步成本会影响性能。在LULESH和类似的模板代码中，我们展示了算法CnC程序可以通过步进融合和平铺合并CnC元素来调整，从而成为运行在多核系统上的经过良好调整和可扩展的应用程序。通过这些优化，我们实现了比原始实现高达38倍的加速，并具有可扩展性，最多可用于48个处理器机器。

{"title":"Optimizing the LULESH stencil code using concurrent collections","authors":"Chenyang Liu, Milind Kulkarni","doi":"10.1145/2830018.2830024","DOIUrl":"https://doi.org/10.1145/2830018.2830024","url":null,"abstract":"Writing scientific applications for modern multicore machines is a challenging task. There are a myriad of hardware solutions available for many different target applications, each having their own advantages and trade-offs. An attractive approach is Concurrent Collections (CnC), which provides a programming model that separates the concerns of the application expert from the performance expert. CnC uses a data and control flow model paired with philosophies from previous data-flow programming models and tuple-space influences. By following the CnC programming paradigm, the runtime will seamlessly exploit available parallelism regardless of the platform; however, there are limitations to its effectiveness depending on the algorithm. In this paper, we explore ways to optimize the performance of the proxy application, Livermore Unstructured Lagrange Explicit Shock Hydrodynamics (LULESH), written using Concurrent Collections. The LULESH algorithm is expressed as a minimally-constrained set of partially-ordered operations with explicit dependencies. However, performance is plagued by scheduling overhead and synchronization costs caused by the fine granularity of computation steps. In LULESH and similar stencil-codes, we show that an algorithmic CnC program can be tuned by coalescing CnC elements through step fusion and tiling to become a well-tuned and scalable application running on multi-core systems. With these optimizations, we achieve up to 38x speedup over the original implementation with good scalability for up to 48 processor machines.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90653071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

From DSL to HPC component-based runtime: a multi-stencil DSL case study 从DSL到基于HPC组件的运行时:一个多模板DSL案例研究

高性能计算技术

Pub Date : 2015-11-15 DOI: 10.1145/2830018.2830020

Julien Bigot, Hélène Coullon, Christian Pérez

High performance architectures evolve continuously to be more powerful. Such architectures also usually become more difficult to use efficiently. As a scientist is not a low level and high performance programming expert, Domain Specific Languages (DSLs) are a promising solution to automatically and efficiently write high performance codes. However, if DSLs ease programming for scientists, maintainability and portability issues are transferred from scientists to DSL designers. This paper deals with an approach to improve maintainability and programming productivity of DSLs through the generation of a component-based parallel runtime. To study it, the paper presents a DSL for multi-stencil programs, that is evaluated on a real-case of shallow water equations.

高性能架构不断发展，变得更加强大。这样的体系结构通常也变得更加难以有效地使用。由于科学家不是低级和高性能的编程专家，领域特定语言(dsl)是一个很有前途的解决方案，以自动和有效地编写高性能的代码。然而，如果DSL为科学家简化了编程，那么可维护性和可移植性问题就从科学家转移到了DSL设计者身上。本文讨论了一种通过生成基于组件的并行运行时来提高dsl可维护性和编程效率的方法。为了研究这一问题，本文提出了一个多模板程序的DSL，并对一个实际的浅水方程进行了评价。

引用次数: 3

SDSLc: a multi-target domain-specific compiler for stencil computations SDSLc:用于模板计算的多目标特定领域编译器

高性能计算技术

Pub Date : 2015-11-15 DOI: 10.1145/2830018.2830025

P. Rawat, Martin Kong, Thomas Henretty, Justin Holewinski, Kevin Stock, L. Pouchet, J. Ramanujam, A. Rountev, P. Sadayappan

Stencil computations are at the core of applications in a number of scientific computing domains. We describe a domain-specific language for regular stencil computations that allows specification of the computations in a concise manner. We describe a multi-target compiler for this DSL, which generates optimized code for GPUa, FPGAs, and multi-core processors with short-vector SIMD instruction sets, considering both low-order and high-order stencil computations. The hardware differences between these three types of architecture prompt different optimization strategies for the compiler. We evaluate the domain-specific compiler using a number of benchmarks on CPU, GPU and FPGA platforms.

模板计算是许多科学计算领域应用的核心。我们描述了一种用于常规模板计算的领域特定语言，该语言允许以简洁的方式规范计算。我们描述了该DSL的多目标编译器，该编译器为gpu、fpga和具有短向量SIMD指令集的多核处理器生成优化代码，同时考虑了低阶和高阶模板计算。这三种体系结构之间的硬件差异促使编译器采用不同的优化策略。我们使用CPU、GPU和FPGA平台上的许多基准测试来评估特定领域的编译器。

引用次数: 27

Puffin: an embedded domain-specific language for existing unstructured hydrodynamics codes 海雀:用于现有非结构化流体力学代码的嵌入式领域特定语言

高性能计算技术

Pub Date : 2015-11-15 DOI: 10.1145/2830018.2830021

Christopher W. Earl

In this paper, we present Puffin, a domain-specific language embedded in C++98 for incremental adoption in existing unstructured hydrodynamics codes. Because HPC systems with heterogeneous architectures (traditional CPUs, GPUs, Xeon Phis, etc.) are becoming increasingly common, developers of existing HPC software projects need performance across multiple architectures. While Puffin is not yet complete and only supports CPU execution so far, our aim for Puffin is to provide performance portability to existing unstructured hydrodynamics simulation projects. Our preliminary results focus on two topics. First, we show what the costs of using Puffin are. Adopting Puffin has a initial cost of rewriting existing code into Puffin. Using Puffin has the ongoing costs of increased compilation times (2-3X slower) and runtime overhead (0-11% slower). Second, we show the current benefits of using Puffin and mention the potential future benefits. We show how Puffin can gradually be adopted into an existing project, by doing so with the existing test application, LULESH 2.0. We show a reduction in code length by porting code to Puffin.

在本文中，我们提出了Puffin，一种嵌入在c++ 98中的领域特定语言，用于在现有的非结构化流体力学代码中增量采用。由于异构架构的HPC系统(传统的cpu、gpu、至强处理器等)越来越普遍，现有HPC软件项目的开发人员需要跨多种架构的性能。虽然Puffin尚未完成，目前只支持CPU执行，但我们对Puffin的目标是为现有的非结构化流体动力学模拟项目提供性能可移植性。我们的初步结果集中在两个主题上。首先，我们展示了使用海雀的成本。采用Puffin的初始成本是将现有代码重写为Puffin。使用Puffin会增加编译时间(慢2-3倍)和运行时开销(慢0-11%)。其次，我们展示了目前使用海雀的好处，并提到了潜在的未来好处。通过使用现有的测试应用程序LULESH 2.0，我们展示了如何将Puffin逐渐采用到现有的项目中。我们通过将代码移植到Puffin来减少代码长度。

引用次数: 1

Reducing overhead in the Uintah framework to support short-lived tasks on GPU-heterogeneous architectures 减少linux框架的开销，以支持gpu异构架构上的短期任务

高性能计算技术

Pub Date : 2015-11-15 DOI: 10.1145/2830018.2830023

B. Peterson, H. Dasari, A. Humphrey, J. Sutherland, T. Saad, M. Berzins

The Uintah computational framework is used for the parallel solution of partial differential equations on adaptive mesh refinement grids using modern supercomputers. Uintah is structured with an application layer and a separate runtime system. The Uintah runtime system is based on a distributed directed acyclic graph (DAG) of computational tasks, with a task scheduler that efficiently schedules and execute these tasks on both CPU cores and on-node accelerators. The runtime system identifies task dependencies, creates a taskgraph prior to an iteration based on these dependencies, prepares data for tasks, automatically generates MPI message tags, and manages data after task computation. Managing tasks for accelerators pose significant challenges over their CPU task counterparts due to supporting more memory regions, API call latency, memory bandwidth concerns, and the added complexity of development. These challenges are greatest when tasks compute within a few milliseconds, especially those that have stencil based computations that involve halo data, have little reuse of data, and/or require many computational variables. Current and emerging heterogeneous architectures necessitate addressing these challenges within Uintah. This work is not designed to improve performance of existing tasks, but rather reduce runtime overhead to allow developers writing short-lived computational tasks to utilize Uintah in a heterogeneous environment. This work analyzes an initial approach for managing accelerator tasks alongside existing CPU tasks within Uintah. The principal contribution of this work is to identify and address inefficiencies that arise when mapping tasks onto the GPU, to implement new schemes to reduce runtime system overhead, to introduce new features that allow for more tasks to leverage on-node accelerators, and to show overhead reduction results from these improvements.

在现代超级计算机的自适应网格精细化网格上，利用untah计算框架并行求解偏微分方程。ubuntu由一个应用层和一个独立的运行时系统构成。tah运行时系统基于计算任务的分布式有向无环图(DAG)，具有任务调度器，可以在CPU内核和节点上加速器上有效地调度和执行这些任务。运行时系统识别任务依赖项，在基于这些依赖项的迭代之前创建任务图，为任务准备数据，自动生成MPI消息标记，并在任务计算后管理数据。由于支持更多的内存区域、API调用延迟、内存带宽问题以及增加的开发复杂性，管理加速器的任务对CPU任务的对应项构成了重大挑战。当任务在几毫秒内进行计算时，这些挑战是最大的，特别是那些具有基于模板的计算，涉及光环数据，数据重用很少，和/或需要许多计算变量的任务。当前和新兴的异构架构需要在untah内部解决这些挑战。这项工作的目的不是提高现有任务的性能，而是减少运行时开销，允许开发人员编写短期的计算任务，以便在异构环境中利用ubuntu。这项工作分析了一种用于管理加速器任务和现有CPU任务的初始方法。这项工作的主要贡献是识别和解决在将任务映射到GPU时出现的低效率问题，实现新方案以减少运行时系统开销，引入允许更多任务利用节点上加速器的新功能，并显示这些改进带来的开销减少结果。

{"title":"Reducing overhead in the Uintah framework to support short-lived tasks on GPU-heterogeneous architectures","authors":"B. Peterson, H. Dasari, A. Humphrey, J. Sutherland, T. Saad, M. Berzins","doi":"10.1145/2830018.2830023","DOIUrl":"https://doi.org/10.1145/2830018.2830023","url":null,"abstract":"The Uintah computational framework is used for the parallel solution of partial differential equations on adaptive mesh refinement grids using modern supercomputers. Uintah is structured with an application layer and a separate runtime system. The Uintah runtime system is based on a distributed directed acyclic graph (DAG) of computational tasks, with a task scheduler that efficiently schedules and execute these tasks on both CPU cores and on-node accelerators. The runtime system identifies task dependencies, creates a taskgraph prior to an iteration based on these dependencies, prepares data for tasks, automatically generates MPI message tags, and manages data after task computation. Managing tasks for accelerators pose significant challenges over their CPU task counterparts due to supporting more memory regions, API call latency, memory bandwidth concerns, and the added complexity of development. These challenges are greatest when tasks compute within a few milliseconds, especially those that have stencil based computations that involve halo data, have little reuse of data, and/or require many computational variables. Current and emerging heterogeneous architectures necessitate addressing these challenges within Uintah. This work is not designed to improve performance of existing tasks, but rather reduce runtime overhead to allow developers writing short-lived computational tasks to utilize Uintah in a heterogeneous environment. This work analyzes an initial approach for managing accelerator tasks alongside existing CPU tasks within Uintah. The principal contribution of this work is to identify and address inefficiencies that arise when mapping tasks onto the GPU, to implement new schemes to reduce runtime system overhead, to introduce new features that allow for more tasks to leverage on-node accelerators, and to show overhead reduction results from these improvements.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85223152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

高性能计算技术

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀