Charles R. Yount, Josh Tobin, Alexander Breuer, A. Duran
Stencil computation is an important class of algorithms used in a large variety of scientific-simulation applications. While the code for many problems can certainly be written in a straightforward manner in a high-level language, this often results in sub-optimal performance on modern computing platforms. On the other hand, adding advanced optimizations such as multi-level loop interchanges and vector-folding allows the code to perform better, but at the expense of reducing readability, maintainability, and portability. This paper describes the YASK (Yet Another Stencil Kernel) framework that simplifies the tasks of defining stencil functions, generating high-performance code targeted especially for Intel® Xeon® and Intel® Xeon Phi™ processors, and running tuning experiments. The features of the framework are described, including domain-specific-languages (DSLs), code generators for stencil-equation and loop code, and a genetic-algorithm-based automated tuning tool. Two practical use-cases are illustrated with real-world examples: the standalone YASK kernel is used to tune an isotropic 3D finitedifference stencil, and the generated YASK code is integrated into an external earthquake simulator.
模板计算是一类重要的算法,用于各种各样的科学模拟应用。虽然许多问题的代码当然可以用高级语言以直接的方式编写,但这通常会导致现代计算平台上的次优性能。另一方面,添加高级优化(如多级循环交换和矢量折叠)可以使代码性能更好,但代价是降低可读性、可维护性和可移植性。本文描述了YASK (Yet Another Stencil Kernel)框架,该框架简化了定义Stencil函数的任务,生成专门针对Intel®Xeon®和Intel®Xeon Phi™处理器的高性能代码,并运行调优实验。描述了该框架的特性,包括领域特定语言(dsl)、模板方程和循环代码的代码生成器以及基于遗传算法的自动调优工具。两个实际用例用现实世界的例子说明:独立的YASK内核用于调优各向同性3D有限差分模板,生成的YASK代码集成到外部地震模拟器中。
{"title":"YASK—Yet Another Stencil Kernel: A Framework for HPC Stencil Code-Generation and Tuning","authors":"Charles R. Yount, Josh Tobin, Alexander Breuer, A. Duran","doi":"10.1109/WOLFHPC.2016.8","DOIUrl":"https://doi.org/10.1109/WOLFHPC.2016.8","url":null,"abstract":"Stencil computation is an important class of algorithms used in a large variety of scientific-simulation applications. While the code for many problems can certainly be written in a straightforward manner in a high-level language, this often results in sub-optimal performance on modern computing platforms. On the other hand, adding advanced optimizations such as multi-level loop interchanges and vector-folding allows the code to perform better, but at the expense of reducing readability, maintainability, and portability. This paper describes the YASK (Yet Another Stencil Kernel) framework that simplifies the tasks of defining stencil functions, generating high-performance code targeted especially for Intel® Xeon® and Intel® Xeon Phi™ processors, and running tuning experiments. The features of the framework are described, including domain-specific-languages (DSLs), code generators for stencil-equation and loop code, and a genetic-algorithm-based automated tuning tool. Two practical use-cases are illustrated with real-world examples: the standalone YASK kernel is used to tune an isotropic 3D finitedifference stencil, and the generated YASK code is integrated into an external earthquake simulator.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"349 1","pages":"30-39"},"PeriodicalIF":0.0,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75453895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bálint Jóo, Aaron C. Walden, Dhiraj D. Kalamkar, T. Kurth, K. Vaidyanathan
{"title":"Optimizing Dirac Wilson Operator and linear solvers for Intel KNL","authors":"Bálint Jóo, Aaron C. Walden, Dhiraj D. Kalamkar, T. Kurth, K. Vaidyanathan","doi":"10.2172/1988224","DOIUrl":"https://doi.org/10.2172/1988224","url":null,"abstract":"","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77683267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Navjot Kukreja, M. Louboutin, Felippe Vieira, F. Luporini, Michael Lange, G. Gorman
Domain specific languages have successfully been used in a variety of fields to cleanly express scientific problems as well as to simplify implementation and performance optimization on different computer architectures. Although a large number of stencil languages are available, finite difference domain specific languages have proved challenging to design because most practical use cases require additional features that fall outside the finite difference abstraction. Inspired by the complexity of real-world seismic imaging problems, we introduce Devito, a domain specific language in which high level equations are expressed using symbolic expressions from the SymPy package. Complex equations are automatically manipulated, optimized, and translated into highly optimized C code that aims to perform comparably or better than hand-tuned code. All this is transparent to users, who only see concise symbolic mathematical expressions.
{"title":"Devito: Automated Fast Finite Difference Computation","authors":"Navjot Kukreja, M. Louboutin, Felippe Vieira, F. Luporini, Michael Lange, G. Gorman","doi":"10.1109/WOLFHPC.2016.6","DOIUrl":"https://doi.org/10.1109/WOLFHPC.2016.6","url":null,"abstract":"Domain specific languages have successfully been used in a variety of fields to cleanly express scientific problems as well as to simplify implementation and performance optimization on different computer architectures. Although a large number of stencil languages are available, finite difference domain specific languages have proved challenging to design because most practical use cases require additional features that fall outside the finite difference abstraction. Inspired by the complexity of real-world seismic imaging problems, we introduce Devito, a domain specific language in which high level equations are expressed using symbolic expressions from the SymPy package. Complex equations are automatically manipulated, optimized, and translated into highly optimized C code that aims to perform comparably or better than hand-tuned code. All this is transparent to users, who only see concise symbolic mathematical expressions.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"98 1","pages":"11-19"},"PeriodicalIF":0.0,"publicationDate":"2016-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73429875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Iwasawa, A. Tanikawa, N. Hosono, Keigo Nitadori, T. Muranushi, J. Makino
We have developed FDPS (Framework for Developing Particle Simulator), which enables researchers and programmers to develop high-performance particle simulation codes easily. The basic idea of FDPS is to separate the program code for complex parallelization including domain decomposition, redistribution of particles, and exchange of particle information for interaction calculation between nodes, from actual interaction calculation and orbital integration. FDPS provides the former part and the users write the latter. Thus, a user can implement, for example, a high-performance N- body code, only in 120 lines. In this paper, we present the structure and implementation of FDPS, and describe its performance on two sample applications: gravitational N-body simulation and Smoothed Particle Hydrodynamics simulation. Both codes show very good parallel efficiency and scalability on the K computer. FDPS lets the researchers concentrate on the implementation of physics and mathematical schemes, without wasting their time on the development and performance tuning of their codes.
{"title":"FDPS: a novel framework for developing high-performance particle simulation codes for distributed-memory systems","authors":"M. Iwasawa, A. Tanikawa, N. Hosono, Keigo Nitadori, T. Muranushi, J. Makino","doi":"10.1145/2830018.2830019","DOIUrl":"https://doi.org/10.1145/2830018.2830019","url":null,"abstract":"We have developed FDPS (Framework for Developing Particle Simulator), which enables researchers and programmers to develop high-performance particle simulation codes easily. The basic idea of FDPS is to separate the program code for complex parallelization including domain decomposition, redistribution of particles, and exchange of particle information for interaction calculation between nodes, from actual interaction calculation and orbital integration. FDPS provides the former part and the users write the latter. Thus, a user can implement, for example, a high-performance N- body code, only in 120 lines. In this paper, we present the structure and implementation of FDPS, and describe its performance on two sample applications: gravitational N-body simulation and Smoothed Particle Hydrodynamics simulation. Both codes show very good parallel efficiency and scalability on the K computer. FDPS lets the researchers concentrate on the implementation of physics and mathematical schemes, without wasting their time on the development and performance tuning of their codes.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"73 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86255912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Liao, Pei-Hung Lin, D. Quinlan, Yue Zhao, Xipeng Shen
Domain specific languages (DSLs) offer an attractive path to program large-scale, heterogeneous parallel computers since application developers can leverage high-level annotations defined by DSLs to efficiently express algorithms without being distracted by low-level hardware details. However, performance of DSL programs heavily relies on how well a DSL implementation, including compilers and runtime systems, can exploit knowledge across multiple layers of software/hardware environments for optimizations. The knowledge ranges from domain assumptions, high-level DSL semantics, to low-level hardware features. Traditionally, such knowledge is either implicitly assumed or represented using ad-hoc approaches, including narrative text, source-level annotations, or customized software and hardware specifications in high performance computing (HPC). The lack of a formal, uniform, extensible, reusable and scalable knowledge management approach is becoming a major obstacle to efficient DSLs implementations targeting fast-changing parallel architectures. In this paper, we present a novel DSL implementation paradigm using an ontology-based knowledge base to formally and uniformly exploit the knowledge needed for optimizations. An ontology is a formal and explicit knowledge representation to describe concepts, properties, and individuals in a domain. During the past decades, a wide range of ontology standards and tools have been developed to help users capture, share, utilize and reason domain knowledge. Using modern ontology techniques, we design a knowledge base capturing concepts and properties of a problem domain, DSL programs, and hardware architectures. Compiler interfaces are also defined to allow interactions with the knowledge base to assist program analysis, optimization and code generation. Our preliminary evaluation using stencil computation shows the feasibility and benefits of our approach.
{"title":"Enhancing domain specific language implementations through ontology","authors":"C. Liao, Pei-Hung Lin, D. Quinlan, Yue Zhao, Xipeng Shen","doi":"10.1145/2830018.2830022","DOIUrl":"https://doi.org/10.1145/2830018.2830022","url":null,"abstract":"Domain specific languages (DSLs) offer an attractive path to program large-scale, heterogeneous parallel computers since application developers can leverage high-level annotations defined by DSLs to efficiently express algorithms without being distracted by low-level hardware details. However, performance of DSL programs heavily relies on how well a DSL implementation, including compilers and runtime systems, can exploit knowledge across multiple layers of software/hardware environments for optimizations. The knowledge ranges from domain assumptions, high-level DSL semantics, to low-level hardware features. Traditionally, such knowledge is either implicitly assumed or represented using ad-hoc approaches, including narrative text, source-level annotations, or customized software and hardware specifications in high performance computing (HPC). The lack of a formal, uniform, extensible, reusable and scalable knowledge management approach is becoming a major obstacle to efficient DSLs implementations targeting fast-changing parallel architectures. In this paper, we present a novel DSL implementation paradigm using an ontology-based knowledge base to formally and uniformly exploit the knowledge needed for optimizations. An ontology is a formal and explicit knowledge representation to describe concepts, properties, and individuals in a domain. During the past decades, a wide range of ontology standards and tools have been developed to help users capture, share, utilize and reason domain knowledge. Using modern ontology techniques, we design a knowledge base capturing concepts and properties of a problem domain, DSL programs, and hardware architectures. Compiler interfaces are also defined to allow interactions with the knowledge base to assist program analysis, optimization and code generation. Our preliminary evaluation using stencil computation shows the feasibility and benefits of our approach.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90443953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Writing scientific applications for modern multicore machines is a challenging task. There are a myriad of hardware solutions available for many different target applications, each having their own advantages and trade-offs. An attractive approach is Concurrent Collections (CnC), which provides a programming model that separates the concerns of the application expert from the performance expert. CnC uses a data and control flow model paired with philosophies from previous data-flow programming models and tuple-space influences. By following the CnC programming paradigm, the runtime will seamlessly exploit available parallelism regardless of the platform; however, there are limitations to its effectiveness depending on the algorithm. In this paper, we explore ways to optimize the performance of the proxy application, Livermore Unstructured Lagrange Explicit Shock Hydrodynamics (LULESH), written using Concurrent Collections. The LULESH algorithm is expressed as a minimally-constrained set of partially-ordered operations with explicit dependencies. However, performance is plagued by scheduling overhead and synchronization costs caused by the fine granularity of computation steps. In LULESH and similar stencil-codes, we show that an algorithmic CnC program can be tuned by coalescing CnC elements through step fusion and tiling to become a well-tuned and scalable application running on multi-core systems. With these optimizations, we achieve up to 38x speedup over the original implementation with good scalability for up to 48 processor machines.
{"title":"Optimizing the LULESH stencil code using concurrent collections","authors":"Chenyang Liu, Milind Kulkarni","doi":"10.1145/2830018.2830024","DOIUrl":"https://doi.org/10.1145/2830018.2830024","url":null,"abstract":"Writing scientific applications for modern multicore machines is a challenging task. There are a myriad of hardware solutions available for many different target applications, each having their own advantages and trade-offs. An attractive approach is Concurrent Collections (CnC), which provides a programming model that separates the concerns of the application expert from the performance expert. CnC uses a data and control flow model paired with philosophies from previous data-flow programming models and tuple-space influences. By following the CnC programming paradigm, the runtime will seamlessly exploit available parallelism regardless of the platform; however, there are limitations to its effectiveness depending on the algorithm. In this paper, we explore ways to optimize the performance of the proxy application, Livermore Unstructured Lagrange Explicit Shock Hydrodynamics (LULESH), written using Concurrent Collections. The LULESH algorithm is expressed as a minimally-constrained set of partially-ordered operations with explicit dependencies. However, performance is plagued by scheduling overhead and synchronization costs caused by the fine granularity of computation steps. In LULESH and similar stencil-codes, we show that an algorithmic CnC program can be tuned by coalescing CnC elements through step fusion and tiling to become a well-tuned and scalable application running on multi-core systems. With these optimizations, we achieve up to 38x speedup over the original implementation with good scalability for up to 48 processor machines.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90653071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High performance architectures evolve continuously to be more powerful. Such architectures also usually become more difficult to use efficiently. As a scientist is not a low level and high performance programming expert, Domain Specific Languages (DSLs) are a promising solution to automatically and efficiently write high performance codes. However, if DSLs ease programming for scientists, maintainability and portability issues are transferred from scientists to DSL designers. This paper deals with an approach to improve maintainability and programming productivity of DSLs through the generation of a component-based parallel runtime. To study it, the paper presents a DSL for multi-stencil programs, that is evaluated on a real-case of shallow water equations.
{"title":"From DSL to HPC component-based runtime: a multi-stencil DSL case study","authors":"Julien Bigot, Hélène Coullon, Christian Pérez","doi":"10.1145/2830018.2830020","DOIUrl":"https://doi.org/10.1145/2830018.2830020","url":null,"abstract":"High performance architectures evolve continuously to be more powerful. Such architectures also usually become more difficult to use efficiently. As a scientist is not a low level and high performance programming expert, Domain Specific Languages (DSLs) are a promising solution to automatically and efficiently write high performance codes. However, if DSLs ease programming for scientists, maintainability and portability issues are transferred from scientists to DSL designers. This paper deals with an approach to improve maintainability and programming productivity of DSLs through the generation of a component-based parallel runtime. To study it, the paper presents a DSL for multi-stencil programs, that is evaluated on a real-case of shallow water equations.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73383647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Rawat, Martin Kong, Thomas Henretty, Justin Holewinski, Kevin Stock, L. Pouchet, J. Ramanujam, A. Rountev, P. Sadayappan
Stencil computations are at the core of applications in a number of scientific computing domains. We describe a domain-specific language for regular stencil computations that allows specification of the computations in a concise manner. We describe a multi-target compiler for this DSL, which generates optimized code for GPUa, FPGAs, and multi-core processors with short-vector SIMD instruction sets, considering both low-order and high-order stencil computations. The hardware differences between these three types of architecture prompt different optimization strategies for the compiler. We evaluate the domain-specific compiler using a number of benchmarks on CPU, GPU and FPGA platforms.
{"title":"SDSLc: a multi-target domain-specific compiler for stencil computations","authors":"P. Rawat, Martin Kong, Thomas Henretty, Justin Holewinski, Kevin Stock, L. Pouchet, J. Ramanujam, A. Rountev, P. Sadayappan","doi":"10.1145/2830018.2830025","DOIUrl":"https://doi.org/10.1145/2830018.2830025","url":null,"abstract":"Stencil computations are at the core of applications in a number of scientific computing domains. We describe a domain-specific language for regular stencil computations that allows specification of the computations in a concise manner. We describe a multi-target compiler for this DSL, which generates optimized code for GPUa, FPGAs, and multi-core processors with short-vector SIMD instruction sets, considering both low-order and high-order stencil computations. The hardware differences between these three types of architecture prompt different optimization strategies for the compiler. We evaluate the domain-specific compiler using a number of benchmarks on CPU, GPU and FPGA platforms.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79839034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we present Puffin, a domain-specific language embedded in C++98 for incremental adoption in existing unstructured hydrodynamics codes. Because HPC systems with heterogeneous architectures (traditional CPUs, GPUs, Xeon Phis, etc.) are becoming increasingly common, developers of existing HPC software projects need performance across multiple architectures. While Puffin is not yet complete and only supports CPU execution so far, our aim for Puffin is to provide performance portability to existing unstructured hydrodynamics simulation projects. Our preliminary results focus on two topics. First, we show what the costs of using Puffin are. Adopting Puffin has a initial cost of rewriting existing code into Puffin. Using Puffin has the ongoing costs of increased compilation times (2-3X slower) and runtime overhead (0-11% slower). Second, we show the current benefits of using Puffin and mention the potential future benefits. We show how Puffin can gradually be adopted into an existing project, by doing so with the existing test application, LULESH 2.0. We show a reduction in code length by porting code to Puffin.
{"title":"Puffin: an embedded domain-specific language for existing unstructured hydrodynamics codes","authors":"Christopher W. Earl","doi":"10.1145/2830018.2830021","DOIUrl":"https://doi.org/10.1145/2830018.2830021","url":null,"abstract":"In this paper, we present Puffin, a domain-specific language embedded in C++98 for incremental adoption in existing unstructured hydrodynamics codes. Because HPC systems with heterogeneous architectures (traditional CPUs, GPUs, Xeon Phis, etc.) are becoming increasingly common, developers of existing HPC software projects need performance across multiple architectures. While Puffin is not yet complete and only supports CPU execution so far, our aim for Puffin is to provide performance portability to existing unstructured hydrodynamics simulation projects. Our preliminary results focus on two topics. First, we show what the costs of using Puffin are. Adopting Puffin has a initial cost of rewriting existing code into Puffin. Using Puffin has the ongoing costs of increased compilation times (2-3X slower) and runtime overhead (0-11% slower). Second, we show the current benefits of using Puffin and mention the potential future benefits. We show how Puffin can gradually be adopted into an existing project, by doing so with the existing test application, LULESH 2.0. We show a reduction in code length by porting code to Puffin.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76969919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Peterson, H. Dasari, A. Humphrey, J. Sutherland, T. Saad, M. Berzins
The Uintah computational framework is used for the parallel solution of partial differential equations on adaptive mesh refinement grids using modern supercomputers. Uintah is structured with an application layer and a separate runtime system. The Uintah runtime system is based on a distributed directed acyclic graph (DAG) of computational tasks, with a task scheduler that efficiently schedules and execute these tasks on both CPU cores and on-node accelerators. The runtime system identifies task dependencies, creates a taskgraph prior to an iteration based on these dependencies, prepares data for tasks, automatically generates MPI message tags, and manages data after task computation. Managing tasks for accelerators pose significant challenges over their CPU task counterparts due to supporting more memory regions, API call latency, memory bandwidth concerns, and the added complexity of development. These challenges are greatest when tasks compute within a few milliseconds, especially those that have stencil based computations that involve halo data, have little reuse of data, and/or require many computational variables. Current and emerging heterogeneous architectures necessitate addressing these challenges within Uintah. This work is not designed to improve performance of existing tasks, but rather reduce runtime overhead to allow developers writing short-lived computational tasks to utilize Uintah in a heterogeneous environment. This work analyzes an initial approach for managing accelerator tasks alongside existing CPU tasks within Uintah. The principal contribution of this work is to identify and address inefficiencies that arise when mapping tasks onto the GPU, to implement new schemes to reduce runtime system overhead, to introduce new features that allow for more tasks to leverage on-node accelerators, and to show overhead reduction results from these improvements.
{"title":"Reducing overhead in the Uintah framework to support short-lived tasks on GPU-heterogeneous architectures","authors":"B. Peterson, H. Dasari, A. Humphrey, J. Sutherland, T. Saad, M. Berzins","doi":"10.1145/2830018.2830023","DOIUrl":"https://doi.org/10.1145/2830018.2830023","url":null,"abstract":"The Uintah computational framework is used for the parallel solution of partial differential equations on adaptive mesh refinement grids using modern supercomputers. Uintah is structured with an application layer and a separate runtime system. The Uintah runtime system is based on a distributed directed acyclic graph (DAG) of computational tasks, with a task scheduler that efficiently schedules and execute these tasks on both CPU cores and on-node accelerators. The runtime system identifies task dependencies, creates a taskgraph prior to an iteration based on these dependencies, prepares data for tasks, automatically generates MPI message tags, and manages data after task computation. Managing tasks for accelerators pose significant challenges over their CPU task counterparts due to supporting more memory regions, API call latency, memory bandwidth concerns, and the added complexity of development. These challenges are greatest when tasks compute within a few milliseconds, especially those that have stencil based computations that involve halo data, have little reuse of data, and/or require many computational variables. Current and emerging heterogeneous architectures necessitate addressing these challenges within Uintah. This work is not designed to improve performance of existing tasks, but rather reduce runtime overhead to allow developers writing short-lived computational tasks to utilize Uintah in a heterogeneous environment. This work analyzes an initial approach for managing accelerator tasks alongside existing CPU tasks within Uintah. The principal contribution of this work is to identify and address inefficiencies that arise when mapping tasks onto the GPU, to implement new schemes to reduce runtime system overhead, to introduce new features that allow for more tasks to leverage on-node accelerators, and to show overhead reduction results from these improvements.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85223152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}