{"title":"Proceedings of the 5th International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing","authors":"","doi":"10.1145/2830018","DOIUrl":"https://doi.org/10.1145/2830018","url":null,"abstract":"","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"76 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86179583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anthony Danalis, G. Bosilca, Aurélien Bouteiller, T. Hérault, J. Dongarra
Increased parallelism and use of heterogeneous computing resources is now an established trend in High Performance Computing (HPC), a trend that, looking forward to Exascale, seems bound to intensify. Despite the evolution of hardware over the past decade, the programming paradigm of choice was invariably derived from Coarse Grain Parallelism with explicit data movements. We argue that message passing has remained the de facto standard in HPC because, until now, the ever increasing challenges that application developers had to address to create efficient portable applications remained manageable for expert programmers.Data-flow based programming is an alternative approach with significant potential. In this paper, we discuss the Parameterized Task Graph (PTG) abstraction and present the specialized input language that we use to specify PTGs in our data-flow task-based runtime system, PaRSEC. This language and the corresponding execution model are in contrast with the execution model of explicit message passing as well as the model of alternative task based runtime systems. The Parameterized Task Graph language decouples the expression of the parallelism in the algorithm from the control-flow ordering, load balance, and data distribution. Thus, programs are more adaptable and map more efficiently on challenging hardware, as well as maintain portability across diverse architectures. To support these claims, we discuss the different challenges of HPC programming and how PaR-SEC can address them, and we demonstrate that in today's large scale supercomputers, PaRSEC can significantly outperform state-of-the-art MPI applications and libraries, a trend that will increase with future architectural evolution.
{"title":"PTG: An Abstraction for Unhindered Parallelism","authors":"Anthony Danalis, G. Bosilca, Aurélien Bouteiller, T. Hérault, J. Dongarra","doi":"10.1109/WOLFHPC.2014.8","DOIUrl":"https://doi.org/10.1109/WOLFHPC.2014.8","url":null,"abstract":"Increased parallelism and use of heterogeneous computing resources is now an established trend in High Performance Computing (HPC), a trend that, looking forward to Exascale, seems bound to intensify. Despite the evolution of hardware over the past decade, the programming paradigm of choice was invariably derived from Coarse Grain Parallelism with explicit data movements. We argue that message passing has remained the de facto standard in HPC because, until now, the ever increasing challenges that application developers had to address to create efficient portable applications remained manageable for expert programmers.Data-flow based programming is an alternative approach with significant potential. In this paper, we discuss the Parameterized Task Graph (PTG) abstraction and present the specialized input language that we use to specify PTGs in our data-flow task-based runtime system, PaRSEC. This language and the corresponding execution model are in contrast with the execution model of explicit message passing as well as the model of alternative task based runtime systems. The Parameterized Task Graph language decouples the expression of the parallelism in the algorithm from the control-flow ordering, load balance, and data distribution. Thus, programs are more adaptable and map more efficiently on challenging hardware, as well as maintain portability across diverse architectures. To support these claims, we discuss the different challenges of HPC programming and how PaR-SEC can address them, and we demonstrate that in today's large scale supercomputers, PaRSEC can significantly outperform state-of-the-art MPI applications and libraries, a trend that will increase with future architectural evolution.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"4 1","pages":"21-30"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87686733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
HSLOT arms users with a rich set of configurable transformation directives, to be used as-they-are or to be specialized and combined into powerful custom transformations. We offer a plethora of loop transformations, which includes both the classic set (unroll, fuse, fission, tile, and so on) as well as unique ones (specialize, swap nest, split, fork, and so on) that are not found in other state-of-the-art systems. We show how HSLOT enables more transformations such as merging two loops that cannot be fused because of data dependencies and how HSLOT can be used in a simple and systematic fashion to improve memory accesses and expose better parallelism. To use our system, users simply annotate loops with the transformations sequence and compile with our Open64-based HSLOTimplementing Fortran compiler, HSLF90, which produces both object files and optionally source. We describe our experiment results using a set of scientific kernels written in Fortran with HSLOT directives on AMD 32 core system.
{"title":"HSLOT: The HERCULES Scriptable Loop Transformations Engine","authors":"Christos Kartsaklis, Eunjung Park, John Cavazos","doi":"10.1109/WOLFHPC.2014.10","DOIUrl":"https://doi.org/10.1109/WOLFHPC.2014.10","url":null,"abstract":"HSLOT arms users with a rich set of configurable transformation directives, to be used as-they-are or to be specialized and combined into powerful custom transformations. We offer a plethora of loop transformations, which includes both the classic set (unroll, fuse, fission, tile, and so on) as well as unique ones (specialize, swap nest, split, fork, and so on) that are not found in other state-of-the-art systems. We show how HSLOT enables more transformations such as merging two loops that cannot be fused because of data dependencies and how HSLOT can be used in a simple and systematic fashion to improve memory accesses and expose better parallelism. To use our system, users simply annotate loops with the transformations sequence and compile with our Open64-based HSLOTimplementing Fortran compiler, HSLF90, which produces both object files and optionally source. We describe our experiment results using a set of scientific kernels written in Fortran with HSLOT directives on AMD 32 core system.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"40 1","pages":"31-41"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87400122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alejandro Fernández, Vicencc Beltran, Sergi Mateo, Tomasz Patejko, E. Ayguadé
Developing complex scientific applications on high performance systems requires both domain knowledge and expertise in parallel and distributed programming models. In addition, modern high performance systems are heterogeneous, thus composed of multicores and accelerators, which despite being efficient and powerful, are harder to program. Domain-Specific Languages (DSLs) are a promising approach to hide the complexity of HPC systems and boost programmer's productivity. However, the huge cost and complexity of implementing efficient and scalable DSLs on HPC systems is hindering its adoption for most domains. Addressing such problems, we present Data Flow Language (DFL), a DSL designed to exploit distributed and heterogeneous HPC systems. DFL abstracts the key concepts such systems as SMP tasks for multicores, kernels for accelerators and high-level operations for distributed computing. In addition, DFL leverages the hybrid MPI/OmpSs data-flow programming model to efficiently implement the previous concepts. All of these features make DFL suitable as the target language for other DSLs. However, it is also suitable as a fast prototyping language to develop distributed applications on heterogeneous systems.
{"title":"A Data Flow Language to Develop High Performance Computing DSLs","authors":"Alejandro Fernández, Vicencc Beltran, Sergi Mateo, Tomasz Patejko, E. Ayguadé","doi":"10.1109/WOLFHPC.2014.6","DOIUrl":"https://doi.org/10.1109/WOLFHPC.2014.6","url":null,"abstract":"Developing complex scientific applications on high performance systems requires both domain knowledge and expertise in parallel and distributed programming models. In addition, modern high performance systems are heterogeneous, thus composed of multicores and accelerators, which despite being efficient and powerful, are harder to program. Domain-Specific Languages (DSLs) are a promising approach to hide the complexity of HPC systems and boost programmer's productivity. However, the huge cost and complexity of implementing efficient and scalable DSLs on HPC systems is hindering its adoption for most domains. Addressing such problems, we present Data Flow Language (DFL), a DSL designed to exploit distributed and heterogeneous HPC systems. DFL abstracts the key concepts such systems as SMP tasks for multicores, kernels for accelerators and high-level operations for distributed computing. In addition, DFL leverages the hybrid MPI/OmpSs data-flow programming model to efficiently implement the previous concepts. All of these features make DFL suitable as the target language for other DSLs. However, it is also suitable as a fast prototyping language to develop distributed applications on heterogeneous systems.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"15 1","pages":"11-20"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80033525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Reguly, G. Mudalige, M. Giles, Dan Curran, Simon McIntosh-Smith
Code maintainability, performance portability and future proofing are some of the key challenges in this era of rapid change in High Performance Computing. Domain Specific Languages and Active Libraries address these challenges by focusing on a single application domain and providing a high-level programming approach, and then subsequently using domain knowledge to deliver high performance on various hardware. In this paper, we introduce the OPS high-level abstraction and active library aimed at multi-block structured grid computations, and discuss some of its key design points; we demonstrate how OPS can be embedded in C/C++ and the API made to look like a traditional library, and how through a combination of simple text manipulation and back-end logic we can enable execution on a diverse range of hardware using different parallel programming approaches. Relying on the access-execute description of the OPS abstraction, we introduce a number of automated execution techniques that enable distributed memory parallelization, optimization of communication patterns, checkpointing and cache-blocking. Using performance results from CloverLeaf from the Mantevo suite of benchmarks, we demonstrate the utility of OPS.
{"title":"The OPS Domain Specific Abstraction for Multi-block Structured Grid Computations","authors":"I. Reguly, G. Mudalige, M. Giles, Dan Curran, Simon McIntosh-Smith","doi":"10.1109/WOLFHPC.2014.7","DOIUrl":"https://doi.org/10.1109/WOLFHPC.2014.7","url":null,"abstract":"Code maintainability, performance portability and future proofing are some of the key challenges in this era of rapid change in High Performance Computing. Domain Specific Languages and Active Libraries address these challenges by focusing on a single application domain and providing a high-level programming approach, and then subsequently using domain knowledge to deliver high performance on various hardware. In this paper, we introduce the OPS high-level abstraction and active library aimed at multi-block structured grid computations, and discuss some of its key design points; we demonstrate how OPS can be embedded in C/C++ and the API made to look like a traditional library, and how through a combination of simple text manipulation and back-end logic we can enable execution on a diverse range of hardware using different parallel programming approaches. Relying on the access-execute description of the OPS abstraction, we introduce a number of automated execution techniques that enable distributed memory parallelization, optimization of communication patterns, checkpointing and cache-blocking. Using performance results from CloverLeaf from the Mantevo suite of benchmarks, we demonstrate the utility of OPS.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"37 1","pages":"58-67"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76938675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Palmer, W. Perkins, Yousu Chen, Shuangshuang Jin, D. Callahan, Kevin A. Glass, R. Diao, M. Rice, S. Elbert, M. Vallem, Zhenyu Huang
This paper describes the GridPACKTM framework, which is designed to help power grid engineers develop modeling software capable of running on high performance computers. The framework makes extensive use of software templates to provide high level functionality while at the same time allowing developers the freedom to express whatever models and algorithms they are using. GridPACKTM contains modules for setting up distributed power grid networks, assigning buses and branches with arbitrary behaviors to the network, creating distributed matrices and vectors and using parallel linear and non-linear solvers to solve algebraic equations. It also provides mappers to create matrices and vectors based on properties of the network and functionality to support IO and to manage errors. The goal of GridPACKTM is to substantially reduce the complexity of writing software for parallel computers while still providing efficient and scalable software solutions. The use of GridPACKTM is illustrated for a simple powerflow example and performance results for powerflow and dynamic simulation are discussed.
{"title":"GridPACK: A Framework for Developing Power Grid Simulations on High Performance Computing Platforms","authors":"B. Palmer, W. Perkins, Yousu Chen, Shuangshuang Jin, D. Callahan, Kevin A. Glass, R. Diao, M. Rice, S. Elbert, M. Vallem, Zhenyu Huang","doi":"10.1109/WOLFHPC.2014.12","DOIUrl":"https://doi.org/10.1109/WOLFHPC.2014.12","url":null,"abstract":"This paper describes the GridPACKTM framework, which is designed to help power grid engineers develop modeling software capable of running on high performance computers. The framework makes extensive use of software templates to provide high level functionality while at the same time allowing developers the freedom to express whatever models and algorithms they are using. GridPACKTM contains modules for setting up distributed power grid networks, assigning buses and branches with arbitrary behaviors to the network, creating distributed matrices and vectors and using parallel linear and non-linear solvers to solve algebraic equations. It also provides mappers to create matrices and vectors based on properties of the network and functionality to support IO and to manage errors. The goal of GridPACKTM is to substantially reduce the complexity of writing software for parallel computers while still providing efficient and scalable software solutions. The use of GridPACKTM is illustrated for a simple powerflow example and performance results for powerflow and dynamic simulation are discussed.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"37 1","pages":"68-77"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77787165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. McCormick, Christine Sweeney, Nicholas D. Moss, Dean Prichard, S. Gutierrez, K. Davis, J. Mohd-Yusof
The push towards exascale computing has sparked a new set of explorations for providing new productive programming environments. While many efforts are focusing on the design and development of domain-specific languages (DSLs), few have addressed the need for providing a fully domain-aware toolchain. Without such domain awareness critical features for achieving acceptance and adoption, such as debugger support, pose a long-term risk to the overall success of the DSL approach. In this paper we explore the use of language extensions to design and implement the Scout DSL and a supporting toolchain infrastructure. We highlight how language features and the software design methodologies used within the toolchain play a significant role in providing a suitable environment for DSL development.
{"title":"Exploring the Construction of a Domain-Aware Toolchain for High-Performance Computing","authors":"P. McCormick, Christine Sweeney, Nicholas D. Moss, Dean Prichard, S. Gutierrez, K. Davis, J. Mohd-Yusof","doi":"10.1109/WOLFHPC.2014.9","DOIUrl":"https://doi.org/10.1109/WOLFHPC.2014.9","url":null,"abstract":"The push towards exascale computing has sparked a new set of explorations for providing new productive programming environments. While many efforts are focusing on the design and development of domain-specific languages (DSLs), few have addressed the need for providing a fully domain-aware toolchain. Without such domain awareness critical features for achieving acceptance and adoption, such as debugger support, pose a long-term risk to the overall success of the DSL approach. In this paper we explore the use of language extensions to design and implement the Scout DSL and a supporting toolchain infrastructure. We highlight how language features and the software design methodologies used within the toolchain play a significant role in providing a suitable environment for DSL development.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"1 1","pages":"1-10"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82172546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Richard Membarth, P. Slusallek, M. Köster, Roland Leißa, Sebastian Hack
This paper applies partial evaluation to stage a stencil code Domain-Specific Language (DSL) onto a functional and imperative programming language. Platform-specific primitives such as scheduling or vectorization, and algorithmic variants such as boundary handling are factored out into a library that make up the elements of that DSL. We show how partial evaluation can eliminate all overhead of this separation of concerns and creates code that resembles hand-crafted versions for a particular target platform. We evaluate our technique by implementing a DSL for the V-cycle multigrid iteration. Our approach generates code for AMD and NVIDIA GPUs (via SPIR and NVVM) as well as for CPUs using AVX/AVX2 alike from the same high-level DSL program. First results show that we achieve a speedup of up to 3x on the CPU by vectorizing multigrid components and a speedup of up to 2x on the GPU by merging the computation of multigrid components.
{"title":"Target-Specific Refinement of Multigrid Codes","authors":"Richard Membarth, P. Slusallek, M. Köster, Roland Leißa, Sebastian Hack","doi":"10.1109/WOLFHPC.2014.5","DOIUrl":"https://doi.org/10.1109/WOLFHPC.2014.5","url":null,"abstract":"This paper applies partial evaluation to stage a stencil code Domain-Specific Language (DSL) onto a functional and imperative programming language. Platform-specific primitives such as scheduling or vectorization, and algorithmic variants such as boundary handling are factored out into a library that make up the elements of that DSL. We show how partial evaluation can eliminate all overhead of this separation of concerns and creates code that resembles hand-crafted versions for a particular target platform. We evaluate our technique by implementing a DSL for the V-cycle multigrid iteration. Our approach generates code for AMD and NVIDIA GPUs (via SPIR and NVVM) as well as for CPUs using AVX/AVX2 alike from the same high-level DSL program. First results show that we achieve a speedup of up to 3x on the CPU by vectorizing multigrid components and a speedup of up to 2x on the GPU by merging the computation of multigrid components.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"17 1","pages":"52-57"},"PeriodicalIF":0.0,"publicationDate":"2014-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85156737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christian Schmitt, S. Kuckuk, Frank Hannig, H. Köstler, Jürgen Teich
High-Performance Computing (HPC) systems are becoming increasingly parallel and heterogeneous. As a consequence, HPC applications, such as simulation software, need to be especially designed towards these systems to achieve optimal performance. This, in turn, leads to higher complexity, making software engineers and scientists require a deep knowledge of the hardware and its technologies. As a remedy, domain-specific languages (DSLs) are a convenient technology for domain experts to describe settings and problems they want to solve using terms and models familiar to them. This specification is transformed into a target language, i. e., source code in another programming language or a binary executable, by a specialized compiler. We propose ExaSlang, a language for the specification of numerical solvers based on the multigrid method targeting distributed-memory systems. Furthermore, we present the transformation framework that drives the corresponding source-to-source compiler. It emits C++ code utilizing a hybrid OpenMP and MPI parallelization. Moreover, we substantiate our approach with scaling results of our code scaling up to the complete JUQUEEN cluster, consisting of 28,672 nodes, with a total of 458,752 cores.
{"title":"ExaSlang: A Domain-Specific Language for Highly Scalable Multigrid Solvers","authors":"Christian Schmitt, S. Kuckuk, Frank Hannig, H. Köstler, Jürgen Teich","doi":"10.1109/WOLFHPC.2014.11","DOIUrl":"https://doi.org/10.1109/WOLFHPC.2014.11","url":null,"abstract":"High-Performance Computing (HPC) systems are becoming increasingly parallel and heterogeneous. As a consequence, HPC applications, such as simulation software, need to be especially designed towards these systems to achieve optimal performance. This, in turn, leads to higher complexity, making software engineers and scientists require a deep knowledge of the hardware and its technologies. As a remedy, domain-specific languages (DSLs) are a convenient technology for domain experts to describe settings and problems they want to solve using terms and models familiar to them. This specification is transformed into a target language, i. e., source code in another programming language or a binary executable, by a specialized compiler. We propose ExaSlang, a language for the specification of numerical solvers based on the multigrid method targeting distributed-memory systems. Furthermore, we present the transformation framework that drives the corresponding source-to-source compiler. It emits C++ code utilizing a hybrid OpenMP and MPI parallelization. Moreover, we substantiate our approach with scaling results of our code scaling up to the complete JUQUEEN cluster, consisting of 28,672 nodes, with a total of 458,752 cores.","PeriodicalId":59014,"journal":{"name":"高性能计算技术","volume":"17 1","pages":"42-51"},"PeriodicalIF":0.0,"publicationDate":"2014-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81711717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}