自顶向下的编程方法和工具与stars -支持可扩展的编程范例:扩展抽象

ACM SIGPLAN Symposium on Scala Pub Date : 2011-11-14 DOI:10.1145/2133173.2133182

Rosa M. Badia

{"title":"自顶向下的编程方法和工具与stars -支持可扩展的编程范例:扩展抽象","authors":"Rosa M. Badia","doi":"10.1145/2133173.2133182","DOIUrl":null,"url":null,"abstract":"Current supercomputers are evolving to clusters with a very large number of nodes, and what is more, the nodes are each time becoming more complex composed of several multicore chips and GPUs. With such architectures, the application developers are every time facing a more complex task. On the other hand, most HPC applications are scientific legacy codes written in MPI and designed for at most thousands of processors. Current efforts deal with extending these applications to scale to larger number of cores and to be combined with CUDA or OpenCL to efficienly run on GPUs.\n To evolve a given application to be suitable to run in new heterogeneous supercomputers, application developers can take different alternatives. Optimizations to improve the MPI bottlenecks, for example, by using asynchronous communications, or optimizations on the sequential code to improve its locality, or optimizations at the node level to avoid resource contention, to list a few.\n This paper proposes a methodology to enable current MPI applications to be improved using the MPI/StarSs programming model. StarSs [2] is a task-based programming model that enables to parallelize sequential applications by means of annotating the code with compiler directives. What is more important, it supports their execution in heterogeneous platforms, including clusters of GPUs. Also it nicely hybridizes with MPI [1], and enables the overlap of communication and computation.\n The approach is based on the generation at execution time of a directed acyclic graph (DAG), where the nodes of the graph denote tasks in the application and edges denote data dependences between tasks. Once a partial DAG has been generated, the StarSs runtime is able to schedule the tasks to the different cores or GPUs of the platform.\n Another relevant aspect is that the programming model offers to the application developers a single name space while the actual memory addresses can be distributed (as in a cluster or a node with GPUs). The StarSs runtime maintains a hierarchical directory with information about where to find each block of data and different software caches are maintained in each of the distributed memory spaces. The runtime is responsible for transferring the data between the different memory spaces and for keeping the coherence.\n While the programming model itself comes with a very simple syntax, identifying tasks may sometimes not be as easy as one can predict, especially when trying to taskify MPI applications. With the purpose of simplifying this process, a set of tools has been developed to conform with the framework: Ssgrind, that helps identifying tasks and the directionality of the tasksâǍŹ parameters, Ayudame and Temanejo, to help debugging StarSs applications, and Paraver, Cube and Scalasca, that enable a detailed performance analysis of the applications. The extended version of the paper will detail the programming methodology outlined illustrating it with examples.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Top down programming methodology and tools with StarSs - enabling scalable programming paradigms: extended abstract\",\"authors\":\"Rosa M. Badia\",\"doi\":\"10.1145/2133173.2133182\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Current supercomputers are evolving to clusters with a very large number of nodes, and what is more, the nodes are each time becoming more complex composed of several multicore chips and GPUs. With such architectures, the application developers are every time facing a more complex task. On the other hand, most HPC applications are scientific legacy codes written in MPI and designed for at most thousands of processors. Current efforts deal with extending these applications to scale to larger number of cores and to be combined with CUDA or OpenCL to efficienly run on GPUs.\\n To evolve a given application to be suitable to run in new heterogeneous supercomputers, application developers can take different alternatives. Optimizations to improve the MPI bottlenecks, for example, by using asynchronous communications, or optimizations on the sequential code to improve its locality, or optimizations at the node level to avoid resource contention, to list a few.\\n This paper proposes a methodology to enable current MPI applications to be improved using the MPI/StarSs programming model. StarSs [2] is a task-based programming model that enables to parallelize sequential applications by means of annotating the code with compiler directives. What is more important, it supports their execution in heterogeneous platforms, including clusters of GPUs. Also it nicely hybridizes with MPI [1], and enables the overlap of communication and computation.\\n The approach is based on the generation at execution time of a directed acyclic graph (DAG), where the nodes of the graph denote tasks in the application and edges denote data dependences between tasks. Once a partial DAG has been generated, the StarSs runtime is able to schedule the tasks to the different cores or GPUs of the platform.\\n Another relevant aspect is that the programming model offers to the application developers a single name space while the actual memory addresses can be distributed (as in a cluster or a node with GPUs). The StarSs runtime maintains a hierarchical directory with information about where to find each block of data and different software caches are maintained in each of the distributed memory spaces. The runtime is responsible for transferring the data between the different memory spaces and for keeping the coherence.\\n While the programming model itself comes with a very simple syntax, identifying tasks may sometimes not be as easy as one can predict, especially when trying to taskify MPI applications. With the purpose of simplifying this process, a set of tools has been developed to conform with the framework: Ssgrind, that helps identifying tasks and the directionality of the tasksâǍŹ parameters, Ayudame and Temanejo, to help debugging StarSs applications, and Paraver, Cube and Scalasca, that enable a detailed performance analysis of the applications. The extended version of the paper will detail the programming methodology outlined illustrating it with examples.\",\"PeriodicalId\":259517,\"journal\":{\"name\":\"ACM SIGPLAN Symposium on Scala\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-11-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM SIGPLAN Symposium on Scala\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2133173.2133182\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM SIGPLAN Symposium on Scala","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2133173.2133182","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

目前的超级计算机正在向具有非常多节点的集群发展，而且节点每一次都变得更加复杂，这些节点由多个多核芯片和gpu组成。有了这样的体系结构，应用程序开发人员每次都面临着更复杂的任务。另一方面，大多数HPC应用程序是用MPI编写的科学遗留代码，最多为数千个处理器设计。目前的努力是将这些应用扩展到更大的内核数量，并与CUDA或OpenCL相结合，以便在gpu上高效运行。为了使给定的应用程序适合在新的异构超级计算机中运行，应用程序开发人员可以采用不同的替代方案。改进MPI瓶颈的优化，例如，通过使用异步通信，或对顺序代码进行优化以改进其局部性，或在节点级别进行优化以避免资源争用，等等。本文提出了一种方法，使当前的MPI应用程序能够使用MPI/ stars编程模型进行改进。stars[2]是一种基于任务的编程模型，它可以通过用编译器指令注释代码来并行化顺序应用程序。更重要的是，它支持它们在异构平台上的执行，包括gpu集群。它还与MPI[1]很好地杂交，实现了通信和计算的重叠。该方法基于在执行时生成有向无环图(DAG)，其中图的节点表示应用程序中的任务，边表示任务之间的数据依赖关系。一旦生成了部分DAG, stars运行时就能够将任务调度到平台的不同核心或gpu上。另一个相关的方面是，编程模型为应用程序开发人员提供了一个单一的名称空间，而实际的内存地址可以分布(如在集群或带有gpu的节点中)。stars运行时维护一个分层目录，其中包含关于在哪里找到每个数据块的信息，并且在每个分布式内存空间中维护不同的软件缓存。运行时负责在不同的内存空间之间传输数据并保持一致性。虽然编程模型本身具有非常简单的语法，但识别任务有时可能不像预期的那么容易，特别是在尝试为MPI应用程序分配任务时。为了简化这个过程，开发了一组符合框架的工具:Ssgrind，它有助于识别任务和tasksâǍŹ参数的方向性;Ayudame和Temanejo，它有助于调试stars应用程序;Paraver, Cube和scala，它可以对应用程序进行详细的性能分析。本文的扩展版将详细介绍编程方法，并举例说明。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Top down programming methodology and tools with StarSs - enabling scalable programming paradigms: extended abstract

Current supercomputers are evolving to clusters with a very large number of nodes, and what is more, the nodes are each time becoming more complex composed of several multicore chips and GPUs. With such architectures, the application developers are every time facing a more complex task. On the other hand, most HPC applications are scientific legacy codes written in MPI and designed for at most thousands of processors. Current efforts deal with extending these applications to scale to larger number of cores and to be combined with CUDA or OpenCL to efficienly run on GPUs. To evolve a given application to be suitable to run in new heterogeneous supercomputers, application developers can take different alternatives. Optimizations to improve the MPI bottlenecks, for example, by using asynchronous communications, or optimizations on the sequential code to improve its locality, or optimizations at the node level to avoid resource contention, to list a few. This paper proposes a methodology to enable current MPI applications to be improved using the MPI/StarSs programming model. StarSs [2] is a task-based programming model that enables to parallelize sequential applications by means of annotating the code with compiler directives. What is more important, it supports their execution in heterogeneous platforms, including clusters of GPUs. Also it nicely hybridizes with MPI [1], and enables the overlap of communication and computation. The approach is based on the generation at execution time of a directed acyclic graph (DAG), where the nodes of the graph denote tasks in the application and edges denote data dependences between tasks. Once a partial DAG has been generated, the StarSs runtime is able to schedule the tasks to the different cores or GPUs of the platform. Another relevant aspect is that the programming model offers to the application developers a single name space while the actual memory addresses can be distributed (as in a cluster or a node with GPUs). The StarSs runtime maintains a hierarchical directory with information about where to find each block of data and different software caches are maintained in each of the distributed memory spaces. The runtime is responsible for transferring the data between the different memory spaces and for keeping the coherence. While the programming model itself comes with a very simple syntax, identifying tasks may sometimes not be as easy as one can predict, especially when trying to taskify MPI applications. With the purpose of simplifying this process, a set of tools has been developed to conform with the framework: Ssgrind, that helps identifying tasks and the directionality of the tasksâǍŹ parameters, Ayudame and Temanejo, to help debugging StarSs applications, and Paraver, Cube and Scalasca, that enable a detailed performance analysis of the applications. The extended version of the paper will detail the programming methodology outlined illustrating it with examples.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM SIGPLAN Symposium on Scala

自引率

0.00%

发文量