T. Heller, Hartmut Kaiser, Andreas Schäfer, D. Fey
With the general availability of PetaFLOP clusters and the advent of heterogeneous machines equipped with special accelerator cards such as the Xeon Phi[2], computer scientist face the difficult task of improving application scalability beyond what is possible with conventional techniques and programming models today. In addition, the need for highly adaptive runtime algorithms and for applications handling highly inhomogeneous data further impedes our ability to efficiently write code which performs and scales well. In this paper we present the advantages of using HPX[19, 3, 29], a general purpose parallel runtime system for applications of any scale as a backend for LibGeoDecomp[25] for implementing a three-dimensional N-Body simulation with local interactions. We compare scaling and performance results for this application while using the HPX and MPI backends for LibGeoDecomp. LibGeoDecomp is a Library for Geometric Decomposition codes implementing the idea of a user supplied simulation model, where the library handles the spatial and temporal loops, and the data storage. The presented results are acquired from various homogeneous and heterogeneous runs including up to 1024 nodes (16384 conventional cores) combined with up to 16 Xeon Phi accelerators (3856 hardware threads) on TACC's Stampede supercomputer[1]. In the configuration using the HPX backend, more than 0.35 PFLOPS have been achieved, which corresponds to a parallel application efficiency of around 79%. Our measurements demonstrate the advantage of using the intrinsically asynchronous and message driven programming model exposed by HPX which enables better latency hiding, fine to medium grain parallelism, and constraint based synchronization. HPX's uniform programming model simplifies writing highly parallel code for heterogeneous resources.
{"title":"Using HPX and LibGeoDecomp for scaling HPC applications on heterogeneous supercomputers","authors":"T. Heller, Hartmut Kaiser, Andreas Schäfer, D. Fey","doi":"10.1145/2530268.2530269","DOIUrl":"https://doi.org/10.1145/2530268.2530269","url":null,"abstract":"With the general availability of PetaFLOP clusters and the advent of heterogeneous machines equipped with special accelerator cards such as the Xeon Phi[2], computer scientist face the difficult task of improving application scalability beyond what is possible with conventional techniques and programming models today. In addition, the need for highly adaptive runtime algorithms and for applications handling highly inhomogeneous data further impedes our ability to efficiently write code which performs and scales well.\u0000 In this paper we present the advantages of using HPX[19, 3, 29], a general purpose parallel runtime system for applications of any scale as a backend for LibGeoDecomp[25] for implementing a three-dimensional N-Body simulation with local interactions. We compare scaling and performance results for this application while using the HPX and MPI backends for LibGeoDecomp. LibGeoDecomp is a Library for Geometric Decomposition codes implementing the idea of a user supplied simulation model, where the library handles the spatial and temporal loops, and the data storage.\u0000 The presented results are acquired from various homogeneous and heterogeneous runs including up to 1024 nodes (16384 conventional cores) combined with up to 16 Xeon Phi accelerators (3856 hardware threads) on TACC's Stampede supercomputer[1]. In the configuration using the HPX backend, more than 0.35 PFLOPS have been achieved, which corresponds to a parallel application efficiency of around 79%. Our measurements demonstrate the advantage of using the intrinsically asynchronous and message driven programming model exposed by HPX which enables better latency hiding, fine to medium grain parallelism, and constraint based synchronization. HPX's uniform programming model simplifies writing highly parallel code for heterogeneous resources.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123234225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a Monte Carlo SPAI pre-conditioner. In contrast to the standard deterministic SPAI pre-conditioners that use the Frobenius norm, a Monte Carlo alternative that relies on the use of Markov Chain Monte Carlo (MCMC) methods to compute a rough matrix inverse (MI) is given. Monte Carlo methods enable a quick rough estimate of the non-zero elements of the inverse matrix with a given precision and certain probability. The advantage of this method is that the same approach is applied to sparse and dense matrices and that complexity of the Monte Carlo matrix inversion is linear of the size of the matrix. The behaviour of the proposed algorithm is studied, its performance is investigated and a comparison with the standard deterministic SPAI, as well as the optimized and parallel MSPAI version is made. Further Monte Carlo SPAI and MSPAI are used for solving systems of linear algebraic equations (SLAE) using BiCGSTAB and a comparison of the results is made.
本文提出了一种蒙特卡罗SPAI预调节器。与使用Frobenius范数的标准确定性SPAI预调节器相反,给出了一种依赖于使用马尔可夫链蒙特卡罗(MCMC)方法来计算粗糙矩阵逆(MI)的蒙特卡罗替代方法。蒙特卡罗方法能够以给定的精度和一定的概率对逆矩阵的非零元素进行快速粗略估计。这种方法的优点是同样的方法适用于稀疏矩阵和密集矩阵,并且蒙特卡罗矩阵反演的复杂度与矩阵的大小成线性关系。研究了该算法的行为,对其性能进行了研究,并与标准确定性SPAI进行了比较,并给出了优化后的并行MSPAI版本。进一步将Monte Carlo SPAI和MSPAI应用于BiCGSTAB求解线性代数方程组(SLAE),并对结果进行了比较。
{"title":"On scalability behaviour of Monte Carlo sparse approximate inverse for matrix computations","authors":"J. Strassburg, V. Alexandrov","doi":"10.1145/2530268.2530274","DOIUrl":"https://doi.org/10.1145/2530268.2530274","url":null,"abstract":"This paper presents a Monte Carlo SPAI pre-conditioner. In contrast to the standard deterministic SPAI pre-conditioners that use the Frobenius norm, a Monte Carlo alternative that relies on the use of Markov Chain Monte Carlo (MCMC) methods to compute a rough matrix inverse (MI) is given. Monte Carlo methods enable a quick rough estimate of the non-zero elements of the inverse matrix with a given precision and certain probability. The advantage of this method is that the same approach is applied to sparse and dense matrices and that complexity of the Monte Carlo matrix inversion is linear of the size of the matrix. The behaviour of the proposed algorithm is studied, its performance is investigated and a comparison with the standard deterministic SPAI, as well as the optimized and parallel MSPAI version is made. Further Monte Carlo SPAI and MSPAI are used for solving systems of linear algebraic equations (SLAE) using BiCGSTAB and a comparison of the results is made.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124172162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
W. Gansterer, Gerhard Niederbrucker, H. Straková, Stefan Schulze Grotthoff
The construction of distributed algorithms for matrix computations built on top of distributed data aggregation algorithms with randomized communication schedules is investigated. For this purpose, a new aggregation algorithm for summing or averaging distributed values, the push-flow algorithm, is developed, which achieves superior resilience properties with respect to node failures compared to existing aggregation methods. On a hypercube topology it asymptotically requires the same number of iterations as the optimal all-to-all reduction operation and it scales well with the number of nodes. Orthogonalization is studied as a prototypical matrix computation task. A new fault tolerant distributed orthogonalization method (rdmGS), which can produce accurate results even in the presence of node failures, is built on top of distributed data aggregation algorithms.
{"title":"Robust distributed orthogonalization based on randomized aggregation","authors":"W. Gansterer, Gerhard Niederbrucker, H. Straková, Stefan Schulze Grotthoff","doi":"10.1145/2133173.2133177","DOIUrl":"https://doi.org/10.1145/2133173.2133177","url":null,"abstract":"The construction of distributed algorithms for matrix computations built on top of distributed data aggregation algorithms with randomized communication schedules is investigated. For this purpose, a new aggregation algorithm for summing or averaging distributed values, the push-flow algorithm, is developed, which achieves superior resilience properties with respect to node failures compared to existing aggregation methods. On a hypercube topology it asymptotically requires the same number of iterations as the optimal all-to-all reduction operation and it scales well with the number of nodes. Orthogonalization is studied as a prototypical matrix computation task. A new fault tolerant distributed orthogonalization method (rdmGS), which can produce accurate results even in the presence of node failures, is built on top of distributed data aggregation algorithms.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"4290 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133388177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the inclusion of non-blocking global collective operations in the MPI 3.0 draft specification many fundamental algorithms such as those for performing 3-dimensional (3D) FFTs will be modified to take advantage of non-blocking collectives. Novel modifications to such fundamental algorithms will need to be suitable for incorporation in general-purpose FFT libraries to be routinely used by HPC application users. Here we present such a general-purpose algorithmic strategy to utilize non-blocking collective communications in the calculation of a single parallel 3D FFT. In this scheme, the global collective communication is partitioned into blocking and non-blocking components such that overlap between communication and computation is obtained in the 3D FFT calculation. We present benchmarks of our scheme for overlapping computation and communication in the calculation of single variable 3D FFTs on two different architectures (a) HECToR, a Cray XE6 machine and (b) a Fujitsu PRIMERGY Intel Westmere cluster with InfiniBand interconnect.
{"title":"On non-blocking collectives in 3D FFTs","authors":"R. S. Saksena","doi":"10.1145/2133173.2133180","DOIUrl":"https://doi.org/10.1145/2133173.2133180","url":null,"abstract":"With the inclusion of non-blocking global collective operations in the MPI 3.0 draft specification many fundamental algorithms such as those for performing 3-dimensional (3D) FFTs will be modified to take advantage of non-blocking collectives. Novel modifications to such fundamental algorithms will need to be suitable for incorporation in general-purpose FFT libraries to be routinely used by HPC application users. Here we present such a general-purpose algorithmic strategy to utilize non-blocking collective communications in the calculation of a single parallel 3D FFT. In this scheme, the global collective communication is partitioned into blocking and non-blocking components such that overlap between communication and computation is obtained in the 3D FFT calculation. We present benchmarks of our scheme for overlapping computation and communication in the calculation of single variable 3D FFTs on two different architectures (a) HECToR, a Cray XE6 machine and (b) a Fujitsu PRIMERGY Intel Westmere cluster with InfiniBand interconnect.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131380313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikola Rajovic, Nikola Puzovic, L. Vilanova, Carlos Villavieja, Alex Ramírez
Energy efficiency is a first-order concern when deploying any computer system. From battery-operated mobile devices, to data centers and supercomputers, energy consumption limits the performance that can be offered. We are exploring an alternative to current supercomputers that builds on the small energy-efficient mobile processors. We present results from the prototype system based on ARM Cortex-A9 and make projections about the possibilities to increase energy efficiency.
{"title":"The low-power architecture approach towards exascale computing","authors":"Nikola Rajovic, Nikola Puzovic, L. Vilanova, Carlos Villavieja, Alex Ramírez","doi":"10.1145/2133173.2133175","DOIUrl":"https://doi.org/10.1145/2133173.2133175","url":null,"abstract":"Energy efficiency is a first-order concern when deploying any computer system. From battery-operated mobile devices, to data centers and supercomputers, energy consumption limits the performance that can be offered.\u0000 We are exploring an alternative to current supercomputers that builds on the small energy-efficient mobile processors. We present results from the prototype system based on ARM Cortex-A9 and make projections about the possibilities to increase energy efficiency.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123793422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The general purpose graphics processing units (GPGPU) are increasingly deployed for scientific computing due to their performance advantages over CPUs. As a result, fault tolerance has become a more serious concern compared to the period when GPGPUs were used exclusively for graphics applications. Using GPUs and CPUs together in a hybrid computing system increases flexibility and performance but also increases the possibility of the computations being affected by soft errors. In this work, we propose a soft error resilient algorithm for QR factorization on such hybrid systems. Our contributions include (1) a checkpointing and recovery mechanism for the left-factor Q whose performance is scalable on hybrid systems; (2) optimized Givens rotation utilities on GPGPUs to efficiently reduce an upper Hessenberg matrix to an upper triangular form for the protection of the right factor R, and (3) a recovery algorithm based on QR update on GPGPUs. Experimental results show that our fault tolerant QR factorization can success- fully detect and recover from soft errors in the entire matrix with little overhead on hybrid systems with GPGPUs.
{"title":"Soft error resilient QR factorization for hybrid system with GPGPU","authors":"Peng Du, P. Luszczek, S. Tomov, J. Dongarra","doi":"10.1145/2133173.2133179","DOIUrl":"https://doi.org/10.1145/2133173.2133179","url":null,"abstract":"The general purpose graphics processing units (GPGPU) are increasingly deployed for scientific computing due to their performance advantages over CPUs. As a result, fault tolerance has become a more serious concern compared to the period when GPGPUs were used exclusively for graphics applications. Using GPUs and CPUs together in a hybrid computing system increases flexibility and performance but also increases the possibility of the computations being affected by soft errors. In this work, we propose a soft error resilient algorithm for QR factorization on such hybrid systems. Our contributions include (1) a checkpointing and recovery mechanism for the left-factor Q whose performance is scalable on hybrid systems; (2) optimized Givens rotation utilities on GPGPUs to efficiently reduce an upper Hessenberg matrix to an upper triangular form for the protection of the right factor R, and (3) a recovery algorithm based on QR update on GPGPUs. Experimental results show that our fault tolerant QR factorization can success- fully detect and recover from soft errors in the entire matrix with little overhead on hybrid systems with GPGPUs.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121355153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper details our experiences in performing a detailed performance analysis on a large-scale parallel cardiac simulation by the Chaste software on an Nehalem and Infiniband-based cluster. Our methodology achieves good accuracy for relatively modest amounts of cluster time. The use of sections in the Chaste internal profiler, coupled with the IPM tool, enabled some detailed insights into the performance and scalability of the application. For large core counts, our analysis showed that performance was no longer dominated by the linear systems solver. The computationally-intensive components scaled well up to 2048 cores, and poorly scaling and highly imbalanced components associated with program output and miscellaneous functions were limiting scalability.
{"title":"Performance analysis of a cardiac simulation code using IPM","authors":"P. Strazdins, M. Hegland","doi":"10.1145/2133173.2133186","DOIUrl":"https://doi.org/10.1145/2133173.2133186","url":null,"abstract":"This paper details our experiences in performing a detailed performance analysis on a large-scale parallel cardiac simulation by the Chaste software on an Nehalem and Infiniband-based cluster. Our methodology achieves good accuracy for relatively modest amounts of cluster time. The use of sections in the Chaste internal profiler, coupled with the IPM tool, enabled some detailed insights into the performance and scalability of the application.\u0000 For large core counts, our analysis showed that performance was no longer dominated by the linear systems solver. The computationally-intensive components scaled well up to 2048 cores, and poorly scaling and highly imbalanced components associated with program output and miscellaneous functions were limiting scalability.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"PP 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115520935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Soft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Soft errors normally do not interrupt the execution of the affected program, but the affected computation results can not be trusted any more. A well known technique to correct soft errors in matrix-matrix multiplication is algorithm-based fault tolerance (ABFT). While ABFT achieves much better efficiency than triple modular redundancy (TMR) - a traditional general technique to correct soft errors, both ABFT and TMR detect errors off-line after the computation is finished. This paper extends the traditional ABFT technique from off-line to on-line so that soft errors in matrix-matrix multiplication can be detect in the middle of the computation during the program execution and higher efficiency can be achieved by correcting the corrupted computations in a timely manner. Experimental results demonstrate that the proposed technique can correct one error every ten seconds with negligible (i.e., less than 1%) performance penalty over the ATLAS dgemm().
{"title":"Fault tolerant matrix-matrix multiplication: correcting soft errors on-line","authors":"Panruo Wu, Chong Ding, Longxiang Chen, Feng Gao, T. Davies, Christer Karlsson, Zizhong Chen","doi":"10.1145/2133173.2133185","DOIUrl":"https://doi.org/10.1145/2133173.2133185","url":null,"abstract":"Soft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Soft errors normally do not interrupt the execution of the affected program, but the affected computation results can not be trusted any more. A well known technique to correct soft errors in matrix-matrix multiplication is algorithm-based fault tolerance (ABFT). While ABFT achieves much better efficiency than triple modular redundancy (TMR) - a traditional general technique to correct soft errors, both ABFT and TMR detect errors off-line after the computation is finished. This paper extends the traditional ABFT technique from off-line to on-line so that soft errors in matrix-matrix multiplication can be detect in the middle of the computation during the program execution and higher efficiency can be achieved by correcting the corrupted computations in a timely manner. Experimental results demonstrate that the proposed technique can correct one error every ten seconds with negligible (i.e., less than 1%) performance penalty over the ATLAS dgemm().","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114319460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jun He, J. Kowalkowski, M. Paterno, D. Holmgren, J. Simone, Xian-He Sun
Nowadays, high performance computers have more cores and nodes than ever before. Computation is spread out among them, leading to more communication. For this reason, communication can easily become the bottleneck of a system and limit its scalability. The layout of an application on a computer is the key factor to preserve communication locality and reduce its cost. In this paper, we propose a simple model to optimize the layout for scientific applications by minimizing inter-node communication cost. The model takes into account the latency and bandwidth of the network and associates them with the dominant layout variables of the application. We take MILC as an example and analyze its communication patterns. According to our experimental results, the model developed for MILC achieved a satisfactory accuracy for predicting the performance, leading to up to 31% performance improvement.
{"title":"Layout-aware scientific computing: a case study using MILC","authors":"Jun He, J. Kowalkowski, M. Paterno, D. Holmgren, J. Simone, Xian-He Sun","doi":"10.1145/2133173.2133183","DOIUrl":"https://doi.org/10.1145/2133173.2133183","url":null,"abstract":"Nowadays, high performance computers have more cores and nodes than ever before. Computation is spread out among them, leading to more communication. For this reason, communication can easily become the bottleneck of a system and limit its scalability. The layout of an application on a computer is the key factor to preserve communication locality and reduce its cost. In this paper, we propose a simple model to optimize the layout for scientific applications by minimizing inter-node communication cost. The model takes into account the latency and bandwidth of the network and associates them with the dominant layout variables of the application. We take MILC as an example and analyze its communication patterns. According to our experimental results, the model developed for MILC achieved a satisfactory accuracy for predicting the performance, leading to up to 31% performance improvement.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121588533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current supercomputers are evolving to clusters with a very large number of nodes, and what is more, the nodes are each time becoming more complex composed of several multicore chips and GPUs. With such architectures, the application developers are every time facing a more complex task. On the other hand, most HPC applications are scientific legacy codes written in MPI and designed for at most thousands of processors. Current efforts deal with extending these applications to scale to larger number of cores and to be combined with CUDA or OpenCL to efficienly run on GPUs. To evolve a given application to be suitable to run in new heterogeneous supercomputers, application developers can take different alternatives. Optimizations to improve the MPI bottlenecks, for example, by using asynchronous communications, or optimizations on the sequential code to improve its locality, or optimizations at the node level to avoid resource contention, to list a few. This paper proposes a methodology to enable current MPI applications to be improved using the MPI/StarSs programming model. StarSs [2] is a task-based programming model that enables to parallelize sequential applications by means of annotating the code with compiler directives. What is more important, it supports their execution in heterogeneous platforms, including clusters of GPUs. Also it nicely hybridizes with MPI [1], and enables the overlap of communication and computation. The approach is based on the generation at execution time of a directed acyclic graph (DAG), where the nodes of the graph denote tasks in the application and edges denote data dependences between tasks. Once a partial DAG has been generated, the StarSs runtime is able to schedule the tasks to the different cores or GPUs of the platform. Another relevant aspect is that the programming model offers to the application developers a single name space while the actual memory addresses can be distributed (as in a cluster or a node with GPUs). The StarSs runtime maintains a hierarchical directory with information about where to find each block of data and different software caches are maintained in each of the distributed memory spaces. The runtime is responsible for transferring the data between the different memory spaces and for keeping the coherence. While the programming model itself comes with a very simple syntax, identifying tasks may sometimes not be as easy as one can predict, especially when trying to taskify MPI applications. With the purpose of simplifying this process, a set of tools has been developed to conform with the framework: Ssgrind, that helps identifying tasks and the directionality of the tasksâǍŹ parameters, Ayudame and Temanejo, to help debugging StarSs applications, and Paraver, Cube and Scalasca, that enable a detailed performance analysis of the applications. The extended version of the paper will detail the programming methodology outlined illustrating it with examples.
{"title":"Top down programming methodology and tools with StarSs - enabling scalable programming paradigms: extended abstract","authors":"Rosa M. Badia","doi":"10.1145/2133173.2133182","DOIUrl":"https://doi.org/10.1145/2133173.2133182","url":null,"abstract":"Current supercomputers are evolving to clusters with a very large number of nodes, and what is more, the nodes are each time becoming more complex composed of several multicore chips and GPUs. With such architectures, the application developers are every time facing a more complex task. On the other hand, most HPC applications are scientific legacy codes written in MPI and designed for at most thousands of processors. Current efforts deal with extending these applications to scale to larger number of cores and to be combined with CUDA or OpenCL to efficienly run on GPUs.\u0000 To evolve a given application to be suitable to run in new heterogeneous supercomputers, application developers can take different alternatives. Optimizations to improve the MPI bottlenecks, for example, by using asynchronous communications, or optimizations on the sequential code to improve its locality, or optimizations at the node level to avoid resource contention, to list a few.\u0000 This paper proposes a methodology to enable current MPI applications to be improved using the MPI/StarSs programming model. StarSs [2] is a task-based programming model that enables to parallelize sequential applications by means of annotating the code with compiler directives. What is more important, it supports their execution in heterogeneous platforms, including clusters of GPUs. Also it nicely hybridizes with MPI [1], and enables the overlap of communication and computation.\u0000 The approach is based on the generation at execution time of a directed acyclic graph (DAG), where the nodes of the graph denote tasks in the application and edges denote data dependences between tasks. Once a partial DAG has been generated, the StarSs runtime is able to schedule the tasks to the different cores or GPUs of the platform.\u0000 Another relevant aspect is that the programming model offers to the application developers a single name space while the actual memory addresses can be distributed (as in a cluster or a node with GPUs). The StarSs runtime maintains a hierarchical directory with information about where to find each block of data and different software caches are maintained in each of the distributed memory spaces. The runtime is responsible for transferring the data between the different memory spaces and for keeping the coherence.\u0000 While the programming model itself comes with a very simple syntax, identifying tasks may sometimes not be as easy as one can predict, especially when trying to taskify MPI applications. With the purpose of simplifying this process, a set of tools has been developed to conform with the framework: Ssgrind, that helps identifying tasks and the directionality of the tasksâǍŹ parameters, Ayudame and Temanejo, to help debugging StarSs applications, and Paraver, Cube and Scalasca, that enable a detailed performance analysis of the applications. The extended version of the paper will detail the programming methodology outlined illustrating it with examples.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130002104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}