Abstractions and programming models simplify the writing of programs by providing a clear mental framework for reasoning about problem domains and for isolating program expression from irrelevant implementation details. This paper focuses on the domain of graph algorithms, where there are several classes of details that we would like to hide from the programmer, including execution model, granularity of decomposition, and data representation. Most current systems expose some or all of these issues at the same level as their graph abstractions, constraining portability and extensibility while also negatively impacting programmer productivity. To address these challenges, this paper presents a unifying generalized SIMD-like programming model (and corresponding C++ implementation) that can be used to uniformly express graph and other irregular applications on a wide range of types of parallelism, decompositions, and data representations. With respect to these issues, we develop a detailed analysis of our approach and compare it to a number of popular alternatives.
{"title":"A Unifying Programming Model for Parallel Graph Algorithms","authors":"Jeremiah Willcock, A. Lumsdaine","doi":"10.1109/IPDPSW.2015.79","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.79","url":null,"abstract":"Abstractions and programming models simplify the writing of programs by providing a clear mental framework for reasoning about problem domains and for isolating program expression from irrelevant implementation details. This paper focuses on the domain of graph algorithms, where there are several classes of details that we would like to hide from the programmer, including execution model, granularity of decomposition, and data representation. Most current systems expose some or all of these issues at the same level as their graph abstractions, constraining portability and extensibility while also negatively impacting programmer productivity. To address these challenges, this paper presents a unifying generalized SIMD-like programming model (and corresponding C++ implementation) that can be used to uniformly express graph and other irregular applications on a wide range of types of parallelism, decompositions, and data representations. With respect to these issues, we develop a detailed analysis of our approach and compare it to a number of popular alternatives.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132229176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most of chip-multiprocessors share a common large sized last level cache (LLC). In non-uniform cache access based architectures, the LLC is divided into multiple banks to be accessed independently. It has been observed that the principal amount of chip power in CMP is consumed by the LLC banks which can be divided into two major parts: dynamic and static. Techniques have been proposed to reduce the static power consumption of LLC by powering off the less utilized banks and forwarding its requests to other active banks (target banks). Once a bank is powered off, all the future requests arrive to its controller and get forwarded to the target bank. Such a bank shutdown process saves static power but reduces the performance of LLC. Due to multiple banks shutdown the target banks may also get overloaded. Additionally, the request forwarding increases the on chip traffic. In this paper, we improve the performance of the target banks by dynamically managing its associativity. The cost of request forwarding is optimized by considering network distance as an additional metric for target selection. These two strategies help to reduce performance degradation. Experimental analysis shows 43% reduction in static energy and 23% reduction in EDP for a 4MB LLC with a performance constraint of 3%.
{"title":"Performance Constrained Static Energy Reduction Using Way-Sharing Target-Banks","authors":"Shounak Chakraborty, Shirshendu Das, H. Kapoor","doi":"10.1109/IPDPSW.2015.49","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.49","url":null,"abstract":"Most of chip-multiprocessors share a common large sized last level cache (LLC). In non-uniform cache access based architectures, the LLC is divided into multiple banks to be accessed independently. It has been observed that the principal amount of chip power in CMP is consumed by the LLC banks which can be divided into two major parts: dynamic and static. Techniques have been proposed to reduce the static power consumption of LLC by powering off the less utilized banks and forwarding its requests to other active banks (target banks). Once a bank is powered off, all the future requests arrive to its controller and get forwarded to the target bank. Such a bank shutdown process saves static power but reduces the performance of LLC. Due to multiple banks shutdown the target banks may also get overloaded. Additionally, the request forwarding increases the on chip traffic. In this paper, we improve the performance of the target banks by dynamically managing its associativity. The cost of request forwarding is optimized by considering network distance as an additional metric for target selection. These two strategies help to reduce performance degradation. Experimental analysis shows 43% reduction in static energy and 23% reduction in EDP for a 4MB LLC with a performance constraint of 3%.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133876551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we present a directive-based auto-tuning (AT) framework, called ppOpen-AT, and demonstrate its effect using simulation code based on the Finite Difference Method (FDM). The framework utilizes well-known loop transformation techniques. However, the codes used are carefully designed to minimize the software stack in order to meet the requirements of a many-core architecture currently in operation. The results of evaluations conducted using ppOpen-AT indicate that maximum speedup factors greater than 550% are obtained when it is applied in eight nodes of the Intel Xeon Phi. Further, in the AT for data packing and unpacking, a 49% speedup factor for the whole application is achieved. By using it with strong scaling on 32 nodes in a cluster of the Xeon Phi, we also obtain 24% speedups for the overall execution.
{"title":"Directive-Based Auto-Tuning for the Finite Difference Method on the Xeon Phi","authors":"T. Katagiri, S. Ohshima, M. Matsumoto","doi":"10.1109/IPDPSW.2015.11","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.11","url":null,"abstract":"In this paper, we present a directive-based auto-tuning (AT) framework, called ppOpen-AT, and demonstrate its effect using simulation code based on the Finite Difference Method (FDM). The framework utilizes well-known loop transformation techniques. However, the codes used are carefully designed to minimize the software stack in order to meet the requirements of a many-core architecture currently in operation. The results of evaluations conducted using ppOpen-AT indicate that maximum speedup factors greater than 550% are obtained when it is applied in eight nodes of the Intel Xeon Phi. Further, in the AT for data packing and unpacking, a 49% speedup factor for the whole application is achieved. By using it with strong scaling on 32 nodes in a cluster of the Xeon Phi, we also obtain 24% speedups for the overall execution.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"315 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133346632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we present an all-core implementation of Burrows Wheeler Compression algorithm that exploits all computing resources on a system. Our focus is to provide significant benefit to everyday users on common end-to-end applications by exploiting the parallelism of multiple CPU cores and additional accelerators, viz. Many-core GPU, on their machines. The all-core framework is suitable for problems that process large files or buffers in blocks. We consider a system to be made up of compute stations and use a work-queue to dynamically divide the tasks among them. Each compute station uses an implementation that optimally exploits its architecture. We develop a fast GPU BWC algorithm by extending the state-of-the-art GPU string sort to efficiently perform BWT step of BWC. Our hybrid BWC with GPU acceleration achieves a 2.9× speedup over best CPU implementation. Our all-core framework allows concurrent processing of blocks by both GPU and all available CPU cores. We achieve a 3.06× speedup by using all CPU cores and a 4.87× speedup when we additionally use an accelerator i.e. GPU. Our approach will scale to the number and different types of computing resources or accelerators found on a system.
{"title":"Fast Burrows Wheeler Compression Using All-Cores","authors":"A. Deshpande, P J Narayanan","doi":"10.1109/IPDPSW.2015.53","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.53","url":null,"abstract":"In this paper, we present an all-core implementation of Burrows Wheeler Compression algorithm that exploits all computing resources on a system. Our focus is to provide significant benefit to everyday users on common end-to-end applications by exploiting the parallelism of multiple CPU cores and additional accelerators, viz. Many-core GPU, on their machines. The all-core framework is suitable for problems that process large files or buffers in blocks. We consider a system to be made up of compute stations and use a work-queue to dynamically divide the tasks among them. Each compute station uses an implementation that optimally exploits its architecture. We develop a fast GPU BWC algorithm by extending the state-of-the-art GPU string sort to efficiently perform BWT step of BWC. Our hybrid BWC with GPU acceleration achieves a 2.9× speedup over best CPU implementation. Our all-core framework allows concurrent processing of blocks by both GPU and all available CPU cores. We achieve a 3.06× speedup by using all CPU cores and a 4.87× speedup when we additionally use an accelerator i.e. GPU. Our approach will scale to the number and different types of computing resources or accelerators found on a system.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116402240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present in this paper a general framework to study issues of effective load balancing and scheduling in highly parallel and distributed environments such as currently built Cloud computing systems. We propose a novel approach based on the concept of the Sandpile cellular automaton: a decentralized multi-agent system working in a critical state at the edge of chaos. Our goal is providing fairness between concurrent job submissions by minimizing slowdown of individual applications and dynamically rescheduling them to the best suited resources. The algorithm design is experimentally validated by a number of numerical experiments showing the effectiveness and scalability of the scheme in the presence of a large number of jobs and resources and its ability to react to dynamic changes in real time.
{"title":"Dynamic Job Scheduling in the Cloud Using Slowdown Optimization and Sandpile Cellular Automata Model","authors":"Jakub Gasior, F. Seredyński","doi":"10.1109/IPDPSW.2015.139","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.139","url":null,"abstract":"We present in this paper a general framework to study issues of effective load balancing and scheduling in highly parallel and distributed environments such as currently built Cloud computing systems. We propose a novel approach based on the concept of the Sandpile cellular automaton: a decentralized multi-agent system working in a critical state at the edge of chaos. Our goal is providing fairness between concurrent job submissions by minimizing slowdown of individual applications and dynamically rescheduling them to the best suited resources. The algorithm design is experimentally validated by a number of numerical experiments showing the effectiveness and scalability of the scheme in the presence of a large number of jobs and resources and its ability to react to dynamic changes in real time.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116629401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francois Legillon, N. Melab, Didier Renard, E. Talbi
Offers of public IAAS providers often vary: new providers enter the market, existing ones change their pricing or improve their offering. Decision on whether and how to improve already deployed platforms, either by reconfiguration or migration to another provider, can be seen as a NP-hard optimization problem. In this paper, we define a new realistic model for this Migration Problem, based on a Multi-Objective Optimization formulation. An evolutionary approach is introduced to tackle the problem, using specific operators. Experiments are conducted on multiple realistic data-sets, showing that the evolutionary approach is viable to tackle real-size instances in a reasonable amount of time.
{"title":"A Multi-objective Evolutionary Algorithm for Cloud Platform Reconfiguration","authors":"Francois Legillon, N. Melab, Didier Renard, E. Talbi","doi":"10.1109/IPDPSW.2015.138","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.138","url":null,"abstract":"Offers of public IAAS providers often vary: new providers enter the market, existing ones change their pricing or improve their offering. Decision on whether and how to improve already deployed platforms, either by reconfiguration or migration to another provider, can be seen as a NP-hard optimization problem. In this paper, we define a new realistic model for this Migration Problem, based on a Multi-Objective Optimization formulation. An evolutionary approach is introduced to tackle the problem, using specific operators. Experiments are conducted on multiple realistic data-sets, showing that the evolutionary approach is viable to tackle real-size instances in a reasonable amount of time.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116632446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PCO Introduction and Committees","authors":"D. E. Baz, B. Uçar","doi":"10.1109/IPDPSW.2015.169","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.169","url":null,"abstract":"PCO Introduction and Committees","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121907241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jian Lin, Khaled Hamidouche, Xiaoyi Lu, Mingzhe Li, D. Panda
Coarray Fortran (CAF) is a parallel programming paradigm that extends Fortran for the partitioned global address space (PGAS) programming model at the language level. The current runtime implementations of CAF are mainly using MPI or GASNet as underlying communication components. MVAPICH2-X is a hybrid MPI+PGAS programming library with a Unified Communication Runtime (UCR) design. In this paper, the classic implementation of CAF runtime in Open UH is redesigned and rebuilt on top of MVAPICH2-X. The proposed design does not only enable the support of MPI+CAF hybrid programming model, but also provides superior performance on most of the CAF one-sided operations and the newly proposed collective operations in Fortran 2015 specification. A comprehensive evaluation with different benchmarks and applications has been performed. Comparing with current GASNet-based solutions, the CAF runtime with MVAPICH2-X can improve the bandwidths of put and bidirectional operations up to 3.5X for inter-node communication, and improve the bandwidths of collective communication operations represented by broadcast up to 3.0X on 64 processes. It also reduces the execution time of NPB CAF benchmarks by up to 18% on 256 processes.
{"title":"High-Performance Coarray Fortran Support with MVAPICH2-X: Initial Experience and Evaluation","authors":"Jian Lin, Khaled Hamidouche, Xiaoyi Lu, Mingzhe Li, D. Panda","doi":"10.1109/IPDPSW.2015.115","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.115","url":null,"abstract":"Coarray Fortran (CAF) is a parallel programming paradigm that extends Fortran for the partitioned global address space (PGAS) programming model at the language level. The current runtime implementations of CAF are mainly using MPI or GASNet as underlying communication components. MVAPICH2-X is a hybrid MPI+PGAS programming library with a Unified Communication Runtime (UCR) design. In this paper, the classic implementation of CAF runtime in Open UH is redesigned and rebuilt on top of MVAPICH2-X. The proposed design does not only enable the support of MPI+CAF hybrid programming model, but also provides superior performance on most of the CAF one-sided operations and the newly proposed collective operations in Fortran 2015 specification. A comprehensive evaluation with different benchmarks and applications has been performed. Comparing with current GASNet-based solutions, the CAF runtime with MVAPICH2-X can improve the bandwidths of put and bidirectional operations up to 3.5X for inter-node communication, and improve the bandwidths of collective communication operations represented by broadcast up to 3.0X on 64 processes. It also reduces the execution time of NPB CAF benchmarks by up to 18% on 256 processes.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128792390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AsHES Introduction and Committees","authors":"Yunquan Zhang","doi":"10.1109/IPDPSW.2014.217","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.217","url":null,"abstract":"ASHES Introduction and Committees","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123384653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}