Ian Finlayson, Jerome Mueller, S. Rajapakse, Daniel Easterling
Despite the fact that we are firmly in the multicore era, the use of parallel programming is not as widespread as it could be - in the software industry or in education. There have been many calls to incorporate more parallel programming content into undergraduate computer science education. One obstacle to doing this is that the programming languages most commonly used for parallel programming are detailed, low-level languages such as C, C++, Fortran (with OpenMP or MPI), OpenCL and CUDA. These languages allow programmers to write very efficient code, but that is not so important for those whose goal is to learn the concepts of parallel computing. This paper introduces a parallel programming language called Tetra which provides parallel programming features as first class language features, and also provides garbage collection and is designed to be as simple as possible. Tetra also includes an integrated development environment which is specifically geared for debugging parallel programs and visualizing program execution across multiple threads.
{"title":"Introducing Tetra: An Educational Parallel Programming System","authors":"Ian Finlayson, Jerome Mueller, S. Rajapakse, Daniel Easterling","doi":"10.1109/IPDPSW.2015.51","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.51","url":null,"abstract":"Despite the fact that we are firmly in the multicore era, the use of parallel programming is not as widespread as it could be - in the software industry or in education. There have been many calls to incorporate more parallel programming content into undergraduate computer science education. One obstacle to doing this is that the programming languages most commonly used for parallel programming are detailed, low-level languages such as C, C++, Fortran (with OpenMP or MPI), OpenCL and CUDA. These languages allow programmers to write very efficient code, but that is not so important for those whose goal is to learn the concepts of parallel computing. This paper introduces a parallel programming language called Tetra which provides parallel programming features as first class language features, and also provides garbage collection and is designed to be as simple as possible. Tetra also includes an integrated development environment which is specifically geared for debugging parallel programs and visualizing program execution across multiple threads.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131077308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Collective operations are widely used in large scale scientific applications, and critical to the scalability of these applications for large process counts. It has also been demonstrated that collective operations have to be carefully tuned for a given platform and application scenario to maximize their performance. Non-blocking collective operations extend the concept of collective operations by offering the additional benefit of being able to overlap communication and computation. This paper presents the automatic run-time tuning of non-blocking collective communication operations, which allows the communication library to choose the best performing implementation for a non-blocking collective operation on a case by case basis. The paper demonstrates that libraries using a single algorithm or implementation for a non-blocking collective operation will inevitably lead to suboptimal performance in many scenarios, and thus validate the necessity for run-time tuning of these operations. The benefits of the approach are further demonstrated for an application kernel using a multi-dimensional Fast Fourier Transform. The results obtained for the application scenario indicate a performance improvement of up to 40% compared to the current state of the art.
{"title":"Auto-tuning Non-blocking Collective Communication Operations","authors":"Youcef Barigou, V. Venkatesan, E. Gabriel","doi":"10.1109/IPDPSW.2015.15","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.15","url":null,"abstract":"Collective operations are widely used in large scale scientific applications, and critical to the scalability of these applications for large process counts. It has also been demonstrated that collective operations have to be carefully tuned for a given platform and application scenario to maximize their performance. Non-blocking collective operations extend the concept of collective operations by offering the additional benefit of being able to overlap communication and computation. This paper presents the automatic run-time tuning of non-blocking collective communication operations, which allows the communication library to choose the best performing implementation for a non-blocking collective operation on a case by case basis. The paper demonstrates that libraries using a single algorithm or implementation for a non-blocking collective operation will inevitably lead to suboptimal performance in many scenarios, and thus validate the necessity for run-time tuning of these operations. The benefits of the approach are further demonstrated for an application kernel using a multi-dimensional Fast Fourier Transform. The results obtained for the application scenario indicate a performance improvement of up to 40% compared to the current state of the art.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132854904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Taking full advantage of SIMD instructions in C programs still requires tedious and non-portable programming using intrinsics, despite considerable efforts spent developing auto-vectorization capabilities in recent decades. Whole Function Vectorization (WFV) is a recent technique for extending the use of SIMD across entire functions. WFV has so far only been used in data-parallel languages such as OpenCL and ISPC. We propose a vector-oriented programming framework that facilitates WFV directly in C. We show that our framework achieves competitive performance to Open CL and ISPC while maintaining C's original syntax and semantics. This allows C programmers to gain better performance for their applications by improving SIMD utilization, without stepping out of C.
{"title":"Streamlining Whole Function Vectorization in C Using Higher Order Vector Semantics","authors":"Gil Rapaport, A. Zaks, Y. Ben-Asher","doi":"10.1109/IPDPSW.2015.37","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.37","url":null,"abstract":"Taking full advantage of SIMD instructions in C programs still requires tedious and non-portable programming using intrinsics, despite considerable efforts spent developing auto-vectorization capabilities in recent decades. Whole Function Vectorization (WFV) is a recent technique for extending the use of SIMD across entire functions. WFV has so far only been used in data-parallel languages such as OpenCL and ISPC. We propose a vector-oriented programming framework that facilitates WFV directly in C. We show that our framework achieves competitive performance to Open CL and ISPC while maintaining C's original syntax and semantics. This allows C programmers to gain better performance for their applications by improving SIMD utilization, without stepping out of C.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131731883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The popular and diverse hardware accelerator ecosystem makes apples-to-apples comparisons between platforms rather difficult. SPEC ACCEL tries to offer a yardstick to compare different accelerator hardware and software ecosystems. This paper uses this SPEC benchmark to compare an AMD GPU, an NVIDIA GPU and an Intel Xeon Phi with respect to performance and energy consumption. It also provides observations on the performance portability between the different platforms. Since the SPEC ACCEL Open ACC suite can yet not be run on a Xeon Phi, that suite was ported to OpenMP 4.0 target directives to enable a comparison. The challenges and solutions of this porting of 15 applications are described as well.
{"title":"Performance Portable Applications for Hardware Accelerators: Lessons Learned from SPEC ACCEL","authors":"G. Juckeland, Alexander Grund, W. Nagel","doi":"10.1109/IPDPSW.2015.26","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.26","url":null,"abstract":"The popular and diverse hardware accelerator ecosystem makes apples-to-apples comparisons between platforms rather difficult. SPEC ACCEL tries to offer a yardstick to compare different accelerator hardware and software ecosystems. This paper uses this SPEC benchmark to compare an AMD GPU, an NVIDIA GPU and an Intel Xeon Phi with respect to performance and energy consumption. It also provides observations on the performance portability between the different platforms. Since the SPEC ACCEL Open ACC suite can yet not be run on a Xeon Phi, that suite was ported to OpenMP 4.0 target directives to enable a comparison. The challenges and solutions of this porting of 15 applications are described as well.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126753912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Agullo, Olivier Beaumont, Lionel Eyraud-Dubois, J. Herrmann, Suraj Kumar, L. Marchal, Samuel Thibault
We consider the problem of allocating and scheduling dense linear application on fully heterogeneous platforms made of CPUs and GPUs. More specifically, we focus on the Cholesky factorization since it exhibits the main features of such problems. Indeed, the relative performance of CPU and GPU highly depends on the sub-routine: GPUs are for instance much more efficient to process regular kernels such as matrix-matrix multiplications rather than more irregular kernels such as matrix factorization. In this context, one solution consists in relying on dynamic scheduling and resource allocation mechanisms such as the ones provided by PaRSEC or StarPU. In this paper we analyze the performance of dynamic schedulers based on both actual executions and simulations, and we investigate how adding static rules based on an offline analysis of the problem to their decision process can indeed improve their performance, up to reaching some improved theoretical performance bounds which we introduce.
{"title":"Bridging the Gap between Performance and Bounds of Cholesky Factorization on Heterogeneous Platforms","authors":"E. Agullo, Olivier Beaumont, Lionel Eyraud-Dubois, J. Herrmann, Suraj Kumar, L. Marchal, Samuel Thibault","doi":"10.1109/IPDPSW.2015.35","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.35","url":null,"abstract":"We consider the problem of allocating and scheduling dense linear application on fully heterogeneous platforms made of CPUs and GPUs. More specifically, we focus on the Cholesky factorization since it exhibits the main features of such problems. Indeed, the relative performance of CPU and GPU highly depends on the sub-routine: GPUs are for instance much more efficient to process regular kernels such as matrix-matrix multiplications rather than more irregular kernels such as matrix factorization. In this context, one solution consists in relying on dynamic scheduling and resource allocation mechanisms such as the ones provided by PaRSEC or StarPU. In this paper we analyze the performance of dynamic schedulers based on both actual executions and simulations, and we investigate how adding static rules based on an offline analysis of the problem to their decision process can indeed improve their performance, up to reaching some improved theoretical performance bounds which we introduce.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126448495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
DNA computers provide exciting challenges and opportunities in the fields of computer architecture, neural networks, autonomous micromechanical devices, and chemical reaction networks. The advent of digital abstractions such as the seesaw gates hold many opportunities for computer architects to realize complex digital circuits using DNA strand displacement principles. The paper presents a realization of a single bit, 2×2 crossbar interconnection network built using seesaw gates. The functional correctness of the implemented crossbar was verified using a chemical reaction simulator.
{"title":"A Crossbar Interconnection Network in DNA","authors":"B. Talawar","doi":"10.1109/IPDPSW.2015.103","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.103","url":null,"abstract":"DNA computers provide exciting challenges and opportunities in the fields of computer architecture, neural networks, autonomous micromechanical devices, and chemical reaction networks. The advent of digital abstractions such as the seesaw gates hold many opportunities for computer architects to realize complex digital circuits using DNA strand displacement principles. The paper presents a realization of a single bit, 2×2 crossbar interconnection network built using seesaw gates. The functional correctness of the implemented crossbar was verified using a chemical reaction simulator.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122273252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider single commodity strictly convex network flow problems. The dual problem is unconstrained, differentiable and well suited for solution via parallel iterative methods. We propose and prove convergence of parallel asynchronous modified Newton algorithms for solving the dual problem. Parallel asynchronous Newton multisplitting algorithms are also considered, their convergence is also shown. A first set of computational results is presented and analyzed.
{"title":"Parallel Asynchronous Modified Newton Methods for Network Flows","authors":"D. E. Baz, M. Elkihel","doi":"10.1109/IPDPSW.2015.34","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.34","url":null,"abstract":"We consider single commodity strictly convex network flow problems. The dual problem is unconstrained, differentiable and well suited for solution via parallel iterative methods. We propose and prove convergence of parallel asynchronous modified Newton algorithms for solving the dual problem. Parallel asynchronous Newton multisplitting algorithms are also considered, their convergence is also shown. A first set of computational results is presented and analyzed.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126100108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to the trends towards Big Data and Cloud Computing, one would like to provide large storage systems that are accessible by many servers. A shared storage can, however, become a performance bottleneck and a single-point of failure. Distributed storage systems provide a shared storage to the outside world, but internally they consist of a network of servers and disks, thus avoiding the performance bottleneck and single-point of failure problems. We introduce a cache in a distributed storage system. The cache system must be fault tolerant so that no data is lost in case of a hardware failure. This requirement excludes the use of the common write-invalidate cache consistency protocols. The cache is implemented and evaluated in two steps. The first step focuses on design decisions that improve the performance when only one server uses the same file. In the second step we extend the cache with features that focus on the case when more than one server access the same file. The cache improves the throughput significantly compared to having no cache. The two-step evaluation approach makes it possible to quantify how different design decisions affect the performance of different use cases.
{"title":"Cache Support in a High Performance Fault-Tolerant Distributed Storage System for Cloud and Big Data","authors":"L. Lundberg, Håkan Grahn, D. Ilie, C. Melander","doi":"10.1109/IPDPSW.2015.65","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.65","url":null,"abstract":"Due to the trends towards Big Data and Cloud Computing, one would like to provide large storage systems that are accessible by many servers. A shared storage can, however, become a performance bottleneck and a single-point of failure. Distributed storage systems provide a shared storage to the outside world, but internally they consist of a network of servers and disks, thus avoiding the performance bottleneck and single-point of failure problems. We introduce a cache in a distributed storage system. The cache system must be fault tolerant so that no data is lost in case of a hardware failure. This requirement excludes the use of the common write-invalidate cache consistency protocols. The cache is implemented and evaluated in two steps. The first step focuses on design decisions that improve the performance when only one server uses the same file. In the second step we extend the cache with features that focus on the case when more than one server access the same file. The cache improves the throughput significantly compared to having no cache. The two-step evaluation approach makes it possible to quantify how different design decisions affect the performance of different use cases.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"266 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116064953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HPBC Introduction and Committees","authors":"E. Aubanel, V. Bhavsar, M. Frumkin","doi":"10.1109/IPDPSW.2015.162","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.162","url":null,"abstract":"HPBC Introduction and Committees","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115957842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Meraji, John Keenleyside, Sunil Kamath, Bob Blainey
Column-store in-memory databases have received a lot of attention because of their fast query processing response times on modern multi-core machines. Among different database operations, group by/aggregate is an important and potentially costly operation. Moreover, sort-based and hash-based algorithms are the most common ways of processing group by/aggregate queries. While sort-based algorithms are used in traditional Data Base Management Systems (DBMS), hash based algorithms can be applied for faster query processing in new columnar databases. Besides, Graphical Processing Units (GPU) can be utilized as fast, high bandwidth co-processors to improve the query processing performance of columnar databases. The focus of this article is on the prototype for group by/aggregate operations that we created to exploit GPUs. We show different hash based algorithms to improve the performance of group by/aggregate operations on GPU. One of the parameters that affect the performance of the group by/aggregate algorithm is the number of groups and hashing algorithm. We show that we can get up to 7.6x improvement in kernel performance compared to a multi-core CPU implementation when we use a partitioned multi-level hash algorithm using GPU shared and global memories.
{"title":"Towards a Combined Grouping and Aggregation Algorithm for Fast Query Processing in Columnar Databases with GPUs","authors":"S. Meraji, John Keenleyside, Sunil Kamath, Bob Blainey","doi":"10.1109/IPDPSW.2015.21","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.21","url":null,"abstract":"Column-store in-memory databases have received a lot of attention because of their fast query processing response times on modern multi-core machines. Among different database operations, group by/aggregate is an important and potentially costly operation. Moreover, sort-based and hash-based algorithms are the most common ways of processing group by/aggregate queries. While sort-based algorithms are used in traditional Data Base Management Systems (DBMS), hash based algorithms can be applied for faster query processing in new columnar databases. Besides, Graphical Processing Units (GPU) can be utilized as fast, high bandwidth co-processors to improve the query processing performance of columnar databases. The focus of this article is on the prototype for group by/aggregate operations that we created to exploit GPUs. We show different hash based algorithms to improve the performance of group by/aggregate operations on GPU. One of the parameters that affect the performance of the group by/aggregate algorithm is the number of groups and hashing algorithm. We show that we can get up to 7.6x improvement in kernel performance compared to a multi-core CPU implementation when we use a partitioned multi-level hash algorithm using GPU shared and global memories.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121185162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}