Heterogeneous computing platforms with multicore host system and many-core accelerator devices have taken a major step forward in the mainstream HPC computing market this year with the announcement of HP Apollo 6000 System's ProLiant XL250a server features the Intel® Xeon Phi™ coprocessors. Although many application developers attempt to use it in the same way as GPGPU acceleration platforms, doing so forfeits the processing capability of multicore host processors and introduces power inefficiency in business operations. In this paper, we propose an application optimization framework to turn sequential legacy applications into highly parallel applications that make use of the hardware resources both on the host CPU and on the accelerator devices to enable simultaneous heterogeneous computing. As a case study, we look at how to apply this framework and adopt a structured methodology to develop option pricing applications to take advantages of a heterogeneous computing environment.
{"title":"A Performance Optimization Framework for the Simultaneous Heterogeneous Computing Platforms","authors":"S. Li","doi":"10.1145/2916026.2916029","DOIUrl":"https://doi.org/10.1145/2916026.2916029","url":null,"abstract":"Heterogeneous computing platforms with multicore host system and many-core accelerator devices have taken a major step forward in the mainstream HPC computing market this year with the announcement of HP Apollo 6000 System's ProLiant XL250a server features the Intel® Xeon Phi™ coprocessors. Although many application developers attempt to use it in the same way as GPGPU acceleration platforms, doing so forfeits the processing capability of multicore host processors and introduces power inefficiency in business operations. In this paper, we propose an application optimization framework to turn sequential legacy applications into highly parallel applications that make use of the hardware resources both on the host CPU and on the accelerator devices to enable simultaneous heterogeneous computing. As a case study, we look at how to apply this framework and adopt a structured methodology to develop option pricing applications to take advantages of a heterogeneous computing environment.","PeriodicalId":409042,"journal":{"name":"Proceedings of the ACM Workshop on Software Engineering Methods for Parallel and High Performance Applications","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115361766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
It is our great pleasure to welcome you to the Workshop on Software Engineering Methods for Parallel and High Performance Applications - SEM4HPC 2016. The workshop aims to discuss parallel computing beyond traditional scientific computing and using them to develop enterprise and industrial applications. Compared to the traditional sequential computing paradigm, the software development, analysis and migration tools for parallel and high performance applications are far less matured for the IT industry to make a shift towards the new computing paradigm. The mission of this workshop is to bring the global industry and academic experts in this area to identify various research challenges that exist in software engineering methods for parallel and high performance application development, maintenance and migration. The workshop also aims to bring out the current state of the art and practice of the software engineering methods through case-studies, novel research ideas, and keynote and invited talks. The call for papers attracted submissions from Germany, India, Spain, and the United States. We received eleven full technical papers out of which five were selected with an acceptance ratio of 45%. We also encourage attendees to attend the keynote and invited talk presentations. These valuable and insightful talks can and will guide us to a better understanding of challenges in this area: Keynote: Challenges in Transition, Kazuaki Ishizaki (IBM Research -- Tokyo, Japan) Invited Talk: The READEX project for Dynamic Energy Efficiency Tuning, Michael Gerndt (Technical University of Munich, Germany) Invited Talk: Developer Productivity in HPC Application Development: An Overview of Recent Techniques, Santonu Sarkar (BITS Pilani -- Goa Campus, India)
{"title":"Proceedings of the ACM Workshop on Software Engineering Methods for Parallel and High Performance Applications","authors":"Atul Kumar, S. Sarkar, M. Gerndt","doi":"10.1145/2916026","DOIUrl":"https://doi.org/10.1145/2916026","url":null,"abstract":"It is our great pleasure to welcome you to the Workshop on Software Engineering Methods for Parallel and High Performance Applications - SEM4HPC 2016. \u0000 \u0000The workshop aims to discuss parallel computing beyond traditional scientific computing and using them to develop enterprise and industrial applications. Compared to the traditional sequential computing paradigm, the software development, analysis and migration tools for parallel and high performance applications are far less matured for the IT industry to make a shift towards the new computing paradigm. The mission of this workshop is to bring the global industry and academic experts in this area to identify various research challenges that exist in software engineering methods for parallel and high performance application development, maintenance and migration. The workshop also aims to bring out the current state of the art and practice of the software engineering methods through case-studies, novel research ideas, and keynote and invited talks. \u0000 \u0000The call for papers attracted submissions from Germany, India, Spain, and the United States. We received eleven full technical papers out of which five were selected with an acceptance ratio of 45%. \u0000 \u0000We also encourage attendees to attend the keynote and invited talk presentations. These valuable and insightful talks can and will guide us to a better understanding of challenges in this area: \u0000Keynote: Challenges in Transition, Kazuaki Ishizaki (IBM Research -- Tokyo, Japan) \u0000Invited Talk: The READEX project for Dynamic Energy Efficiency Tuning, Michael Gerndt (Technical University of Munich, Germany) \u0000Invited Talk: Developer Productivity in HPC Application Development: An Overview of Recent Techniques, Santonu Sarkar (BITS Pilani -- Goa Campus, India)","PeriodicalId":409042,"journal":{"name":"Proceedings of the ACM Workshop on Software Engineering Methods for Parallel and High Performance Applications","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126495283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Kapoor, Rahul Yamasani, S. Saurav, Abhishek Bajpai
This paper discusses different approaches that allow optimizing the combinational logic used in Multipliers for Generic ECC (Elliptic Curve Cryptography) implementation in the Galois field GF(2n) . First,a Combinational Multiplier using Karatsuba Ofman logic with 2*2as a base multiplier has been studied. Proper utilization of Look Up Table (LUT) at base level results in effective optimization of the hardware resources. Hence in order to optimize LUT utilization, designs for combinational logic with 3*3 base and 2*3 base have been explored, keeping the LUT structure of Virtex-6 FPGA in mind. Comparisons have shown that, 3*3 base multipliers designed using Karatsuba Ofman algorithm outperformed 2*2 and 2*3 base Multiplier in terms of resource utilization. To further maximize utilization of hardware resources, the exploration has been further carried out using Shift and Add Algorithm(SAA) and it has been found that SAA remains optimized for lower length operands. Algorithmic and platform oriented optimization results in efficient hardware implementations. The final proposed design is a Hybrid Karatsuba Algorithm, which uses SAA at lower level and at higher level uses Karatsuba Ofman Logic. Again here using 3*3 bit Multiplier with SAA configuration is better than the other two. This approach stands a step closer for efficient implementations of fast algorithm on hardware based applications, as this hybrid multiplier is found to use least number of FPGA resources. All the operations in this paper have been performed based on Virtex-6 ML605 using ESD tool as XILINX 12.1
{"title":"LUT Optimization In Implementation Of Combinational Karatsuba Ofman On Virtex-6 FPGA","authors":"D. Kapoor, Rahul Yamasani, S. Saurav, Abhishek Bajpai","doi":"10.1145/2916026.2916030","DOIUrl":"https://doi.org/10.1145/2916026.2916030","url":null,"abstract":"This paper discusses different approaches that allow optimizing the combinational logic used in Multipliers for Generic ECC (Elliptic Curve Cryptography) implementation in the Galois field GF(2n) . First,a Combinational Multiplier using Karatsuba Ofman logic with 2*2as a base multiplier has been studied. Proper utilization of Look Up Table (LUT) at base level results in effective optimization of the hardware resources. Hence in order to optimize LUT utilization, designs for combinational logic with 3*3 base and 2*3 base have been explored, keeping the LUT structure of Virtex-6 FPGA in mind. Comparisons have shown that, 3*3 base multipliers designed using Karatsuba Ofman algorithm outperformed 2*2 and 2*3 base Multiplier in terms of resource utilization. To further maximize utilization of hardware resources, the exploration has been further carried out using Shift and Add Algorithm(SAA) and it has been found that SAA remains optimized for lower length operands. Algorithmic and platform oriented optimization results in efficient hardware implementations. The final proposed design is a Hybrid Karatsuba Algorithm, which uses SAA at lower level and at higher level uses Karatsuba Ofman Logic. Again here using 3*3 bit Multiplier with SAA configuration is better than the other two. This approach stands a step closer for efficient implementations of fast algorithm on hardware based applications, as this hybrid multiplier is found to use least number of FPGA resources. All the operations in this paper have been performed based on Virtex-6 ML605 using ESD tool as XILINX 12.1","PeriodicalId":409042,"journal":{"name":"Proceedings of the ACM Workshop on Software Engineering Methods for Parallel and High Performance Applications","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128763650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Afternoon Session 1","authors":"S. Sarkar","doi":"10.1145/3248634","DOIUrl":"https://doi.org/10.1145/3248634","url":null,"abstract":"","PeriodicalId":409042,"journal":{"name":"Proceedings of the ACM Workshop on Software Engineering Methods for Parallel and High Performance Applications","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115487903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High Performance Computing (HPC) systems consume a lot of energy. The overall energy consumption is one of the biggest challenges on the way towards exascale computers. Therefore, energy reduction techniques have to be applied on all levels from the basic chip technology up to the data center infrastructure. The READEX project explores the potential of dynamically switching application and system parameters, such as the clock frequency of the cores, to reduce the overall energy consumption of applications. An analysis is performed during application design time to precompute a tuning model that is then input to the runtime tuning library. This library switches the application and system configuration at runtime to adapt to varying application characteristics.
{"title":"The READEX Project for Dynamic Energy Efficiency Tuning","authors":"M. Gerndt","doi":"10.1145/2916026.2916033","DOIUrl":"https://doi.org/10.1145/2916026.2916033","url":null,"abstract":"High Performance Computing (HPC) systems consume a lot of energy. The overall energy consumption is one of the biggest challenges on the way towards exascale computers. Therefore, energy reduction techniques have to be applied on all levels from the basic chip technology up to the data center infrastructure. The READEX project explores the potential of dynamically switching application and system parameters, such as the clock frequency of the cores, to reduce the overall energy consumption of applications. An analysis is performed during application design time to precompute a tuning model that is then input to the runtime tuning library. This library switches the application and system configuration at runtime to adapt to varying application characteristics.","PeriodicalId":409042,"journal":{"name":"Proceedings of the ACM Workshop on Software Engineering Methods for Parallel and High Performance Applications","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115898897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Afternoon Session 2","authors":"M. Gerndt","doi":"10.1145/3248635","DOIUrl":"https://doi.org/10.1145/3248635","url":null,"abstract":"","PeriodicalId":409042,"journal":{"name":"Proceedings of the ACM Workshop on Software Engineering Methods for Parallel and High Performance Applications","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114281546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Optimal performance is an important goal in compute intensive applications. For GPU applications, this requires a lot of experience and knowledge about the algorithms and the underlying hardware, making them an ideal target for auto-tuning approaches. We present an auto-tuner which optimizes array layouts in CUDA applications. Depending on the data and program parameters, kernels can have varying optimal configurations. We thus adjust array layouts adaptively at runtime and achieve or even exceed performance of hand optimized code. We automatically detect data characteristics to identify different performance scenarios without user input or additional programming. We perform an empirical analysis of the application in order to construct our decision models. Our adaptive optimization requires in principle profiling data for an extremely high number of scenarios which cannot be exhaustively evaluated for complex applications. We solve this by extending a previously published method that is able to efficiently profile single kernel calls and enhance it to find application-wide optimal solutions. Our method is able to optimize applications in a few minutes, reaching speed ups of up to 20% compared to hand optimized code.
{"title":"Adaptive GPU Array Layout Auto-Tuning","authors":"Nicolas Weber, M. Goesele","doi":"10.1145/2916026.2916031","DOIUrl":"https://doi.org/10.1145/2916026.2916031","url":null,"abstract":"Optimal performance is an important goal in compute intensive applications. For GPU applications, this requires a lot of experience and knowledge about the algorithms and the underlying hardware, making them an ideal target for auto-tuning approaches. We present an auto-tuner which optimizes array layouts in CUDA applications. Depending on the data and program parameters, kernels can have varying optimal configurations. We thus adjust array layouts adaptively at runtime and achieve or even exceed performance of hand optimized code. We automatically detect data characteristics to identify different performance scenarios without user input or additional programming. We perform an empirical analysis of the application in order to construct our decision models. Our adaptive optimization requires in principle profiling data for an extremely high number of scenarios which cannot be exhaustively evaluated for complex applications. We solve this by extending a previously published method that is able to efficiently profile single kernel calls and enhance it to find application-wide optimal solutions. Our method is able to optimize applications in a few minutes, reaching speed ups of up to 20% compared to hand optimized code.","PeriodicalId":409042,"journal":{"name":"Proceedings of the ACM Workshop on Software Engineering Methods for Parallel and High Performance Applications","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115120237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
User written programs, when transformed by optimizing and parallelizing compilers, can be incorrect, if the compiler is not trusted. So, establishing the validity of these transformations is a crucial and challenging task. For program verification, the PRES+ (Petri net Representation of Embedded Systems) is now well accepted as a model to capture the data and control flow of a program. In this paper, an efficient path based equivalence checking method using a simple PRES+ model (which is easier to generate from a program) for validating several optimizing and parallelizing transformations is proposed. The experimental results demonstrate the efficiency of the method.
{"title":"Implementing an Efficient Path Based Equivalence Checker for Parallel Programs","authors":"S. Bandyopadhyay, K. Banerjee","doi":"10.1145/2916026.2916027","DOIUrl":"https://doi.org/10.1145/2916026.2916027","url":null,"abstract":"User written programs, when transformed by optimizing and parallelizing compilers, can be incorrect, if the compiler is not trusted. So, establishing the validity of these transformations is a crucial and challenging task. For program verification, the PRES+ (Petri net Representation of Embedded Systems) is now well accepted as a model to capture the data and control flow of a program. In this paper, an efficient path based equivalence checking method using a simple PRES+ model (which is easier to generate from a program) for validating several optimizing and parallelizing transformations is proposed. The experimental results demonstrate the efficiency of the method.","PeriodicalId":409042,"journal":{"name":"Proceedings of the ACM Workshop on Software Engineering Methods for Parallel and High Performance Applications","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132151370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Morning Session","authors":"Atul Kumar","doi":"10.1145/3248633","DOIUrl":"https://doi.org/10.1145/3248633","url":null,"abstract":"","PeriodicalId":409042,"journal":{"name":"Proceedings of the ACM Workshop on Software Engineering Methods for Parallel and High Performance Applications","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125591553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Increasing computing power with evolving hardware architectures has lead to change in programming paradigm from serial to parallel. Unlike the sequential counterpart, application building for High Performance Computing (HPC) is extremely challenging for developers. In order to improve the programmer productivity, it is necessary to address the challenges such as: i) How to abstract the hardware and low level complexities to make programming easier? ii) What features should a design assistance tool have to simplify application development? iii) How should the programming languages be enhanced for HPC? iv) What sort of prediction techniques can be developed to assist programmers to predict potential speedup? v) Can refactoring techniques solve the issue of parallelizing existing serial code? In this talk we make an attempt to present a landscape of the existing approaches to assist the software building process in HPC from a developer's point of view, and highlight some important research questions. We also discuss the state of practice in the industry and some of the application specific tools developed for HPC.
{"title":"Developer Productivity in HPC Application Development: An Overview of Recent Techniques","authors":"S. Sarkar","doi":"10.1145/2916026.2916034","DOIUrl":"https://doi.org/10.1145/2916026.2916034","url":null,"abstract":"Increasing computing power with evolving hardware architectures has lead to change in programming paradigm from serial to parallel. Unlike the sequential counterpart, application building for High Performance Computing (HPC) is extremely challenging for developers. In order to improve the programmer productivity, it is necessary to address the challenges such as: i) How to abstract the hardware and low level complexities to make programming easier? ii) What features should a design assistance tool have to simplify application development? iii) How should the programming languages be enhanced for HPC? iv) What sort of prediction techniques can be developed to assist programmers to predict potential speedup? v) Can refactoring techniques solve the issue of parallelizing existing serial code? In this talk we make an attempt to present a landscape of the existing approaches to assist the software building process in HPC from a developer's point of view, and highlight some important research questions. We also discuss the state of practice in the industry and some of the application specific tools developed for HPC.","PeriodicalId":409042,"journal":{"name":"Proceedings of the ACM Workshop on Software Engineering Methods for Parallel and High Performance Applications","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132081203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}