P. Roth, H. Shan, D. Riegner, Nikolas Antolin, S. Sreepathi, L. Oliker, Samuel Williams, S. Moore, W. Windl
The Rapid Alloy Method for Producing Accurate, General Empirical potential generation toolkit (RAMPAGE) is a program for fitting multicomponent interatomic potential functions for metal alloys. In this paper, we describe a collaborative effort between domain scientists and performance engineers to improve the parallelism, scalability, and maintainability of the code. We modified RAMPAGE to use the Message Passing Interface (MPI) for communication and synchronization, to use more than one MPI process when evaluating candidate potential functions, and to have its MPI processes execute functionality that was previously executed by external non-MPI processes. We ported RAMPAGE to run on the Eos and Titan Cray systems of the United States Department of Energy (DOE)'s Oak Ridge Leadership Computing Facility (OLCF), and the Cori and Edison systems at the DOE's National Energy Research Scientific Computing Center (NERSC). Our modifications resulted in a 7x speedup on 8 Eos system nodes, and scalability up to 2048 processes on the Cori system with Intel Knights Landing processors. To improve maintainability of the RAMPAGE source code, we introduced several software engineering best practices to the RAMPAGE developers' workflow.
{"title":"Performance analysis and optimization of the RAMPAGE metal alloy potential generation software","authors":"P. Roth, H. Shan, D. Riegner, Nikolas Antolin, S. Sreepathi, L. Oliker, Samuel Williams, S. Moore, W. Windl","doi":"10.1145/3141865.3141868","DOIUrl":"https://doi.org/10.1145/3141865.3141868","url":null,"abstract":"The Rapid Alloy Method for Producing Accurate, General Empirical potential generation toolkit (RAMPAGE) is a program for fitting multicomponent interatomic potential functions for metal alloys. In this paper, we describe a collaborative effort between domain scientists and performance engineers to improve the parallelism, scalability, and maintainability of the code. We modified RAMPAGE to use the Message Passing Interface (MPI) for communication and synchronization, to use more than one MPI process when evaluating candidate potential functions, and to have its MPI processes execute functionality that was previously executed by external non-MPI processes. We ported RAMPAGE to run on the Eos and Titan Cray systems of the United States Department of Energy (DOE)'s Oak Ridge Leadership Computing Facility (OLCF), and the Cori and Edison systems at the DOE's National Energy Research Scientific Computing Center (NERSC). Our modifications resulted in a 7x speedup on 8 Eos system nodes, and scalability up to 2048 processes on the Cori system with Intel Knights Landing processors. To improve maintainability of the RAMPAGE source code, we introduced several software engineering best practices to the RAMPAGE developers' workflow.","PeriodicalId":424955,"journal":{"name":"Proceedings of the 4th ACM SIGPLAN International Workshop on Software Engineering for Parallel Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115491728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Melo, S. Souza, P. L. D. Souza, Jeffrey C. Carver
High-Performance Computing (HPC) applications consist of concurrent programs with multi-process and/or multithreaded models with varying degrees of parallelism. Although their design patterns, models, and principles are similar to those of sequential ones, their non-deterministic behavior makes the testing activity more complex. In an attempt to solve such complexity, several techniques for concurrent software testing have been developed over the past years. However, the transference of knowledge between academy and industry remains a challenge, mainly due to the lack of a solid base of evidence with information that assists the decision-making process. This paper proposes the construction of a body of evidence for the concurrent programming field that supports the selection of an adequate testing technique for a software project. We propose a characterization schema which assists the decision-making support and is based on relevant information from the technical literature regarding available techniques, attributes, and concepts of concurrent programming that affect the testing process. The schema classified 109 studies that compose the preliminary body of evidence. A survey was conducted with specialists for the validation of the schema, regarding adequacy and relevance of the attributes defined. The results indicate the schema is effective and can support testing teams for concurrent applications.
{"title":"How to test your concurrent software: an approach for the selection of testing techniques","authors":"S. Melo, S. Souza, P. L. D. Souza, Jeffrey C. Carver","doi":"10.1145/3141865.3142468","DOIUrl":"https://doi.org/10.1145/3141865.3142468","url":null,"abstract":"High-Performance Computing (HPC) applications consist of concurrent programs with multi-process and/or multithreaded models with varying degrees of parallelism. Although their design patterns, models, and principles are similar to those of sequential ones, their non-deterministic behavior makes the testing activity more complex. In an attempt to solve such complexity, several techniques for concurrent software testing have been developed over the past years. However, the transference of knowledge between academy and industry remains a challenge, mainly due to the lack of a solid base of evidence with information that assists the decision-making process. This paper proposes the construction of a body of evidence for the concurrent programming field that supports the selection of an adequate testing technique for a software project. We propose a characterization schema which assists the decision-making support and is based on relevant information from the technical literature regarding available techniques, attributes, and concepts of concurrent programming that affect the testing process. The schema classified 109 studies that compose the preliminary body of evidence. A survey was conducted with specialists for the validation of the schema, regarding adequacy and relevance of the attributes defined. The results indicate the schema is effective and can support testing teams for concurrent applications.","PeriodicalId":424955,"journal":{"name":"Proceedings of the 4th ACM SIGPLAN International Workshop on Software Engineering for Parallel Systems","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127127750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
At the beginning of computer science memory management was a big issue with applications requiring to fit in the small amount of available memory (close to a few kilobytes). Hardware evolution has made this resource cheap for the past few years. Available memory is now close to a few hundred gigabytes. But the current evolution in the multi/many-core era tends to make some issues come back. The memory available tends not to follow the increasing number of cores making the memory resource per thread rare again. We also encounter new issues with the requirement to manage a bigger space with many more allocated objects. This new aspect increases the probability of memory leaks. It also increases the probability of memory management performance issues. Hence, with MALT we provide a tool to track the memory allocated by an application. We then map the extracted metrics onto the source code, just like kcachegrind does with valgrind for the CPU performance. Compared to most available tools, MALT can also be used to track potential performance losses due to bad allocation patterns (too many allocations, small allocations, recycling large allocations, short-lived allocations...) thanks to the various metrics it exposes to the user. This paper will detail the metrics extracted by MALT and how we present them to the user thanks to a nice web based graphical interface which is missing with most of the available Linux tools.
{"title":"MALT: a Malloc tracker","authors":"S. Valat, Andres S. Charif-Rubial, W. Jalby","doi":"10.1145/3141865.3141867","DOIUrl":"https://doi.org/10.1145/3141865.3141867","url":null,"abstract":"At the beginning of computer science memory management was a big issue with applications requiring to fit in the small amount of available memory (close to a few kilobytes). Hardware evolution has made this resource cheap for the past few years. Available memory is now close to a few hundred gigabytes. But the current evolution in the multi/many-core era tends to make some issues come back. The memory available tends not to follow the increasing number of cores making the memory resource per thread rare again. We also encounter new issues with the requirement to manage a bigger space with many more allocated objects. This new aspect increases the probability of memory leaks. It also increases the probability of memory management performance issues. Hence, with MALT we provide a tool to track the memory allocated by an application. We then map the extracted metrics onto the source code, just like kcachegrind does with valgrind for the CPU performance. Compared to most available tools, MALT can also be used to track potential performance losses due to bad allocation patterns (too many allocations, small allocations, recycling large allocations, short-lived allocations...) thanks to the various metrics it exposes to the user. This paper will detail the metrics extracted by MALT and how we present them to the user thanks to a nice web based graphical interface which is missing with most of the available Linux tools.","PeriodicalId":424955,"journal":{"name":"Proceedings of the 4th ACM SIGPLAN International Workshop on Software Engineering for Parallel Systems","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114266136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Developers often require different concurrency models to fit the various concurrency needs of the different parts of their applications. Many programming languages, such as Clojure, Scala, and Haskell, cater to this need by incorporating different concurrency models. It has been shown that, in practice, developers often combine these concurrency models. However, they are often combined in an ad hoc way and the semantics of the combination is not always well-defined. The starting hypothesis of this paper is that different concurrency models need to be carefully integrated such that the properties of each individual model are still maintained. This paper proposes one such combination, namely the combination of the actor model and software transactional memory. In this paper we show that, while both individual models offer strong safety guarantees, these guarantees are no longer valid when they are combined. The main contribution of this paper is a novel hybrid concurrency model called transactional actors that combines both models while preserving their guarantees. This paper also presents an implementation in Clojure and an experimental evaluation of the performance of the transactional actor model.
{"title":"Transactional actors: communication in transactions","authors":"Janwillem Swalens, Joeri De Koster, W. Meuter","doi":"10.1145/3141865.3141866","DOIUrl":"https://doi.org/10.1145/3141865.3141866","url":null,"abstract":"Developers often require different concurrency models to fit the various concurrency needs of the different parts of their applications. Many programming languages, such as Clojure, Scala, and Haskell, cater to this need by incorporating different concurrency models. It has been shown that, in practice, developers often combine these concurrency models. However, they are often combined in an ad hoc way and the semantics of the combination is not always well-defined. The starting hypothesis of this paper is that different concurrency models need to be carefully integrated such that the properties of each individual model are still maintained. This paper proposes one such combination, namely the combination of the actor model and software transactional memory. In this paper we show that, while both individual models offer strong safety guarantees, these guarantees are no longer valid when they are combined. The main contribution of this paper is a novel hybrid concurrency model called transactional actors that combines both models while preserving their guarantees. This paper also presents an implementation in Clojure and an experimental evaluation of the performance of the transactional actor model.","PeriodicalId":424955,"journal":{"name":"Proceedings of the 4th ACM SIGPLAN International Workshop on Software Engineering for Parallel Systems","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129112017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthias Book, M. Riedel, Helmut Neukirchen, Markus Goetz
The design, development and deployment of scientific computing applications can be quite complex as they require scientific, high-performance computing (HPC), and software engineering expertise. Often, HPC applications are however developed by end users who are experts in their scientific domain, but need support from a supercomputing centre for the engineering and optimization aspects. The cooperation and communication between experts from these quite different disciplines can be difficult though. We therefore propose to employ the Interaction Room, a technique that facilitates interdisciplinary collaboration in complex software projects.
{"title":"Facilitating collaboration in high-performance computing projects with an interaction room","authors":"Matthias Book, M. Riedel, Helmut Neukirchen, Markus Goetz","doi":"10.1145/3141865.3142467","DOIUrl":"https://doi.org/10.1145/3141865.3142467","url":null,"abstract":"The design, development and deployment of scientific computing applications can be quite complex as they require scientific, high-performance computing (HPC), and software engineering expertise. Often, HPC applications are however developed by end users who are experts in their scientific domain, but need support from a supercomputing centre for the engineering and optimization aspects. The cooperation and communication between experts from these quite different disciplines can be difficult though. We therefore propose to employ the Interaction Room, a technique that facilitates interdisciplinary collaboration in complex software projects.","PeriodicalId":424955,"journal":{"name":"Proceedings of the 4th ACM SIGPLAN International Workshop on Software Engineering for Parallel Systems","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117264508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Some effort has been employed to allow interpreted languages to be able to take advantage of the computing capabilities of GPUs. Using interpreted languages allows to abstract the hardware and its specificities away from the user application, making development less complicated. However, due to hardware dependencies, the code needs to be compiled before execution. We want to compile a Lua function into a GPU kernel as transparently as possible, allowing the user to access the underlying hardware, without the complexities related to the traditional GPU programming. This scenario presents a great challenge on how to infer the variables data types while interfering as little as possible on the user programming paradigm.
{"title":"Declaring Lua data types for GPU code generation","authors":"Paulo Motta","doi":"10.1145/3141865.3142466","DOIUrl":"https://doi.org/10.1145/3141865.3142466","url":null,"abstract":"Some effort has been employed to allow interpreted languages to be able to take advantage of the computing capabilities of GPUs. Using interpreted languages allows to abstract the hardware and its specificities away from the user application, making development less complicated. However, due to hardware dependencies, the code needs to be compiled before execution. We want to compile a Lua function into a GPU kernel as transparently as possible, allowing the user to access the underlying hardware, without the complexities related to the traditional GPU programming. This scenario presents a great challenge on how to infer the variables data types while interfering as little as possible on the user programming paradigm.","PeriodicalId":424955,"journal":{"name":"Proceedings of the 4th ACM SIGPLAN International Workshop on Software Engineering for Parallel Systems","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131686396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Performance measurement and analysis are commonly carried out tasks for high-performance computing applications. Both sampling and instrumentation approaches for performance measurement can capture hardware performance counter (HWPC) metrics to asses the software's ability to use the functional units of the processor. Since the measurement software usually executes on the same processor, it necessarily competes with the target application for hardware resources. Consequently, the measurement system perturbs the target application, which often results in runtime overhead. While the runtime overhead of different measurement techniques has been previously studied, it has not been thoroughly examined to what extent HWPC values are perturbed by the measurement process. In this paper, we investigate the influence of the two widely-used performance measurement systems HPCToolkit (sampling) and Score-P (instrumentation) w.r.t. their influence on HWPC. Our experiments on the SPEC CPU 2006 C/C++ benchmarks show that, while Score-P's default instrumentation can massively increase runtime, it does not always heavily perturb relevant HWPC. On the other hand, HPCToolkit shows no significant runtime overhead, but significantly influences some relevant HWPC. We conclude that for every performance experiment sufficient baseline measurements are essential to identify the HWPC that remain valid indicators of performance for a given measurement technique. Thus, performance analysis tools need to offer easily accessible means to automate the baseline and validation functionality.
性能测量和分析是高性能计算应用程序通常执行的任务。用于性能测量的采样和仪器方法都可以捕获硬件性能计数器(HWPC)指标,以评估软件使用处理器功能单元的能力。由于测量软件通常在同一处理器上执行,因此它必然与目标应用程序争夺硬件资源。因此,测量系统会干扰目标应用程序,这通常会导致运行时开销。虽然以前研究过不同测量技术的运行时开销,但尚未彻底检查HWPC值在多大程度上受到测量过程的干扰。在本文中,我们研究了两种广泛使用的绩效测量系统HPCToolkit(采样)和Score-P(仪器)的影响,而不是它们对HWPC的影响。我们在SPEC CPU 2006 C/ c++基准测试上的实验表明,虽然Score-P的默认工具可以大量增加运行时,但它并不总是严重干扰相关的HWPC。另一方面,HPCToolkit没有显示出明显的运行时开销,但会显著影响一些相关的HWPC。我们得出的结论是,对于每个性能实验,充分的基线测量对于确定HWPC仍然是给定测量技术的有效性能指标至关重要。因此,性能分析工具需要提供易于访问的方法来自动化基线和验证功能。
{"title":"The influence of HPCToolkit and Score-p on hardware performance counters","authors":"Jan-Patrick Lehr, Christian Iwainsky, C. Bischof","doi":"10.1145/3141865.3141869","DOIUrl":"https://doi.org/10.1145/3141865.3141869","url":null,"abstract":"Performance measurement and analysis are commonly carried out tasks for high-performance computing applications. Both sampling and instrumentation approaches for performance measurement can capture hardware performance counter (HWPC) metrics to asses the software's ability to use the functional units of the processor. Since the measurement software usually executes on the same processor, it necessarily competes with the target application for hardware resources. Consequently, the measurement system perturbs the target application, which often results in runtime overhead. While the runtime overhead of different measurement techniques has been previously studied, it has not been thoroughly examined to what extent HWPC values are perturbed by the measurement process. In this paper, we investigate the influence of the two widely-used performance measurement systems HPCToolkit (sampling) and Score-P (instrumentation) w.r.t. their influence on HWPC. Our experiments on the SPEC CPU 2006 C/C++ benchmarks show that, while Score-P's default instrumentation can massively increase runtime, it does not always heavily perturb relevant HWPC. On the other hand, HPCToolkit shows no significant runtime overhead, but significantly influences some relevant HWPC. We conclude that for every performance experiment sufficient baseline measurements are essential to identify the HWPC that remain valid indicators of performance for a given measurement technique. Thus, performance analysis tools need to offer easily accessible means to automate the baseline and validation functionality.","PeriodicalId":424955,"journal":{"name":"Proceedings of the 4th ACM SIGPLAN International Workshop on Software Engineering for Parallel Systems","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124169379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 4th ACM SIGPLAN International Workshop on Software Engineering for Parallel Systems","authors":"","doi":"10.1145/3141865","DOIUrl":"https://doi.org/10.1145/3141865","url":null,"abstract":"","PeriodicalId":424955,"journal":{"name":"Proceedings of the 4th ACM SIGPLAN International Workshop on Software Engineering for Parallel Systems","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125337651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}