Over the last few years, GPUs became ubiquitous in HPC installations around the world. Today, they provide the main source of performance in a number of Top500 machines - for example Summit, Sierra, and JUWELS Booster. Also for the upcoming Exascale era, GPUs are selected as key enablers and will be installed numerously. While individual GPU devices already offer plenty of performance (O (10) TFLOP/sFP64), current and next-generation super-computers employ them in the thousands. Using these machines to the fullest extend means not only utilizing individual devices efficiently, but using the entire interconnected system of devices thoroughly. JUWELS Booster is a recently installed Tier-0/1 system at Jülich Supercomputing Centre (JSC), currently the 7th-fastest supercomputer in the world, and the fastest in Europe. JUWELS Booster features 936 nodes, each equipped with 4 NVIDIA A100 Tensor Core GPUs and 4 Mellanox HDR200 InfiniBand HCAs. The peak performance of all GPUs together sums up to 73 PFLOP/s and it features a DragonFly+ network topology with 800 Gbit/s network injection bandwidth per node. During installation of JUWELS Booster, a selected set of applications were given access to the system as part of the JUWELS Booster Early Access Program. To prepare for their first compute time allocation, scientific users were able to gain first experiences on the machine. They gave direct feedback to the system operations team during installation and beyond. Close collaboration was facilitated with the application support staff of JSC, giving unique insights into the individual processes of utilizing a brand-new large-sale system for a first time. Likewise, performance profiles of applications could be studied and collaboratively analyzed, employing available tools and methods. Performance limiters of the specific application on the platform were identified and proposals for improvement developed. This talk will present first experiences with JUWELS Booster and the applications utilizing the system during its first months. Applied methods for onboarding, analysis, and optimization will be shown and assessed. Highlights of the state of the art of performance analysis and modeling for GPUs will be presented with concrete examples from the JUWELS Booster Early Access Program.
{"title":"JUWELS Booster - Early User Experiences","authors":"A. Herten","doi":"10.1145/3452412.3462752","DOIUrl":"https://doi.org/10.1145/3452412.3462752","url":null,"abstract":"Over the last few years, GPUs became ubiquitous in HPC installations around the world. Today, they provide the main source of performance in a number of Top500 machines - for example Summit, Sierra, and JUWELS Booster. Also for the upcoming Exascale era, GPUs are selected as key enablers and will be installed numerously. While individual GPU devices already offer plenty of performance (O (10) TFLOP/sFP64), current and next-generation super-computers employ them in the thousands. Using these machines to the fullest extend means not only utilizing individual devices efficiently, but using the entire interconnected system of devices thoroughly. JUWELS Booster is a recently installed Tier-0/1 system at Jülich Supercomputing Centre (JSC), currently the 7th-fastest supercomputer in the world, and the fastest in Europe. JUWELS Booster features 936 nodes, each equipped with 4 NVIDIA A100 Tensor Core GPUs and 4 Mellanox HDR200 InfiniBand HCAs. The peak performance of all GPUs together sums up to 73 PFLOP/s and it features a DragonFly+ network topology with 800 Gbit/s network injection bandwidth per node. During installation of JUWELS Booster, a selected set of applications were given access to the system as part of the JUWELS Booster Early Access Program. To prepare for their first compute time allocation, scientific users were able to gain first experiences on the machine. They gave direct feedback to the system operations team during installation and beyond. Close collaboration was facilitated with the application support staff of JSC, giving unique insights into the individual processes of utilizing a brand-new large-sale system for a first time. Likewise, performance profiles of applications could be studied and collaboratively analyzed, employing available tools and methods. Performance limiters of the specific application on the platform were identified and proposals for improvement developed. This talk will present first experiences with JUWELS Booster and the applications utilizing the system during its first months. Applied methods for onboarding, analysis, and optimization will be shown and assessed. Highlights of the state of the art of performance analysis and modeling for GPUs will be presented with concrete examples from the JUWELS Booster Early Access Program.","PeriodicalId":342766,"journal":{"name":"Proceedings of the 2021 on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123719554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Convolutional neural networks (CNN) drive successful machine learning applications in a growing number of areas. However, training a CNN may take a massive amount of time and expensive high-end GPU resources. CNN training time may change significantly depending on training parameters and GPU type. Therefore, an accurate estimation of CNN training time can help in selecting training parameters and GPU type, which minimise training time and cost. We focus on one training parameter, which has a particularly significant effect on the training time-the mini-batch size. Predicting CNN training time on a wide range of mini-batch sizes is challenging because a small variation in a mini-batch size can change the selection of convolution algorithms and cause abrupt changes in training time, which is also affected by non-GPU operations. This paper shows our approach to predicting CNN training time over a wide range of mini-batch sizes by utilising a proxy application to benchmark convolutional and dense layers and considering non-GPU time. In contrast to prior works, which build one prediction model for all possible CNN configurations, we build simple models that would each make highly accurate predictions for one particular CNN. We evaluate our approach using several CNN samples and GPU types and demonstrate that it can yield highly accurate predictions on unseen mini-batch sizes with a mean percentage error averaged over all experiments equal to 1.38% (the minimum is 0.21% and the maximum is 5.01%).
{"title":"Predicting How CNN Training Time Changes on Various Mini-Batch Sizes by Considering Convolution Algorithms and Non-GPU Time","authors":"Peter Bryzgalov, T. Maeda, Yutaro Shigeto","doi":"10.1145/3452412.3462750","DOIUrl":"https://doi.org/10.1145/3452412.3462750","url":null,"abstract":"Convolutional neural networks (CNN) drive successful machine learning applications in a growing number of areas. However, training a CNN may take a massive amount of time and expensive high-end GPU resources. CNN training time may change significantly depending on training parameters and GPU type. Therefore, an accurate estimation of CNN training time can help in selecting training parameters and GPU type, which minimise training time and cost. We focus on one training parameter, which has a particularly significant effect on the training time-the mini-batch size. Predicting CNN training time on a wide range of mini-batch sizes is challenging because a small variation in a mini-batch size can change the selection of convolution algorithms and cause abrupt changes in training time, which is also affected by non-GPU operations. This paper shows our approach to predicting CNN training time over a wide range of mini-batch sizes by utilising a proxy application to benchmark convolutional and dense layers and considering non-GPU time. In contrast to prior works, which build one prediction model for all possible CNN configurations, we build simple models that would each make highly accurate predictions for one particular CNN. We evaluate our approach using several CNN samples and GPU types and demonstrate that it can yield highly accurate predictions on unseen mini-batch sizes with a mean percentage error averaged over all experiments equal to 1.38% (the minimum is 0.21% and the maximum is 5.01%).","PeriodicalId":342766,"journal":{"name":"Proceedings of the 2021 on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126618501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Connor Scully-Allison, R. Liem, Ana Luisa Veroneze Solórzano, J. Labarta, G. Juckeland, L. Schnorr, Max Katz, Olga Pearce
In this panel, a team of four experts in performance analysis, parallel computing, and distributed systems discuss the future of performance analysis. A particular emphasis will be placed on how the growth of GPUs and cloud computing are changing the landscape of tools and techniques we are currently using. This panel will discuss the limitations of today's technology, the barriers to progress, the research which may help us overcome these barriers and provide insight into what future tools may look like. The panel is in the format of question & answer session given by the moderator combined with interactive communication with the audience.
{"title":"Panel Discussion on the Future of Performance Analysis and Engineering","authors":"Connor Scully-Allison, R. Liem, Ana Luisa Veroneze Solórzano, J. Labarta, G. Juckeland, L. Schnorr, Max Katz, Olga Pearce","doi":"10.1145/3452412.3464484","DOIUrl":"https://doi.org/10.1145/3452412.3464484","url":null,"abstract":"In this panel, a team of four experts in performance analysis, parallel computing, and distributed systems discuss the future of performance analysis. A particular emphasis will be placed on how the growth of GPUs and cloud computing are changing the landscape of tools and techniques we are currently using. This panel will discuss the limitations of today's technology, the barriers to progress, the research which may help us overcome these barriers and provide insight into what future tools may look like. The panel is in the format of question & answer session given by the moderator combined with interactive communication with the audience.","PeriodicalId":342766,"journal":{"name":"Proceedings of the 2021 on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116637928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Víctor López, Guillem Ramirez Miranda, M. Garcia-Gasulla
This paper presents the design, implementation, and application of TALP, a lightweight, portable, extensible, and scalable tool for online parallel performance measurement. The efficiency metrics reported by TALP allow HPC users to evaluate the parallel efficiency of their executions, both post-mortem and at runtime. The API that TALP provides allows the running application or resource managers to collect performance metrics at runtime. This enables the opportunity to adapt the execution based on the metrics collected dynamically. The set of metrics collected by TALP are well defined, independent of the tool, and consolidated. We extend the collection of metrics with two additional ones that can differentiate between the load imbalance originated from the intranode or internode imbalance. We evaluate the potential of TALP with three parallel applications that present various parallel issues and carefully analyze the overhead introduced to determine its limitations.
{"title":"TALP: A Lightweight Tool to Unveil Parallel Efficiency of Large-scale Executions","authors":"Víctor López, Guillem Ramirez Miranda, M. Garcia-Gasulla","doi":"10.1145/3452412.3462753","DOIUrl":"https://doi.org/10.1145/3452412.3462753","url":null,"abstract":"This paper presents the design, implementation, and application of TALP, a lightweight, portable, extensible, and scalable tool for online parallel performance measurement. The efficiency metrics reported by TALP allow HPC users to evaluate the parallel efficiency of their executions, both post-mortem and at runtime. The API that TALP provides allows the running application or resource managers to collect performance metrics at runtime. This enables the opportunity to adapt the execution based on the metrics collected dynamically. The set of metrics collected by TALP are well defined, independent of the tool, and consolidated. We extend the collection of metrics with two additional ones that can differentiate between the load imbalance originated from the intranode or internode imbalance. We evaluate the potential of TALP with three parallel applications that present various parallel issues and carefully analyze the overhead introduced to determine its limitations.","PeriodicalId":342766,"journal":{"name":"Proceedings of the 2021 on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122257042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vincent Bode, Fariz Huseynli, Matrtin Schreiber, C. Trinitis, M. Schulz
The rise of heterogeneity in High-Performance Computing (HPC) architectures has caused a spike in the number of viable hardware solutions for different workloads. In order to take advantage of the increasing possibilities to influence how hardware can be tailored to boost software performance, collaboration between hardware manufacturers, computing centers and application developers must intensify with the goal of hardware-software co-design. To support the co-design effort, we need efficient methods to compare the performance of the many potential architectures running user-supplied applications. We present the High-Dimensional Exploration and Optimization Tool (HOT), a tool for visualizing and comparing software performance on CPU/GPU hybrid architectures. HOT is currently based on data acquired from Intel's Offload Advisor (I-OA) to model application performance, allowing us to extract performance predictions for existing/custom accelerator architectures. This eliminates the necessity of porting applications to different (parallel) programming models and also avoids benchmarking the application on target hardware. However, tools like I-OA allow users to tweak many hardware parameters, making it tedious to evaluate and compare results. HOT, therefore, focuses on visualizing these high-dimensional design spaces and assists the user in identifying suitable hardware configurations for given applications. Thus, users can gain rapid insights into how hardware/software influence each other in heterogeneous environments. We show the usage of HOT on several case studies. To determine the accuracy of collected performance data with I-OA, we analyze LULESH on different architectures. Next, we apply HOT to the synthetic benchmarks STREAM and 2MM to demonstrate the tool's visualization under these well-defined and known workloads, validating both the tool and its usage. Finally, we apply HOT to the real world code Gadget and the proxy application LULESH allowing us to easily identify their bottlenecks and optimize the choice of compute architecture for them.
{"title":"On the Exploration and Optimization of High-Dimensional Architectural Design Space","authors":"Vincent Bode, Fariz Huseynli, Matrtin Schreiber, C. Trinitis, M. Schulz","doi":"10.1145/3452412.3462754","DOIUrl":"https://doi.org/10.1145/3452412.3462754","url":null,"abstract":"The rise of heterogeneity in High-Performance Computing (HPC) architectures has caused a spike in the number of viable hardware solutions for different workloads. In order to take advantage of the increasing possibilities to influence how hardware can be tailored to boost software performance, collaboration between hardware manufacturers, computing centers and application developers must intensify with the goal of hardware-software co-design. To support the co-design effort, we need efficient methods to compare the performance of the many potential architectures running user-supplied applications. We present the High-Dimensional Exploration and Optimization Tool (HOT), a tool for visualizing and comparing software performance on CPU/GPU hybrid architectures. HOT is currently based on data acquired from Intel's Offload Advisor (I-OA) to model application performance, allowing us to extract performance predictions for existing/custom accelerator architectures. This eliminates the necessity of porting applications to different (parallel) programming models and also avoids benchmarking the application on target hardware. However, tools like I-OA allow users to tweak many hardware parameters, making it tedious to evaluate and compare results. HOT, therefore, focuses on visualizing these high-dimensional design spaces and assists the user in identifying suitable hardware configurations for given applications. Thus, users can gain rapid insights into how hardware/software influence each other in heterogeneous environments. We show the usage of HOT on several case studies. To determine the accuracy of collected performance data with I-OA, we analyze LULESH on different architectures. Next, we apply HOT to the synthetic benchmarks STREAM and 2MM to demonstrate the tool's visualization under these well-defined and known workloads, validating both the tool and its usage. Finally, we apply HOT to the real world code Gadget and the proxy application LULESH allowing us to easily identify their bottlenecks and optimize the choice of compute architecture for them.","PeriodicalId":342766,"journal":{"name":"Proceedings of the 2021 on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132153828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 2021 on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy","authors":"","doi":"10.1145/3452412","DOIUrl":"https://doi.org/10.1145/3452412","url":null,"abstract":"","PeriodicalId":342766,"journal":{"name":"Proceedings of the 2021 on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133711763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}