Proceedings of the 2021 on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy最新文献

英文中文

JUWELS Booster - Early User Experiences JUWELS Booster -早期用户体验

Proceedings of the 2021 on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy

Pub Date : 2021-06-25 DOI: 10.1145/3452412.3462752

A. Herten

Over the last few years, GPUs became ubiquitous in HPC installations around the world. Today, they provide the main source of performance in a number of Top500 machines - for example Summit, Sierra, and JUWELS Booster. Also for the upcoming Exascale era, GPUs are selected as key enablers and will be installed numerously. While individual GPU devices already offer plenty of performance (O (10) TFLOP/sFP64), current and next-generation super-computers employ them in the thousands. Using these machines to the fullest extend means not only utilizing individual devices efficiently, but using the entire interconnected system of devices thoroughly. JUWELS Booster is a recently installed Tier-0/1 system at Jülich Supercomputing Centre (JSC), currently the 7th-fastest supercomputer in the world, and the fastest in Europe. JUWELS Booster features 936 nodes, each equipped with 4 NVIDIA A100 Tensor Core GPUs and 4 Mellanox HDR200 InfiniBand HCAs. The peak performance of all GPUs together sums up to 73 PFLOP/s and it features a DragonFly+ network topology with 800 Gbit/s network injection bandwidth per node. During installation of JUWELS Booster, a selected set of applications were given access to the system as part of the JUWELS Booster Early Access Program. To prepare for their first compute time allocation, scientific users were able to gain first experiences on the machine. They gave direct feedback to the system operations team during installation and beyond. Close collaboration was facilitated with the application support staff of JSC, giving unique insights into the individual processes of utilizing a brand-new large-sale system for a first time. Likewise, performance profiles of applications could be studied and collaboratively analyzed, employing available tools and methods. Performance limiters of the specific application on the platform were identified and proposals for improvement developed. This talk will present first experiences with JUWELS Booster and the applications utilizing the system during its first months. Applied methods for onboarding, analysis, and optimization will be shown and assessed. Highlights of the state of the art of performance analysis and modeling for GPUs will be presented with concrete examples from the JUWELS Booster Early Access Program.

在过去的几年里，gpu在世界各地的HPC安装中变得无处不在。今天，它们在许多Top500机器中提供了主要的性能来源-例如Summit, Sierra和JUWELS Booster。同样，在即将到来的百亿亿次时代，gpu被选为关键推动者，并将大量安装。虽然单个GPU设备已经提供了足够的性能(O (10) TFLOP/sFP64)，但当前和下一代超级计算机仍在使用数千个GPU设备。充分利用这些机器不仅意味着有效地利用单个设备，而且意味着彻底地利用整个相互连接的设备系统。JUWELS Booster是j lich超级计算中心(JSC)最近安装的Tier-0/1系统，目前是世界上第七快的超级计算机，也是欧洲最快的超级计算机。JUWELS Booster具有936个节点，每个节点配备4个NVIDIA A100 Tensor Core gpu和4个Mellanox HDR200 InfiniBand hca。所有gpu的峰值性能总计可达73 PFLOP/s，并具有DragonFly+网络拓扑，每个节点的网络注入带宽为800 Gbit/s。在安装JUWELS Booster期间，一组选定的应用程序被授予访问系统的权限，作为JUWELS Booster早期访问计划的一部分。为了准备他们的第一次计算时间分配，科学用户能够在机器上获得第一次体验。他们在安装期间和之后向系统操作团队提供直接反馈。与JSC的应用支持人员进行了密切的合作，对首次使用全新的大型销售系统的各个流程有了独特的见解。同样，可以使用可用的工具和方法研究和协作分析应用程序的性能配置文件。确定了平台上特定应用程序的性能限制因素，并提出了改进建议。本次演讲将介绍JUWELS Booster的首次使用体验以及在最初几个月使用该系统的应用程序。将展示和评估用于入职、分析和优化的应用方法。重点介绍了gpu性能分析和建模的最新技术，并将介绍JUWELS Booster Early Access项目的具体示例。

{"title":"JUWELS Booster - Early User Experiences","authors":"A. Herten","doi":"10.1145/3452412.3462752","DOIUrl":"https://doi.org/10.1145/3452412.3462752","url":null,"abstract":"Over the last few years, GPUs became ubiquitous in HPC installations around the world. Today, they provide the main source of performance in a number of Top500 machines - for example Summit, Sierra, and JUWELS Booster. Also for the upcoming Exascale era, GPUs are selected as key enablers and will be installed numerously. While individual GPU devices already offer plenty of performance (O (10) TFLOP/sFP64), current and next-generation super-computers employ them in the thousands. Using these machines to the fullest extend means not only utilizing individual devices efficiently, but using the entire interconnected system of devices thoroughly. JUWELS Booster is a recently installed Tier-0/1 system at Jülich Supercomputing Centre (JSC), currently the 7th-fastest supercomputer in the world, and the fastest in Europe. JUWELS Booster features 936 nodes, each equipped with 4 NVIDIA A100 Tensor Core GPUs and 4 Mellanox HDR200 InfiniBand HCAs. The peak performance of all GPUs together sums up to 73 PFLOP/s and it features a DragonFly+ network topology with 800 Gbit/s network injection bandwidth per node. During installation of JUWELS Booster, a selected set of applications were given access to the system as part of the JUWELS Booster Early Access Program. To prepare for their first compute time allocation, scientific users were able to gain first experiences on the machine. They gave direct feedback to the system operations team during installation and beyond. Close collaboration was facilitated with the application support staff of JSC, giving unique insights into the individual processes of utilizing a brand-new large-sale system for a first time. Likewise, performance profiles of applications could be studied and collaboratively analyzed, employing available tools and methods. Performance limiters of the specific application on the platform were identified and proposals for improvement developed. This talk will present first experiences with JUWELS Booster and the applications utilizing the system during its first months. Applied methods for onboarding, analysis, and optimization will be shown and assessed. Highlights of the state of the art of performance analysis and modeling for GPUs will be presented with concrete examples from the JUWELS Booster Early Access Program.","PeriodicalId":342766,"journal":{"name":"Proceedings of the 2021 on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123719554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Predicting How CNN Training Time Changes on Various Mini-Batch Sizes by Considering Convolution Algorithms and Non-GPU Time 通过考虑卷积算法和非gpu时间来预测CNN训练时间在不同小批量大小下的变化

Proceedings of the 2021 on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy

Pub Date : 2021-06-25 DOI: 10.1145/3452412.3462750

Peter Bryzgalov, T. Maeda, Yutaro Shigeto

Convolutional neural networks (CNN) drive successful machine learning applications in a growing number of areas. However, training a CNN may take a massive amount of time and expensive high-end GPU resources. CNN training time may change significantly depending on training parameters and GPU type. Therefore, an accurate estimation of CNN training time can help in selecting training parameters and GPU type, which minimise training time and cost. We focus on one training parameter, which has a particularly significant effect on the training time-the mini-batch size. Predicting CNN training time on a wide range of mini-batch sizes is challenging because a small variation in a mini-batch size can change the selection of convolution algorithms and cause abrupt changes in training time, which is also affected by non-GPU operations. This paper shows our approach to predicting CNN training time over a wide range of mini-batch sizes by utilising a proxy application to benchmark convolutional and dense layers and considering non-GPU time. In contrast to prior works, which build one prediction model for all possible CNN configurations, we build simple models that would each make highly accurate predictions for one particular CNN. We evaluate our approach using several CNN samples and GPU types and demonstrate that it can yield highly accurate predictions on unseen mini-batch sizes with a mean percentage error averaged over all experiments equal to 1.38% (the minimum is 0.21% and the maximum is 5.01%).

卷积神经网络(CNN)在越来越多的领域推动了成功的机器学习应用。然而，训练CNN可能需要大量的时间和昂贵的高端GPU资源。CNN的训练时间可能会随着训练参数和GPU类型的不同而有很大的变化。因此，准确估计CNN的训练时间有助于选择训练参数和GPU类型，从而最大限度地减少训练时间和成本。我们关注一个对训练时间影响特别显著的训练参数——小批大小。在广泛的小批大小范围内预测CNN的训练时间是具有挑战性的，因为小批大小的微小变化会改变卷积算法的选择并导致训练时间的突然变化，这也会受到非gpu操作的影响。本文展示了我们通过使用代理应用程序对卷积层和密集层进行基准测试并考虑非gpu时间来预测大范围小批量大小的CNN训练时间的方法。与之前的工作(为所有可能的CNN配置构建一个预测模型)相比，我们构建了简单的模型，每个模型都可以对一个特定的CNN进行高度准确的预测。我们使用几个CNN样本和GPU类型来评估我们的方法，并证明它可以在未见过的小批大小上产生高度准确的预测，所有实验的平均百分比误差等于1.38%(最小值为0.21%，最大值为5.01%)。

{"title":"Predicting How CNN Training Time Changes on Various Mini-Batch Sizes by Considering Convolution Algorithms and Non-GPU Time","authors":"Peter Bryzgalov, T. Maeda, Yutaro Shigeto","doi":"10.1145/3452412.3462750","DOIUrl":"https://doi.org/10.1145/3452412.3462750","url":null,"abstract":"Convolutional neural networks (CNN) drive successful machine learning applications in a growing number of areas. However, training a CNN may take a massive amount of time and expensive high-end GPU resources. CNN training time may change significantly depending on training parameters and GPU type. Therefore, an accurate estimation of CNN training time can help in selecting training parameters and GPU type, which minimise training time and cost. We focus on one training parameter, which has a particularly significant effect on the training time-the mini-batch size. Predicting CNN training time on a wide range of mini-batch sizes is challenging because a small variation in a mini-batch size can change the selection of convolution algorithms and cause abrupt changes in training time, which is also affected by non-GPU operations. This paper shows our approach to predicting CNN training time over a wide range of mini-batch sizes by utilising a proxy application to benchmark convolutional and dense layers and considering non-GPU time. In contrast to prior works, which build one prediction model for all possible CNN configurations, we build simple models that would each make highly accurate predictions for one particular CNN. We evaluate our approach using several CNN samples and GPU types and demonstrate that it can yield highly accurate predictions on unseen mini-batch sizes with a mean percentage error averaged over all experiments equal to 1.38% (the minimum is 0.21% and the maximum is 5.01%).","PeriodicalId":342766,"journal":{"name":"Proceedings of the 2021 on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126618501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Panel Discussion on the Future of Performance Analysis and Engineering 关于性能分析和工程的未来的小组讨论

Proceedings of the 2021 on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy

Pub Date : 2021-06-25 DOI: 10.1145/3452412.3464484

Connor Scully-Allison, R. Liem, Ana Luisa Veroneze Solórzano, J. Labarta, G. Juckeland, L. Schnorr, Max Katz, Olga Pearce

In this panel, a team of four experts in performance analysis, parallel computing, and distributed systems discuss the future of performance analysis. A particular emphasis will be placed on how the growth of GPUs and cloud computing are changing the landscape of tools and techniques we are currently using. This panel will discuss the limitations of today's technology, the barriers to progress, the research which may help us overcome these barriers and provide insight into what future tools may look like. The panel is in the format of question & answer session given by the moderator combined with interactive communication with the audience.

在这个小组中，由性能分析、并行计算和分布式系统方面的四名专家组成的团队将讨论性能分析的未来。我们将特别强调gpu和云计算的发展如何改变我们目前使用的工具和技术的格局。该小组将讨论当今技术的局限性，进步的障碍，可能帮助我们克服这些障碍的研究，并为未来工具的外观提供见解。小组讨论采用主持人提问与回答的形式，并与观众进行互动交流。

引用次数: 0

TALP: A Lightweight Tool to Unveil Parallel Efficiency of Large-scale Executions TALP:揭示大规模执行并行效率的轻量级工具

Proceedings of the 2021 on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy

Pub Date : 2021-06-21 DOI: 10.1145/3452412.3462753

Víctor López, Guillem Ramirez Miranda, M. Garcia-Gasulla

This paper presents the design, implementation, and application of TALP, a lightweight, portable, extensible, and scalable tool for online parallel performance measurement. The efficiency metrics reported by TALP allow HPC users to evaluate the parallel efficiency of their executions, both post-mortem and at runtime. The API that TALP provides allows the running application or resource managers to collect performance metrics at runtime. This enables the opportunity to adapt the execution based on the metrics collected dynamically. The set of metrics collected by TALP are well defined, independent of the tool, and consolidated. We extend the collection of metrics with two additional ones that can differentiate between the load imbalance originated from the intranode or internode imbalance. We evaluate the potential of TALP with three parallel applications that present various parallel issues and carefully analyze the overhead introduced to determine its limitations.

本文介绍了TALP的设计、实现和应用，TALP是一种轻量级、便携、可扩展和可扩展的在线并行性能测量工具。TALP报告的效率指标允许HPC用户评估其执行的并行效率，包括事后和运行时。TALP提供的API允许运行中的应用程序或资源管理器在运行时收集性能指标。这样就有机会根据动态收集的指标来调整执行。TALP收集的指标集定义良好，独立于工具，并且是统一的。我们用两个额外的指标来扩展度量集合，这两个指标可以区分源于内节点或节点间不平衡的负载不平衡。我们用三个存在各种并行问题的并行应用程序来评估TALP的潜力，并仔细分析引入的开销以确定其局限性。

引用次数: 4

On the Exploration and Optimization of High-Dimensional Architectural Design Space 高维建筑设计空间的探索与优化

Proceedings of the 2021 on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy

Pub Date : 2021-06-21 DOI: 10.1145/3452412.3462754

Vincent Bode, Fariz Huseynli, Matrtin Schreiber, C. Trinitis, M. Schulz

The rise of heterogeneity in High-Performance Computing (HPC) architectures has caused a spike in the number of viable hardware solutions for different workloads. In order to take advantage of the increasing possibilities to influence how hardware can be tailored to boost software performance, collaboration between hardware manufacturers, computing centers and application developers must intensify with the goal of hardware-software co-design. To support the co-design effort, we need efficient methods to compare the performance of the many potential architectures running user-supplied applications. We present the High-Dimensional Exploration and Optimization Tool (HOT), a tool for visualizing and comparing software performance on CPU/GPU hybrid architectures. HOT is currently based on data acquired from Intel's Offload Advisor (I-OA) to model application performance, allowing us to extract performance predictions for existing/custom accelerator architectures. This eliminates the necessity of porting applications to different (parallel) programming models and also avoids benchmarking the application on target hardware. However, tools like I-OA allow users to tweak many hardware parameters, making it tedious to evaluate and compare results. HOT, therefore, focuses on visualizing these high-dimensional design spaces and assists the user in identifying suitable hardware configurations for given applications. Thus, users can gain rapid insights into how hardware/software influence each other in heterogeneous environments. We show the usage of HOT on several case studies. To determine the accuracy of collected performance data with I-OA, we analyze LULESH on different architectures. Next, we apply HOT to the synthetic benchmarks STREAM and 2MM to demonstrate the tool's visualization under these well-defined and known workloads, validating both the tool and its usage. Finally, we apply HOT to the real world code Gadget and the proxy application LULESH allowing us to easily identify their bottlenecks and optimize the choice of compute architecture for them.

高性能计算(HPC)体系结构中异构性的增加导致了针对不同工作负载的可行硬件解决方案数量的激增。为了利用越来越多的可能性来影响如何定制硬件以提高软件性能，硬件制造商、计算中心和应用程序开发人员之间的协作必须以硬件软件协同设计为目标加强。为了支持协同设计工作，我们需要有效的方法来比较运行用户提供的应用程序的许多潜在架构的性能。我们提出了高维探索和优化工具(HOT)，这是一个在CPU/GPU混合架构上可视化和比较软件性能的工具。HOT目前基于从Intel的Offload Advisor (I-OA)获得的数据来建模应用程序性能，允许我们提取现有/自定义加速器架构的性能预测。这消除了将应用程序移植到不同(并行)编程模型的必要性，还避免了在目标硬件上对应用程序进行基准测试。但是，像I-OA这样的工具允许用户调整许多硬件参数，这使得评估和比较结果变得非常繁琐。因此，HOT侧重于可视化这些高维设计空间，并帮助用户确定给定应用程序的合适硬件配置。因此，用户可以快速了解硬件/软件在异构环境中如何相互影响。我们在几个案例研究中展示了HOT的用法。为了确定使用I-OA收集的性能数据的准确性，我们在不同的架构上分析了LULESH。接下来，我们将HOT应用于合成基准测试STREAM和2MM，以在这些定义良好且已知的工作负载下演示该工具的可视化，验证该工具及其使用情况。最后，我们将HOT应用于现实世界的代码Gadget和代理应用程序LULESH，使我们能够轻松地识别它们的瓶颈并为它们优化计算体系结构的选择。

{"title":"On the Exploration and Optimization of High-Dimensional Architectural Design Space","authors":"Vincent Bode, Fariz Huseynli, Matrtin Schreiber, C. Trinitis, M. Schulz","doi":"10.1145/3452412.3462754","DOIUrl":"https://doi.org/10.1145/3452412.3462754","url":null,"abstract":"The rise of heterogeneity in High-Performance Computing (HPC) architectures has caused a spike in the number of viable hardware solutions for different workloads. In order to take advantage of the increasing possibilities to influence how hardware can be tailored to boost software performance, collaboration between hardware manufacturers, computing centers and application developers must intensify with the goal of hardware-software co-design. To support the co-design effort, we need efficient methods to compare the performance of the many potential architectures running user-supplied applications. We present the High-Dimensional Exploration and Optimization Tool (HOT), a tool for visualizing and comparing software performance on CPU/GPU hybrid architectures. HOT is currently based on data acquired from Intel's Offload Advisor (I-OA) to model application performance, allowing us to extract performance predictions for existing/custom accelerator architectures. This eliminates the necessity of porting applications to different (parallel) programming models and also avoids benchmarking the application on target hardware. However, tools like I-OA allow users to tweak many hardware parameters, making it tedious to evaluate and compare results. HOT, therefore, focuses on visualizing these high-dimensional design spaces and assists the user in identifying suitable hardware configurations for given applications. Thus, users can gain rapid insights into how hardware/software influence each other in heterogeneous environments. We show the usage of HOT on several case studies. To determine the accuracy of collected performance data with I-OA, we analyze LULESH on different architectures. Next, we apply HOT to the synthetic benchmarks STREAM and 2MM to demonstrate the tool's visualization under these well-defined and known workloads, validating both the tool and its usage. Finally, we apply HOT to the real world code Gadget and the proxy application LULESH allowing us to easily identify their bottlenecks and optimize the choice of compute architecture for them.","PeriodicalId":342766,"journal":{"name":"Proceedings of the 2021 on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132153828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Proceedings of the 2021 on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy 2021年性能工程，建模，分析和可视化策略论文集

Proceedings of the 2021 on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy

Pub Date : 1900-01-01 DOI: 10.1145/3452412

引用次数: 0

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2021 on Performance EngineeRing, Modelling, Analysis, and VisualizatiOn STrategy

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀