Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators

ISC Workshops Pub Date : 2023-04-09 DOI:10.48550/arXiv.2304.04276

Yehonatan Fridman, G. Tamir, Gal Oren

{"title":"Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators","authors":"Yehonatan Fridman, G. Tamir, Gal Oren","doi":"10.48550/arXiv.2304.04276","DOIUrl":null,"url":null,"abstract":"Over the last decade, most of the increase in computing power has been gained by advances in accelerated many-core architectures, mainly in the form of GPGPUs. While accelerators achieve phenomenal performances in various computing tasks, their utilization requires code adaptations and transformations. Thus, OpenMP, the most common standard for multi-threading in scientific computing applications, introduced offloading capabilities between host (CPUs) and accelerators since v4.0, with increasing support in the successive v4.5, v5.0, v5.1, and the latest v5.2 versions. Recently, two state-of-the-art GPUs -- the Intel Ponte Vecchio Max 1100 and the NVIDIA A100 GPUs -- were released to the market, with the oneAPI and NVHPC compilers for offloading, correspondingly. In this work, we present early performance results of OpenMP offloading capabilities to these devices while specifically analyzing the portability of advanced directives (using SOLLVE's OMPVV test suite) and the scalability of the hardware in representative scientific mini-app (the LULESH benchmark). Our results show that the coverage for version 4.5 is nearly complete in both latest NVHPC and oneAPI tools. However, we observed a lack of support in versions 5.0, 5.1, and 5.2, which is particularly noticeable when using NVHPC. From the performance perspective, we found that the PVC1100 and A100 are relatively comparable on the LULESH benchmark. While the A100 is slightly better due to faster memory bandwidth, the PVC1100 reaches the next problem size (400^3) scalably due to the larger memory size.","PeriodicalId":345133,"journal":{"name":"ISC Workshops","volume":"59 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ISC Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2304.04276","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Over the last decade, most of the increase in computing power has been gained by advances in accelerated many-core architectures, mainly in the form of GPGPUs. While accelerators achieve phenomenal performances in various computing tasks, their utilization requires code adaptations and transformations. Thus, OpenMP, the most common standard for multi-threading in scientific computing applications, introduced offloading capabilities between host (CPUs) and accelerators since v4.0, with increasing support in the successive v4.5, v5.0, v5.1, and the latest v5.2 versions. Recently, two state-of-the-art GPUs -- the Intel Ponte Vecchio Max 1100 and the NVIDIA A100 GPUs -- were released to the market, with the oneAPI and NVHPC compilers for offloading, correspondingly. In this work, we present early performance results of OpenMP offloading capabilities to these devices while specifically analyzing the portability of advanced directives (using SOLLVE's OMPVV test suite) and the scalability of the hardware in representative scientific mini-app (the LULESH benchmark). Our results show that the coverage for version 4.5 is nearly complete in both latest NVHPC and oneAPI tools. However, we observed a lack of support in versions 5.0, 5.1, and 5.2, which is particularly noticeable when using NVHPC. From the performance perspective, we found that the PVC1100 and A100 are relatively comparable on the LULESH benchmark. While the A100 is slightly better due to faster memory bandwidth, the PVC1100 reaches the next problem size (400^3) scalably due to the larger memory size.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

最先进加速器上OpenMP卸载的可移植性和可扩展性

在过去的十年中，计算能力的大部分增长都是通过加速多核架构的进步获得的，主要是以gpgpu的形式。虽然加速器在各种计算任务中实现了惊人的性能，但它们的使用需要对代码进行调整和转换。因此，作为科学计算应用程序中最常见的多线程标准，OpenMP从v4.0开始在主机(cpu)和加速器之间引入了卸载功能，并在随后的v4.5、v5.0、v5.1和最新的v5.2版本中增加了支持。最近，两款最先进的gpu——英特尔Ponte Vecchio Max 1100和NVIDIA A100 gpu——发布到市场上，相应的，有一个api和NVHPC编译器用于卸载。在这项工作中，我们展示了OpenMP卸载功能到这些设备的早期性能结果，同时特别分析了高级指令的可移植性(使用SOLLVE的OMPVV测试套件)和代表性科学迷你应用程序(LULESH基准)中硬件的可扩展性。我们的结果表明，在最新的NVHPC和oneAPI工具中，4.5版本的覆盖几乎完全。然而，我们观察到在5.0、5.1和5.2版本中缺乏支持，这在使用NVHPC时尤其明显。从性能的角度来看，我们发现PVC1100和A100在LULESH基准上是相对可比性的。虽然A100由于更快的内存带宽而略好，但PVC1100由于更大的内存大小而可扩展地达到下一个问题大小(400^3)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ISC Workshops

自引率

0.00%

发文量

期刊最新文献

Challenges and Opportunities for RISC-V Architectures towards Genomics-based Workloads Software Development Vehicles to enable extended and early co-design: a RISC-V and HPC case of study Test-driving RISC-V Vector hardware for HPC Backporting RISC-V Vector assembly Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators