Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators

Yehonatan Fridman, G. Tamir, Gal Oren
{"title":"Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators","authors":"Yehonatan Fridman, G. Tamir, Gal Oren","doi":"10.48550/arXiv.2304.04276","DOIUrl":null,"url":null,"abstract":"Over the last decade, most of the increase in computing power has been gained by advances in accelerated many-core architectures, mainly in the form of GPGPUs. While accelerators achieve phenomenal performances in various computing tasks, their utilization requires code adaptations and transformations. Thus, OpenMP, the most common standard for multi-threading in scientific computing applications, introduced offloading capabilities between host (CPUs) and accelerators since v4.0, with increasing support in the successive v4.5, v5.0, v5.1, and the latest v5.2 versions. Recently, two state-of-the-art GPUs -- the Intel Ponte Vecchio Max 1100 and the NVIDIA A100 GPUs -- were released to the market, with the oneAPI and NVHPC compilers for offloading, correspondingly. In this work, we present early performance results of OpenMP offloading capabilities to these devices while specifically analyzing the portability of advanced directives (using SOLLVE's OMPVV test suite) and the scalability of the hardware in representative scientific mini-app (the LULESH benchmark). Our results show that the coverage for version 4.5 is nearly complete in both latest NVHPC and oneAPI tools. However, we observed a lack of support in versions 5.0, 5.1, and 5.2, which is particularly noticeable when using NVHPC. From the performance perspective, we found that the PVC1100 and A100 are relatively comparable on the LULESH benchmark. While the A100 is slightly better due to faster memory bandwidth, the PVC1100 reaches the next problem size (400^3) scalably due to the larger memory size.","PeriodicalId":345133,"journal":{"name":"ISC Workshops","volume":"59 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ISC Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2304.04276","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Over the last decade, most of the increase in computing power has been gained by advances in accelerated many-core architectures, mainly in the form of GPGPUs. While accelerators achieve phenomenal performances in various computing tasks, their utilization requires code adaptations and transformations. Thus, OpenMP, the most common standard for multi-threading in scientific computing applications, introduced offloading capabilities between host (CPUs) and accelerators since v4.0, with increasing support in the successive v4.5, v5.0, v5.1, and the latest v5.2 versions. Recently, two state-of-the-art GPUs -- the Intel Ponte Vecchio Max 1100 and the NVIDIA A100 GPUs -- were released to the market, with the oneAPI and NVHPC compilers for offloading, correspondingly. In this work, we present early performance results of OpenMP offloading capabilities to these devices while specifically analyzing the portability of advanced directives (using SOLLVE's OMPVV test suite) and the scalability of the hardware in representative scientific mini-app (the LULESH benchmark). Our results show that the coverage for version 4.5 is nearly complete in both latest NVHPC and oneAPI tools. However, we observed a lack of support in versions 5.0, 5.1, and 5.2, which is particularly noticeable when using NVHPC. From the performance perspective, we found that the PVC1100 and A100 are relatively comparable on the LULESH benchmark. While the A100 is slightly better due to faster memory bandwidth, the PVC1100 reaches the next problem size (400^3) scalably due to the larger memory size.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
最先进加速器上OpenMP卸载的可移植性和可扩展性
在过去的十年中,计算能力的大部分增长都是通过加速多核架构的进步获得的,主要是以gpgpu的形式。虽然加速器在各种计算任务中实现了惊人的性能,但它们的使用需要对代码进行调整和转换。因此,作为科学计算应用程序中最常见的多线程标准,OpenMP从v4.0开始在主机(cpu)和加速器之间引入了卸载功能,并在随后的v4.5、v5.0、v5.1和最新的v5.2版本中增加了支持。最近,两款最先进的gpu——英特尔Ponte Vecchio Max 1100和NVIDIA A100 gpu——发布到市场上,相应的,有一个api和NVHPC编译器用于卸载。在这项工作中,我们展示了OpenMP卸载功能到这些设备的早期性能结果,同时特别分析了高级指令的可移植性(使用SOLLVE的OMPVV测试套件)和代表性科学迷你应用程序(LULESH基准)中硬件的可扩展性。我们的结果表明,在最新的NVHPC和oneAPI工具中,4.5版本的覆盖几乎完全。然而,我们观察到在5.0、5.1和5.2版本中缺乏支持,这在使用NVHPC时尤其明显。从性能的角度来看,我们发现PVC1100和A100在LULESH基准上是相对可比性的。虽然A100由于更快的内存带宽而略好,但PVC1100由于更大的内存大小而可扩展地达到下一个问题大小(400^3)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Challenges and Opportunities for RISC-V Architectures towards Genomics-based Workloads Software Development Vehicles to enable extended and early co-design: a RISC-V and HPC case of study Test-driving RISC-V Vector hardware for HPC Backporting RISC-V Vector assembly Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1