Portable and Efficient Dense Linear Algebra in the Beginning of the Exascale Era

2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) Pub Date : 2022-11-01 DOI:10.1109/P3HPC56579.2022.00009

M. Gates, A. YarKhan, D. Sukkari, Kadir Akbudak, S. Cayrols, Daniel Bielich, A. Abdelfattah, Mohammed Al Farhan, J. Dongarra

{"title":"Portable and Efficient Dense Linear Algebra in the Beginning of the Exascale Era","authors":"M. Gates, A. YarKhan, D. Sukkari, Kadir Akbudak, S. Cayrols, Daniel Bielich, A. Abdelfattah, Mohammed Al Farhan, J. Dongarra","doi":"10.1109/P3HPC56579.2022.00009","DOIUrl":null,"url":null,"abstract":"The SLATE project is implementing a distributed dense linear algebra library for highly-scalable distributed-memory accelerator-based computer systems. The goal is to provide a library that can be easily ported to different hardware (CPUs, GPUs, accelerators) and will provide high performance for machines into the future. Current ports include CPUs, CUDA, ROCm, and oneAPI. We achieve both performance and portability by leveraging several layers and abstractions, including OpenMP tasks to track data dependencies, MPI for distributed communication, and the BLAS++ and LAPACK++ libraries developed as a portable layer across vendor-optimized CPU and GPU BLAS and LAPACK functionality. We rely on the C++ standard library and templating to reduce code duplication for better maintainability. The few kernels not present in BLAS are implemented in CUDA, HIP, and OpenMP target offload, and are easily ported to new platforms.","PeriodicalId":261766,"journal":{"name":"2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/P3HPC56579.2022.00009","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The SLATE project is implementing a distributed dense linear algebra library for highly-scalable distributed-memory accelerator-based computer systems. The goal is to provide a library that can be easily ported to different hardware (CPUs, GPUs, accelerators) and will provide high performance for machines into the future. Current ports include CPUs, CUDA, ROCm, and oneAPI. We achieve both performance and portability by leveraging several layers and abstractions, including OpenMP tasks to track data dependencies, MPI for distributed communication, and the BLAS++ and LAPACK++ libraries developed as a portable layer across vendor-optimized CPU and GPU BLAS and LAPACK functionality. We rely on the C++ standard library and templating to reduce code duplication for better maintainability. The few kernels not present in BLAS are implemented in CUDA, HIP, and OpenMP target offload, and are easily ported to new platforms.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

百亿亿次计算机时代初期的便携式高效密集线性代数

SLATE项目正在为基于高度可扩展的分布式内存加速器的计算机系统实现一个分布式密集线性代数库。我们的目标是提供一个库，可以很容易地移植到不同的硬件(cpu、gpu、加速器)，并将为未来的机器提供高性能。当前接口包括cpu、CUDA、ROCm、oneAPI。我们通过利用多个层和抽象来实现性能和可移植性，包括用于跟踪数据依赖性的OpenMP任务，用于分布式通信的MPI，以及作为可移植层开发的BLAS++和lapack+ +库，这些库跨供应商优化的CPU和GPU BLAS和LAPACK功能。我们依靠c++标准库和模板来减少代码重复，以获得更好的可维护性。BLAS中不存在的少数内核在CUDA, HIP和OpenMP目标卸载中实现，并且很容易移植到新平台。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)

自引率

0.00%

发文量