gpu上的可变大小批处理条件数计算

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2018-09-01 DOI:10.1109/CAHPC.2018.8645907

H. Anzt, J. Dongarra, Goran Flegar, Thomas Grützmacher

{"title":"gpu上的可变大小批处理条件数计算","authors":"H. Anzt, J. Dongarra, Goran Flegar, Thomas Grützmacher","doi":"10.1109/CAHPC.2018.8645907","DOIUrl":null,"url":null,"abstract":"We present a kernel that is designed to quickly compute the condition number of a large collection of tiny matrices on a graphics processing unit (GPU). The matrices can differ in size and the process integrates the use of pivoting to ensure a numerically-stable matrix inversion. The performance assessment reveals that, in double precision arithmetic, the new GPU kernel achieves up to 550 GFLOPs (billions of floating-point operations per second) and 800 GFLOPs on NVIDIA's P100 and V100 GPUs, respectively. The results also demonstrate a considerable speed-up with respect to a workflow that computes the condition number via launching a set of four batched kernels. In addition, we present a variable-size batched kernel for the computation of the matrix infinity norm. We show that this memory-bound kernel achieves up to 90% of the sustainable peak bandwidth.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"261 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Variable-Size Batched Condition Number Calculation on GPUs\",\"authors\":\"H. Anzt, J. Dongarra, Goran Flegar, Thomas Grützmacher\",\"doi\":\"10.1109/CAHPC.2018.8645907\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a kernel that is designed to quickly compute the condition number of a large collection of tiny matrices on a graphics processing unit (GPU). The matrices can differ in size and the process integrates the use of pivoting to ensure a numerically-stable matrix inversion. The performance assessment reveals that, in double precision arithmetic, the new GPU kernel achieves up to 550 GFLOPs (billions of floating-point operations per second) and 800 GFLOPs on NVIDIA's P100 and V100 GPUs, respectively. The results also demonstrate a considerable speed-up with respect to a workflow that computes the condition number via launching a set of four batched kernels. In addition, we present a variable-size batched kernel for the computation of the matrix infinity norm. We show that this memory-bound kernel achieves up to 90% of the sustainable peak bandwidth.\",\"PeriodicalId\":307747,\"journal\":{\"name\":\"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)\",\"volume\":\"261 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CAHPC.2018.8645907\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CAHPC.2018.8645907","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

我们提出了一个内核，旨在快速计算图形处理单元(GPU)上大量微小矩阵的条件数。矩阵的大小可以不同，并且该过程集成了旋转的使用，以确保数值稳定的矩阵反演。性能评估显示，在双精度运算中，新的GPU内核在NVIDIA的P100和V100 GPU上分别实现了高达550 GFLOPs(每秒数十亿次浮点运算)和800 GFLOPs。结果还表明，通过启动一组四个批处理内核来计算条件数的工作流有相当大的加速。此外，我们还提出了一种用于计算矩阵无穷范数的可变大小的批处理核。我们表明，这个内存受限的内核可以达到可持续峰值带宽的90%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Variable-Size Batched Condition Number Calculation on GPUs

We present a kernel that is designed to quickly compute the condition number of a large collection of tiny matrices on a graphics processing unit (GPU). The matrices can differ in size and the process integrates the use of pivoting to ensure a numerically-stable matrix inversion. The performance assessment reveals that, in double precision arithmetic, the new GPU kernel achieves up to 550 GFLOPs (billions of floating-point operations per second) and 800 GFLOPs on NVIDIA's P100 and V100 GPUs, respectively. The results also demonstrate a considerable speed-up with respect to a workflow that computes the condition number via launching a set of four batched kernels. In addition, we present a variable-size batched kernel for the computation of the matrix infinity norm. We show that this memory-bound kernel achieves up to 90% of the sustainable peak bandwidth.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

自引率

0.00%

发文量

期刊最新文献

Assessing Time Predictability Features of ARM Big. LITTLE Multicores Impacts of Three Soft-Fault Models on Hybrid Parallel Asynchronous Iterative Methods Predicting the Performance Impact of Increasing Memory Bandwidth for Scientific Workflows From Java to FPGA: An Experience with the Intel HARP System Copyright