Tasking in Accelerators: Performance Evaluation

2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT) Pub Date : 2019-12-01 DOI:10.1109/PDCAT46702.2019.00034

Leonel Toledo, Antonio J. Peña, Sandra Catalán, Pedro Valero-Lara

引用次数: 4

Abstract

In this work, we analyze the implications and results of implementing dynamic parallelism, concurrent kernels and CUDA Graphs to solve task-oriented problems. As a benchmark we propose three different methods for solving DGEMM operation on tiled-matrices; which might be the most popular benchmark for performance analysis. For the algorithms that we study, we present significant differences in terms of data dependencies, synchronization and granularity. The main contribution of this work is determining which of the previous approaches work better for having multiple task running concurrently in a single GPU, as well as stating the main limitations and benefits of every technique. Using dynamic parallelism and CUDA Streams we were able to achieve up to 30% speedups and for CUDA Graph API up to 25x acceleration outperforming state of the art results.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

加速器中的任务分配:绩效评估

在这项工作中，我们分析了实现动态并行、并发内核和CUDA图来解决面向任务的问题的含义和结果。作为基准，我们提出了三种不同的方法来求解平铺矩阵上的DGEMM运算;这可能是性能分析中最流行的基准。对于我们所研究的算法，我们在数据依赖性、同步性和粒度方面呈现出显著的差异。这项工作的主要贡献是确定之前的方法中哪一种更适合在单个GPU中同时运行多个任务，以及说明每种技术的主要局限性和优点。使用动态并行和CUDA流，我们能够实现高达30%的加速，CUDA图形API高达25倍的加速，超越了最先进的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)

自引率

0.00%

发文量