Generating Efficient Tensor Contractions for GPUs

2015 44th International Conference on Parallel Processing Pub Date : 2015-09-01 DOI:10.1109/ICPP.2015.106

T. Nelson, Axel Rivera, Prasanna Balaprakash, Mary W. Hall, P. Hovland, E. Jessup, B. Norris

引用次数: 37

Abstract

Many scientific and numerical applications, including quantum chemistry modeling and fluid dynamics simulation, require tensor product and tensor contraction evaluation. Tensor computations are characterized by arrays with numerous dimensions, inherent parallelism, moderate data reuse and many degrees of freedom in the order in which to perform the computation. The best-performing implementation is heavily dependent on the tensor dimensionality and the target architecture. In this paper, we map tensor computations to GPUs, starting with a high-level tensor input language and producing efficient CUDA code as output. Our approach is to combine tensor-specific mathematical transformations with a GPU decision algorithm, machine learning and auto tuning of a large parameter space. Generated code shows significant performance gains over sequential and Open MP parallel code, and a comparison with Open ACC shows the importance of auto tuning and other optimizations in our framework for achieving efficient results.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

为gpu生成高效张量收缩

许多科学和数值应用，包括量子化学建模和流体动力学模拟，都需要张量积和张量收缩的评估。张量计算的特点是具有多个维度的数组、固有的并行性、适度的数据重用和执行计算顺序的多个自由度。性能最好的实现在很大程度上依赖于张量维度和目标体系结构。在本文中，我们将张量计算映射到gpu，从高级张量输入语言开始，并产生高效的CUDA代码作为输出。我们的方法是将张量特定的数学变换与GPU决策算法、机器学习和大参数空间的自动调整相结合。与顺序和Open MP并行代码相比，生成的代码显示了显著的性能提升，并且与Open ACC的比较显示了在我们的框架中自动调优和其他优化对于实现高效结果的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2015 44th International Conference on Parallel Processing

自引率

0.00%

发文量