Exploiting Computation Reuse for Stencil Accelerators.

Proceedings. Design Automation Conference Pub Date : 2020-07-01 Epub Date: 2020-10-09 DOI:10.1109/dac18072.2020.9218680

Yuze Chi, Jason Cong

{"title":"Exploiting Computation Reuse for Stencil Accelerators.","authors":"Yuze Chi, Jason Cong","doi":"10.1109/dac18072.2020.9218680","DOIUrl":null,"url":null,"abstract":"<p><p>Stencil kernel is an important type of kernel used extensively in many application domains. Over the years, researchers have been studying the optimizations on parallelization, communication reuse, and computation reuse for various target platforms. However, challenges still exist, especially on the computation reuse problem for accelerators, due to the lack of complete design-space exploration and effective design-space pruning. In this paper, we present solutions to the above challenges for a wide range of stencil kernels (i.e., stencil with reduction operations), where the computation reuse patterns are extremely flexible due to the commutative and associative properties. We formally define the complete design space, based on which we present a provably optimal dynamic programming algorithm and a heuristic beam search algorithm that provides near-optimal solutions under an architecture-aware model. Experimental results show that for synthesizing stencil kernels to FPGAs, compared with state-of-the-art stencil compiler without computation reuse capability, our proposed algorithm can reduce the look-up table (LUT) and digital signal processor (DSP) usage by 58.1% and 54.6% on average respectively, which leads to an average speedup of 2.3× for compute-intensive kernels, outperforming the latest CPU/GPU results.</p>","PeriodicalId":87346,"journal":{"name":"Proceedings. Design Automation Conference","volume":"2020 ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/dac18072.2020.9218680","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. Design Automation Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/dac18072.2020.9218680","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/10/9 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Stencil kernel is an important type of kernel used extensively in many application domains. Over the years, researchers have been studying the optimizations on parallelization, communication reuse, and computation reuse for various target platforms. However, challenges still exist, especially on the computation reuse problem for accelerators, due to the lack of complete design-space exploration and effective design-space pruning. In this paper, we present solutions to the above challenges for a wide range of stencil kernels (i.e., stencil with reduction operations), where the computation reuse patterns are extremely flexible due to the commutative and associative properties. We formally define the complete design space, based on which we present a provably optimal dynamic programming algorithm and a heuristic beam search algorithm that provides near-optimal solutions under an architecture-aware model. Experimental results show that for synthesizing stencil kernels to FPGAs, compared with state-of-the-art stencil compiler without computation reuse capability, our proposed algorithm can reduce the look-up table (LUT) and digital signal processor (DSP) usage by 58.1% and 54.6% on average respectively, which leads to an average speedup of 2.3× for compute-intensive kernels, outperforming the latest CPU/GPU results.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

开发模板加速器的计算重用。

模板内核是一种重要的内核类型，广泛应用于许多应用领域。多年来，研究人员一直在研究各种目标平台的并行化、通信重用和计算重用的优化。然而，由于缺乏完整的设计空间探索和有效的设计空间修剪，仍然存在挑战，特别是加速器的计算重用问题。在本文中，我们针对广泛的模板内核(即带有约简操作的模板)提出了解决上述挑战的方案，其中计算重用模式由于交换性和关联性而非常灵活。我们正式定义了完整的设计空间，在此基础上，我们提出了一个可证明的最优动态规划算法和一个启发式光束搜索算法，该算法在架构感知模型下提供了接近最优的解决方案。实验结果表明，在将模板内核合成为fpga时，与目前最先进的没有计算重用能力的模板编译器相比，本文提出的算法可将查找表(LUT)和数字信号处理器(DSP)的使用分别减少58.1%和54.6%，对计算密集型内核的平均加速提高2.3倍，优于最新的CPU/GPU结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings. Design Automation Conference

自引率

0.00%

发文量

期刊最新文献

Muffin: A Framework Toward Multi-Dimension AI Fairness by Uniting Off-the-Shelf Models. DAC '22: 59th ACM/IEEE Design Automation Conference, San Francisco, California, USA, July 10 - 14, 2022 General Chair's Message Exploiting Computation Reuse for Stencil Accelerators. Reconciling remote attestation and safety-critical operation on simple IoT devices