HLS Implementation of a Building Cube Stencil Computation Framework for an FPGA Accelerator

2024 IEEE International Conference on Consumer Electronics (ICCE) Pub Date : 2024-01-06 DOI:10.1109/ICCE59016.2024.10444277

Daiki Furukawa, Taito Manabe, Yuichiro Shibata, Tomohiro Ueno, Kentaro Sano

{"title":"HLS Implementation of a Building Cube Stencil Computation Framework for an FPGA Accelerator","authors":"Daiki Furukawa, Taito Manabe, Yuichiro Shibata, Tomohiro Ueno, Kentaro Sano","doi":"10.1109/ICCE59016.2024.10444277","DOIUrl":null,"url":null,"abstract":"FPGAs are promising energy-efficient accelerators for computing-intensive applications such as electromagnetic field simulations, which are also important tasks for consumer product design. Especially, stencil computation, which is a commonly-used computing pattern for scientific and engineering simulations, is known to have a high degree of affinity with FPGAs. In practical simulations, data reduction methods, such as the building cube method (BCM), are often utilized to balance computation accuracy and speed. However, such techniques tend to introduce irregular memory access patterns, making it a tough task for application programmers to implement efficient memory access hardware units in FPGAs. In this paper, we propose a design framework for stencil computation with BCM, enabling application programmers to focus on algorithm implementation without being aware of memory access optimization. We implement the framework on an Intel FPGA PAC D5005 platform, to evaluate its effectiveness in terms of resource utilization, execution time, and throughput. As for resource utilization, it was confirmed that the area overhead of the proposed BCM framework is small enough, leaving sufficient resource space for user applications. The performance evaluation results revealed that the measured throughput of the BCM framework deteriorated by more than 90% compared to non-BCM execution due to irregular memory access patterns. However, since the number of cells to be computed in BCM is significantly reduced, the final computation speed was improved by up to 28 times, indicating that the reduction in the throughput is acceptable.","PeriodicalId":518694,"journal":{"name":"2024 IEEE International Conference on Consumer Electronics (ICCE)","volume":"3 4","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2024 IEEE International Conference on Consumer Electronics (ICCE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCE59016.2024.10444277","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

FPGAs are promising energy-efficient accelerators for computing-intensive applications such as electromagnetic field simulations, which are also important tasks for consumer product design. Especially, stencil computation, which is a commonly-used computing pattern for scientific and engineering simulations, is known to have a high degree of affinity with FPGAs. In practical simulations, data reduction methods, such as the building cube method (BCM), are often utilized to balance computation accuracy and speed. However, such techniques tend to introduce irregular memory access patterns, making it a tough task for application programmers to implement efficient memory access hardware units in FPGAs. In this paper, we propose a design framework for stencil computation with BCM, enabling application programmers to focus on algorithm implementation without being aware of memory access optimization. We implement the framework on an Intel FPGA PAC D5005 platform, to evaluate its effectiveness in terms of resource utilization, execution time, and throughput. As for resource utilization, it was confirmed that the area overhead of the proposed BCM framework is small enough, leaving sufficient resource space for user applications. The performance evaluation results revealed that the measured throughput of the BCM framework deteriorated by more than 90% compared to non-BCM execution due to irregular memory access patterns. However, since the number of cells to be computed in BCM is significantly reduced, the final computation speed was improved by up to 28 times, indicating that the reduction in the throughput is acceptable.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

为 FPGA 加速器实现 Building Cube Stencil 计算框架的 HLS 实现

对于电磁场仿真等计算密集型应用来说，FPGA 是一种前景广阔的节能加速器，而电磁场仿真也是消费类产品设计的重要任务。尤其是模板计算，它是科学和工程仿真中常用的计算模式，与 FPGA 有很高的亲和力。在实际仿真中，为了兼顾计算精度和速度，通常会使用数据缩减方法，如建筑立方体法（BCM）。然而，这类技术往往会引入不规则的内存访问模式，这使得应用程序员在 FPGA 中实现高效内存访问硬件单元成为一项艰巨的任务。在本文中，我们提出了利用 BCM 进行模版计算的设计框架，使应用程序员能够专注于算法实施，而无需考虑内存访问优化。我们在英特尔 FPGA PAC D5005 平台上实现了该框架，并从资源利用率、执行时间和吞吐量等方面评估了其有效性。在资源利用率方面，证实了所提出的 BCM 框架的面积开销足够小，为用户应用留出了足够的资源空间。性能评估结果表明，由于内存访问模式不规则，BCM 框架的实测吞吐量与非 BCCM 执行相比下降了 90% 以上。不过，由于 BCM 中需要计算的单元数量大幅减少，最终计算速度提高了 28 倍，这表明吞吐量的降低是可以接受的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2024 IEEE International Conference on Consumer Electronics (ICCE)

自引率

0.00%

发文量