{"title":"HLS Implementation of a Building Cube Stencil Computation Framework for an FPGA Accelerator","authors":"Daiki Furukawa, Taito Manabe, Yuichiro Shibata, Tomohiro Ueno, Kentaro Sano","doi":"10.1109/ICCE59016.2024.10444277","DOIUrl":null,"url":null,"abstract":"FPGAs are promising energy-efficient accelerators for computing-intensive applications such as electromagnetic field simulations, which are also important tasks for consumer product design. Especially, stencil computation, which is a commonly-used computing pattern for scientific and engineering simulations, is known to have a high degree of affinity with FPGAs. In practical simulations, data reduction methods, such as the building cube method (BCM), are often utilized to balance computation accuracy and speed. However, such techniques tend to introduce irregular memory access patterns, making it a tough task for application programmers to implement efficient memory access hardware units in FPGAs. In this paper, we propose a design framework for stencil computation with BCM, enabling application programmers to focus on algorithm implementation without being aware of memory access optimization. We implement the framework on an Intel FPGA PAC D5005 platform, to evaluate its effectiveness in terms of resource utilization, execution time, and throughput. As for resource utilization, it was confirmed that the area overhead of the proposed BCM framework is small enough, leaving sufficient resource space for user applications. The performance evaluation results revealed that the measured throughput of the BCM framework deteriorated by more than 90% compared to non-BCM execution due to irregular memory access patterns. However, since the number of cells to be computed in BCM is significantly reduced, the final computation speed was improved by up to 28 times, indicating that the reduction in the throughput is acceptable.","PeriodicalId":518694,"journal":{"name":"2024 IEEE International Conference on Consumer Electronics (ICCE)","volume":"3 4","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2024 IEEE International Conference on Consumer Electronics (ICCE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCE59016.2024.10444277","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
FPGAs are promising energy-efficient accelerators for computing-intensive applications such as electromagnetic field simulations, which are also important tasks for consumer product design. Especially, stencil computation, which is a commonly-used computing pattern for scientific and engineering simulations, is known to have a high degree of affinity with FPGAs. In practical simulations, data reduction methods, such as the building cube method (BCM), are often utilized to balance computation accuracy and speed. However, such techniques tend to introduce irregular memory access patterns, making it a tough task for application programmers to implement efficient memory access hardware units in FPGAs. In this paper, we propose a design framework for stencil computation with BCM, enabling application programmers to focus on algorithm implementation without being aware of memory access optimization. We implement the framework on an Intel FPGA PAC D5005 platform, to evaluate its effectiveness in terms of resource utilization, execution time, and throughput. As for resource utilization, it was confirmed that the area overhead of the proposed BCM framework is small enough, leaving sufficient resource space for user applications. The performance evaluation results revealed that the measured throughput of the BCM framework deteriorated by more than 90% compared to non-BCM execution due to irregular memory access patterns. However, since the number of cells to be computed in BCM is significantly reduced, the final computation speed was improved by up to 28 times, indicating that the reduction in the throughput is acceptable.