Evaluation of OpenCL Performance-oriented Optimizations for Streaming Kernels on the FPGA: (Abstract Only)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2018-02-15 DOI:10.1145/3174243.3174967

Zheming Jin, H. Finkel

{"title":"Evaluation of OpenCL Performance-oriented Optimizations for Streaming Kernels on the FPGA: (Abstract Only)","authors":"Zheming Jin, H. Finkel","doi":"10.1145/3174243.3174967","DOIUrl":null,"url":null,"abstract":"The streaming applications efficiently and High-level synthesis (HLS) tools allow people without complex hardware design knowledge to evaluate an application on FPGAs, there is an opportunity and a need to understand where OpenCL and FPGA can play in the streaming domains. To this end, we evaluate the overhead of the OpenCL infrastructure on the Nallatech 385A FPGA board that features an Arria 10 GX1150 FPGA. Then we explore the implementation space and discuss the performance optimization techniques for the streaming kernels using the OpenCL-to-FPGA HLS tool. On the target platform, the infrastructure overhead requires 12% of the FPGA memory and logic resources. The latency of the single work-item kernel execution is 11 us and the maximum frequency of a kernel implementation is around 300 MHz. The experimental results of the streaming kernels show FPGA resources, such as block RAMs and DSPs, can limit the kernel performance before the constraint of memory bandwidth takes effect. Kernel vectorization and compute unit duplication are practical optimization techniques that can improve the kernel performance by a factor of 2 to 10. The combination of the two techniques can achieve the best performance. To improve the performance of compute unit duplication, the local work size needs to be tuned and the optimal value can increase the performance by a factor of 3 to 70 compared to the default value.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3174243.3174967","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The streaming applications efficiently and High-level synthesis (HLS) tools allow people without complex hardware design knowledge to evaluate an application on FPGAs, there is an opportunity and a need to understand where OpenCL and FPGA can play in the streaming domains. To this end, we evaluate the overhead of the OpenCL infrastructure on the Nallatech 385A FPGA board that features an Arria 10 GX1150 FPGA. Then we explore the implementation space and discuss the performance optimization techniques for the streaming kernels using the OpenCL-to-FPGA HLS tool. On the target platform, the infrastructure overhead requires 12% of the FPGA memory and logic resources. The latency of the single work-item kernel execution is 11 us and the maximum frequency of a kernel implementation is around 300 MHz. The experimental results of the streaming kernels show FPGA resources, such as block RAMs and DSPs, can limit the kernel performance before the constraint of memory bandwidth takes effect. Kernel vectorization and compute unit duplication are practical optimization techniques that can improve the kernel performance by a factor of 2 to 10. The combination of the two techniques can achieve the best performance. To improve the performance of compute unit duplication, the local work size needs to be tuned and the optimal value can increase the performance by a factor of 3 to 70 compared to the default value.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

FPGA上面向流内核的OpenCL性能优化评估(摘要)

高效的流应用程序和高级综合(HLS)工具允许没有复杂硬件设计知识的人评估FPGA上的应用程序，有机会和需要了解OpenCL和FPGA在流领域中的作用。为此，我们评估了Nallatech 385A FPGA板上OpenCL基础架构的开销，该板具有Arria 10 GX1150 FPGA。然后，我们探索了实现空间，并讨论了使用OpenCL-to-FPGA HLS工具的流内核性能优化技术。在目标平台上，基础设施开销需要12%的FPGA内存和逻辑资源。单个工作项内核执行的延迟时间为11秒，内核实现的最大频率约为300 MHz。流内核的实验结果表明，在内存带宽约束生效之前，FPGA资源(如块ram和dsp)可以限制内核的性能。内核矢量化和计算单元复制是实用的优化技术，可以将内核性能提高2到10倍。两种技术的结合可以达到最佳性能。为了提高计算单元复制的性能，需要调整本地工作大小，与默认值相比，最优值可以将性能提高3到70倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

自引率

0.00%

发文量

期刊最新文献

Architecture and Circuit Design of an All-Spintronic FPGA Session details: Session 6: High Level Synthesis 2 A FPGA Friendly Approximate Computing Framework with Hybrid Neural Networks: (Abstract Only) Software/Hardware Co-design for Multichannel Scheduling in IEEE 802.11p MLME: (Abstract Only) Session details: Special Session: Deep Learning