基于FPGA的OpenCL Base64数据解包内核评估与优化

Zheming Jin, Iris Johnson, H. Finkel
{"title":"基于FPGA的OpenCL Base64数据解包内核评估与优化","authors":"Zheming Jin, Iris Johnson, H. Finkel","doi":"10.1109/PDP2018.2018.00046","DOIUrl":null,"url":null,"abstract":"Development of applications using OpenCL targeting FPGAs is an emerging approach on heterogeneous computing systems. This paper uses the data unpacking algorithm in Base64 encoding as a case study to present programming and optimization techniques, and experimental results of the OpenCL-based implementations on an FPGA. We explain the algorithm and evaluate the performance of the kernel implementations with Intel's FPGA OpenCL SDK. The experimental results show kernel vectorization and duplication are two optimization techniques that can improve the kernel performance. The performance of kernel duplication is also closely related to the local work size. Our experiment shows 16-lane vectorization increases the bandwidth by a factor of 2 to 10 for large input data sizes. Moreover, the performance of kernel duplication using 16 compute units is 40% to 1.5% less than that of kernel vectorization depending on the input size. Tuning the local work size can improve the kernel performance by a factor of 3 to 23. For this kernel, using local memory is not an effective technique to improve the kernel performance because input data is not reused. A combination of vectorization and duplication achieves the highest performance of 12.3 GiB/s. Compared to an Intel Xeon E5 CPU and an Nvidia Tesla K80 GPU, the performance of the kernel on the Arria 10 FPGA is 6.7X faster than the CPU and 3X slower than the GPU. The performance per watt on the FPGA is 20.5X higher than the CPU and 1.19X lower than the GPU.","PeriodicalId":333367,"journal":{"name":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2018-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Evaluating and Optimizing OpenCL Base64 Data Unpacking Kernel with FPGA\",\"authors\":\"Zheming Jin, Iris Johnson, H. Finkel\",\"doi\":\"10.1109/PDP2018.2018.00046\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Development of applications using OpenCL targeting FPGAs is an emerging approach on heterogeneous computing systems. This paper uses the data unpacking algorithm in Base64 encoding as a case study to present programming and optimization techniques, and experimental results of the OpenCL-based implementations on an FPGA. We explain the algorithm and evaluate the performance of the kernel implementations with Intel's FPGA OpenCL SDK. The experimental results show kernel vectorization and duplication are two optimization techniques that can improve the kernel performance. The performance of kernel duplication is also closely related to the local work size. Our experiment shows 16-lane vectorization increases the bandwidth by a factor of 2 to 10 for large input data sizes. Moreover, the performance of kernel duplication using 16 compute units is 40% to 1.5% less than that of kernel vectorization depending on the input size. Tuning the local work size can improve the kernel performance by a factor of 3 to 23. For this kernel, using local memory is not an effective technique to improve the kernel performance because input data is not reused. A combination of vectorization and duplication achieves the highest performance of 12.3 GiB/s. Compared to an Intel Xeon E5 CPU and an Nvidia Tesla K80 GPU, the performance of the kernel on the Arria 10 FPGA is 6.7X faster than the CPU and 3X slower than the GPU. The performance per watt on the FPGA is 20.5X higher than the CPU and 1.19X lower than the GPU.\",\"PeriodicalId\":333367,\"journal\":{\"name\":\"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-03-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PDP2018.2018.00046\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDP2018.2018.00046","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

利用OpenCL开发针对fpga的应用程序是异构计算系统的一种新兴方法。本文以Base64编码下的数据解包算法为例,介绍了该算法的编程和优化技术,并给出了基于opencl的FPGA实现实验结果。我们解释了该算法,并利用英特尔的FPGA OpenCL SDK评估了内核实现的性能。实验结果表明,核矢量化和复制是提高核性能的两种优化技术。内核复制的性能也与本地工作大小密切相关。我们的实验表明,对于大的输入数据量,16通道矢量化将带宽提高了2到10倍。此外,根据输入大小的不同,使用16个计算单元的内核复制的性能比内核矢量化的性能低40%到1.5%。调优本地工作大小可以将内核性能提高3到23倍。对于这个内核,使用本地内存并不是提高内核性能的有效技术,因为输入数据不会被重用。矢量化和复制的组合实现了12.3 GiB/s的最高性能。与Intel至强E5 CPU和Nvidia Tesla K80 GPU相比,Arria 10 FPGA上的内核性能比CPU快6.7倍,比GPU慢3倍。FPGA的每瓦性能比CPU高20.5倍,比GPU低1.19倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Evaluating and Optimizing OpenCL Base64 Data Unpacking Kernel with FPGA
Development of applications using OpenCL targeting FPGAs is an emerging approach on heterogeneous computing systems. This paper uses the data unpacking algorithm in Base64 encoding as a case study to present programming and optimization techniques, and experimental results of the OpenCL-based implementations on an FPGA. We explain the algorithm and evaluate the performance of the kernel implementations with Intel's FPGA OpenCL SDK. The experimental results show kernel vectorization and duplication are two optimization techniques that can improve the kernel performance. The performance of kernel duplication is also closely related to the local work size. Our experiment shows 16-lane vectorization increases the bandwidth by a factor of 2 to 10 for large input data sizes. Moreover, the performance of kernel duplication using 16 compute units is 40% to 1.5% less than that of kernel vectorization depending on the input size. Tuning the local work size can improve the kernel performance by a factor of 3 to 23. For this kernel, using local memory is not an effective technique to improve the kernel performance because input data is not reused. A combination of vectorization and duplication achieves the highest performance of 12.3 GiB/s. Compared to an Intel Xeon E5 CPU and an Nvidia Tesla K80 GPU, the performance of the kernel on the Arria 10 FPGA is 6.7X faster than the CPU and 3X slower than the GPU. The performance per watt on the FPGA is 20.5X higher than the CPU and 1.19X lower than the GPU.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
TMbarrier: Speculative Barriers Using Hardware Transactional Memory Evaluating the Effect of Multi-Tenancy Patterns in Containerized Cloud-Hosted Content Management System A Generic Learning Multi-agent-System Approach for Spatio-Temporal-, Thermal- and Energy-Aware Scheduling Developing and Using a Geometric Multigrid, Unstructured Grid Mini-Application to Assess Many-Core Architectures Extending PluTo for Multiple Devices by Integrating OpenACC
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1