基于OpenCL的fpga高性能模板计算空间和时间组合块

H. Zohouri, Artur Podobas, S. Matsuoka
{"title":"基于OpenCL的fpga高性能模板计算空间和时间组合块","authors":"H. Zohouri, Artur Podobas, S. Matsuoka","doi":"10.1145/3174243.3174248","DOIUrl":null,"url":null,"abstract":"Recent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of performance for stencil computation, most previous work achieve this by avoiding spatial blocking and restricting input dimensions relative to FPGA on-chip memory. In this work we create a stencil accelerator using Intel FPGA SDK for OpenCL that achieves high performance without having such restrictions. We combine spatial and temporal blocking to avoid input size restrictions, and employ multiple FPGA-specific optimizations to tackle issues arisen from the added design complexity. Accelerator parameter tuning is guided by our performance model, which we also use to project performance for the upcoming Intel Stratix 10 devices. On an Arria 10 GX 1150 device, our accelerator can reach up to 760 and 375 GFLOP/s of compute performance, for 2D and 3D stencils, respectively, which rivals the performance of a highly-optimized GPU implementation. Furthermore, we estimate that the upcoming Stratix 10 devices can achieve a performance of up to 3.5 TFLOP/s and 1.6 TFLOP/s for 2D and 3D stencil computation, respectively.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"40 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"80","resultStr":"{\"title\":\"Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL\",\"authors\":\"H. Zohouri, Artur Podobas, S. Matsuoka\",\"doi\":\"10.1145/3174243.3174248\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of performance for stencil computation, most previous work achieve this by avoiding spatial blocking and restricting input dimensions relative to FPGA on-chip memory. In this work we create a stencil accelerator using Intel FPGA SDK for OpenCL that achieves high performance without having such restrictions. We combine spatial and temporal blocking to avoid input size restrictions, and employ multiple FPGA-specific optimizations to tackle issues arisen from the added design complexity. Accelerator parameter tuning is guided by our performance model, which we also use to project performance for the upcoming Intel Stratix 10 devices. On an Arria 10 GX 1150 device, our accelerator can reach up to 760 and 375 GFLOP/s of compute performance, for 2D and 3D stencils, respectively, which rivals the performance of a highly-optimized GPU implementation. Furthermore, we estimate that the upcoming Stratix 10 devices can achieve a performance of up to 3.5 TFLOP/s and 1.6 TFLOP/s for 2D and 3D stencil computation, respectively.\",\"PeriodicalId\":164936,\"journal\":{\"name\":\"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays\",\"volume\":\"40 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"80\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3174243.3174248\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3174243.3174248","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 80

摘要

高级合成工具的最新发展吸引了软件程序员在fpga上加速他们的高性能计算应用。尽管已经证明FPGA可以在模板计算的性能方面与gpu竞争,但大多数先前的工作通过避免空间阻塞和限制相对于FPGA片上存储器的输入尺寸来实现这一点。在这项工作中,我们使用英特尔FPGA SDK为OpenCL创建了一个模板加速器,该加速器在没有此类限制的情况下实现了高性能。我们结合空间和时间阻塞来避免输入大小限制,并采用多个fpga特定的优化来解决由于增加的设计复杂性而产生的问题。加速器参数调整由我们的性能模型指导,我们也使用该模型来预测即将推出的英特尔Stratix 10设备的性能。在Arria 10 GX 1150设备上,我们的加速器可以分别达到760和375 GFLOP/s的计算性能,用于2D和3D模板,可与高度优化的GPU实现的性能相媲美。此外,我们估计即将推出的Stratix 10器件可以分别实现高达3.5 TFLOP/s和1.6 TFLOP/s的2D和3D模板计算性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL
Recent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of performance for stencil computation, most previous work achieve this by avoiding spatial blocking and restricting input dimensions relative to FPGA on-chip memory. In this work we create a stencil accelerator using Intel FPGA SDK for OpenCL that achieves high performance without having such restrictions. We combine spatial and temporal blocking to avoid input size restrictions, and employ multiple FPGA-specific optimizations to tackle issues arisen from the added design complexity. Accelerator parameter tuning is guided by our performance model, which we also use to project performance for the upcoming Intel Stratix 10 devices. On an Arria 10 GX 1150 device, our accelerator can reach up to 760 and 375 GFLOP/s of compute performance, for 2D and 3D stencils, respectively, which rivals the performance of a highly-optimized GPU implementation. Furthermore, we estimate that the upcoming Stratix 10 devices can achieve a performance of up to 3.5 TFLOP/s and 1.6 TFLOP/s for 2D and 3D stencil computation, respectively.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Architecture and Circuit Design of an All-Spintronic FPGA Session details: Session 6: High Level Synthesis 2 A FPGA Friendly Approximate Computing Framework with Hybrid Neural Networks: (Abstract Only) Software/Hardware Co-design for Multichannel Scheduling in IEEE 802.11p MLME: (Abstract Only) Session details: Special Session: Deep Learning
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1