了解fpga和gpu的性能差异:(摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2018-02-15 DOI:10.1145/3174243.3174970

J. Cong, Zhenman Fang, Michael Lo, Hanrui Wang, Jingxian Xu, Shaochong Zhang

{"title":"了解fpga和gpu的性能差异:(摘要)","authors":"J. Cong, Zhenman Fang, Michael Lo, Hanrui Wang, Jingxian Xu, Shaochong Zhang","doi":"10.1145/3174243.3174970","DOIUrl":null,"url":null,"abstract":"The notorious power wall has significantly limited the scaling for general-purpose processors. To address this issue, various accelerators, such as GPUs and FPGAs, emerged to achieve better performance and energy-efficiency. Between these two programmable accelerators, a natural question arises: which applications are better suited for FPGAs, which for GPUs, and why? In this paper, our goal is to better understand the performance differences between FPGAs and GPUs and provide more insights to the community. We intentionally start with a widely used GPU-friendly benchmark suite Rodinia, and port 11 of the benchmarks (15 kernels) onto FPGAs using the more portable and programmable high-level synthesis C. We provide a simple five-step strategy for FPGA accelerator designs that can be easily understood and mastered by software programmers, and present a quantitative performance breakdown of each step. Then we propose a set of performance metrics, including normalized operations per cycle (OPC_norm) for each pipeline, and effective parallel factor (effective_para_factor), to compare the performance of GPU and FPGA accelerator designs. We find that for 6 out of the 15 kernels, today's FPGAs can provide comparable performance or even achieve better performance, while only consume about 1/10 of GPUs' power (both on the same technology node). We observe that FPGAs usually have higher OPC_norm in most kernels in light of their customized deep pipeline but lower effective_para_factor due to far lower memory bandwidth than GPUs. Future FPGAs should increase their off-chip bandwidth and clock frequency to catch up with GPUs.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Understanding Performance Differences of FPGAs and GPUs: (Abtract Only)\",\"authors\":\"J. Cong, Zhenman Fang, Michael Lo, Hanrui Wang, Jingxian Xu, Shaochong Zhang\",\"doi\":\"10.1145/3174243.3174970\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The notorious power wall has significantly limited the scaling for general-purpose processors. To address this issue, various accelerators, such as GPUs and FPGAs, emerged to achieve better performance and energy-efficiency. Between these two programmable accelerators, a natural question arises: which applications are better suited for FPGAs, which for GPUs, and why? In this paper, our goal is to better understand the performance differences between FPGAs and GPUs and provide more insights to the community. We intentionally start with a widely used GPU-friendly benchmark suite Rodinia, and port 11 of the benchmarks (15 kernels) onto FPGAs using the more portable and programmable high-level synthesis C. We provide a simple five-step strategy for FPGA accelerator designs that can be easily understood and mastered by software programmers, and present a quantitative performance breakdown of each step. Then we propose a set of performance metrics, including normalized operations per cycle (OPC_norm) for each pipeline, and effective parallel factor (effective_para_factor), to compare the performance of GPU and FPGA accelerator designs. We find that for 6 out of the 15 kernels, today's FPGAs can provide comparable performance or even achieve better performance, while only consume about 1/10 of GPUs' power (both on the same technology node). We observe that FPGAs usually have higher OPC_norm in most kernels in light of their customized deep pipeline but lower effective_para_factor due to far lower memory bandwidth than GPUs. Future FPGAs should increase their off-chip bandwidth and clock frequency to catch up with GPUs.\",\"PeriodicalId\":164936,\"journal\":{\"name\":\"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays\",\"volume\":\"46 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-02-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3174243.3174970\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3174243.3174970","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

臭名昭著的功率墙极大地限制了通用处理器的扩展。为了解决这个问题，各种各样的加速器，如gpu和fpga，出现了，以实现更好的性能和能源效率。在这两个可编程加速器之间，一个自然的问题出现了:哪个应用程序更适合fpga，哪个更适合gpu，为什么?在本文中，我们的目标是更好地理解fpga和gpu之间的性能差异，并为社区提供更多见解。我们有意从广泛使用的gpu友好基准套件Rodinia开始，并使用更易于移植和可编程的高级合成c将11个基准(15个内核)移植到FPGA上。我们为FPGA加速器设计提供了一个简单的五步策略，可以很容易地被软件程序员理解和掌握，并给出了每个步骤的定量性能分解。然后，我们提出了一组性能指标，包括每个管道的标准化每周期操作(OPC_norm)和有效并行因子(effecve_para_factor)，以比较GPU和FPGA加速器设计的性能。我们发现，对于15个内核中的6个，今天的fpga可以提供相当的性能甚至达到更好的性能，而只消耗大约gpu的1/10的功率(两者都在相同的技术节点上)。我们观察到，fpga在大多数内核中通常具有更高的OPC_norm，因为它们的定制深度管道，但由于内存带宽远低于gpu，因此效率系数较低。未来的fpga应该提高其片外带宽和时钟频率，以赶上gpu。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Understanding Performance Differences of FPGAs and GPUs: (Abtract Only)

The notorious power wall has significantly limited the scaling for general-purpose processors. To address this issue, various accelerators, such as GPUs and FPGAs, emerged to achieve better performance and energy-efficiency. Between these two programmable accelerators, a natural question arises: which applications are better suited for FPGAs, which for GPUs, and why? In this paper, our goal is to better understand the performance differences between FPGAs and GPUs and provide more insights to the community. We intentionally start with a widely used GPU-friendly benchmark suite Rodinia, and port 11 of the benchmarks (15 kernels) onto FPGAs using the more portable and programmable high-level synthesis C. We provide a simple five-step strategy for FPGA accelerator designs that can be easily understood and mastered by software programmers, and present a quantitative performance breakdown of each step. Then we propose a set of performance metrics, including normalized operations per cycle (OPC_norm) for each pipeline, and effective parallel factor (effective_para_factor), to compare the performance of GPU and FPGA accelerator designs. We find that for 6 out of the 15 kernels, today's FPGAs can provide comparable performance or even achieve better performance, while only consume about 1/10 of GPUs' power (both on the same technology node). We observe that FPGAs usually have higher OPC_norm in most kernels in light of their customized deep pipeline but lower effective_para_factor due to far lower memory bandwidth than GPUs. Future FPGAs should increase their off-chip bandwidth and clock frequency to catch up with GPUs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

自引率

0.00%

发文量

期刊最新文献

Architecture and Circuit Design of an All-Spintronic FPGA Session details: Session 6: High Level Synthesis 2 A FPGA Friendly Approximate Computing Framework with Hybrid Neural Networks: (Abstract Only) Software/Hardware Co-design for Multichannel Scheduling in IEEE 802.11p MLME: (Abstract Only) Session details: Special Session: Deep Learning