Design and scalability analysis of bandwidth-compressed stream computing with multiple FPGAs

Antoniette Mondigo, Tomohiro Ueno, Daichi Tanaka, K. Sano, S. Yamamoto
{"title":"Design and scalability analysis of bandwidth-compressed stream computing with multiple FPGAs","authors":"Antoniette Mondigo, Tomohiro Ueno, Daichi Tanaka, K. Sano, S. Yamamoto","doi":"10.1109/ReCoSoC.2017.8016148","DOIUrl":null,"url":null,"abstract":"Stream computing in Field Programmable Gate Arrays (FPGAs) is seen as a promising solution in delivering the necessary performance and energy efficiency requirements of compute-intensive applications like numerical simulations. The inherent structure and customizability of FPGAs naturally make them the better alternative in achieving a highly-scalable computing design solution. This paper presents a scalable custom computing approach through temporal parallelism by increasing the depth of a computing pipeline in a 1D ring of cascaded FPGAs with high-speed, low-latency communication links. Spatial parallelism is also explored by replicating the computing core inside the FPGAs to further increase throughput. Due to communication bandwidth limitations, a hardware-based lossless bandwidth compression scheme was utilized in order to alleviate this bottleneck and transfer more data streams. A performance model is presented for the scalability analysis and performance estimation of this approach. For evaluation and verification, an actual numerical simulation was implemented on an Intel Arria 10 FPGA with spatially paralleled computing cores. Initial results show that the measured performance ratings are close to the predicted values using the performance model. Similarly, it was also demonstrated that the 1D ring topology of multiple FPGAs with bandwidth-compressed links can scale the performance when a sufficiently large data set is computed, even with a deeper pipeline and insufficient inter-FPGA bandwidth.","PeriodicalId":393701,"journal":{"name":"2017 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ReCoSoC.2017.8016148","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

Stream computing in Field Programmable Gate Arrays (FPGAs) is seen as a promising solution in delivering the necessary performance and energy efficiency requirements of compute-intensive applications like numerical simulations. The inherent structure and customizability of FPGAs naturally make them the better alternative in achieving a highly-scalable computing design solution. This paper presents a scalable custom computing approach through temporal parallelism by increasing the depth of a computing pipeline in a 1D ring of cascaded FPGAs with high-speed, low-latency communication links. Spatial parallelism is also explored by replicating the computing core inside the FPGAs to further increase throughput. Due to communication bandwidth limitations, a hardware-based lossless bandwidth compression scheme was utilized in order to alleviate this bottleneck and transfer more data streams. A performance model is presented for the scalability analysis and performance estimation of this approach. For evaluation and verification, an actual numerical simulation was implemented on an Intel Arria 10 FPGA with spatially paralleled computing cores. Initial results show that the measured performance ratings are close to the predicted values using the performance model. Similarly, it was also demonstrated that the 1D ring topology of multiple FPGAs with bandwidth-compressed links can scale the performance when a sufficiently large data set is computed, even with a deeper pipeline and insufficient inter-FPGA bandwidth.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于多fpga的带宽压缩流计算设计与可扩展性分析
现场可编程门阵列(fpga)中的流计算被视为一种很有前途的解决方案,可以为数值模拟等计算密集型应用提供必要的性能和能效要求。fpga固有的结构和可定制性自然使它们成为实现高度可扩展计算设计解决方案的更好选择。本文提出了一种可扩展的自定义计算方法,通过时间并行性,通过增加具有高速,低延迟通信链路的级联fpga的一维环中的计算管道的深度。通过在fpga内部复制计算核心来进一步提高吞吐量,探索了空间并行性。由于通信带宽的限制,为了缓解这一瓶颈,传输更多的数据流,采用了基于硬件的无损带宽压缩方案。针对该方法的可扩展性分析和性能评估,提出了一个性能模型。为了评估和验证,在具有空间并行计算核的Intel Arria 10 FPGA上进行了实际数值模拟。初步结果表明,测量的性能等级接近使用性能模型的预测值。同样,还证明了具有带宽压缩链路的多个fpga的1D环拓扑可以在计算足够大的数据集时扩展性能,即使有更深的管道和fpga间带宽不足。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
High-level design using Intel FPGA OpenCL: A hyperspectral imaging spatial-spectral classifier Federated system-to-service authentication and authorization combining PUFs and tokens Design method for asymmetric 3D interconnect architectures with high level models Fault recovery and adaptation in time-triggered Networks-on-Chips for mixed-criticality systems Exploring the performance of partially reconfigurable point-to-point interconnects
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1