马拉松:FPGA覆盖noc上的静态调度无冲突路由

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2016-05-01 DOI:10.1109/FCCM.2016.47

Nachiket Kapre

{"title":"马拉松:FPGA覆盖noc上的静态调度无冲突路由","authors":"Nachiket Kapre","doi":"10.1109/FCCM.2016.47","DOIUrl":null,"url":null,"abstract":"We can improve the performance of deflection-routed FPGA overlay networks-on-chip (NoCs) like Hoplite by as much as 10× (random traffic) at the expense of modest extra storage cost when combining static scheduling with packet switching in an efficient, hybrid manner. Deflection routed bufferless NoCs such as Hoplite, allow extremely lightweight packet switched routers on FPGAs, but suffer from high packet latencies due to deflections under congestion. When the communication workload is known in advance, time-multiplexed routing can offer a faster alternative by eliminating deflections but require expensive storage of routing decisions in context buffers in LUT RAMs. In this paper, we propose a hybrid Marathon NoC that combines the low packet latencies of deflection-free time-multiplexed routing with the low implementation cost of context-free packet-switched Hoplite NoC. The Marathon NoC requires a deterministic routing function to be implemented in the switch along with time-stamped packet injection in the PEs to ensure deflection-free routing in the network. The network also needs a one-time offline static scheduling stage that determines the appropriate time to inject a packet to guarantee conflict-free deflection-free route on the shared network. For random traffic patterns, Marathon outperforms Hoplite by as much as 10× and time multiplexing by as much as 1.2× when considering total communication time at identical area costs. For other synthetic patterns, Marathon outperforms Hoplite in all cases except local pattern and is within 2 - 5× of best time multiplexing performance at large system sizes. For communication workloads extracted from real-world sparse matrix-vector multiplication kernels, Marathon outperforms both Hoplite and Time Multiplexing by 1.3 - 2.8×.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Marathon: Statically-Scheduled Conflict-Free Routing on FPGA Overlay NoCs\",\"authors\":\"Nachiket Kapre\",\"doi\":\"10.1109/FCCM.2016.47\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We can improve the performance of deflection-routed FPGA overlay networks-on-chip (NoCs) like Hoplite by as much as 10× (random traffic) at the expense of modest extra storage cost when combining static scheduling with packet switching in an efficient, hybrid manner. Deflection routed bufferless NoCs such as Hoplite, allow extremely lightweight packet switched routers on FPGAs, but suffer from high packet latencies due to deflections under congestion. When the communication workload is known in advance, time-multiplexed routing can offer a faster alternative by eliminating deflections but require expensive storage of routing decisions in context buffers in LUT RAMs. In this paper, we propose a hybrid Marathon NoC that combines the low packet latencies of deflection-free time-multiplexed routing with the low implementation cost of context-free packet-switched Hoplite NoC. The Marathon NoC requires a deterministic routing function to be implemented in the switch along with time-stamped packet injection in the PEs to ensure deflection-free routing in the network. The network also needs a one-time offline static scheduling stage that determines the appropriate time to inject a packet to guarantee conflict-free deflection-free route on the shared network. For random traffic patterns, Marathon outperforms Hoplite by as much as 10× and time multiplexing by as much as 1.2× when considering total communication time at identical area costs. For other synthetic patterns, Marathon outperforms Hoplite in all cases except local pattern and is within 2 - 5× of best time multiplexing performance at large system sizes. For communication workloads extracted from real-world sparse matrix-vector multiplication kernels, Marathon outperforms both Hoplite and Time Multiplexing by 1.3 - 2.8×.\",\"PeriodicalId\":113498,\"journal\":{\"name\":\"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FCCM.2016.47\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FCCM.2016.47","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

当以高效、混合的方式将静态调度与分组交换相结合时，我们可以以适度的额外存储成本为代价，将像Hoplite这样的偏折路由FPGA覆盖片上网络(noc)的性能提高多达10倍(随机流量)。偏转路由无缓冲noc，如Hoplite，允许在fpga上极其轻量级的分组交换路由器，但由于拥塞下的偏转而遭受高分组延迟。当通信工作负载事先已知时，时间复用路由可以通过消除偏移提供更快的替代方案，但需要在LUT ram的上下文缓冲区中存储昂贵的路由决策。在本文中，我们提出了一种混合马拉松NoC，它结合了无偏转时间复用路由的低数据包延迟和无上下文分组交换Hoplite NoC的低实现成本。马拉松NoC要求在交换机中实现确定性路由功能，并在pe中实现带时间戳的数据包注入，以确保网络中的路由无偏转。网络还需要一个一次性的离线静态调度阶段，该阶段决定注入数据包的适当时间，以保证共享网络上的无冲突无偏转路由。对于随机交通模式，当考虑相同区域成本下的总通信时间时，马拉松比Hoplite多出10倍，时间复用多出1.2倍。对于其他合成模式，Marathon在除局部模式外的所有情况下都优于Hoplite，并且在大型系统尺寸下的最佳时间复用性能在2 - 5倍之内。对于从现实世界的稀疏矩阵向量乘法内核中提取的通信工作负载，Marathon的性能比Hoplite和Time Multiplexing都高出1.3 - 2.8倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Marathon: Statically-Scheduled Conflict-Free Routing on FPGA Overlay NoCs

We can improve the performance of deflection-routed FPGA overlay networks-on-chip (NoCs) like Hoplite by as much as 10× (random traffic) at the expense of modest extra storage cost when combining static scheduling with packet switching in an efficient, hybrid manner. Deflection routed bufferless NoCs such as Hoplite, allow extremely lightweight packet switched routers on FPGAs, but suffer from high packet latencies due to deflections under congestion. When the communication workload is known in advance, time-multiplexed routing can offer a faster alternative by eliminating deflections but require expensive storage of routing decisions in context buffers in LUT RAMs. In this paper, we propose a hybrid Marathon NoC that combines the low packet latencies of deflection-free time-multiplexed routing with the low implementation cost of context-free packet-switched Hoplite NoC. The Marathon NoC requires a deterministic routing function to be implemented in the switch along with time-stamped packet injection in the PEs to ensure deflection-free routing in the network. The network also needs a one-time offline static scheduling stage that determines the appropriate time to inject a packet to guarantee conflict-free deflection-free route on the shared network. For random traffic patterns, Marathon outperforms Hoplite by as much as 10× and time multiplexing by as much as 1.2× when considering total communication time at identical area costs. For other synthetic patterns, Marathon outperforms Hoplite in all cases except local pattern and is within 2 - 5× of best time multiplexing performance at large system sizes. For communication workloads extracted from real-world sparse matrix-vector multiplication kernels, Marathon outperforms both Hoplite and Time Multiplexing by 1.3 - 2.8×.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

自引率

0.00%

发文量

期刊最新文献

Spatial Predicates Evaluation in the Geohash Domain Using Reconfigurable Hardware Two-Hit Filter Synthesis for Genomic Database Search Initiation Interval Aware Resource Sharing for FPGA DSP Blocks Finding Space-Time Stream Permutations for Minimum Memory and Latency Runtime Parameterizable Regular Expression Operators for Databases