Yu-Ting Chen, J. Cong, Zhenman Fang, Jie Lei, Peng Wei
FPGA-enabled datacenters have shown great potential for providing performance and energy efficiency improvement, and captured a great amount of attention from both academia and industry. In this paper we aim to answer one key question: how can we efficiently integrate FPGAs into state-of-the-art big-data computing frameworks? Although very important, this problem has not been well studied, especially for the integration of fine-grained FPGA accelerators that have short execution time but will be invoked many times. To provide a generalized methodology and insight for efficient integration, we conduct an in-depth analysis of challenges and corresponding solutions of integration at single-thread, single-node multi-thread, and multi-node levels. With a step-by-step case study for the next-generation DNA sequencing application, we demonstrate how a straightforward integration with 1000x slowdown can be tuned into an efficient integration with 2.6x overall system speedup and 2.4x energy efficiency improvement.
{"title":"When Spark Meets FPGAs: A Case Study for Next-Generation DNA Sequencing Acceleration","authors":"Yu-Ting Chen, J. Cong, Zhenman Fang, Jie Lei, Peng Wei","doi":"10.1109/FCCM.2016.18","DOIUrl":"https://doi.org/10.1109/FCCM.2016.18","url":null,"abstract":"FPGA-enabled datacenters have shown great potential for providing performance and energy efficiency improvement, and captured a great amount of attention from both academia and industry. In this paper we aim to answer one key question: how can we efficiently integrate FPGAs into state-of-the-art big-data computing frameworks? Although very important, this problem has not been well studied, especially for the integration of fine-grained FPGA accelerators that have short execution time but will be invoked many times. To provide a generalized methodology and insight for efficient integration, we conduct an in-depth analysis of challenges and corresponding solutions of integration at single-thread, single-node multi-thread, and multi-node levels. With a step-by-step case study for the next-generation DNA sequencing application, we demonstrate how a straightforward integration with 1000x slowdown can be tuned into an efficient integration with 2.6x overall system speedup and 2.4x energy efficiency improvement.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126957043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Computing the forces derived from long-range electrostatics is a critical application and also a central part of Molecular Dynamics. Part of that computation, the transformation of a charge grid to a potential grid via a 3D FFT, has received some attention recently and has been found to work extremely well on FPGAs. Here we report on the rest of the computation, which consists of two mappings: charges onto a grid and a potential grid onto the particles. These mappings are interesting in their own right as they are far more compute intensive than the FFTs; each is typically done using tricubic interpolation. We believe that these mappings have been studied only once previously for FPGAs and then found to be exorbitantly expensive; i.e., only bicubic would lit on the chip. In the current work we lind that, when using the Altera Arria 10, not only do both mappings lit, but also an appropriately sized 3D FFT. This enables the building of a balanced accelerator for the entire long-range electrostatics computation on a single FPGA. This design scales directly to FPGA clusters. Other contributions include a new mapping scheme based on table lookup and a measure of the utility of the floating point support of the Arria-10.
{"title":"FPGA-Accelerated Particle-Grid Mapping","authors":"A. Sanaullah, Arash Khoshparvar, M. Herbordt","doi":"10.1109/FCCM.2016.53","DOIUrl":"https://doi.org/10.1109/FCCM.2016.53","url":null,"abstract":"Computing the forces derived from long-range electrostatics is a critical application and also a central part of Molecular Dynamics. Part of that computation, the transformation of a charge grid to a potential grid via a 3D FFT, has received some attention recently and has been found to work extremely well on FPGAs. Here we report on the rest of the computation, which consists of two mappings: charges onto a grid and a potential grid onto the particles. These mappings are interesting in their own right as they are far more compute intensive than the FFTs; each is typically done using tricubic interpolation. We believe that these mappings have been studied only once previously for FPGAs and then found to be exorbitantly expensive; i.e., only bicubic would lit on the chip. In the current work we lind that, when using the Altera Arria 10, not only do both mappings lit, but also an appropriately sized 3D FFT. This enables the building of a balanced accelerator for the entire long-range electrostatics computation on a single FPGA. This design scales directly to FPGA clusters. Other contributions include a new mapping scheme based on table lookup and a measure of the utility of the floating point support of the Arria-10.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128076045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Krommydas, A. Helal, Anshuman Verma, Wu-chun Feng
For decades, the streaming architecture of FPGAs has delivered accelerated performance across many application domains, such as option pricing solvers in finance, computational fluid dynamics in oil and gas, and packet processing in network routers and firewalls. However, this performance has come at the significant expense of programmability, i.e., the performance-programmability gap. In particular, FPGA developers use a hardware design language (HDL) to implement the application data path and to design hardware modules for computation pipelines, memory management, synchronization, and communication. This process requires extensive low-level knowledge of the target FPGA architecture and consumes significant development time and effort. To address this lack of programmability of FPGAs, OpenCL provides an easy-to-use and portable programming model for CPUs, GPUs, APUs, and now, FPGAs. However, this significantly improved programmability can come at the expense of performance, that is, there still remains a performance-programmability gap. To improve the performance of OpenCL kernels on FPGAs, and thus, bridge the performance-programmability gap, we apply and evaluate the effect of various optimization techniques on GEM, an N-body method from the OpenDwarfs benchmark suite.
{"title":"Bridging the Performance-Programmability Gap for FPGAs via OpenCL: A Case Study with OpenDwarfs","authors":"K. Krommydas, A. Helal, Anshuman Verma, Wu-chun Feng","doi":"10.1109/FCCM.2016.56","DOIUrl":"https://doi.org/10.1109/FCCM.2016.56","url":null,"abstract":"For decades, the streaming architecture of FPGAs has delivered accelerated performance across many application domains, such as option pricing solvers in finance, computational fluid dynamics in oil and gas, and packet processing in network routers and firewalls. However, this performance has come at the significant expense of programmability, i.e., the performance-programmability gap. In particular, FPGA developers use a hardware design language (HDL) to implement the application data path and to design hardware modules for computation pipelines, memory management, synchronization, and communication. This process requires extensive low-level knowledge of the target FPGA architecture and consumes significant development time and effort. To address this lack of programmability of FPGAs, OpenCL provides an easy-to-use and portable programming model for CPUs, GPUs, APUs, and now, FPGAs. However, this significantly improved programmability can come at the expense of performance, that is, there still remains a performance-programmability gap. To improve the performance of OpenCL kernels on FPGAs, and thus, bridge the performance-programmability gap, we apply and evaluate the effect of various optimization techniques on GEM, an N-body method from the OpenDwarfs benchmark suite.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124284781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent work has shown how multi-ported RAMs can be built out of dual-ported RAMs. Such techniques combine two structures: a set of "data banks" to hold the data, and a method for selecting the bank containing the last-written data, often called a live-value table (LVT). Most previous work has focused on the design of the LVT to reduce area and improve performance. In this paper, we instead reduce area by optimizing the design of the "data banks" portion. The optimization is embedded into a memory compiler that solves a set cover problem. When the set cover problem is solved optimally, the data banks use minimum area. Our technique applies to multi-ported RAMs that have a structural pattern we describe as "switched ports". Switched ports are a generalization of true ports, where a certain number of write ports can be dynamically switched into a possibly different number of read ports using one common read/write control signal. Furthermore, a given application may have multiple sets, each set with a different read/write control. While previous work generates multi-port RAM solutions that contain only true ports, or only simple ports, we contend that using only these two models is too limiting and prevents optimizations from being applied. Experimental results on 10 random instances of multi-port RAMs show 17% BRAM reduction on average compared to the best of other approaches. The compiler and a fully parameterized Verilog implementation is released as an open source library. The library has been extensively tested using Altera's EDA tools.
{"title":"A Multi-ported Memory Compiler Utilizing True Dual-Port BRAMs","authors":"Ameer Abdelhadi, G. Lemieux","doi":"10.1109/FCCM.2016.45","DOIUrl":"https://doi.org/10.1109/FCCM.2016.45","url":null,"abstract":"Recent work has shown how multi-ported RAMs can be built out of dual-ported RAMs. Such techniques combine two structures: a set of \"data banks\" to hold the data, and a method for selecting the bank containing the last-written data, often called a live-value table (LVT). Most previous work has focused on the design of the LVT to reduce area and improve performance. In this paper, we instead reduce area by optimizing the design of the \"data banks\" portion. The optimization is embedded into a memory compiler that solves a set cover problem. When the set cover problem is solved optimally, the data banks use minimum area. Our technique applies to multi-ported RAMs that have a structural pattern we describe as \"switched ports\". Switched ports are a generalization of true ports, where a certain number of write ports can be dynamically switched into a possibly different number of read ports using one common read/write control signal. Furthermore, a given application may have multiple sets, each set with a different read/write control. While previous work generates multi-port RAM solutions that contain only true ports, or only simple ports, we contend that using only these two models is too limiting and prevents optimizations from being applied. Experimental results on 10 random instances of multi-port RAMs show 17% BRAM reduction on average compared to the best of other approaches. The compiler and a fully parameterized Verilog implementation is released as an open source library. The library has been extensively tested using Altera's EDA tools.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121885031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Richmond, Jeremy Blackstone, Matthew Hogains, Kevin Thai, R. Kastner
Tools for C/C++ based-hardware development have grown in popularity in recent years. However, the impact of these tools has been limited by their lack of support for integration with vendor IP, external memories, and communication peripherals. In this paper we introduce Tinker, an open-source Board Support Package generator for Altera's OpenCL Compiler. Board Support Packages define memory, communication, and IP ports for easy integration with high level synthesis cores. Tinker abstracts the low-level hardware details of hardware development when creating board support packages and greatly increases the flexibility of OpenCL development. Tinker currently generates custom memory architectures from user specifications. We use our tool to generate a variety of architectures and apply them to two application kernels.
{"title":"Tinker: Generating Custom Memory Architectures for Altera's OpenCL Compiler","authors":"D. Richmond, Jeremy Blackstone, Matthew Hogains, Kevin Thai, R. Kastner","doi":"10.1109/FCCM.2016.13","DOIUrl":"https://doi.org/10.1109/FCCM.2016.13","url":null,"abstract":"Tools for C/C++ based-hardware development have grown in popularity in recent years. However, the impact of these tools has been limited by their lack of support for integration with vendor IP, external memories, and communication peripherals. In this paper we introduce Tinker, an open-source Board Support Package generator for Altera's OpenCL Compiler. Board Support Packages define memory, communication, and IP ports for easy integration with high level synthesis cores. Tinker abstracts the low-level hardware details of hardware development when creating board support packages and greatly increases the flexibility of OpenCL development. Tinker currently generates custom memory architectures from user specifications. We use our tool to generate a variety of architectures and apply them to two application kernels.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131365853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Integrated micro-fluidic (MF) cooling is a promising technique to solve the thermal problems in 3D FPGAs [1] (As shown in Figure 1). However, this cooling method has some nonideal properties such as non-uniform heat removal capacity along the flow direction. Existing 3D FPGA placement and routing (P&R) tools are unaware of micro-fluidic cooling, thus leading to large on-chip temperature variation which is harmful to the reliability of 3D FPGAs. In this paper we demonstrate that we can incorporate micro-fluidic cooling considerations in existing 3D FPGA P&R tools simply with a cooling-aware Engineering Change Order (ECO) based placement framework. Taking the placement result of an existing P&R tool, the framework modifies the node positions to improve the on-chip temperature uniformity accounting for fluidic cooling structures. Hence we do not need to invest in a stand alone fluidic cooling aware 3D FPGA CAD framework.
{"title":"ECO Based Placement and Routing Framework for 3D FPGAs with Micro-fluidic Cooling","authors":"Zhiyuan Yang, Caleb Serafy, Ankur Srivastava","doi":"10.1109/FCCM.2016.57","DOIUrl":"https://doi.org/10.1109/FCCM.2016.57","url":null,"abstract":"Integrated micro-fluidic (MF) cooling is a promising technique to solve the thermal problems in 3D FPGAs [1] (As shown in Figure 1). However, this cooling method has some nonideal properties such as non-uniform heat removal capacity along the flow direction. Existing 3D FPGA placement and routing (P&R) tools are unaware of micro-fluidic cooling, thus leading to large on-chip temperature variation which is harmful to the reliability of 3D FPGAs. In this paper we demonstrate that we can incorporate micro-fluidic cooling considerations in existing 3D FPGA P&R tools simply with a cooling-aware Engineering Change Order (ECO) based placement framework. Taking the placement result of an existing P&R tool, the framework modifies the node positions to improve the on-chip temperature uniformity accounting for fluidic cooling structures. Hence we do not need to invest in a stand alone fluidic cooling aware 3D FPGA CAD framework.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130916923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianqi Wang, Xi Jin, Bo Peng, Chuanjun Wang, Linlin Zheng
We propose an heterogeneous multi-FPGA accelerating solution, which is called as RP-ring (Reconfigurable Processor ring), for direct-summation N-body simulation. In this solution, we try to use existing FPGA boards rather than design new specialized boards to reduce cost. It can be expanded conveniently with any available FPGA board and only requires quite low communication bandwidth between FPGA boards. The communication protocol is simple and can be implemented with limited hardware/software resource. In order to prevent the slowest board from dragging the overall performance down, we build a mathematical model to decompose workload among FPGAs. The model divide workload based on the logic resource, memory access bandwidth and communication bandwidth of each FPGA chip. We apply the solution in astrodynamics simulation and achieve two orders of magnitude speedup compared with CPU implementations.
{"title":"RP-Ring: A Heterogeneous Multi-FPGA Accelerating Solution for N-Body Simulations","authors":"Tianqi Wang, Xi Jin, Bo Peng, Chuanjun Wang, Linlin Zheng","doi":"10.1109/FCCM.2016.20","DOIUrl":"https://doi.org/10.1109/FCCM.2016.20","url":null,"abstract":"We propose an heterogeneous multi-FPGA accelerating solution, which is called as RP-ring (Reconfigurable Processor ring), for direct-summation N-body simulation. In this solution, we try to use existing FPGA boards rather than design new specialized boards to reduce cost. It can be expanded conveniently with any available FPGA board and only requires quite low communication bandwidth between FPGA boards. The communication protocol is simple and can be implemented with limited hardware/software resource. In order to prevent the slowest board from dragging the overall performance down, we build a mathematical model to decompose workload among FPGAs. The model divide workload based on the logic resource, memory access bandwidth and communication bandwidth of each FPGA chip. We apply the solution in astrodynamics simulation and achieve two orders of magnitude speedup compared with CPU implementations.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132252418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Soft processors have a role to play in easing the difficulty of designing applications into FPGAs for two reasons: first, they can be deployed only when needed, unlike permanent on-die hard processors. Second, for the portions of an application that can function sufficiently fast on a soft processor, it is far easier to write and debug single-threaded software code than to create hardware. The breadth of this second role increases when the performance of the soft processor increases, yet there has been little progress in the performance of soft processors since their commercial inception -- in particular, the sophisticated out-of-order superscalar approaches that arrived in the mid 1990s are not employed, despite the fact that their area cost is now easily tolerable. In this paper we take an important step towards out-of-order execution in soft processors by exploring instruction scheduling in an FPGA substrate. This differs from the hard-processor design problem because the logic substrate is restricted to LUTs, whereas hard processor scheduling circuits employ CAM and wired-OR structures to great benefit. We discuss both circuit and microarchitectural trade-offs, and compare three circuit structures for the scheduler, including a new structure called a fused-logic matrix scheduler. With this circuit, large schedulers up to 40 entries can be built with the same cycle time as the commercial Nios II/f soft processor (240~MHz). This careful design has the potential to significantly increase both the IPC and raw compute performance of a soft processor, compared to current commercial soft processors.
{"title":"High Performance Instruction Scheduling Circuits for Out-of-Order Soft Processors","authors":"Henry Wong, Vaughn Betz, Jonathan Rose","doi":"10.1109/FCCM.2016.11","DOIUrl":"https://doi.org/10.1109/FCCM.2016.11","url":null,"abstract":"Soft processors have a role to play in easing the difficulty of designing applications into FPGAs for two reasons: first, they can be deployed only when needed, unlike permanent on-die hard processors. Second, for the portions of an application that can function sufficiently fast on a soft processor, it is far easier to write and debug single-threaded software code than to create hardware. The breadth of this second role increases when the performance of the soft processor increases, yet there has been little progress in the performance of soft processors since their commercial inception -- in particular, the sophisticated out-of-order superscalar approaches that arrived in the mid 1990s are not employed, despite the fact that their area cost is now easily tolerable. In this paper we take an important step towards out-of-order execution in soft processors by exploring instruction scheduling in an FPGA substrate. This differs from the hard-processor design problem because the logic substrate is restricted to LUTs, whereas hard processor scheduling circuits employ CAM and wired-OR structures to great benefit. We discuss both circuit and microarchitectural trade-offs, and compare three circuit structures for the scheduler, including a new structure called a fused-logic matrix scheduler. With this circuit, large schedulers up to 40 entries can be built with the same cycle time as the commercial Nios II/f soft processor (240~MHz). This careful design has the potential to significantly increase both the IPC and raw compute performance of a soft processor, compared to current commercial soft processors.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133480221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eddie Hung, James J. Davis, Joshua M. Levine, Edward A. Stott, P. Cheung, G. Constantinides
In a modern FPGA system-on-chip design, it is often insufficient to simply assess the total power consumption of the entire circuit by design-time estimation or runtime power rail measurement. Instead, to make better runtime decisions, it is desirable to understand the power consumed by each individual module in the system. In this work, we combine board-level power measurements with register-level activity counting to build an online model that produces a breakdown of power consumption within the design. Online model refinement avoids the need for a time-consuming characterisation stage and also allows the model to track long-term changes to operating conditions. Our flow is named KAPow, a (loose) acronym for 'K'ounting Activity for Power estimation, which we show to be accurate, with per-module power estimates as close to ±5mW of true measurements, and to have low overheads. We also demonstrate an application example in which a per-module power breakdown can be used to determine an efficient mapping of tasks to modules and reduce system-wide power consumption by over 8%.
{"title":"KAPow: A System Identification Approach to Online Per-Module Power Estimation in FPGA Designs","authors":"Eddie Hung, James J. Davis, Joshua M. Levine, Edward A. Stott, P. Cheung, G. Constantinides","doi":"10.1109/FCCM.2016.25","DOIUrl":"https://doi.org/10.1109/FCCM.2016.25","url":null,"abstract":"In a modern FPGA system-on-chip design, it is often insufficient to simply assess the total power consumption of the entire circuit by design-time estimation or runtime power rail measurement. Instead, to make better runtime decisions, it is desirable to understand the power consumed by each individual module in the system. In this work, we combine board-level power measurements with register-level activity counting to build an online model that produces a breakdown of power consumption within the design. Online model refinement avoids the need for a time-consuming characterisation stage and also allows the model to track long-term changes to operating conditions. Our flow is named KAPow, a (loose) acronym for 'K'ounting Activity for Power estimation, which we show to be accurate, with per-module power estimates as close to ±5mW of true measurements, and to have low overheads. We also demonstrate an application example in which a per-module power breakdown can be used to determine an efficient mapping of tasks to modules and reduce system-wide power consumption by over 8%.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123668500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We can improve the performance of deflection-routed FPGA overlay networks-on-chip (NoCs) like Hoplite by as much as 10× (random traffic) at the expense of modest extra storage cost when combining static scheduling with packet switching in an efficient, hybrid manner. Deflection routed bufferless NoCs such as Hoplite, allow extremely lightweight packet switched routers on FPGAs, but suffer from high packet latencies due to deflections under congestion. When the communication workload is known in advance, time-multiplexed routing can offer a faster alternative by eliminating deflections but require expensive storage of routing decisions in context buffers in LUT RAMs. In this paper, we propose a hybrid Marathon NoC that combines the low packet latencies of deflection-free time-multiplexed routing with the low implementation cost of context-free packet-switched Hoplite NoC. The Marathon NoC requires a deterministic routing function to be implemented in the switch along with time-stamped packet injection in the PEs to ensure deflection-free routing in the network. The network also needs a one-time offline static scheduling stage that determines the appropriate time to inject a packet to guarantee conflict-free deflection-free route on the shared network. For random traffic patterns, Marathon outperforms Hoplite by as much as 10× and time multiplexing by as much as 1.2× when considering total communication time at identical area costs. For other synthetic patterns, Marathon outperforms Hoplite in all cases except local pattern and is within 2 - 5× of best time multiplexing performance at large system sizes. For communication workloads extracted from real-world sparse matrix-vector multiplication kernels, Marathon outperforms both Hoplite and Time Multiplexing by 1.3 - 2.8×.
{"title":"Marathon: Statically-Scheduled Conflict-Free Routing on FPGA Overlay NoCs","authors":"Nachiket Kapre","doi":"10.1109/FCCM.2016.47","DOIUrl":"https://doi.org/10.1109/FCCM.2016.47","url":null,"abstract":"We can improve the performance of deflection-routed FPGA overlay networks-on-chip (NoCs) like Hoplite by as much as 10× (random traffic) at the expense of modest extra storage cost when combining static scheduling with packet switching in an efficient, hybrid manner. Deflection routed bufferless NoCs such as Hoplite, allow extremely lightweight packet switched routers on FPGAs, but suffer from high packet latencies due to deflections under congestion. When the communication workload is known in advance, time-multiplexed routing can offer a faster alternative by eliminating deflections but require expensive storage of routing decisions in context buffers in LUT RAMs. In this paper, we propose a hybrid Marathon NoC that combines the low packet latencies of deflection-free time-multiplexed routing with the low implementation cost of context-free packet-switched Hoplite NoC. The Marathon NoC requires a deterministic routing function to be implemented in the switch along with time-stamped packet injection in the PEs to ensure deflection-free routing in the network. The network also needs a one-time offline static scheduling stage that determines the appropriate time to inject a packet to guarantee conflict-free deflection-free route on the shared network. For random traffic patterns, Marathon outperforms Hoplite by as much as 10× and time multiplexing by as much as 1.2× when considering total communication time at identical area costs. For other synthetic patterns, Marathon outperforms Hoplite in all cases except local pattern and is within 2 - 5× of best time multiplexing performance at large system sizes. For communication workloads extracted from real-world sparse matrix-vector multiplication kernels, Marathon outperforms both Hoplite and Time Multiplexing by 1.3 - 2.8×.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121577751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}