Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718323
A. DeHon, Nikil Mehta
Shrinking integrated circuit feature sizes lead to increased variation and higher defect rates. Prior work has shown how to tolerate the failure of entire LUTs and how to tolerate failures and high variation in interconnect. We show how to use LUTs even when they are partially defective - a form of fine-grained defect tolerance. We characterize the defect tolerance of a range of mapping strategies for defective LUTs, including LUT swapping in a cluster, input permutation, input polarity selection, defect-aware packing, and defect-aware placement. By tolerating partially defective LUTs, we show that, even without allocating dedicated spare LUTs, it is possible to achieve near perfect yield with cluster local remapping when roughly 1% of the LUT multiplexers fail to switch. With full, defect-aware placement, this can increase to 10-25% with just a few extra rows and columns. In contrast, substitution of perfect LUTs to dedicated spares only tolerates failure rates of 0.01-0.05%.
{"title":"Exploiting partially defective LUTs: Why you don't need perfect fabrication","authors":"A. DeHon, Nikil Mehta","doi":"10.1109/FPT.2013.6718323","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718323","url":null,"abstract":"Shrinking integrated circuit feature sizes lead to increased variation and higher defect rates. Prior work has shown how to tolerate the failure of entire LUTs and how to tolerate failures and high variation in interconnect. We show how to use LUTs even when they are partially defective - a form of fine-grained defect tolerance. We characterize the defect tolerance of a range of mapping strategies for defective LUTs, including LUT swapping in a cluster, input permutation, input polarity selection, defect-aware packing, and defect-aware placement. By tolerating partially defective LUTs, we show that, even without allocating dedicated spare LUTs, it is possible to achieve near perfect yield with cluster local remapping when roughly 1% of the LUT multiplexers fail to switch. With full, defect-aware placement, this can increase to 10-25% with just a few extra rows and columns. In contrast, substitution of perfect LUTs to dedicated spares only tolerates failure rates of 0.01-0.05%.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128084928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718338
T. Toi, Noritsugu Nakamura, T. Fujii, Toshiro Kitaoka, K. Togawa, K. Furuta, T. Awashima
One of the characteristics of our coarse-grained dynamically reconfigurable processor is that it uses the same operational resource for both control-intensive and dataintensive code segments. We maximize throughput from the knowledge of high-level synthesis under timing constraints. Because the optimal clock speeds for both code segments are different, a dynamic frequency control is introduced to shorten the total execution time. A state transition controller (STC) that handles the control step can change the clock speed for every cycle. For control-intensive code segments, the STC delay is shortened by a rollback mechanism, which looks ahead to the next control step and rolls back if a different control step is actually selected. For the data-intensive code segments, the delay is shortened by fully synchronized synthesis. Experimental results show that throughputs have increased from 18% to 56% with the combination of these optimizations. A chip was fabricated with our 40-nm low-power process technology.
{"title":"Optimizing time and space multiplexed computation in a dynamically reconfigurable processor","authors":"T. Toi, Noritsugu Nakamura, T. Fujii, Toshiro Kitaoka, K. Togawa, K. Furuta, T. Awashima","doi":"10.1109/FPT.2013.6718338","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718338","url":null,"abstract":"One of the characteristics of our coarse-grained dynamically reconfigurable processor is that it uses the same operational resource for both control-intensive and dataintensive code segments. We maximize throughput from the knowledge of high-level synthesis under timing constraints. Because the optimal clock speeds for both code segments are different, a dynamic frequency control is introduced to shorten the total execution time. A state transition controller (STC) that handles the control step can change the clock speed for every cycle. For control-intensive code segments, the STC delay is shortened by a rollback mechanism, which looks ahead to the next control step and rolls back if a different control step is actually selected. For the data-intensive code segments, the delay is shortened by fully synchronized synthesis. Experimental results show that throughputs have increased from 18% to 56% with the combination of these optimizations. A chip was fabricated with our 40-nm low-power process technology.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133114474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718402
Dirk Koch, Christian Beckhoff, Alexander Wold, J. Tørresen
In this paper, we present an open source partial reconfiguration (PR) system which is designed for portability and usability serving as a reference for engineers and students interested in using the advanced reconfiguration capabilities available in Xilinx FPGAs. This includes design aspects such as floorplanning and interfacing PR modules as well as fast reconfiguration and online management. The system features relocatable modules which can even contain reconfigurable modules themselves, hence, implementing hierarchical PR.
{"title":"EasyPR — An easy usable open-source PR system","authors":"Dirk Koch, Christian Beckhoff, Alexander Wold, J. Tørresen","doi":"10.1109/FPT.2013.6718402","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718402","url":null,"abstract":"In this paper, we present an open source partial reconfiguration (PR) system which is designed for portability and usability serving as a reference for engineers and students interested in using the advanced reconfiguration capabilities available in Xilinx FPGAs. This includes design aspects such as floorplanning and interfacing PR modules as well as fast reconfiguration and online management. The system features relocatable modules which can even contain reconfigurable modules themselves, hence, implementing hierarchical PR.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115088962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718348
Z. Ang, Akash Kumar
Basis pursuit denoising (BPDN) is an optimization method used in cutting edge computer vision and compressive sensing research. Although hosting a BPDN solver on an embedded platform is desirable because analysis can be performed in real-time, existing solvers are generally unsuitable for embedded implementation due to either poor run-time performance or high memory usage. To address the aforementioned issues, this paper proposes an embedded-friendly solver which demonstrates superior run-time performance, high recovery accuracy and competitive memory usage compared to existing solvers. For a problem with 5000 variables and 500 constraints, the solver occupies a small memory footprint of 29 kB and takes 0.14 seconds to complete on the Xilinx Zynq Z-7020 system-on-chip. The same problem takes 0.19 seconds on the Intel Core i7-2620M, which runs at 4 times the clock frequency and 114 times the power budget of the Z-7020. Without sacrificing runtime performance, the solver has been highly optimized for power constrained embedded applications. By far this is the first embedded solver capable of handling large scale problems with several thousand variables.
{"title":"Real-time and low power embedded ℓ1-optimization solver design","authors":"Z. Ang, Akash Kumar","doi":"10.1109/FPT.2013.6718348","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718348","url":null,"abstract":"Basis pursuit denoising (BPDN) is an optimization method used in cutting edge computer vision and compressive sensing research. Although hosting a BPDN solver on an embedded platform is desirable because analysis can be performed in real-time, existing solvers are generally unsuitable for embedded implementation due to either poor run-time performance or high memory usage. To address the aforementioned issues, this paper proposes an embedded-friendly solver which demonstrates superior run-time performance, high recovery accuracy and competitive memory usage compared to existing solvers. For a problem with 5000 variables and 500 constraints, the solver occupies a small memory footprint of 29 kB and takes 0.14 seconds to complete on the Xilinx Zynq Z-7020 system-on-chip. The same problem takes 0.19 seconds on the Intel Core i7-2620M, which runs at 4 times the clock frequency and 114 times the power budget of the Z-7020. Without sacrificing runtime performance, the solver has been highly optimized for power constrained embedded applications. By far this is the first embedded solver capable of handling large scale problems with several thousand variables.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114518766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718427
N. Sugimoto, Takaaki Miyajima, Takuya Kuhara, Y. Katuta, Takushi Mitsuichi, H. Amano
This paper presents a design of an FPGA-based Blokus Duo solver. It searches a game tree by using the miniMax algorithm with alpha-beta pruning and move ordering. In addition, HLS tool called CyberWorkBench (CWB) is used to implement hardware. By making the use of functions in CWB, parallel fully pipelined design is generated. The implemented solver works at 100MHz with Xilinx Spartan-6 XC6SLX45 FPGA on the Digilent Atlys board. It can search states after three moves in most cases.
{"title":"Artificial intelligence of Blokus Duo on FPGA using Cyber Work Bench","authors":"N. Sugimoto, Takaaki Miyajima, Takuya Kuhara, Y. Katuta, Takushi Mitsuichi, H. Amano","doi":"10.1109/FPT.2013.6718427","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718427","url":null,"abstract":"This paper presents a design of an FPGA-based Blokus Duo solver. It searches a game tree by using the miniMax algorithm with alpha-beta pruning and move ordering. In addition, HLS tool called CyberWorkBench (CWB) is used to implement hardware. By making the use of functions in CWB, parallel fully pipelined design is generated. The implemented solver works at 100MHz with Xilinx Spartan-6 XC6SLX45 FPGA on the Digilent Atlys board. It can search states after three moves in most cases.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130169400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718361
C. Lavin, B. Nelson, B. Hutchings
HMFlow reuses precompiled circuit modules (hard macros) and other techniques to rapidly compile large designs in a few seconds - many times faster than standard Xilinx flows. However, the clock rates of designs rapidly compiled by HMFlow are often significantly lower than those compiled by the Xilinx flow. To improve clock rates, HMFlow algorithms were modified as follows: (1) the router was modified to take advantage of longer routing wires in the FPGA devices, (2) the original greedy placer was replaced with an annealing-based placer, and (3) certain registers were removed from the hard-macro and moved into the fabric to reduce critical-path delays. Benchmark circuits compiled with these modifications can achieve clock rates that are about 75% as fast as those achieved by Xilinx, on average. Fast run-times are also preserved; the improved algorithms only increase HMFlow run-times by about 50% across the benchmark suite so that HMFlow remains more than 30× faster than the standard Xilinx flow for the benchmarks tested in this paper.
{"title":"Improving clock-rate of hard-macro designs","authors":"C. Lavin, B. Nelson, B. Hutchings","doi":"10.1109/FPT.2013.6718361","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718361","url":null,"abstract":"HMFlow reuses precompiled circuit modules (hard macros) and other techniques to rapidly compile large designs in a few seconds - many times faster than standard Xilinx flows. However, the clock rates of designs rapidly compiled by HMFlow are often significantly lower than those compiled by the Xilinx flow. To improve clock rates, HMFlow algorithms were modified as follows: (1) the router was modified to take advantage of longer routing wires in the FPGA devices, (2) the original greedy placer was replaced with an annealing-based placer, and (3) certain registers were removed from the hard-macro and moved into the fabric to reduce critical-path delays. Benchmark circuits compiled with these modifications can achieve clock rates that are about 75% as fast as those achieved by Xilinx, on average. Fast run-times are also preserved; the improved algorithms only increase HMFlow run-times by about 50% across the benchmark suite so that HMFlow remains more than 30× faster than the standard Xilinx flow for the benchmarks tested in this paper.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125618275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718356
Xinyu Niu, J. Coutinho, Yu Wang, W. Luk
Computing nodes in reconfigurable clusters are occupied and released by applications during their execution. At compile time, application developers are not aware of the amount of resources available at run time. Dynamic Stencil is an approach that optimises stencil applications by constructing scalable designs which can adapt to available run-time resources in a reconfigurable cluster. This approach has three stages: compile-time optimisation, run-time initialisation, and run-time scaling, and can be used in developing effective servers for stencil computation. Reverse-Time Migration, a high-performance stencil application, is developed with the proposed approach. Experimental results show that high throughput and significant resource utilisation can be achieved with Dynamic Stencil designs, which can dynamically scale into nodes becoming available during their execution. When statically optimised and initialised, the Dynamic Stencil design is 1.8 to 88 times faster and 1.7 to 92 times more power efficient than reference CPU, GPU, MaxGenFD, Blue Gene/P, Blue Gene/Q and Cray XK6 designs; when dynamically scaled, resource utilisation of the design reaches 91%, which is 1.8 to 2.3 times higher than their static counterparts.
可重构集群中的计算节点在执行过程中被应用程序占用和释放。在编译时,应用程序开发人员并不知道运行时可用的资源量。动态模板是一种通过构建可伸缩设计来优化模板应用程序的方法,该设计可以适应可重构集群中可用的运行时资源。这种方法有三个阶段:编译时优化、运行时初始化和运行时缩放,可用于为模板计算开发有效的服务器。利用该方法开发了一个高性能的模板应用程序——逆时迁移。实验结果表明,动态模板设计可以实现高吞吐量和显著的资源利用率,并且可以动态扩展到在执行过程中可用的节点。静态优化和初始化后,动态模板设计比参考CPU, GPU, MaxGenFD, Blue Gene/P, Blue Gene/Q和Cray XK6设计快1.8至88倍,节能1.7至92倍;当动态缩放时,该设计的资源利用率达到91%,比静态设计高1.8 - 2.3倍。
{"title":"Dynamic Stencil: Effective exploitation of run-time resources in reconfigurable clusters","authors":"Xinyu Niu, J. Coutinho, Yu Wang, W. Luk","doi":"10.1109/FPT.2013.6718356","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718356","url":null,"abstract":"Computing nodes in reconfigurable clusters are occupied and released by applications during their execution. At compile time, application developers are not aware of the amount of resources available at run time. Dynamic Stencil is an approach that optimises stencil applications by constructing scalable designs which can adapt to available run-time resources in a reconfigurable cluster. This approach has three stages: compile-time optimisation, run-time initialisation, and run-time scaling, and can be used in developing effective servers for stencil computation. Reverse-Time Migration, a high-performance stencil application, is developed with the proposed approach. Experimental results show that high throughput and significant resource utilisation can be achieved with Dynamic Stencil designs, which can dynamically scale into nodes becoming available during their execution. When statically optimised and initialised, the Dynamic Stencil design is 1.8 to 88 times faster and 1.7 to 92 times more power efficient than reference CPU, GPU, MaxGenFD, Blue Gene/P, Blue Gene/Q and Cray XK6 designs; when dynamically scaled, resource utilisation of the design reaches 91%, which is 1.8 to 2.3 times higher than their static counterparts.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117020141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718331
R. Ammendola, A. Biagioni, O. Frezza, F. L. Cicero, A. Lonardo, P. Paolucci, D. Rossetti, F. Simula, L. Tosoratto, P. Vicini
We developed a custom FPGA-based Network Interface Controller named APEnet+ aimed at GPU accelerated clusters for High Performance Computing. The card exploits peer-to-peer capabilities (GPU-Direct RDMA) for latest NVIDIA GPGPU devices and the RDMA paradigm to perform fast direct communication between computing nodes, offloading the host CPU from network tasks execution. In this work we focus on the implementation of a Virtual to Physical address translation mechanism, using the FPGA embedded soft-processor. Address management is the most demanding task - we estimated up to 70% of the μC load - for the NIC receiving side, resulting being the main culprit for data bottleneck. To improve the performance of this task and hence improve data transfer over the network, we added a specialized hardware logic block acting as a Translation Lookaside Buffer. This block makes use of a peculiar Content Address Memory implementation designed for scalability and speed. We present detailed measurements to demonstrate the benefits coming from the introduction of such custom logic: a substantial address translation latency reduction (from a measured value of 1.9 μs to 124 ns) and a performance enhancement of both host-bound and GPU-bound data transfers (up to ~ 60% of bandwidth increase) in given message size ranges.
{"title":"Virtual-to-Physical address translation for an FPGA-based interconnect with host and GPU remote DMA capabilities","authors":"R. Ammendola, A. Biagioni, O. Frezza, F. L. Cicero, A. Lonardo, P. Paolucci, D. Rossetti, F. Simula, L. Tosoratto, P. Vicini","doi":"10.1109/FPT.2013.6718331","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718331","url":null,"abstract":"We developed a custom FPGA-based Network Interface Controller named APEnet+ aimed at GPU accelerated clusters for High Performance Computing. The card exploits peer-to-peer capabilities (GPU-Direct RDMA) for latest NVIDIA GPGPU devices and the RDMA paradigm to perform fast direct communication between computing nodes, offloading the host CPU from network tasks execution. In this work we focus on the implementation of a Virtual to Physical address translation mechanism, using the FPGA embedded soft-processor. Address management is the most demanding task - we estimated up to 70% of the μC load - for the NIC receiving side, resulting being the main culprit for data bottleneck. To improve the performance of this task and hence improve data transfer over the network, we added a specialized hardware logic block acting as a Translation Lookaside Buffer. This block makes use of a peculiar Content Address Memory implementation designed for scalability and speed. We present detailed measurements to demonstrate the benefits coming from the introduction of such custom logic: a substantial address translation latency reduction (from a measured value of 1.9 μs to 124 ns) and a performance enhancement of both host-bound and GPU-bound data transfers (up to ~ 60% of bandwidth increase) in given message size ranges.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132552134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718342
Kizheppatt Vipin, Shanker Shreejith, D. Gunasekera, Suhaib A. Fahmy, Nachiket Kapre
We can exploit the standardization of communication abstractions provided by modern high-level synthesis tools like Vivado HLS, Bluespec and SCORE to provide stable system interfaces between the host and PCIe-based FPGA accelerator platforms. At a high level, our FPGA driver attempts to provide CUDA-like driver behavior, and more, to FPGA programmers. On the FPGA fabric, we develop an AXI-compliant, lightweight interface switch coupled to multiple physical interfaces (PCIe, Ethernet, DRAM) to provide programmable, portable routing capability between the host and user logic on the FPGA. On the host, we adapt the RIFFA 1.0 driver to provide enhanced communication APIs along with bitstream configuration capability allowing low-latency, high-throughput communication and safe, reliable programming of user logic on the FPGA. Our driver only consumes 21% BRAMs and 14% logic overhead on a Xilinx ML605 platform or 9% BRAMs and 8% logic overhead on a Xilinx V707 board. We are able to sustain DMA transfer throughput (to DRAM) of 1.47GB/s (74% peak) of the PCIe (x4 Gen2) bandwidth, 120.2MB/s (96%) of the Ethernet (1G) bandwidth and 5.93GB/s (92.5%) of DRAM bandwidth.
{"title":"System-level FPGA device driver with high-level synthesis support","authors":"Kizheppatt Vipin, Shanker Shreejith, D. Gunasekera, Suhaib A. Fahmy, Nachiket Kapre","doi":"10.1109/FPT.2013.6718342","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718342","url":null,"abstract":"We can exploit the standardization of communication abstractions provided by modern high-level synthesis tools like Vivado HLS, Bluespec and SCORE to provide stable system interfaces between the host and PCIe-based FPGA accelerator platforms. At a high level, our FPGA driver attempts to provide CUDA-like driver behavior, and more, to FPGA programmers. On the FPGA fabric, we develop an AXI-compliant, lightweight interface switch coupled to multiple physical interfaces (PCIe, Ethernet, DRAM) to provide programmable, portable routing capability between the host and user logic on the FPGA. On the host, we adapt the RIFFA 1.0 driver to provide enhanced communication APIs along with bitstream configuration capability allowing low-latency, high-throughput communication and safe, reliable programming of user logic on the FPGA. Our driver only consumes 21% BRAMs and 14% logic overhead on a Xilinx ML605 platform or 9% BRAMs and 8% logic overhead on a Xilinx V707 board. We are able to sustain DMA transfer throughput (to DRAM) of 1.47GB/s (74% peak) of the PCIe (x4 Gen2) bandwidth, 120.2MB/s (96%) of the Ethernet (1G) bandwidth and 5.93GB/s (92.5%) of DRAM bandwidth.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131676707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718410
A. Procter, W. Harrison, I. Graves, M. Becchi, G. Allwein
The functional programming community has developed a number of powerful abstractions for dealing with diverse programming models in a modular way. Beginning with a core of pure, side effect free computation, modular monadic semantics (MMS) allows designers to construct domain-specific languages by adding layers of semantic features, such as mutable state and I/O, in an a' la carte fashion. In the realm of interpreter and compiler construction, the benefits of this approach are manifold and well explored. This paper advocates bringing the tools of MMS to bear on hardware design and verification. In particular, we shall discuss a prototype compiler called ReWire which translates high-level MMS hardware specifications into working circuits on FPGAs. This enables designers to tackle the complexity of hardware design in a modular way, without compromising efficiency.
{"title":"Semantics-directed machine architecture in ReWire","authors":"A. Procter, W. Harrison, I. Graves, M. Becchi, G. Allwein","doi":"10.1109/FPT.2013.6718410","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718410","url":null,"abstract":"The functional programming community has developed a number of powerful abstractions for dealing with diverse programming models in a modular way. Beginning with a core of pure, side effect free computation, modular monadic semantics (MMS) allows designers to construct domain-specific languages by adding layers of semantic features, such as mutable state and I/O, in an a' la carte fashion. In the realm of interpreter and compiler construction, the benefits of this approach are manifold and well explored. This paper advocates bringing the tools of MMS to bear on hardware design and verification. In particular, we shall discuss a prototype compiler called ReWire which translates high-level MMS hardware specifications into working circuits on FPGAs. This enables designers to tackle the complexity of hardware design in a modular way, without compromising efficiency.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134165365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}