首页 > 最新文献

2013 International Conference on Field-Programmable Technology (FPT)最新文献

英文 中文
Exploiting partially defective LUTs: Why you don't need perfect fabrication 利用部分有缺陷的lut:为什么不需要完美的制造
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718323
A. DeHon, Nikil Mehta
Shrinking integrated circuit feature sizes lead to increased variation and higher defect rates. Prior work has shown how to tolerate the failure of entire LUTs and how to tolerate failures and high variation in interconnect. We show how to use LUTs even when they are partially defective - a form of fine-grained defect tolerance. We characterize the defect tolerance of a range of mapping strategies for defective LUTs, including LUT swapping in a cluster, input permutation, input polarity selection, defect-aware packing, and defect-aware placement. By tolerating partially defective LUTs, we show that, even without allocating dedicated spare LUTs, it is possible to achieve near perfect yield with cluster local remapping when roughly 1% of the LUT multiplexers fail to switch. With full, defect-aware placement, this can increase to 10-25% with just a few extra rows and columns. In contrast, substitution of perfect LUTs to dedicated spares only tolerates failure rates of 0.01-0.05%.
缩小集成电路特征尺寸导致增加的变化和更高的缺陷率。先前的工作已经展示了如何容忍整个lut的故障以及如何容忍互连中的故障和高变化。我们将展示如何使用lut,即使它们有部分缺陷——一种细粒度缺陷容忍度的形式。我们描述了缺陷LUT的一系列映射策略的缺陷容忍度,包括集群中的LUT交换、输入置换、输入极性选择、缺陷感知封装和缺陷感知放置。通过容忍部分有缺陷的LUT,我们表明,即使没有分配专用的备用LUT,当大约1%的LUT多路复用器无法切换时,也可以通过集群局部重新映射实现接近完美的产量。对于完全的、有缺陷的布局,这可以增加到10-25%,只需要增加一些额外的行和列。相比之下,将完美的lut替换为专用备件只能承受0.01-0.05%的故障率。
{"title":"Exploiting partially defective LUTs: Why you don't need perfect fabrication","authors":"A. DeHon, Nikil Mehta","doi":"10.1109/FPT.2013.6718323","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718323","url":null,"abstract":"Shrinking integrated circuit feature sizes lead to increased variation and higher defect rates. Prior work has shown how to tolerate the failure of entire LUTs and how to tolerate failures and high variation in interconnect. We show how to use LUTs even when they are partially defective - a form of fine-grained defect tolerance. We characterize the defect tolerance of a range of mapping strategies for defective LUTs, including LUT swapping in a cluster, input permutation, input polarity selection, defect-aware packing, and defect-aware placement. By tolerating partially defective LUTs, we show that, even without allocating dedicated spare LUTs, it is possible to achieve near perfect yield with cluster local remapping when roughly 1% of the LUT multiplexers fail to switch. With full, defect-aware placement, this can increase to 10-25% with just a few extra rows and columns. In contrast, substitution of perfect LUTs to dedicated spares only tolerates failure rates of 0.01-0.05%.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128084928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Optimizing time and space multiplexed computation in a dynamically reconfigurable processor 动态可重构处理器中优化时间和空间复用计算
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718338
T. Toi, Noritsugu Nakamura, T. Fujii, Toshiro Kitaoka, K. Togawa, K. Furuta, T. Awashima
One of the characteristics of our coarse-grained dynamically reconfigurable processor is that it uses the same operational resource for both control-intensive and dataintensive code segments. We maximize throughput from the knowledge of high-level synthesis under timing constraints. Because the optimal clock speeds for both code segments are different, a dynamic frequency control is introduced to shorten the total execution time. A state transition controller (STC) that handles the control step can change the clock speed for every cycle. For control-intensive code segments, the STC delay is shortened by a rollback mechanism, which looks ahead to the next control step and rolls back if a different control step is actually selected. For the data-intensive code segments, the delay is shortened by fully synchronized synthesis. Experimental results show that throughputs have increased from 18% to 56% with the combination of these optimizations. A chip was fabricated with our 40-nm low-power process technology.
我们的粗粒度动态可重构处理器的特征之一是,它对控制密集型和数据密集型代码段使用相同的操作资源。我们最大限度地提高了在时间限制下的高水平合成知识的吞吐量。由于两个代码段的最佳时钟速度不同,因此引入了动态频率控制以缩短总执行时间。处理控制步骤的状态转换控制器(STC)可以改变每个周期的时钟速度。对于控制密集型代码段,通过回滚机制缩短了STC延迟,该机制提前查看下一个控制步骤,如果实际选择了不同的控制步骤,则回滚。对于数据密集的代码段,通过完全同步合成缩短了延迟。实验结果表明,通过这些优化组合,吞吐量从18%提高到56%。采用我们的40纳米低功耗工艺技术制备了芯片。
{"title":"Optimizing time and space multiplexed computation in a dynamically reconfigurable processor","authors":"T. Toi, Noritsugu Nakamura, T. Fujii, Toshiro Kitaoka, K. Togawa, K. Furuta, T. Awashima","doi":"10.1109/FPT.2013.6718338","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718338","url":null,"abstract":"One of the characteristics of our coarse-grained dynamically reconfigurable processor is that it uses the same operational resource for both control-intensive and dataintensive code segments. We maximize throughput from the knowledge of high-level synthesis under timing constraints. Because the optimal clock speeds for both code segments are different, a dynamic frequency control is introduced to shorten the total execution time. A state transition controller (STC) that handles the control step can change the clock speed for every cycle. For control-intensive code segments, the STC delay is shortened by a rollback mechanism, which looks ahead to the next control step and rolls back if a different control step is actually selected. For the data-intensive code segments, the delay is shortened by fully synchronized synthesis. Experimental results show that throughputs have increased from 18% to 56% with the combination of these optimizations. A chip was fabricated with our 40-nm low-power process technology.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133114474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
EasyPR — An easy usable open-source PR system 一个简单易用的开源公关系统
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718402
Dirk Koch, Christian Beckhoff, Alexander Wold, J. Tørresen
In this paper, we present an open source partial reconfiguration (PR) system which is designed for portability and usability serving as a reference for engineers and students interested in using the advanced reconfiguration capabilities available in Xilinx FPGAs. This includes design aspects such as floorplanning and interfacing PR modules as well as fast reconfiguration and online management. The system features relocatable modules which can even contain reconfigurable modules themselves, hence, implementing hierarchical PR.
在本文中,我们提出了一个开源的部分重构(PR)系统,该系统旨在实现可移植性和可用性,可供有兴趣使用赛灵思fpga中可用的高级重构功能的工程师和学生参考。这包括设计方面,如平面规划和接口公关模块,以及快速重新配置和在线管理。该系统的特点是可重新定位的模块,甚至可以包含可重新配置的模块本身,因此,实现分层PR。
{"title":"EasyPR — An easy usable open-source PR system","authors":"Dirk Koch, Christian Beckhoff, Alexander Wold, J. Tørresen","doi":"10.1109/FPT.2013.6718402","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718402","url":null,"abstract":"In this paper, we present an open source partial reconfiguration (PR) system which is designed for portability and usability serving as a reference for engineers and students interested in using the advanced reconfiguration capabilities available in Xilinx FPGAs. This includes design aspects such as floorplanning and interfacing PR modules as well as fast reconfiguration and online management. The system features relocatable modules which can even contain reconfigurable modules themselves, hence, implementing hierarchical PR.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115088962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Real-time and low power embedded ℓ1-optimization solver design 实时、低功耗嵌入式优化求解器设计
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718348
Z. Ang, Akash Kumar
Basis pursuit denoising (BPDN) is an optimization method used in cutting edge computer vision and compressive sensing research. Although hosting a BPDN solver on an embedded platform is desirable because analysis can be performed in real-time, existing solvers are generally unsuitable for embedded implementation due to either poor run-time performance or high memory usage. To address the aforementioned issues, this paper proposes an embedded-friendly solver which demonstrates superior run-time performance, high recovery accuracy and competitive memory usage compared to existing solvers. For a problem with 5000 variables and 500 constraints, the solver occupies a small memory footprint of 29 kB and takes 0.14 seconds to complete on the Xilinx Zynq Z-7020 system-on-chip. The same problem takes 0.19 seconds on the Intel Core i7-2620M, which runs at 4 times the clock frequency and 114 times the power budget of the Z-7020. Without sacrificing runtime performance, the solver has been highly optimized for power constrained embedded applications. By far this is the first embedded solver capable of handling large scale problems with several thousand variables.
基追踪去噪(BPDN)是一种用于计算机视觉和压缩感知研究的优化方法。虽然在嵌入式平台上托管BPDN求解器是可取的,因为可以实时执行分析,但由于运行时性能差或内存占用高,现有的求解器通常不适合嵌入式实现。为了解决上述问题,本文提出了一种嵌入式友好的求解器,与现有求解器相比,该求解器具有优越的运行时性能,高恢复精度和具有竞争力的内存使用。对于具有5000个变量和500个约束的问题,求解器占用29 kB的内存,在Xilinx Zynq Z-7020片上完成求解需要0.14秒。同样的问题在英特尔酷睿i7-2620M上需要0.19秒,它的时钟频率是Z-7020的4倍,功率预算是Z-7020的114倍。在不牺牲运行时性能的情况下,求解器已经为功耗受限的嵌入式应用进行了高度优化。到目前为止,这是第一个能够处理具有数千个变量的大规模问题的嵌入式求解器。
{"title":"Real-time and low power embedded ℓ1-optimization solver design","authors":"Z. Ang, Akash Kumar","doi":"10.1109/FPT.2013.6718348","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718348","url":null,"abstract":"Basis pursuit denoising (BPDN) is an optimization method used in cutting edge computer vision and compressive sensing research. Although hosting a BPDN solver on an embedded platform is desirable because analysis can be performed in real-time, existing solvers are generally unsuitable for embedded implementation due to either poor run-time performance or high memory usage. To address the aforementioned issues, this paper proposes an embedded-friendly solver which demonstrates superior run-time performance, high recovery accuracy and competitive memory usage compared to existing solvers. For a problem with 5000 variables and 500 constraints, the solver occupies a small memory footprint of 29 kB and takes 0.14 seconds to complete on the Xilinx Zynq Z-7020 system-on-chip. The same problem takes 0.19 seconds on the Intel Core i7-2620M, which runs at 4 times the clock frequency and 114 times the power budget of the Z-7020. Without sacrificing runtime performance, the solver has been highly optimized for power constrained embedded applications. By far this is the first embedded solver capable of handling large scale problems with several thousand variables.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114518766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Artificial intelligence of Blokus Duo on FPGA using Cyber Work Bench 基于网络工作台的FPGA上Blokus Duo的人工智能
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718427
N. Sugimoto, Takaaki Miyajima, Takuya Kuhara, Y. Katuta, Takushi Mitsuichi, H. Amano
This paper presents a design of an FPGA-based Blokus Duo solver. It searches a game tree by using the miniMax algorithm with alpha-beta pruning and move ordering. In addition, HLS tool called CyberWorkBench (CWB) is used to implement hardware. By making the use of functions in CWB, parallel fully pipelined design is generated. The implemented solver works at 100MHz with Xilinx Spartan-6 XC6SLX45 FPGA on the Digilent Atlys board. It can search states after three moves in most cases.
本文提出了一种基于fpga的Blokus Duo求解器的设计。它使用带有α - β剪枝和移动排序的极大极小算法来搜索博弈树。此外,还使用名为CyberWorkBench (CWB)的HLS工具来实现硬件。通过使用CWB中的函数,可以生成并行的全流水线设计。实现的求解器工作在100MHz,使用Digilent Atlys板上的Xilinx Spartan-6 XC6SLX45 FPGA。在大多数情况下,它可以在移动三步后搜索状态。
{"title":"Artificial intelligence of Blokus Duo on FPGA using Cyber Work Bench","authors":"N. Sugimoto, Takaaki Miyajima, Takuya Kuhara, Y. Katuta, Takushi Mitsuichi, H. Amano","doi":"10.1109/FPT.2013.6718427","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718427","url":null,"abstract":"This paper presents a design of an FPGA-based Blokus Duo solver. It searches a game tree by using the miniMax algorithm with alpha-beta pruning and move ordering. In addition, HLS tool called CyberWorkBench (CWB) is used to implement hardware. By making the use of functions in CWB, parallel fully pipelined design is generated. The implemented solver works at 100MHz with Xilinx Spartan-6 XC6SLX45 FPGA on the Digilent Atlys board. It can search states after three moves in most cases.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130169400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Improving clock-rate of hard-macro designs 提高硬宏设计的时钟率
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718361
C. Lavin, B. Nelson, B. Hutchings
HMFlow reuses precompiled circuit modules (hard macros) and other techniques to rapidly compile large designs in a few seconds - many times faster than standard Xilinx flows. However, the clock rates of designs rapidly compiled by HMFlow are often significantly lower than those compiled by the Xilinx flow. To improve clock rates, HMFlow algorithms were modified as follows: (1) the router was modified to take advantage of longer routing wires in the FPGA devices, (2) the original greedy placer was replaced with an annealing-based placer, and (3) certain registers were removed from the hard-macro and moved into the fabric to reduce critical-path delays. Benchmark circuits compiled with these modifications can achieve clock rates that are about 75% as fast as those achieved by Xilinx, on average. Fast run-times are also preserved; the improved algorithms only increase HMFlow run-times by about 50% across the benchmark suite so that HMFlow remains more than 30× faster than the standard Xilinx flow for the benchmarks tested in this paper.
HMFlow重用预编译电路模块(硬宏)和其他技术,可以在几秒钟内快速编译大型设计——比标准Xilinx流程快很多倍。然而,由HMFlow快速编译的设计的时钟速率通常明显低于由Xilinx流编译的设计。为了提高时钟速率,对HMFlow算法进行了如下修改:(1)修改了路由器,以利用FPGA设备中更长的路由线;(2)将原始的贪婪砂矿替换为基于退火的砂矿;(3)从硬宏中删除某些寄存器,并将其移到结构中,以减少关键路径延迟。通过这些修改编译的基准电路可以实现比Xilinx平均快75%的时钟速率。还保留了快速运行时间;改进的算法在整个基准测试套件中只增加了大约50%的HMFlow运行时间,因此在本文测试的基准测试中,HMFlow仍然比标准Xilinx流快30倍以上。
{"title":"Improving clock-rate of hard-macro designs","authors":"C. Lavin, B. Nelson, B. Hutchings","doi":"10.1109/FPT.2013.6718361","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718361","url":null,"abstract":"HMFlow reuses precompiled circuit modules (hard macros) and other techniques to rapidly compile large designs in a few seconds - many times faster than standard Xilinx flows. However, the clock rates of designs rapidly compiled by HMFlow are often significantly lower than those compiled by the Xilinx flow. To improve clock rates, HMFlow algorithms were modified as follows: (1) the router was modified to take advantage of longer routing wires in the FPGA devices, (2) the original greedy placer was replaced with an annealing-based placer, and (3) certain registers were removed from the hard-macro and moved into the fabric to reduce critical-path delays. Benchmark circuits compiled with these modifications can achieve clock rates that are about 75% as fast as those achieved by Xilinx, on average. Fast run-times are also preserved; the improved algorithms only increase HMFlow run-times by about 50% across the benchmark suite so that HMFlow remains more than 30× faster than the standard Xilinx flow for the benchmarks tested in this paper.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125618275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Dynamic Stencil: Effective exploitation of run-time resources in reconfigurable clusters 动态模板:在可重构集群中有效地利用运行时资源
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718356
Xinyu Niu, J. Coutinho, Yu Wang, W. Luk
Computing nodes in reconfigurable clusters are occupied and released by applications during their execution. At compile time, application developers are not aware of the amount of resources available at run time. Dynamic Stencil is an approach that optimises stencil applications by constructing scalable designs which can adapt to available run-time resources in a reconfigurable cluster. This approach has three stages: compile-time optimisation, run-time initialisation, and run-time scaling, and can be used in developing effective servers for stencil computation. Reverse-Time Migration, a high-performance stencil application, is developed with the proposed approach. Experimental results show that high throughput and significant resource utilisation can be achieved with Dynamic Stencil designs, which can dynamically scale into nodes becoming available during their execution. When statically optimised and initialised, the Dynamic Stencil design is 1.8 to 88 times faster and 1.7 to 92 times more power efficient than reference CPU, GPU, MaxGenFD, Blue Gene/P, Blue Gene/Q and Cray XK6 designs; when dynamically scaled, resource utilisation of the design reaches 91%, which is 1.8 to 2.3 times higher than their static counterparts.
可重构集群中的计算节点在执行过程中被应用程序占用和释放。在编译时,应用程序开发人员并不知道运行时可用的资源量。动态模板是一种通过构建可伸缩设计来优化模板应用程序的方法,该设计可以适应可重构集群中可用的运行时资源。这种方法有三个阶段:编译时优化、运行时初始化和运行时缩放,可用于为模板计算开发有效的服务器。利用该方法开发了一个高性能的模板应用程序——逆时迁移。实验结果表明,动态模板设计可以实现高吞吐量和显著的资源利用率,并且可以动态扩展到在执行过程中可用的节点。静态优化和初始化后,动态模板设计比参考CPU, GPU, MaxGenFD, Blue Gene/P, Blue Gene/Q和Cray XK6设计快1.8至88倍,节能1.7至92倍;当动态缩放时,该设计的资源利用率达到91%,比静态设计高1.8 - 2.3倍。
{"title":"Dynamic Stencil: Effective exploitation of run-time resources in reconfigurable clusters","authors":"Xinyu Niu, J. Coutinho, Yu Wang, W. Luk","doi":"10.1109/FPT.2013.6718356","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718356","url":null,"abstract":"Computing nodes in reconfigurable clusters are occupied and released by applications during their execution. At compile time, application developers are not aware of the amount of resources available at run time. Dynamic Stencil is an approach that optimises stencil applications by constructing scalable designs which can adapt to available run-time resources in a reconfigurable cluster. This approach has three stages: compile-time optimisation, run-time initialisation, and run-time scaling, and can be used in developing effective servers for stencil computation. Reverse-Time Migration, a high-performance stencil application, is developed with the proposed approach. Experimental results show that high throughput and significant resource utilisation can be achieved with Dynamic Stencil designs, which can dynamically scale into nodes becoming available during their execution. When statically optimised and initialised, the Dynamic Stencil design is 1.8 to 88 times faster and 1.7 to 92 times more power efficient than reference CPU, GPU, MaxGenFD, Blue Gene/P, Blue Gene/Q and Cray XK6 designs; when dynamically scaled, resource utilisation of the design reaches 91%, which is 1.8 to 2.3 times higher than their static counterparts.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117020141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Virtual-to-Physical address translation for an FPGA-based interconnect with host and GPU remote DMA capabilities 虚拟到物理地址转换为一个基于fpga的互连与主机和GPU远程DMA能力
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718331
R. Ammendola, A. Biagioni, O. Frezza, F. L. Cicero, A. Lonardo, P. Paolucci, D. Rossetti, F. Simula, L. Tosoratto, P. Vicini
We developed a custom FPGA-based Network Interface Controller named APEnet+ aimed at GPU accelerated clusters for High Performance Computing. The card exploits peer-to-peer capabilities (GPU-Direct RDMA) for latest NVIDIA GPGPU devices and the RDMA paradigm to perform fast direct communication between computing nodes, offloading the host CPU from network tasks execution. In this work we focus on the implementation of a Virtual to Physical address translation mechanism, using the FPGA embedded soft-processor. Address management is the most demanding task - we estimated up to 70% of the μC load - for the NIC receiving side, resulting being the main culprit for data bottleneck. To improve the performance of this task and hence improve data transfer over the network, we added a specialized hardware logic block acting as a Translation Lookaside Buffer. This block makes use of a peculiar Content Address Memory implementation designed for scalability and speed. We present detailed measurements to demonstrate the benefits coming from the introduction of such custom logic: a substantial address translation latency reduction (from a measured value of 1.9 μs to 124 ns) and a performance enhancement of both host-bound and GPU-bound data transfers (up to ~ 60% of bandwidth increase) in given message size ranges.
我们开发了一个定制的基于fpga的网络接口控制器,名为APEnet+,针对GPU加速集群进行高性能计算。该卡为最新的NVIDIA GPGPU设备和RDMA范例利用点对点功能(GPU-Direct RDMA)在计算节点之间执行快速直接通信,从网络任务执行中卸载主机CPU。在这项工作中,我们着重于使用FPGA嵌入式软处理器实现虚拟到物理地址转换机制。对于NIC接收端来说,地址管理是要求最高的任务——我们估计高达70%的μC负载,这是导致数据瓶颈的罪魁祸首。为了提高该任务的性能,从而改善网络上的数据传输,我们添加了一个专门的硬件逻辑块作为翻译Lookaside Buffer。这个块使用了一种特殊的内容地址内存实现,设计用于可伸缩性和速度。我们提供了详细的测量来证明引入这种自定义逻辑所带来的好处:在给定的消息大小范围内,大幅降低了地址转换延迟(从1.9 μs的测量值降至124 ns),增强了主机绑定和gpu绑定数据传输的性能(高达60%的带宽增加)。
{"title":"Virtual-to-Physical address translation for an FPGA-based interconnect with host and GPU remote DMA capabilities","authors":"R. Ammendola, A. Biagioni, O. Frezza, F. L. Cicero, A. Lonardo, P. Paolucci, D. Rossetti, F. Simula, L. Tosoratto, P. Vicini","doi":"10.1109/FPT.2013.6718331","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718331","url":null,"abstract":"We developed a custom FPGA-based Network Interface Controller named APEnet+ aimed at GPU accelerated clusters for High Performance Computing. The card exploits peer-to-peer capabilities (GPU-Direct RDMA) for latest NVIDIA GPGPU devices and the RDMA paradigm to perform fast direct communication between computing nodes, offloading the host CPU from network tasks execution. In this work we focus on the implementation of a Virtual to Physical address translation mechanism, using the FPGA embedded soft-processor. Address management is the most demanding task - we estimated up to 70% of the μC load - for the NIC receiving side, resulting being the main culprit for data bottleneck. To improve the performance of this task and hence improve data transfer over the network, we added a specialized hardware logic block acting as a Translation Lookaside Buffer. This block makes use of a peculiar Content Address Memory implementation designed for scalability and speed. We present detailed measurements to demonstrate the benefits coming from the introduction of such custom logic: a substantial address translation latency reduction (from a measured value of 1.9 μs to 124 ns) and a performance enhancement of both host-bound and GPU-bound data transfers (up to ~ 60% of bandwidth increase) in given message size ranges.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132552134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
System-level FPGA device driver with high-level synthesis support 具有高级合成支持的系统级FPGA设备驱动程序
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718342
Kizheppatt Vipin, Shanker Shreejith, D. Gunasekera, Suhaib A. Fahmy, Nachiket Kapre
We can exploit the standardization of communication abstractions provided by modern high-level synthesis tools like Vivado HLS, Bluespec and SCORE to provide stable system interfaces between the host and PCIe-based FPGA accelerator platforms. At a high level, our FPGA driver attempts to provide CUDA-like driver behavior, and more, to FPGA programmers. On the FPGA fabric, we develop an AXI-compliant, lightweight interface switch coupled to multiple physical interfaces (PCIe, Ethernet, DRAM) to provide programmable, portable routing capability between the host and user logic on the FPGA. On the host, we adapt the RIFFA 1.0 driver to provide enhanced communication APIs along with bitstream configuration capability allowing low-latency, high-throughput communication and safe, reliable programming of user logic on the FPGA. Our driver only consumes 21% BRAMs and 14% logic overhead on a Xilinx ML605 platform or 9% BRAMs and 8% logic overhead on a Xilinx V707 board. We are able to sustain DMA transfer throughput (to DRAM) of 1.47GB/s (74% peak) of the PCIe (x4 Gen2) bandwidth, 120.2MB/s (96%) of the Ethernet (1G) bandwidth and 5.93GB/s (92.5%) of DRAM bandwidth.
我们可以利用现代高级综合工具(如Vivado HLS、Bluespec和SCORE)提供的通信抽象标准化,在主机和基于pcie的FPGA加速器平台之间提供稳定的系统接口。在高层次上,我们的FPGA驱动程序试图为FPGA程序员提供类似cuda的驱动程序行为等。在FPGA结构上,我们开发了一个兼容axis的轻量级接口交换机,与多个物理接口(PCIe、以太网、DRAM)耦合,在FPGA上的主机和用户逻辑之间提供可编程的便携式路由能力。在主机上,我们调整了RIFFA 1.0驱动程序,以提供增强的通信api以及位流配置功能,从而允许在FPGA上进行低延迟,高吞吐量通信和安全可靠的用户逻辑编程。我们的驱动器在Xilinx ML605平台上仅消耗21%的bram和14%的逻辑开销,在Xilinx V707板上仅消耗9%的bram和8%的逻辑开销。我们能够维持PCIe (x4 Gen2)带宽的1.47GB/s(峰值74%),以太网(1G)带宽的120.2MB/s(96%)和DRAM带宽的5.93GB/s(92.5%)的DMA传输吞吐量(到DRAM)。
{"title":"System-level FPGA device driver with high-level synthesis support","authors":"Kizheppatt Vipin, Shanker Shreejith, D. Gunasekera, Suhaib A. Fahmy, Nachiket Kapre","doi":"10.1109/FPT.2013.6718342","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718342","url":null,"abstract":"We can exploit the standardization of communication abstractions provided by modern high-level synthesis tools like Vivado HLS, Bluespec and SCORE to provide stable system interfaces between the host and PCIe-based FPGA accelerator platforms. At a high level, our FPGA driver attempts to provide CUDA-like driver behavior, and more, to FPGA programmers. On the FPGA fabric, we develop an AXI-compliant, lightweight interface switch coupled to multiple physical interfaces (PCIe, Ethernet, DRAM) to provide programmable, portable routing capability between the host and user logic on the FPGA. On the host, we adapt the RIFFA 1.0 driver to provide enhanced communication APIs along with bitstream configuration capability allowing low-latency, high-throughput communication and safe, reliable programming of user logic on the FPGA. Our driver only consumes 21% BRAMs and 14% logic overhead on a Xilinx ML605 platform or 9% BRAMs and 8% logic overhead on a Xilinx V707 board. We are able to sustain DMA transfer throughput (to DRAM) of 1.47GB/s (74% peak) of the PCIe (x4 Gen2) bandwidth, 120.2MB/s (96%) of the Ethernet (1G) bandwidth and 5.93GB/s (92.5%) of DRAM bandwidth.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131676707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Semantics-directed machine architecture in ReWire ReWire中语义导向的机器架构
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718410
A. Procter, W. Harrison, I. Graves, M. Becchi, G. Allwein
The functional programming community has developed a number of powerful abstractions for dealing with diverse programming models in a modular way. Beginning with a core of pure, side effect free computation, modular monadic semantics (MMS) allows designers to construct domain-specific languages by adding layers of semantic features, such as mutable state and I/O, in an a' la carte fashion. In the realm of interpreter and compiler construction, the benefits of this approach are manifold and well explored. This paper advocates bringing the tools of MMS to bear on hardware design and verification. In particular, we shall discuss a prototype compiler called ReWire which translates high-level MMS hardware specifications into working circuits on FPGAs. This enables designers to tackle the complexity of hardware design in a modular way, without compromising efficiency.
函数式编程社区已经开发了许多强大的抽象,以模块化的方式处理各种编程模型。从一个纯粹的、无副作用的计算核心开始,模块化一元语义(MMS)允许设计人员通过添加语义特性层,如可变状态和I/O,以一种“点单”的方式构建特定于领域的语言。在解释器和编译器构造领域,这种方法的好处是多方面的,并且得到了很好的探索。本文主张将MMS工具引入到硬件设计和验证中。特别地,我们将讨论一个称为ReWire的原型编译器,它将高级MMS硬件规范转换为fpga上的工作电路。这使设计人员能够以模块化的方式解决硬件设计的复杂性,而不会影响效率。
{"title":"Semantics-directed machine architecture in ReWire","authors":"A. Procter, W. Harrison, I. Graves, M. Becchi, G. Allwein","doi":"10.1109/FPT.2013.6718410","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718410","url":null,"abstract":"The functional programming community has developed a number of powerful abstractions for dealing with diverse programming models in a modular way. Beginning with a core of pure, side effect free computation, modular monadic semantics (MMS) allows designers to construct domain-specific languages by adding layers of semantic features, such as mutable state and I/O, in an a' la carte fashion. In the realm of interpreter and compiler construction, the benefits of this approach are manifold and well explored. This paper advocates bringing the tools of MMS to bear on hardware design and verification. In particular, we shall discuss a prototype compiler called ReWire which translates high-level MMS hardware specifications into working circuits on FPGAs. This enables designers to tackle the complexity of hardware design in a modular way, without compromising efficiency.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134165365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2013 International Conference on Field-Programmable Technology (FPT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1