2013 International Conference on Field-Programmable Technology (FPT)最新文献

英文中文

Design and optimization of heterogeneous tree-based FPGA using 3D technology 基于3D技术的异构树FPGA设计与优化

2013 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2013-12-09 DOI: 10.1109/FPT.2013.6718380

V. Pangracious, Z. Marrakchi, H. Mehrez

The CMOS technology scaling has greatly improved the overall performance and density of Field Programmable Gate Arrays (FPGAs). However, when looking at the performance metrics such as speed, area and power consumption, the gap is generally very wide for FPGAs compared to application specific integrated circuits (ASICs) mainly due to the programmable interconnect overhead. We propose a 3-dimensional (3D) design methodology using horizontal design partitioning to vertically stack heterogeneous FPGA designs based on a Tree-based multilevel FPGA architecture. We describe the 3D design and optimization methodology to improve speed, interconnect area and power consumption using Tezzaron's 3D stacking technology.

CMOS技术的扩展极大地提高了现场可编程门阵列(fpga)的整体性能和密度。然而，当考虑速度、面积和功耗等性能指标时，与专用集成电路(asic)相比，fpga的差距通常非常大，这主要是由于可编程互连开销。我们提出了一种三维(3D)设计方法，使用水平设计划分来垂直堆叠基于树的多层FPGA架构的异构FPGA设计。我们描述了3D设计和优化方法，以提高速度，互连面积和功耗使用Tezzaron的3D堆叠技术。

引用次数: 0

Correction to “Graph Minor Approach for Application Mapping on CGRAs” 更正“在CGRAs上进行应用绘图的小图方法”

2013 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718431

Liang Chen, T. Mitra

Following the publication of the article “Graph Minor Approach for Application Mapping on CGRAs” [1] in the proceedings of the International Conference on Field Programmable Technology (ICFPT) 2012, we received correspondence [2] pointing to some inaccuracies in the article. With this correction, we would like to clarify some points that could otherwise be misconstrued.

在2012年国际现场可编程技术会议(ICFPT)论文集上发表文章“图小方法用于CGRAs应用测绘”[1]之后，我们收到了指出文章中一些不准确的通信[2]。在此更正中，我们想澄清一些可能被误解的观点。

引用次数: 1

Application-specific customisation of market data feed arbitration 特定于应用程序的定制市场数据馈送仲裁

2013 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718377

S. Denholm, Hiroaki Inoue, Takashi Takenaka, W. Luk

Messages are transmitted from financial exchanges to update their members about changes in the market. As UDP packets are used for message transmission, members subscribe to two identical message feeds from the exchange to lower the risk of message loss or delay. As financial trades can be time sensitive, low latency arbitration between these market data feeds is of particular importance. Members must either provide generic arbitration for all of their financial applications, increasing latency, or arbitrate within each application which wastes resources and scales poorly. We present a reconfigurable accelerated approach for market feed arbitration operating at the network level. Multiple arbitrators can operate within a single FPGA to output customised feeds to downstream financial applications. Application-specific customisations are supported by each core, allowing different market feed messaging protocols, windowing operations and message buffering parameters. We model multiple-core arbitration and explore the scalability and performance improvements within and between cores. We demonstrate our design within a Xilinx Virtex-6 FPGA using the NASDAQ TotalView-ITCH 4.1 messaging standard. Our implementation operates at 16Gbps throughput, and with resource sharing, supports 12 independent cores, 33% more than simple core replication. A 56ns (7 clock cycles) windowing latency is achieved, 2.6 times lower than a hardware-accelerated CPU approach.

金融交易所传递信息，让会员了解市场的最新变化。由于使用UDP报文进行消息传输，成员从交换中心订阅两个相同的消息源，以降低消息丢失或延迟的风险。由于金融交易可能是时间敏感的，因此这些市场数据馈送之间的低延迟仲裁尤为重要。成员必须为他们所有的金融应用程序提供通用仲裁，这会增加延迟，或者在每个应用程序中进行仲裁，这会浪费资源且扩展性差。我们提出了一种可重构的加速方法，用于在网络层面运行的市场馈电仲裁。多个仲裁器可以在单个FPGA内操作，以向下游金融应用程序输出定制的feed。每个核心都支持特定于应用程序的定制，允许不同的市场提要消息传递协议、窗口操作和消息缓冲参数。我们对多核仲裁进行建模，并探索内核内部和内核之间的可伸缩性和性能改进。我们使用纳斯达克TotalView-ITCH 4.1消息传递标准在Xilinx Virtex-6 FPGA中演示了我们的设计。我们的实现以16Gbps的吞吐量运行，通过资源共享，支持12个独立的核心，比简单的核心复制多33%。实现了56ns(7个时钟周期)的窗口延迟，比硬件加速CPU方法低2.6倍。

{"title":"Application-specific customisation of market data feed arbitration","authors":"S. Denholm, Hiroaki Inoue, Takashi Takenaka, W. Luk","doi":"10.1109/FPT.2013.6718377","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718377","url":null,"abstract":"Messages are transmitted from financial exchanges to update their members about changes in the market. As UDP packets are used for message transmission, members subscribe to two identical message feeds from the exchange to lower the risk of message loss or delay. As financial trades can be time sensitive, low latency arbitration between these market data feeds is of particular importance. Members must either provide generic arbitration for all of their financial applications, increasing latency, or arbitrate within each application which wastes resources and scales poorly. We present a reconfigurable accelerated approach for market feed arbitration operating at the network level. Multiple arbitrators can operate within a single FPGA to output customised feeds to downstream financial applications. Application-specific customisations are supported by each core, allowing different market feed messaging protocols, windowing operations and message buffering parameters. We model multiple-core arbitration and explore the scalability and performance improvements within and between cores. We demonstrate our design within a Xilinx Virtex-6 FPGA using the NASDAQ TotalView-ITCH 4.1 messaging standard. Our implementation operates at 16Gbps throughput, and with resource sharing, supports 12 independent cores, 33% more than simple core replication. A 56ns (7 clock cycles) windowing latency is achieved, 2.6 times lower than a hardware-accelerated CPU approach.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127116781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

A high-throughput FPGA architecture for parallel connected components analysis based on label reuse 一种基于标签复用的并行连接元件分析高吞吐量FPGA架构

2013 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718372

M. Klaiber, D. Bailey, Silvia Ahmed, Y. Baroud, S. Simon

A memory efficient architecture for single-pass connected components analysis suited for high throughput embedded image processing systems is proposed which achieves a high throughput by partitioning the image into several vertical slices processed in parallel. The low latency of the architecture allows reuse of labels associated with the image objects. This reduces the amount of memory by a factor of more than 5 compared to previous work. This is significant, since memory is a critical resource in embedded image processing on FPGAs.

提出了一种适用于高吞吐量嵌入式图像处理系统的单通道连接元件分析的高效内存架构，该架构通过将图像划分为多个垂直切片并行处理来实现高吞吐量。该架构的低延迟允许重用与图像对象相关的标签。与以前的工作相比，这将内存减少了5倍以上。这一点很重要，因为内存是fpga嵌入式图像处理的关键资源。

引用次数: 22

Exploiting stochastic delay variability on FPGAs with adaptive partial rerouting 基于自适应部分重路由的fpga随机延迟可变性研究

2013 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718362

Zhenyu Guan, Justin S. J. Wong, S. Chaudhuri, G. Constantinides, P. Cheung

Aggressive transistor scaling will soon lead us to the physical upper-bound of process technology, where stochastic process variability dominates the timing performance of FPGA components. In this paper, a variation-aware partial-rerouting method is proposed to mitigate and take advantage of the effect of delay variability due to process variation. The variation in logic delay across each FPGA (variation map) is measured on commercial FPGAs and is used to assess the effectiveness and potential gain of the proposed method on current FPGA architectures. Our partial-rerouting method achieved 5.25% improvement in critical path delay under a delay variability of σ/μ = 0.3, and is considerably less time consuming than using variation-aware full chipwise routing, which gave a slightly better timing gain of 6.41% but requires 8x more execution time when optimising for 100 target FPGAs with unique variation maps.

积极的晶体管缩放将很快引导我们到工艺技术的物理上限，其中随机过程可变性主导FPGA组件的时序性能。本文提出了一种变化感知部分重路由方法，以减轻和利用由过程变化引起的延迟可变性的影响。在商用FPGA上测量了每个FPGA之间的逻辑延迟变化(变化图)，并用于评估所提出方法在当前FPGA架构上的有效性和潜在增益。我们的部分重路由方法在σ/μ = 0.3的延迟可变性下实现了5.25%的关键路径延迟改善，并且比使用变化感知的全芯片路由消耗的时间要少得多，后者的时序增益略好，为6.41%，但在优化100个具有唯一变异图的目标fpga时需要8倍的执行时间。

引用次数: 5

Quantum FPGA architecture design 量子FPGA架构设计

2013 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718386

Jialin Chen, Lingli Wang, Bin Wang

A Quantum FPGA (QFPGA) architecture is presented for programmable quantum computing, which is a hybrid architecture combining the advantages of the measurement-based quantum computation and the qubus system. QFPGA consists of Quantum Logic Blocks (QLBs) and Quantum Routing Channels (QRCs). The QLB is used to realize a small quantum logic while the QRC is to combine them properly for larger logic realization. There are two types of buses in QFPGA, the local bus in the QLB and the global bus in the QRC, which are to generate the cluster states and general multiqubit rotations around the z axis respectively. However for some applications such as Grover's algorithm and n-qubit quantum Fourier transform, one QLB can be configured for four-qubit phase shift module and four-qubit quantum Fourier transform respectively.

提出了一种用于可编程量子计算的量子FPGA (QFPGA)架构，它是一种结合基于测量的量子计算和qubus系统优点的混合架构。QFPGA由量子逻辑块(qlb)和量子路由通道(qrc)组成。QLB用于实现小的量子逻辑，而QRC则是将它们适当地组合起来以实现更大的逻辑。QFPGA中有两种总线，QLB中的本地总线和QRC中的全局总线，分别用于产生簇态和绕z轴的一般多量子位旋转。但对于Grover算法和n量子位量子傅里叶变换等应用，可以分别为四量子位相移模块和四量子位量子傅里叶变换配置一个QLB。

引用次数: 2

Implementation of a highly scalable blokus duo solver on FPGA 在FPGA上实现一个高度可扩展的blokus二解器

2013 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718423

Chester Liu

This paper presents a highly scalable hardware solver for Blokus Duo. Based on flat Monte Carlo method, the proposed solver contains self-contained agents whose number is configurable and only limited by FPGA capacity, which makes the proposed solver highly scalable. Data structures and tile representations are tailored to support efficient memory usage and operations. Implementation result shows that an agent can operate at up to 150MHz while requiring less than 3000 LUTs on the Altera Cyclone II EP2C70F896C6 FPGA device. Simulation result shows the proposed solver can always win level 1 Pentobi.

本文提出了一个高度可扩展的Blokus Duo硬件求解器。该求解器基于平面蒙特卡罗方法，包含自包含代理，代理数量可配置，且仅受FPGA容量限制，具有较高的可扩展性。数据结构和平铺表示是为支持高效的内存使用和操作而定制的。实现结果表明，在Altera Cyclone II EP2C70F896C6 FPGA器件上，agent可以在高达150MHz的频率下工作，所需lut小于3000。仿真结果表明，该算法总能在1级Pentobi中获胜。

引用次数: 4

StML: Bridging the gap between FPGA design and HDL circuit description 弥合FPGA设计和HDL电路描述之间的差距

2013 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718366

Dustin Peterson, O. Bringmann, Thomas Schweizer, W. Rosenstiel

FPGA circuit implementation is a unidirectional and time-consuming process. Existing approaches like the incremental synthesis try to shorten it, but still need to execute the whole flow for a changed circuit partition. Other approaches circumvent process stages by providing bidirectional mappings between their results. In this paper we propose an approach to provide a bidirectional link between an FPGA design and its HDL code. This link enables the circumvention of the most time-consuming stages (synthesis, mapping, placing, routing) of the FPGA circuit implementation. We implemented our approach in a Java-based EDA tool library, called Static Mapping Library (StML). We demonstrate its applicability by means of hardware debugging and an RTL-based injection of permanent faults, built on top of the StML. Experimental results illustrate that a mapping coverage between 98.5%-100.0% can be obtained, which substantiates the feasibility of this approach. Further experiments illustrate a controllable tradeoff between area overhead, circuit granularity and mapping granularity. With the finest mapping granularity, the area overhead has been between 1.8% and 60.2% for RTL-based circuits. The speedup of the proposed fault injection method has been estimated to be up to 6x for the tested circuits.

FPGA电路的实现是一个单向且耗时的过程。现有的方法，如增量合成，试图缩短它，但仍然需要执行整个流程的改变电路分区。其他方法通过提供结果之间的双向映射来绕过过程阶段。在本文中，我们提出了一种在FPGA设计和其HDL代码之间提供双向链接的方法。这个链接可以绕过FPGA电路实现中最耗时的阶段(合成、映射、放置、路由)。我们在一个称为静态映射库(StML)的基于java的EDA工具库中实现了我们的方法。我们通过硬件调试和基于rtl的永久故障注入(构建在StML之上)来证明其适用性。实验结果表明，该方法的映射覆盖率在98.5% ~ 100.0%之间，验证了该方法的可行性。进一步的实验证明了面积开销、电路粒度和映射粒度之间的可控权衡。使用最精细的映射粒度，基于rtl的电路的面积开销在1.8%到60.2%之间。所提出的故障注入方法对测试电路的加速可达6倍。

{"title":"StML: Bridging the gap between FPGA design and HDL circuit description","authors":"Dustin Peterson, O. Bringmann, Thomas Schweizer, W. Rosenstiel","doi":"10.1109/FPT.2013.6718366","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718366","url":null,"abstract":"FPGA circuit implementation is a unidirectional and time-consuming process. Existing approaches like the incremental synthesis try to shorten it, but still need to execute the whole flow for a changed circuit partition. Other approaches circumvent process stages by providing bidirectional mappings between their results. In this paper we propose an approach to provide a bidirectional link between an FPGA design and its HDL code. This link enables the circumvention of the most time-consuming stages (synthesis, mapping, placing, routing) of the FPGA circuit implementation. We implemented our approach in a Java-based EDA tool library, called Static Mapping Library (StML). We demonstrate its applicability by means of hardware debugging and an RTL-based injection of permanent faults, built on top of the StML. Experimental results illustrate that a mapping coverage between 98.5%-100.0% can be obtained, which substantiates the feasibility of this approach. Further experiments illustrate a controllable tradeoff between area overhead, circuit granularity and mapping granularity. With the finest mapping granularity, the area overhead has been between 1.8% and 60.2% for RTL-based circuits. The speedup of the proposed fault injection method has been estimated to be up to 6x for the tested circuits.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129707987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

ZCluster: A Zynq-based Hadoop cluster ZCluster:基于zynq的Hadoop集群

2013 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718411

Zhongduo Lin, P. Chow

ARM-based servers are garnering increasing interest in big data processing for their low power consumption. However, they are ill-suited for compute-intensive tasks due to their poor processing capability compared to the CPUs used in a traditional server. This paper describes our early efforts to integrate the processing power of the FPGA with the ARM processor inside the Xilinx Zynq SoC. An eight-slave Zynq-based Hadoop cluster is built and a customized hardware accelerator for a standard FIR filter is implemented to demonstrate the effectiveness of hardware acceleration. The Xillybus is used for communication between the ARM processor and the FPGA fabric, achieving a bandwidth of 103MB/s. The Hadoop cluster is proved to be linearly scalable with different input sizes and numbers of slaves. Overall, the cluster achieves a 3.3-fold speedup compared to a native pure software implementation on a single ARM processor and about a 20% improvement compared to an ARM-based cluster without hardware accelerators.

基于arm的服务器因其低功耗而在大数据处理领域引起了越来越多的兴趣。然而，它们不适合计算密集型任务，因为与传统服务器中使用的cpu相比，它们的处理能力较差。本文介绍了我们将FPGA的处理能力与ARM处理器集成到Xilinx Zynq SoC中的早期工作。构建了一个基于zynq的8 slave Hadoop集群，并实现了针对标准FIR滤波器的定制硬件加速器，以验证硬件加速的有效性。Xillybus用于ARM处理器和FPGA结构之间的通信，实现103MB/s的带宽。Hadoop集群被证明可以在不同的输入大小和slave数量下进行线性扩展。总体而言，与单个ARM处理器上的本机纯软件实现相比，集群的速度提高了3.3倍，与没有硬件加速器的基于ARM的集群相比，集群的速度提高了约20%。

引用次数: 48

A connection-based router for FPGAs 用于fpga的基于连接的路由器

2013 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718378

Elias Vansteenkiste, Karel Bruneel, D. Stroobandt

The FPGA's interconnection network not only requires the larger portion of the total silicon area in comparison to the logic available on the FPGA, it also contributes to the majority of the delay and power consumption. Therefore it is essential that routing algorithms are as efficient as possible. In this work the connection router is introduced. It is capable of partially ripping up and rerouting the routing trees of nets. To achieve this, the main congestion loop rips up and reroutes connections instead of nets, which allows the connection router to converge much faster to a solution. The connection router is compared with the VPR directed search router on the basis of VTR benchmarks on a modern commercial FPGA architecture. It is able to find routing solutions 4.4% faster for a relaxed routing problem and 84.3% faster for hard instances of the routing problem. And given the same amount of time as the VPR directed search, the connection router is able to find routing solutions with 5.8% less tracks per channel.

与FPGA上可用的逻辑相比，FPGA的互连网络不仅需要更大比例的总硅面积，而且还导致了大部分延迟和功耗。因此，路由算法必须尽可能高效。本文介绍了连接路由器。它能够部分地撕裂和重新路由网络的路由树。为了实现这一目标，主拥塞环路被撕裂并重新路由连接而不是网络，这使得连接路由器能够更快地收敛到解决方案。在现代商用FPGA架构的VTR基准测试的基础上，将连接路由器与VPR定向搜索路由器进行了比较。对于轻松的路由问题，它能够更快地找到路由解决方案4.4%，对于路由问题的困难实例，它能够更快地找到路由解决方案84.3%。在给定与VPR定向搜索相同的时间的情况下，连接路由器能够以每个通道少5.8%的路径找到路由解决方案。

引用次数: 3

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2013 International Conference on Field-Programmable Technology (FPT)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀