2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献

英文中文

High Level Synthesis Based E-Nose System for Gas Applications 基于高级合成的气体电子鼻系统

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2016-05-01 DOI: 10.1109/FCCM.2016.60

Amine Ait Si Ali, A. Amira, F. Bensaali, M. Benammar, Muhammad Hassan, A. Bermak

This paper proposes a hardware/software co-design approach using the Zynq platform for the implementation of an electronic nose (EN) system based on principal component analysis (PCA) as a dimensionality reduction technique and decision tree (DT) as a classification algorithm using a 4x4 in-house fabricated sensor. The system was successfully trained and simulated in MATLAB environment prior to the implementation on the Zynq platform. High level synthesis was carried out on the proposed designs using different optimization directives including loop unrolling, array partitioning and pipelining.

本文提出了一种使用Zynq平台实现电子鼻(EN)系统的硬件/软件协同设计方法，该系统基于主成分分析(PCA)作为降维技术和决策树(DT)作为使用4x4内部制造传感器的分类算法。在Zynq平台上实现之前，系统已在MATLAB环境中进行了成功的训练和仿真。采用循环展开、阵列划分和流水线等不同的优化指令对所提出的设计进行了高级综合。

引用次数: 1

High-Throughput and Energy-Efficient Graph Processing on FPGA 基于FPGA的高吞吐量和高能效图形处理

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2016-05-01 DOI: 10.1109/FCCM.2016.35

Shijie Zhou, C. Chelmis, V. Prasanna

In this paper, we propose a novel design for large-scale graph processing on FPGA. Our design uses large external memory for storing massive graph data and FPGA for acceleration, and leverages edge-centric computing principles. We propose a data layout which optimizes the external memory performance and leads to an efficient memory activation schedule to reduce on-chip memory power consumption. Further, we develop a parallel architecture on FPGA which can saturate the external memory bandwidth and concurrently process multiple input data to increase throughput. We use our design to accelerate several classic graph algorithms, including single-source shortest path, weakly connected component, and minimum spanning tree. Experimental results show that for all the considered graph algorithms, our design achieves high throughput of over 600 million traversed edges per second (MTEPS) and high energy-efficiency of over 30 MTEPS/W. Compared with a baseline design, our optimizations result in over 3.6× throughput and 5.8× energy-efficiency improvements, respectively. Our design achieves 32% throughput improvement when compared with state-of-the-art FPGA designs, and up to 7.8× speedup when compared with state-of-the-art multi-core implementation.

本文提出了一种基于FPGA的大规模图形处理的新设计。我们的设计使用大型外部存储器来存储大量图形数据，使用FPGA来加速，并利用边缘中心计算原理。我们提出了一种优化外部存储器性能的数据布局，并导致有效的存储器激活计划，以降低片上存储器功耗。此外，我们在FPGA上开发了一种并行架构，该架构可以饱和外部存储器带宽并并发处理多个输入数据以提高吞吐量。我们使用我们的设计来加速几种经典的图算法，包括单源最短路径、弱连接分量和最小生成树。实验结果表明，对于所有考虑的图算法，我们的设计实现了每秒超过6亿遍历边(MTEPS)的高吞吐量和超过30 MTEPS/W的高能效。与基线设计相比，我们的优化分别使吞吐量提高了3.6倍以上，能效提高了5.8倍。与最先进的FPGA设计相比，我们的设计实现了32%的吞吐量提高，与最先进的多核实现相比，速度提高了7.8倍。

{"title":"High-Throughput and Energy-Efficient Graph Processing on FPGA","authors":"Shijie Zhou, C. Chelmis, V. Prasanna","doi":"10.1109/FCCM.2016.35","DOIUrl":"https://doi.org/10.1109/FCCM.2016.35","url":null,"abstract":"In this paper, we propose a novel design for large-scale graph processing on FPGA. Our design uses large external memory for storing massive graph data and FPGA for acceleration, and leverages edge-centric computing principles. We propose a data layout which optimizes the external memory performance and leads to an efficient memory activation schedule to reduce on-chip memory power consumption. Further, we develop a parallel architecture on FPGA which can saturate the external memory bandwidth and concurrently process multiple input data to increase throughput. We use our design to accelerate several classic graph algorithms, including single-source shortest path, weakly connected component, and minimum spanning tree. Experimental results show that for all the considered graph algorithms, our design achieves high throughput of over 600 million traversed edges per second (MTEPS) and high energy-efficiency of over 30 MTEPS/W. Compared with a baseline design, our optimizations result in over 3.6× throughput and 5.8× energy-efficiency improvements, respectively. Our design achieves 32% throughput improvement when compared with state-of-the-art FPGA designs, and up to 7.8× speedup when compared with state-of-the-art multi-core implementation.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"155 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116878436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 75

A Content Adapted FPGA Memory Architecture with Pattern Recognition Capability for L1 Track Triggering in the LHC Environment LHC环境下具有L1轨迹触发模式识别能力的FPGA内存结构

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2016-05-01 DOI: 10.1109/FCCM.2016.52

T. Harbaum, M. Seboui, M. Balzer, J. Becker, M. Weber

Modern high-energy physics experiments such as the Compact Muon Solenoid experiment at CERN produce an extraordinary amount of data every 25ns. To handle a data rate of more than 50Tbit/s a multi-level trigger system is required, which reduces the data rate. Due to the increased luminosity after the Phase-II-Upgrade of the LHC, the CMS tracking system has to be redesigned. The current trigger system is unable to handle the resulting amount of data after this upgrade. Because of the latency of a few microseconds the Level 1 Track Trigger has to be implemented in hardware. State-of-the-art pattern recognition filter the incoming data by template matching on ASICs with a content addressable memory architecture. An implementation on an FPGA, which replaces the content addressable memory of the ASIC, has not been possible so far. This paper presents a new approach to a content addressable memory architecture, which allows an implementation of an FPGA based design. By combining filtering and track finding on an FPGA design, there are many possibilities of adjusting the two algorithms to each other. There is more flexibility enabled by the FPGA architecture in contrast to the ASIC. The presented design minimizes the stored data by logic to optimally utilize the available resources of an FPGA. Furthermore, the developed design meets the strong timing constraints and possesses the required properties of the content addressable memory.

现代高能物理实验，如欧洲核子研究中心(CERN)的紧凑型μ子螺线管实验(Compact Muon螺线管)，每25ns产生大量数据。为了处理超过50Tbit/s的数据速率，需要一个多级触发系统，这降低了数据速率。由于大型强子对撞机二期升级后亮度增加，CMS跟踪系统必须重新设计。当前触发系统无法处理升级后产生的数据量。由于几微秒的延迟，一级跟踪触发器必须在硬件中实现。最先进的模式识别过滤输入的数据通过模板匹配的asic与内容可寻址的存储器架构。到目前为止，在FPGA上实现取代ASIC的内容可寻址存储器是不可能的。本文提出了一种内容可寻址存储器体系结构的新方法，该方法允许基于FPGA的设计实现。通过在FPGA设计中结合滤波和寻迹，有多种可能使这两种算法相互适应。与ASIC相比，FPGA架构具有更大的灵活性。提出的设计通过逻辑最小化存储的数据，以最佳地利用FPGA的可用资源。此外，所开发的设计满足强时序约束，并具有内容可寻址存储器所需的特性。

{"title":"A Content Adapted FPGA Memory Architecture with Pattern Recognition Capability for L1 Track Triggering in the LHC Environment","authors":"T. Harbaum, M. Seboui, M. Balzer, J. Becker, M. Weber","doi":"10.1109/FCCM.2016.52","DOIUrl":"https://doi.org/10.1109/FCCM.2016.52","url":null,"abstract":"Modern high-energy physics experiments such as the Compact Muon Solenoid experiment at CERN produce an extraordinary amount of data every 25ns. To handle a data rate of more than 50Tbit/s a multi-level trigger system is required, which reduces the data rate. Due to the increased luminosity after the Phase-II-Upgrade of the LHC, the CMS tracking system has to be redesigned. The current trigger system is unable to handle the resulting amount of data after this upgrade. Because of the latency of a few microseconds the Level 1 Track Trigger has to be implemented in hardware. State-of-the-art pattern recognition filter the incoming data by template matching on ASICs with a content addressable memory architecture. An implementation on an FPGA, which replaces the content addressable memory of the ASIC, has not been possible so far. This paper presents a new approach to a content addressable memory architecture, which allows an implementation of an FPGA based design. By combining filtering and track finding on an FPGA design, there are many possibilities of adjusting the two algorithms to each other. There is more flexibility enabled by the FPGA architecture in contrast to the ASIC. The presented design minimizes the stored data by logic to optimally utilize the available resources of an FPGA. Furthermore, the developed design meets the strong timing constraints and possesses the required properties of the content addressable memory.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125559036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

P4-to-VHDL: Automatic Generation of 100 Gbps Packet Parsers p4到vhdl: 100 Gbps包解析器的自动生成

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2016-05-01 DOI: 10.1109/FCCM.2016.46

Pavel Benácek, V. Pus, H. Kubátová

Software Defined Networking and OpenFlow offer an elegant way to decouple network control plane from data plane. This decoupling has led to great innovation in the control plane, yet the data plane changes come at much slower pace, mainly due to the hard-wired implementation of network switches. The P4 language aims to overcome this obstacle by providing a description of a customized packet processing functionality for configurable switches. That enables a new generation of possibly heterogeneous networking hardware that can be runtime tailored for the needs of particular applications from various domains. In this paper we contribute to the idea of P4 by presenting design, analysis and experimental results of our packet parser generator. The generator converts a parse graph description of P4 to a synthetizable VHDL code suitable for FPGA implementation. Our results show that the generated circuit is able to parse 100 Gbps traffic with fairly complex protocol structure at line rate on a Xilinx Virtex-7 FPGA. The approach can be used not only in switches, but also in other appliances, such as application accelerators and smart NICs. We compare the generated output to a hand-written parser to show that the price for configurability is only a slightly larger and slower circuit.

软件定义网络和OpenFlow提供了一种将网络控制平面与数据平面解耦的优雅方法。这种分离导致了控制平面的巨大创新，但数据平面的变化速度要慢得多，主要是由于网络交换机的硬连线实现。P4语言旨在通过为可配置交换机提供自定义数据包处理功能的描述来克服这一障碍。这使新一代可能异构的网络硬件成为可能，这些硬件可以在运行时针对来自不同领域的特定应用程序的需求进行定制。在本文中，我们通过介绍我们的包解析器生成器的设计、分析和实验结果来贡献P4的思想。该生成器将P4的解析图描述转换为适合FPGA实现的可合成的VHDL代码。结果表明，所生成的电路能够在Xilinx Virtex-7 FPGA上以线速率解析具有相当复杂协议结构的100 Gbps流量。这种方法不仅可以用于交换机，还可以用于其他设备，如应用加速器和智能网卡。我们将生成的输出与手工编写的解析器进行比较，以表明可配置性的代价只是稍微大一点、慢一点的电路。

{"title":"P4-to-VHDL: Automatic Generation of 100 Gbps Packet Parsers","authors":"Pavel Benácek, V. Pus, H. Kubátová","doi":"10.1109/FCCM.2016.46","DOIUrl":"https://doi.org/10.1109/FCCM.2016.46","url":null,"abstract":"Software Defined Networking and OpenFlow offer an elegant way to decouple network control plane from data plane. This decoupling has led to great innovation in the control plane, yet the data plane changes come at much slower pace, mainly due to the hard-wired implementation of network switches. The P4 language aims to overcome this obstacle by providing a description of a customized packet processing functionality for configurable switches. That enables a new generation of possibly heterogeneous networking hardware that can be runtime tailored for the needs of particular applications from various domains. In this paper we contribute to the idea of P4 by presenting design, analysis and experimental results of our packet parser generator. The generator converts a parse graph description of P4 to a synthetizable VHDL code suitable for FPGA implementation. Our results show that the generated circuit is able to parse 100 Gbps traffic with fairly complex protocol structure at line rate on a Xilinx Virtex-7 FPGA. The approach can be used not only in switches, but also in other appliances, such as application accelerators and smart NICs. We compare the generated output to a hand-written parser to show that the price for configurability is only a slightly larger and slower circuit.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122825303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 46

Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform CPU-FPGA异构平台上的等价联接加速

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2016-05-01 DOI: 10.1109/FCCM.2016.62

Ren Chen, V. Prasanna

Accelerating database applications using FPGAs has recently been an area of growing interest in both academia and industry. Equi-join is one of the key database operations whose performance highly depends on sorting, which exhibits high memory usage on FPGA. A fully pipelined N-key merge sorter consists of log N sorting stages using O(N) memory totally. For large data sets, external memory has to be employed to perform data buffering between the sorting stages. This introduces pipeline stalls as well as several iterations between FPGA and external memory, causing significant performance degradation. In this paper, we speed-up equi-join using a hybrid CPU-FPGA heterogeneous platform. To alleviate the performance impact of limited memory, we propose a merge sort based hybrid design where the first few sorting stages in the merge sort tree are replaced with "folded" bitonic sorting networks. These "folded" bitonic sorting networks operate in parallel on the FPGA. The partial results are then merged on the CPU to produce the final sorted result. Based on this hybrid sorting design, we develop two streaming join algorithms by optimizing the classic CPU-based nested-loop join and sort-merge join algorithms. On a rangeof data set sizes, our design achieves throughput improvement of 3.1x and 1.9x compared with software-only and FPGA only implementations, respectively. Our design sustains 21.6% of thepeak bandwidth, which is 3.9x utilization obtained by the state-of-the-art FPGA equi-join implementation.

最近，使用fpga加速数据库应用已成为学术界和工业界越来越感兴趣的领域。等同连接是数据库的关键操作之一，其性能高度依赖于排序，在FPGA上具有很高的内存占用率。一个完全流水线的N键归并排序器由log N个排序阶段组成，总共使用O(N)内存。对于大型数据集，必须使用外部内存在排序阶段之间执行数据缓冲。这将导致管道中断以及FPGA和外部存储器之间的多次迭代，从而导致显著的性能下降。在本文中，我们使用一个混合的CPU-FPGA异构平台来加速等速连接。为了减轻有限内存对性能的影响，我们提出了一种基于合并排序的混合设计，其中合并排序树中的前几个排序阶段被“折叠”双元排序网络取代。这些“折叠”的双音排序网络在FPGA上并行运行。然后将部分结果合并到CPU上以产生最终排序结果。基于这种混合排序设计，我们通过优化经典的基于cpu的嵌套循环连接和排序合并连接算法，开发了两种流连接算法。在数据集大小的范围内，我们的设计与纯软件和纯FPGA实现相比，吞吐量分别提高了3.1倍和1.9倍。我们的设计维持了21.6%的峰值带宽，这是最先进的FPGA等同连接实现的3.9倍利用率。

{"title":"Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform","authors":"Ren Chen, V. Prasanna","doi":"10.1109/FCCM.2016.62","DOIUrl":"https://doi.org/10.1109/FCCM.2016.62","url":null,"abstract":"Accelerating database applications using FPGAs has recently been an area of growing interest in both academia and industry. Equi-join is one of the key database operations whose performance highly depends on sorting, which exhibits high memory usage on FPGA. A fully pipelined N-key merge sorter consists of log N sorting stages using O(N) memory totally. For large data sets, external memory has to be employed to perform data buffering between the sorting stages. This introduces pipeline stalls as well as several iterations between FPGA and external memory, causing significant performance degradation. In this paper, we speed-up equi-join using a hybrid CPU-FPGA heterogeneous platform. To alleviate the performance impact of limited memory, we propose a merge sort based hybrid design where the first few sorting stages in the merge sort tree are replaced with \"folded\" bitonic sorting networks. These \"folded\" bitonic sorting networks operate in parallel on the FPGA. The partial results are then merged on the CPU to produce the final sorted result. Based on this hybrid sorting design, we develop two streaming join algorithms by optimizing the classic CPU-based nested-loop join and sort-merge join algorithms. On a rangeof data set sizes, our design achieves throughput improvement of 3.1x and 1.9x compared with software-only and FPGA only implementations, respectively. Our design sustains 21.6% of thepeak bandwidth, which is 3.9x utilization obtained by the state-of-the-art FPGA equi-join implementation.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"272 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121988909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

Acceleration of the Pair-HMM Algorithm for DNA Variant Calling DNA变异调用的Pair-HMM算法加速

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2016-05-01 DOI: 10.1145/3020078.3021749

Sitao Huang, G. Manikandan, Anand Ramachandran, K. Rupnow, Wen-mei W. Hwu, Deming Chen

In this project, we propose an SoC solution to accelerate the Pair-HMM's forward algorithm which is the key performance bottleneck in the GATK's HaplotypeCaller tool for DNA variant calling. We develop two versions of the Pair-HMM accelerator: one using High Level Synthesis (HLS), and another ring-based manual RTL implementation. We investigate the performance of the manual RTL design and HLS design in terms of design flexibility and overall run-time. We achieve a significant speed-up of up to 19x through the HLS implementation and speed-up of up to 95x through the RTL implementation of the algorithm.

在这个项目中，我们提出了一个SoC解决方案来加速Pair-HMM的前向算法，这是GATK的HaplotypeCaller工具用于DNA变异调用的关键性能瓶颈。我们开发了两个版本的Pair-HMM加速器:一个使用高级合成(HLS)，另一个基于环的手动RTL实现。我们在设计灵活性和总体运行时间方面研究了手动RTL设计和HLS设计的性能。我们通过HLS实现实现了高达19倍的显著加速，通过RTL实现的算法实现了高达95倍的加速。

引用次数: 46

Runtime Parameterizable Regular Expression Operators for Databases 数据库的运行时可参数化正则表达式运算符

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2016-05-01 DOI: 10.1109/FCCM.2016.61

Z. István, David Sidler, G. Alonso

Relational databases execute user queries through operator trees, where each operator has a well defined interface and a specific task (e.g., arithmetic function, pattern matching, aggregation, etc.). Hardware acceleration of compute intensive operators is a promising prospect but it comes with challenges. Databases execute tens of thousands of different queries per second. Thus, if only one specific instantiation of an operator is supported by the accelerator, it will have little effect on the overall workload. In this paper we explore the tradeoff between resource efficiency and expression complexity for an FPGA accelerator targeting string-matching operators (LIKE and REGEXP_LIKE in SQL). This tradeoff is complex. For instance, the FPGA not always wins: simple queries that can be answered from indexes run faster on the CPU. On complex regular expressions, the FPGA is faster but needs to be parametrized at runtime to be able to support different queries. For very long patterns, the entire expression might not fit into the FPGA circuit and a combined mode CPU-FPGA must be chosen. We evaluate our design on a heterogeneous multi-core machine in which the FPGA has cache coherent access to the CPU memory. In addition to the string matching circuit, we also show how to implement database page parsing logic so as to be able to work directly on the same memory data structures as the database engine.

关系数据库通过操作符树执行用户查询，其中每个操作符都有一个定义良好的接口和一个特定的任务(例如，算术函数、模式匹配、聚合等)。计算密集型运算的硬件加速是一个很有前景的领域，但同时也面临着挑战。数据库每秒执行成千上万个不同的查询。因此，如果加速器只支持一个特定的操作符实例，那么它对整个工作负载的影响就很小。在本文中，我们探讨了针对字符串匹配运算符(SQL中的LIKE和REGEXP_LIKE)的FPGA加速器的资源效率和表达式复杂性之间的权衡。这种权衡是复杂的。例如，FPGA并不总是获胜:可以通过索引回答的简单查询在CPU上运行得更快。对于复杂的正则表达式，FPGA更快，但需要在运行时参数化，以便能够支持不同的查询。对于非常长的模式，整个表达式可能不适合FPGA电路，必须选择CPU-FPGA组合模式。我们在异构多核机器上评估我们的设计，其中FPGA具有对CPU内存的缓存一致访问。除了字符串匹配电路之外，我们还展示了如何实现数据库页面解析逻辑，以便能够直接处理与数据库引擎相同的内存数据结构。

{"title":"Runtime Parameterizable Regular Expression Operators for Databases","authors":"Z. István, David Sidler, G. Alonso","doi":"10.1109/FCCM.2016.61","DOIUrl":"https://doi.org/10.1109/FCCM.2016.61","url":null,"abstract":"Relational databases execute user queries through operator trees, where each operator has a well defined interface and a specific task (e.g., arithmetic function, pattern matching, aggregation, etc.). Hardware acceleration of compute intensive operators is a promising prospect but it comes with challenges. Databases execute tens of thousands of different queries per second. Thus, if only one specific instantiation of an operator is supported by the accelerator, it will have little effect on the overall workload. In this paper we explore the tradeoff between resource efficiency and expression complexity for an FPGA accelerator targeting string-matching operators (LIKE and REGEXP_LIKE in SQL). This tradeoff is complex. For instance, the FPGA not always wins: simple queries that can be answered from indexes run faster on the CPU. On complex regular expressions, the FPGA is faster but needs to be parametrized at runtime to be able to support different queries. For very long patterns, the entire expression might not fit into the FPGA circuit and a combined mode CPU-FPGA must be chosen. We evaluate our design on a heterogeneous multi-core machine in which the FPGA has cache coherent access to the CPU memory. In addition to the string matching circuit, we also show how to implement database page parsing logic so as to be able to work directly on the same memory data structures as the database engine.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115854688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Knowledge Transfer in Automatic Optimisation of Reconfigurable Designs 可重构设计自动优化中的知识转移

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2016-05-01 DOI: 10.1109/FCCM.2016.29

Maciej Kurek, M. Deisenroth, W. Luk, T. Todman

This paper presents a novel approach for automatic optimisation of reconfigurable design parameters based on knowledge transfer. The key idea is to make use of insights derived from optimising related designs to benefit future optimisations. We show how to use designs targeting one device to speed up optimisation of another device. The proposed approach is evaluated based on various applications including computational finance and seismic imaging. It is capable of achieving up to 35% reduction in optimisation time in producing designs with similar performance, compared to alternative optimisation methods.

提出了一种基于知识转移的可重构设计参数自动优化方法。关键思想是利用从优化相关设计中获得的见解来促进未来的优化。我们展示了如何使用针对一个设备的设计来加速另一个设备的优化。基于计算金融和地震成像等各种应用对该方法进行了评估。与其他优化方法相比，它能够在生产具有相似性能的设计时减少高达35%的优化时间。

引用次数: 9

FPGA-Based Reduction Techniques for Efficient Deep Neural Network Deployment 基于fpga的高效深度神经网络部署约简技术

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2016-05-01 DOI: 10.1109/FCCM.2016.58

A. Page, T. Mohsenin

Deep neural networks have been shown to outperform prior state-of-the-art solutions that often relied heavily on hand-engineered feature extraction techniques coupled with simple classification algorithms. In particular, deep max-pooling convolutional neural networks (MPCNN) have been shown to dominate on several popular public benchmarks. Unfortunately, the benefits of deep networks have yet to be exploited in embedded, resource-bound settings that have strict power and area budgets. GPUs have been shown to improve throughput and energy-efficiency over CPUs due to their parallel architecture. In a similar fashion, FPGAs can improve performance while allowing more fine control over implementation. In order to meet power, area, and latency constraints, it is necessary to develop network reduction strategies in addition to optimal mapping. This work looks at two specific reduction techniques including limited precision for both fixed-point and floating-point formats, and performing weight matrix truncation using singular value decomposition. An FPGA-based framework is also proposed and used to deploy the trained networks. To demonstrate, a handful of public computer vision datasets including MNIST, CIFAR-10, and SVHN are fully implemented on a low-power Xilinx Artix-7 FPGA. Experimental results show that all networks are able to achieve a classification throughput of 16 img/sec and consume less than 700 mW when running at 200 MHz. In addition, the reduced networks are able to, on average, reduce power and area utilization by 37% and 44%, respectively, while only incurring less than 0.20% decrease in accuracy.

深度神经网络已经被证明优于之前的最先进的解决方案，这些解决方案通常严重依赖于手工设计的特征提取技术和简单的分类算法。特别是，深度最大池卷积神经网络(MPCNN)已经在几个流行的公共基准测试中占据主导地位。不幸的是，深度网络的好处还没有在嵌入式、资源受限的环境中得到充分利用，这些环境有严格的功率和区域预算。由于gpu的并行架构，它已经被证明可以提高cpu的吞吐量和能源效率。以类似的方式，fpga可以提高性能，同时允许对实现进行更精细的控制。为了满足功率、面积和延迟的限制，除了优化映射之外，还需要开发网络缩减策略。这项工作着眼于两种特定的约简技术，包括定点和浮点格式的有限精度，以及使用奇异值分解执行权矩阵截断。提出了一个基于fpga的框架，用于部署训练后的网络。为了演示，在低功耗Xilinx Artix-7 FPGA上完全实现了一些公共计算机视觉数据集，包括MNIST、CIFAR-10和SVHN。实验结果表明，在200 MHz下，所有网络都能达到16 img/sec的分类吞吐量，功耗小于700 mW。此外，简化后的网络平均能够将功率和面积利用率分别降低37%和44%，而精度仅下降不到0.20%。

{"title":"FPGA-Based Reduction Techniques for Efficient Deep Neural Network Deployment","authors":"A. Page, T. Mohsenin","doi":"10.1109/FCCM.2016.58","DOIUrl":"https://doi.org/10.1109/FCCM.2016.58","url":null,"abstract":"Deep neural networks have been shown to outperform prior state-of-the-art solutions that often relied heavily on hand-engineered feature extraction techniques coupled with simple classification algorithms. In particular, deep max-pooling convolutional neural networks (MPCNN) have been shown to dominate on several popular public benchmarks. Unfortunately, the benefits of deep networks have yet to be exploited in embedded, resource-bound settings that have strict power and area budgets. GPUs have been shown to improve throughput and energy-efficiency over CPUs due to their parallel architecture. In a similar fashion, FPGAs can improve performance while allowing more fine control over implementation. In order to meet power, area, and latency constraints, it is necessary to develop network reduction strategies in addition to optimal mapping. This work looks at two specific reduction techniques including limited precision for both fixed-point and floating-point formats, and performing weight matrix truncation using singular value decomposition. An FPGA-based framework is also proposed and used to deploy the trained networks. To demonstrate, a handful of public computer vision datasets including MNIST, CIFAR-10, and SVHN are fully implemented on a low-power Xilinx Artix-7 FPGA. Experimental results show that all networks are able to achieve a classification throughput of 16 img/sec and consume less than 700 mW when running at 200 MHz. In addition, the reduced networks are able to, on average, reduce power and area utilization by 37% and 44%, respectively, while only incurring less than 0.20% decrease in accuracy.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"658 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132093276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Parallelizing FPGA Technology Mapping through Partitioning 通过分区并行化FPGA技术映射

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2016-05-01 DOI: 10.1109/FCCM.2016.48

Chuyu Shen, Zili Lin, Ping Fan, X. Meng, Weikang Qian

The traditional FPGA technology mapping flow is very time-consuming as modern FPGA designs become larger. To speed up this procedure, this paper proposes a new approach based on circuit partition to parallelize it. The idea is to split the original circuit into several sub-circuits and assign each one to a core of a multi-core processor for simultaneous technology mapping. Compared to other existing parallelization methods, our method has the benefit of being independent of the detailed mapping algorithm. Our proposed partition method is able to minimize the quality loss caused by the partitioning. We have successfully integrated the proposed approach into an industrial FPGA mapping platform. The proposed flow gains a speed-up of 1.6X on average on a quad-core processor with negligible influence on the LUT count and the critical path length.

随着现代FPGA设计的大型化，传统的FPGA技术映射流程非常耗时。为了加快这一过程，本文提出了一种基于电路划分的并行化方法。这个想法是将原始电路分成几个子电路，并将每个子电路分配给多核处理器的一个核心，以同时进行技术映射。与现有的并行化方法相比，我们的方法具有不依赖于详细映射算法的优点。我们提出的分区方法能够最大限度地减少分区造成的质量损失。我们已经成功地将提出的方法集成到工业FPGA映射平台中。所提出的流在四核处理器上平均获得1.6倍的加速，而对LUT计数和关键路径长度的影响可以忽略不计。

引用次数: 4

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀