Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献

英文中文

A Machine Learning Framework for FPGA Placement (Abstract Only) 一种用于FPGA放置的机器学习框架(仅摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021765

G. Grewal, S. Areibi, Matthew Westrik, Ziad Abuowaimer, Betty Zhao

Many of the key stages in the traditional FPGA CAD flow require substantial amounts of computational effort. Moreover, due to limited overlap among individual stages, poor decisions made in earlier stages will often adversely affect the quality of result in later stages. To help address these issues, we propose a machine-learning framework that uses training data to learn the underlying relationship between circuits and the CAD algorithms used to map them onto a particular FPGA device. The framework does not solve the problem at an arbitrary stage in the flow. Rather, it seeks to assist the designer or the tool to solve the problem. The potential capabilities of the framework are demonstrated by applying it to the placement stage, where it is used to recommend the best placement flow for circuits with different features, and to predict placement and routing results without actually performing placement and routing. Results show that when trained using 372 challenging benchmarks for a Xilinx UltraScale device, the classification models employed in the framework achieve average accuracies in the range 92% to 95%, while the regression models have an average error rate in the range of 0.5% to 3.6%.

传统FPGA CAD流程中的许多关键阶段都需要大量的计算工作。此外，由于各个阶段之间的重叠有限，在早期阶段做出的错误决策往往会对后期阶段的结果质量产生不利影响。为了帮助解决这些问题，我们提出了一个机器学习框架，该框架使用训练数据来学习电路和用于将它们映射到特定FPGA器件的CAD算法之间的潜在关系。框架不能在流的任意阶段解决问题。相反，它试图帮助设计者或工具解决问题。该框架的潜在功能通过将其应用于放置阶段来展示，在该阶段中，它用于为具有不同特性的电路推荐最佳放置流程，并在不实际执行放置和路由的情况下预测放置和路由结果。结果表明，当使用Xilinx UltraScale设备的372个具有挑战性的基准进行训练时，框架中使用的分类模型的平均准确率在92%至95%之间，而回归模型的平均错误率在0.5%至3.6%之间。

{"title":"A Machine Learning Framework for FPGA Placement (Abstract Only)","authors":"G. Grewal, S. Areibi, Matthew Westrik, Ziad Abuowaimer, Betty Zhao","doi":"10.1145/3020078.3021765","DOIUrl":"https://doi.org/10.1145/3020078.3021765","url":null,"abstract":"Many of the key stages in the traditional FPGA CAD flow require substantial amounts of computational effort. Moreover, due to limited overlap among individual stages, poor decisions made in earlier stages will often adversely affect the quality of result in later stages. To help address these issues, we propose a machine-learning framework that uses training data to learn the underlying relationship between circuits and the CAD algorithms used to map them onto a particular FPGA device. The framework does not solve the problem at an arbitrary stage in the flow. Rather, it seeks to assist the designer or the tool to solve the problem. The potential capabilities of the framework are demonstrated by applying it to the placement stage, where it is used to recommend the best placement flow for circuits with different features, and to predict placement and routing results without actually performing placement and routing. Results show that when trained using 372 challenging benchmarks for a Xilinx UltraScale device, the classification models employed in the framework achieve average accuracies in the range 92% to 95%, while the regression models have an average error rate in the range of 0.5% to 3.6%.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128961674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

CPU-FPGA Co-Optimization for Big Data Applications: A Case Study of In-Memory Samtool Sorting (Abstract Only) 面向大数据应用的CPU-FPGA协同优化:以内存中Samtool排序为例(仅摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021787

J. Cong, Zhenman Fang, Muhuan Huang, Libo Wang, Di Wu

To efficiently process a tremendous amount of data, today's big data applications tend to distribute the datasets into multiple partitions, such that each partition can be fit into memory and be processed by a separate core/server in parallel. Meanwhile, due to the limited scaling of general-purpose CPUs, FPGAs have emerged as an attractive alternative to accelerate big data applications due to their low power, high performance and energy efficiency. In this paper we aim to answer one key question: How should the multicore CPU and FPGA coordinate together to optimize the performance of big data applications? To address the above question, we conduct a step-by-step case study to perform CPU and FPGA co-optimization for in-memory Samtool sorting in genomic data processing, which is one of the most important big data applications for personalized healthcare. First, to accelerate the time-consuming compression algorithm and its associated cyclic redundancy check (CRC) in Samtool sorting, we implement a portable and maintainable FPGA accelerator using high-level synthesis (HLS). Although FPGAs are traditionally well-known to be suitable for compression and CRC, we find that a straightforward integration of this FPGA accelerator into the multi-threaded Samtool sorting only achieves marginal system throughput improvement over the software baseline running on a 12-core CPU. To improve system performance, we propose a dataflow execution model to effectively orchestrate the computation between the multi-threaded CPU and FPGA. Experimental results show that our co-optimized CPU-FPGA system achieves a 2.6x speedup for in-memory Samtool sorting.

为了有效地处理大量数据，今天的大数据应用程序倾向于将数据集分布到多个分区中，这样每个分区都可以放入内存中，并由单独的核心/服务器并行处理。同时，由于通用cpu的扩展性有限，fpga因其低功耗、高性能和能效而成为加速大数据应用的一种有吸引力的替代方案。在本文中，我们旨在回答一个关键问题:多核CPU和FPGA应该如何协同以优化大数据应用的性能?为了解决上述问题，我们进行了一个逐步的案例研究，以执行CPU和FPGA协同优化基因组数据处理中的内存Samtool排序，这是个性化医疗保健中最重要的大数据应用之一。首先，为了加速Samtool排序中耗时的压缩算法及其相关的循环冗余校验(CRC)，我们使用高级合成(high-level synthesis, HLS)实现了一个可移植且可维护的FPGA加速器。虽然FPGA传统上以适合压缩和CRC而闻名，但我们发现，将FPGA加速器直接集成到多线程Samtool排序中，只能实现在12核CPU上运行的软件基线上的边际系统吞吐量改进。为了提高系统性能，我们提出了一种数据流执行模型来有效地协调多线程CPU和FPGA之间的计算。实验结果表明，我们的CPU-FPGA协同优化系统实现了2.6倍的内存中Samtool排序加速。

{"title":"CPU-FPGA Co-Optimization for Big Data Applications: A Case Study of In-Memory Samtool Sorting (Abstract Only)","authors":"J. Cong, Zhenman Fang, Muhuan Huang, Libo Wang, Di Wu","doi":"10.1145/3020078.3021787","DOIUrl":"https://doi.org/10.1145/3020078.3021787","url":null,"abstract":"To efficiently process a tremendous amount of data, today's big data applications tend to distribute the datasets into multiple partitions, such that each partition can be fit into memory and be processed by a separate core/server in parallel. Meanwhile, due to the limited scaling of general-purpose CPUs, FPGAs have emerged as an attractive alternative to accelerate big data applications due to their low power, high performance and energy efficiency. In this paper we aim to answer one key question: How should the multicore CPU and FPGA coordinate together to optimize the performance of big data applications? To address the above question, we conduct a step-by-step case study to perform CPU and FPGA co-optimization for in-memory Samtool sorting in genomic data processing, which is one of the most important big data applications for personalized healthcare. First, to accelerate the time-consuming compression algorithm and its associated cyclic redundancy check (CRC) in Samtool sorting, we implement a portable and maintainable FPGA accelerator using high-level synthesis (HLS). Although FPGAs are traditionally well-known to be suitable for compression and CRC, we find that a straightforward integration of this FPGA accelerator into the multi-threaded Samtool sorting only achieves marginal system throughput improvement over the software baseline running on a 12-core CPU. To improve system performance, we propose a dataflow execution model to effectively orchestrate the computation between the multi-threaded CPU and FPGA. Experimental results show that our co-optimized CPU-FPGA system achieves a 2.6x speedup for in-memory Samtool sorting.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"439 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114002022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

ASAP: Accelerated Short Read Alignment on Programmable Hardware (Abstract Only) ASAP:可编程硬件上的加速短读对齐(仅摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021796

Subho Sankar Banerjee, Mohamed El-Hadedy, Jong Bin Lim, Daniel Chen, Z. Kalbarczyk, Deming Chen, Ravishankar K. Iyer

The proliferation of high-throughput sequencing machines allows for the rapid generation of billions of short nucleotide fragments in a short period. This massive amount of sequence data can quickly overwhelm today's storage and compute infrastructure. This poster explores the use of hardware acceleration to significantly improve the runtime of short-read alignment (SRA), a crucial step in pre-processing sequenced genomes. It presents the design and implementation of ASAP, an accelerator for computing Levenshtein distance (LD) in the context of the SRA problem. LD computation is a prominent underlying mathematical kernel that is common to a large number of SRA tools (e.g., BLAST, BWA, SNAP) and is responsible for 50-70% of their runtime. These algorithms mentioned above calculate the exact value of LD between nucleotide strings but only use them to build a total ordering (an ordered list) of the most likely point of origin in the genome. ASAP computes an approximation of LD by encoding computation in propagation delay of circuit elements. This approximation is calculated in an accelerated fashion in hardware and preserves the original total ordering of LDs produced by the traditional algorithms. This computation is performed by constructing circuits that comprise the recursive definition of the LD computation and measuring propagation delay of a signal entering and leaving the circuit. Additionally, ASAP can explore large portions of the search space (substrings of the strings being compared) within one clock cycle, and ignore parts of the search space that does not contribute to an answer. Our design is implemented on an Altera Stratix V FPGA in an IBM POWER8 system using the CAPI interface for cache coherence across the CPU and FPGA. Our design is 200x faster (median measurement) than the equivalent C implementation of the kernel running on the host processor and 2.2x faster for an end-to-end alignment tool for 120-150bp short-read sequences.

高通量测序仪的普及使得在短时间内快速生成数十亿个短核苷酸片段成为可能。如此大量的序列数据可能会很快淹没当前的存储和计算基础设施。这张海报探讨了使用硬件加速来显著改善短读比对(SRA)的运行时间，这是预处理测序基因组的关键步骤。在SRA问题的背景下，给出了计算Levenshtein距离(LD)的加速器ASAP的设计和实现。LD计算是大量SRA工具(例如BLAST、BWA、SNAP)中常见的重要底层数学内核，占其运行时的50-70%。上面提到的这些算法计算核苷酸串之间的LD的精确值，但只使用它们来构建基因组中最可能的起源点的总排序(有序列表)。ASAP通过对电路元件传播延迟的编码计算来计算LD的近似值。这种近似是在硬件上以加速的方式计算的，并保留了传统算法产生的ld的原始总顺序。这种计算是通过构造包含LD计算的递归定义和测量进入和离开电路的信号的传播延迟的电路来执行的。此外，ASAP可以在一个时钟周期内探索大部分搜索空间(正在比较的字符串的子字符串)，并忽略与答案无关的部分搜索空间。我们的设计是在IBM POWER8系统中的Altera Stratix V FPGA上实现的，使用CAPI接口实现CPU和FPGA之间的缓存一致性。我们的设计比运行在主机处理器上的等效C内核实现快200倍(测量中值)，对120-150bp短读序列的端到端比对工具快2.2倍。

{"title":"ASAP: Accelerated Short Read Alignment on Programmable Hardware (Abstract Only)","authors":"Subho Sankar Banerjee, Mohamed El-Hadedy, Jong Bin Lim, Daniel Chen, Z. Kalbarczyk, Deming Chen, Ravishankar K. Iyer","doi":"10.1145/3020078.3021796","DOIUrl":"https://doi.org/10.1145/3020078.3021796","url":null,"abstract":"The proliferation of high-throughput sequencing machines allows for the rapid generation of billions of short nucleotide fragments in a short period. This massive amount of sequence data can quickly overwhelm today's storage and compute infrastructure. This poster explores the use of hardware acceleration to significantly improve the runtime of short-read alignment (SRA), a crucial step in pre-processing sequenced genomes. It presents the design and implementation of ASAP, an accelerator for computing Levenshtein distance (LD) in the context of the SRA problem. LD computation is a prominent underlying mathematical kernel that is common to a large number of SRA tools (e.g., BLAST, BWA, SNAP) and is responsible for 50-70% of their runtime. These algorithms mentioned above calculate the exact value of LD between nucleotide strings but only use them to build a total ordering (an ordered list) of the most likely point of origin in the genome. ASAP computes an approximation of LD by encoding computation in propagation delay of circuit elements. This approximation is calculated in an accelerated fashion in hardware and preserves the original total ordering of LDs produced by the traditional algorithms. This computation is performed by constructing circuits that comprise the recursive definition of the LD computation and measuring propagation delay of a signal entering and leaving the circuit. Additionally, ASAP can explore large portions of the search space (substrings of the strings being compared) within one clock cycle, and ignore parts of the search space that does not contribute to an answer. Our design is implemented on an Altera Stratix V FPGA in an IBM POWER8 system using the CAPI interface for cache coherence across the CPU and FPGA. Our design is 200x faster (median measurement) than the equivalent C implementation of the kernel running on the host processor and 2.2x faster for an end-to-end alignment tool for 120-150bp short-read sequences.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"97 3-4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114025403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Learning Convolutional Neural Networks for Data-Flow Graph Mapping on Spatial Programmable Architectures (Abstract Only) 学习卷积神经网络在空间可编程架构上的数据流图映射(仅摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021801

S. Yin, Dajiang Liu, Lifeng Sun, Xinhan Lin, Leibo Liu, Shaojun Wei

Data flow graph (DFG) mapping is critical for the compiling of spatial programmable architecture, where compilation time is a key factor for both time-to-market requirement and mapping successful rate. Inspired from the great progress made in tree search game using deep neural network, we proposed a framework for learning convolutional neural networks for mapping DFGs onto spatial programmable architectures. Considering that mapping is a process from source to target, we present a dual-input neural network capturing features from both DFGs in applications and Process Element Array (PEA) in spatial programmable architectures. In order to train the neural network, algorithms are designed to automatically generate a data set from PEA intermediate states of preprocessed DFG. Finally, we demonstrate that the trained neural network can get high identifying accuracy of mapping quality and our proposed mapping approach are competitive with state-of-the-art DFG mapping algorithms in performance while the compilation time is greatly reduced.

数据流图(DFG)映射对于空间可编程架构的编译至关重要，其中编译时间是上市时间要求和映射成功率的关键因素。受深度神经网络在树搜索游戏中取得的巨大进展的启发，我们提出了一个学习卷积神经网络的框架，用于将DFGs映射到空间可编程架构。考虑到映射是一个从源到目标的过程，我们提出了一个双输入神经网络，从应用程序中的DFGs和空间可编程架构中的过程元素阵列(process Element Array, PEA)中捕获特征。为了训练神经网络，设计了从预处理DFG的PEA中间状态自动生成数据集的算法。最后，我们证明了训练后的神经网络可以获得较高的映射质量识别精度，并且我们提出的映射方法在性能上与最先进的DFG映射算法相竞争，同时大大减少了编译时间。

{"title":"Learning Convolutional Neural Networks for Data-Flow Graph Mapping on Spatial Programmable Architectures (Abstract Only)","authors":"S. Yin, Dajiang Liu, Lifeng Sun, Xinhan Lin, Leibo Liu, Shaojun Wei","doi":"10.1145/3020078.3021801","DOIUrl":"https://doi.org/10.1145/3020078.3021801","url":null,"abstract":"Data flow graph (DFG) mapping is critical for the compiling of spatial programmable architecture, where compilation time is a key factor for both time-to-market requirement and mapping successful rate. Inspired from the great progress made in tree search game using deep neural network, we proposed a framework for learning convolutional neural networks for mapping DFGs onto spatial programmable architectures. Considering that mapping is a process from source to target, we present a dual-input neural network capturing features from both DFGs in applications and Process Element Array (PEA) in spatial programmable architectures. In order to train the neural network, algorithms are designed to automatically generate a data set from PEA intermediate states of preprocessed DFG. Finally, we demonstrate that the trained neural network can get high identifying accuracy of mapping quality and our proposed mapping approach are competitive with state-of-the-art DFG mapping algorithms in performance while the compilation time is greatly reduced.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"51 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114130588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Accelerating Face Detection on Programmable SoC Using C-Based Synthesis 基于c语言的合成加速可编程SoC的人脸检测

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021753

Nitish Kumar Srivastava, Steve Dai, R. Manohar, Zhiru Zhang

High-level synthesis (HLS) enables designing at a higher level of abstraction to effectively cope with design complexity of emerging applications on modern programmable system-on-chip (SoC). While HLS continues to evolve with a growing set of algorithms, methodologies, and tools to efficiently map software designs onto optimized hardware architectures, there continues to lack realistic benchmark applications with sufficient complexity and enforceable constraints. In this paper we present a case study of accelerating face detection based on the Viola Jones algorithm on a programmable SoC using a C-based HLS flow. We also share our insights in porting a software-based design into a synthesizable implementation with HLS-specific data structures and optimizations. Our design is able to achieve a frame rate of 30 frames per second which is suitable for realtime applications. Our performance and quality of results are comparable to those of many traditional RTL implementations.

高级综合(High-level synthesis, HLS)使设计能够在更高的抽象层次上有效地应对现代可编程片上系统(system-on-chip, SoC)新兴应用的设计复杂性。虽然HLS不断发展出越来越多的算法、方法和工具，以有效地将软件设计映射到优化的硬件架构上，但仍然缺乏具有足够复杂性和可执行约束的现实基准应用程序。在本文中，我们提出了一个使用基于c语言的HLS流的可编程SoC上基于Viola Jones算法加速人脸检测的案例研究。我们还分享了将基于软件的设计移植到具有hls特定数据结构和优化的可合成实现中的见解。我们的设计能够实现每秒30帧的帧率，适合于实时应用。我们的性能和结果质量可与许多传统RTL实现相媲美。

引用次数: 14

The Role of FPGAs in Deep Learning fpga在深度学习中的作用

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3030013

A. Ling, J. Anderson

Deep learning has garnered significant visibility recently as an Artificial Intelligence (AI) paradigm, with success in wide ranging applications such as image and speech recognition, natural language understanding, self-driving cars, and game playing (e.g., Alpha Go). This special session is devoted to exploring the potential role of FPGAs in this important fast-evolving domain.

深度学习作为一种人工智能(AI)范式最近获得了显著的关注，在图像和语音识别、自然语言理解、自动驾驶汽车和游戏(例如阿尔法围棋)等广泛应用中取得了成功。本次特别会议致力于探讨fpga在这一重要的快速发展领域的潜在作用。

引用次数: 11

Dynamic Hazard Resolution for Pipelining Irregular Loops in High-Level Synthesis 高阶综合中管道不规则回路的动态危险识别

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021754

Steve Dai, Ritchie Zhao, Gai Liu, S. Srinath, Udit Gupta, C. Batten, Zhiru Zhang

Current pipelining approach in high-level synthesis (HLS) achieves high performance for applications with regular and statically analyzable memory access patterns. However, it cannot effectively handle infrequent data-dependent structural and data hazards because they are conservatively assumed to always occur in the synthesized pipeline. To enable high-throughput pipelining of irregular loops, we study the problem of augmenting HLS with application-specific dynamic hazard resolution, and examine its implications on scheduling and quality of results. We propose to generate an aggressive pipeline at compile-time while resolving hazards with memory port arbitration and squash-and-replay at run-time. Our experiments targeting a Xilinx FPGA demonstrate promising performance improvement across a suite of representative benchmarks.

当前的高级综合(high-level synthesis, HLS)中的流水线方法对于具有常规和静态可分析内存访问模式的应用程序实现了高性能。但是，它不能有效地处理不常见的与数据相关的结构和数据危险，因为它们被保守地认为总是发生在合成管道中。为了实现不规则循环的高通量流水线，我们研究了使用特定应用的动态危险分辨率来增强HLS的问题，并研究了其对调度和结果质量的影响。我们建议在编译时生成一个积极的管道，同时在运行时通过内存端口仲裁和压缩重放来解决风险。我们针对Xilinx FPGA的实验在一系列代表性基准测试中证明了有希望的性能改进。

引用次数: 30

fpgaConvNet: Automated Mapping of Convolutional Neural Networks on FPGAs (Abstract Only) fpga上卷积神经网络的自动映射(仅摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021791

Stylianos I. Venieris, C. Bouganis

In recent years, Convolutional Neural Networks (ConvNets) have become the state-of-the-art in several Artificial Intelligence tasks. Across the range of applications, the performance needs vary significantly, from high-throughput image recognition to the very low-latency requirements of autonomous cars. In this context, FPGAs can provide a potential platform that can be optimally configured based on the different performance needs. However, the complexity of ConvNet models keeps increasing leading to a large design space. This work presents fpgaConvNet, an end-to-end framework for mapping ConvNets on FPGAs. The proposed framework employs an automated design methodology based on the Synchronous Dataflow (SDF) paradigm and defines a set of transformations on the SDF graph in order to efficiently explore the architectural design space. By treating high-throughput and latency-critical systems separately, the presented tool is able to efficiently explore the architectural design space and to generate hardware designs from high-level ConvNet specifications, explicitly optimised for the performance metric of interest. Overall our framework yields designs that improve the performance density and the performance efficiency by up to 6× and 4.49× respectively over existing highly-optimised FPGA, DSP and embedded GPU work.

近年来，卷积神经网络(ConvNets)已成为许多人工智能任务的最先进技术。从高吞吐量图像识别到自动驾驶汽车的低延迟要求，在各种应用中，性能需求差异很大。在这种情况下，fpga可以提供一个潜在的平台，可以根据不同的性能需求进行最佳配置。然而，卷积神经网络模型的复杂性不断增加，导致其设计空间很大。这项工作提出了fpgaConvNet，一个将卷积网络映射到fpga上的端到端框架。提出的框架采用了基于同步数据流(SDF)范式的自动化设计方法，并在SDF图上定义了一组转换，以便有效地探索架构设计空间。通过分别处理高吞吐量和延迟关键系统，该工具能够有效地探索架构设计空间，并根据高级ConvNet规范生成硬件设计，并针对感兴趣的性能指标进行显式优化。总体而言，我们的框架产生的设计将性能密度和性能效率分别提高到现有高度优化的FPGA, DSP和嵌入式GPU工作的6倍和4.49倍。

{"title":"fpgaConvNet: Automated Mapping of Convolutional Neural Networks on FPGAs (Abstract Only)","authors":"Stylianos I. Venieris, C. Bouganis","doi":"10.1145/3020078.3021791","DOIUrl":"https://doi.org/10.1145/3020078.3021791","url":null,"abstract":"In recent years, Convolutional Neural Networks (ConvNets) have become the state-of-the-art in several Artificial Intelligence tasks. Across the range of applications, the performance needs vary significantly, from high-throughput image recognition to the very low-latency requirements of autonomous cars. In this context, FPGAs can provide a potential platform that can be optimally configured based on the different performance needs. However, the complexity of ConvNet models keeps increasing leading to a large design space. This work presents fpgaConvNet, an end-to-end framework for mapping ConvNets on FPGAs. The proposed framework employs an automated design methodology based on the Synchronous Dataflow (SDF) paradigm and defines a set of transformations on the SDF graph in order to efficiently explore the architectural design space. By treating high-throughput and latency-critical systems separately, the presented tool is able to efficiently explore the architectural design space and to generate hardware designs from high-level ConvNet specifications, explicitly optimised for the performance metric of interest. Overall our framework yields designs that improve the performance density and the performance efficiency by up to 6× and 4.49× respectively over existing highly-optimised FPGA, DSP and embedded GPU work.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116441812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion 基于子图动态扩展的gpu加速FPGA路由

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021732

Minghua Shen, Guojie Luo

FPGAs are increasingly popular as application-specific accelerators because they lead to a good balance between flexibility and energy efficiency, compared to CPUs and ASICs. However, the long routing time imposes a barrier on FPGA computing, which significantly hinders the design productivity. Existing attempts of parallelizing the FPGA routing either do not fully exploit the parallelism or suffer from an excessive quality loss. Massive parallelism using GPUs has the potential to solve this issue but faces non-trivial challenges. To cope with these challenges, this work presents Corolla, a GPU-accelerated FPGA routing method. Corolla enables applying the GPU-friendly shortest path algorithm in FPGA routing, leveraging the idea of problem size reduction by limiting the search in routing subgraphs. We maintain the convergence after problem size reduction using the dynamic expansion of the routing resource subgraphs. In addition, Corolla explores the fine-grained single-net parallelism and proposes a hybrid approach to combine the static and dynamic parallelism on GPU. To explore the coarse-grained multi-net parallelism, Corolla proposes an effective method to parallelize mutli-net routing while preserving the equivalent routing results as the original single-net routing. Experimental results show that Corolla achieves an average of 18.72x speedup on GPU with a tolerable loss in the routing quality and sustains a scalable speedup on large-scale routing graphs. To our knowledge, this is the first work to demonstrate the effectiveness of GPU-accelerated FPGA routing.

与cpu和asic相比，fpga作为特定应用的加速器越来越受欢迎，因为它们在灵活性和能源效率之间取得了良好的平衡。然而，较长的路由时间给FPGA计算带来了障碍，严重影响了设计效率。现有的并行化FPGA路由的尝试要么没有充分利用并行性，要么遭受过多的质量损失。使用gpu的大规模并行有可能解决这个问题，但也面临着不小的挑战。为了应对这些挑战，本工作提出了一种gpu加速FPGA路由方法Corolla。Corolla允许在FPGA路由中应用gpu友好的最短路径算法，通过限制在路由子图中的搜索来利用减少问题大小的思想。我们使用路由资源子图的动态展开来保持问题缩减后的收敛性。此外，Corolla还对细粒度的单网并行性进行了探索，提出了一种将GPU上的静态并行性和动态并行性结合起来的混合方法。为了探索粗粒度的多网并行性，Corolla提出了一种有效的多网并行化路由的方法，同时保留与原始单网路由等效的路由结果。实验结果表明，在可容忍的路由质量损失的情况下，Corolla在GPU上实现了18.72倍的平均加速，并且在大规模路由图上保持了可扩展的加速。据我们所知，这是第一个证明gpu加速FPGA路由有效性的工作。

{"title":"Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion","authors":"Minghua Shen, Guojie Luo","doi":"10.1145/3020078.3021732","DOIUrl":"https://doi.org/10.1145/3020078.3021732","url":null,"abstract":"FPGAs are increasingly popular as application-specific accelerators because they lead to a good balance between flexibility and energy efficiency, compared to CPUs and ASICs. However, the long routing time imposes a barrier on FPGA computing, which significantly hinders the design productivity. Existing attempts of parallelizing the FPGA routing either do not fully exploit the parallelism or suffer from an excessive quality loss. Massive parallelism using GPUs has the potential to solve this issue but faces non-trivial challenges. To cope with these challenges, this work presents Corolla, a GPU-accelerated FPGA routing method. Corolla enables applying the GPU-friendly shortest path algorithm in FPGA routing, leveraging the idea of problem size reduction by limiting the search in routing subgraphs. We maintain the convergence after problem size reduction using the dynamic expansion of the routing resource subgraphs. In addition, Corolla explores the fine-grained single-net parallelism and proposes a hybrid approach to combine the static and dynamic parallelism on GPU. To explore the coarse-grained multi-net parallelism, Corolla proposes an effective method to parallelize mutli-net routing while preserving the equivalent routing results as the original single-net routing. Experimental results show that Corolla achieves an average of 18.72x speedup on GPU with a tolerable loss in the routing quality and sustains a scalable speedup on large-scale routing graphs. To our knowledge, this is the first work to demonstrate the effectiveness of GPU-accelerated FPGA routing.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128972776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Session details: Machine Learning 会议细节:机器学习

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3257184

J. Cong

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀