G. Grewal, S. Areibi, Matthew Westrik, Ziad Abuowaimer, Betty Zhao
Many of the key stages in the traditional FPGA CAD flow require substantial amounts of computational effort. Moreover, due to limited overlap among individual stages, poor decisions made in earlier stages will often adversely affect the quality of result in later stages. To help address these issues, we propose a machine-learning framework that uses training data to learn the underlying relationship between circuits and the CAD algorithms used to map them onto a particular FPGA device. The framework does not solve the problem at an arbitrary stage in the flow. Rather, it seeks to assist the designer or the tool to solve the problem. The potential capabilities of the framework are demonstrated by applying it to the placement stage, where it is used to recommend the best placement flow for circuits with different features, and to predict placement and routing results without actually performing placement and routing. Results show that when trained using 372 challenging benchmarks for a Xilinx UltraScale device, the classification models employed in the framework achieve average accuracies in the range 92% to 95%, while the regression models have an average error rate in the range of 0.5% to 3.6%.
{"title":"A Machine Learning Framework for FPGA Placement (Abstract Only)","authors":"G. Grewal, S. Areibi, Matthew Westrik, Ziad Abuowaimer, Betty Zhao","doi":"10.1145/3020078.3021765","DOIUrl":"https://doi.org/10.1145/3020078.3021765","url":null,"abstract":"Many of the key stages in the traditional FPGA CAD flow require substantial amounts of computational effort. Moreover, due to limited overlap among individual stages, poor decisions made in earlier stages will often adversely affect the quality of result in later stages. To help address these issues, we propose a machine-learning framework that uses training data to learn the underlying relationship between circuits and the CAD algorithms used to map them onto a particular FPGA device. The framework does not solve the problem at an arbitrary stage in the flow. Rather, it seeks to assist the designer or the tool to solve the problem. The potential capabilities of the framework are demonstrated by applying it to the placement stage, where it is used to recommend the best placement flow for circuits with different features, and to predict placement and routing results without actually performing placement and routing. Results show that when trained using 372 challenging benchmarks for a Xilinx UltraScale device, the classification models employed in the framework achieve average accuracies in the range 92% to 95%, while the regression models have an average error rate in the range of 0.5% to 3.6%.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128961674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Cong, Zhenman Fang, Muhuan Huang, Libo Wang, Di Wu
To efficiently process a tremendous amount of data, today's big data applications tend to distribute the datasets into multiple partitions, such that each partition can be fit into memory and be processed by a separate core/server in parallel. Meanwhile, due to the limited scaling of general-purpose CPUs, FPGAs have emerged as an attractive alternative to accelerate big data applications due to their low power, high performance and energy efficiency. In this paper we aim to answer one key question: How should the multicore CPU and FPGA coordinate together to optimize the performance of big data applications? To address the above question, we conduct a step-by-step case study to perform CPU and FPGA co-optimization for in-memory Samtool sorting in genomic data processing, which is one of the most important big data applications for personalized healthcare. First, to accelerate the time-consuming compression algorithm and its associated cyclic redundancy check (CRC) in Samtool sorting, we implement a portable and maintainable FPGA accelerator using high-level synthesis (HLS). Although FPGAs are traditionally well-known to be suitable for compression and CRC, we find that a straightforward integration of this FPGA accelerator into the multi-threaded Samtool sorting only achieves marginal system throughput improvement over the software baseline running on a 12-core CPU. To improve system performance, we propose a dataflow execution model to effectively orchestrate the computation between the multi-threaded CPU and FPGA. Experimental results show that our co-optimized CPU-FPGA system achieves a 2.6x speedup for in-memory Samtool sorting.
{"title":"CPU-FPGA Co-Optimization for Big Data Applications: A Case Study of In-Memory Samtool Sorting (Abstract Only)","authors":"J. Cong, Zhenman Fang, Muhuan Huang, Libo Wang, Di Wu","doi":"10.1145/3020078.3021787","DOIUrl":"https://doi.org/10.1145/3020078.3021787","url":null,"abstract":"To efficiently process a tremendous amount of data, today's big data applications tend to distribute the datasets into multiple partitions, such that each partition can be fit into memory and be processed by a separate core/server in parallel. Meanwhile, due to the limited scaling of general-purpose CPUs, FPGAs have emerged as an attractive alternative to accelerate big data applications due to their low power, high performance and energy efficiency. In this paper we aim to answer one key question: How should the multicore CPU and FPGA coordinate together to optimize the performance of big data applications? To address the above question, we conduct a step-by-step case study to perform CPU and FPGA co-optimization for in-memory Samtool sorting in genomic data processing, which is one of the most important big data applications for personalized healthcare. First, to accelerate the time-consuming compression algorithm and its associated cyclic redundancy check (CRC) in Samtool sorting, we implement a portable and maintainable FPGA accelerator using high-level synthesis (HLS). Although FPGAs are traditionally well-known to be suitable for compression and CRC, we find that a straightforward integration of this FPGA accelerator into the multi-threaded Samtool sorting only achieves marginal system throughput improvement over the software baseline running on a 12-core CPU. To improve system performance, we propose a dataflow execution model to effectively orchestrate the computation between the multi-threaded CPU and FPGA. Experimental results show that our co-optimized CPU-FPGA system achieves a 2.6x speedup for in-memory Samtool sorting.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"439 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114002022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Subho Sankar Banerjee, Mohamed El-Hadedy, Jong Bin Lim, Daniel Chen, Z. Kalbarczyk, Deming Chen, Ravishankar K. Iyer
The proliferation of high-throughput sequencing machines allows for the rapid generation of billions of short nucleotide fragments in a short period. This massive amount of sequence data can quickly overwhelm today's storage and compute infrastructure. This poster explores the use of hardware acceleration to significantly improve the runtime of short-read alignment (SRA), a crucial step in pre-processing sequenced genomes. It presents the design and implementation of ASAP, an accelerator for computing Levenshtein distance (LD) in the context of the SRA problem. LD computation is a prominent underlying mathematical kernel that is common to a large number of SRA tools (e.g., BLAST, BWA, SNAP) and is responsible for 50-70% of their runtime. These algorithms mentioned above calculate the exact value of LD between nucleotide strings but only use them to build a total ordering (an ordered list) of the most likely point of origin in the genome. ASAP computes an approximation of LD by encoding computation in propagation delay of circuit elements. This approximation is calculated in an accelerated fashion in hardware and preserves the original total ordering of LDs produced by the traditional algorithms. This computation is performed by constructing circuits that comprise the recursive definition of the LD computation and measuring propagation delay of a signal entering and leaving the circuit. Additionally, ASAP can explore large portions of the search space (substrings of the strings being compared) within one clock cycle, and ignore parts of the search space that does not contribute to an answer. Our design is implemented on an Altera Stratix V FPGA in an IBM POWER8 system using the CAPI interface for cache coherence across the CPU and FPGA. Our design is 200x faster (median measurement) than the equivalent C implementation of the kernel running on the host processor and 2.2x faster for an end-to-end alignment tool for 120-150bp short-read sequences.
高通量测序仪的普及使得在短时间内快速生成数十亿个短核苷酸片段成为可能。如此大量的序列数据可能会很快淹没当前的存储和计算基础设施。这张海报探讨了使用硬件加速来显著改善短读比对(SRA)的运行时间,这是预处理测序基因组的关键步骤。在SRA问题的背景下,给出了计算Levenshtein距离(LD)的加速器ASAP的设计和实现。LD计算是大量SRA工具(例如BLAST、BWA、SNAP)中常见的重要底层数学内核,占其运行时的50-70%。上面提到的这些算法计算核苷酸串之间的LD的精确值,但只使用它们来构建基因组中最可能的起源点的总排序(有序列表)。ASAP通过对电路元件传播延迟的编码计算来计算LD的近似值。这种近似是在硬件上以加速的方式计算的,并保留了传统算法产生的ld的原始总顺序。这种计算是通过构造包含LD计算的递归定义和测量进入和离开电路的信号的传播延迟的电路来执行的。此外,ASAP可以在一个时钟周期内探索大部分搜索空间(正在比较的字符串的子字符串),并忽略与答案无关的部分搜索空间。我们的设计是在IBM POWER8系统中的Altera Stratix V FPGA上实现的,使用CAPI接口实现CPU和FPGA之间的缓存一致性。我们的设计比运行在主机处理器上的等效C内核实现快200倍(测量中值),对120-150bp短读序列的端到端比对工具快2.2倍。
{"title":"ASAP: Accelerated Short Read Alignment on Programmable Hardware (Abstract Only)","authors":"Subho Sankar Banerjee, Mohamed El-Hadedy, Jong Bin Lim, Daniel Chen, Z. Kalbarczyk, Deming Chen, Ravishankar K. Iyer","doi":"10.1145/3020078.3021796","DOIUrl":"https://doi.org/10.1145/3020078.3021796","url":null,"abstract":"The proliferation of high-throughput sequencing machines allows for the rapid generation of billions of short nucleotide fragments in a short period. This massive amount of sequence data can quickly overwhelm today's storage and compute infrastructure. This poster explores the use of hardware acceleration to significantly improve the runtime of short-read alignment (SRA), a crucial step in pre-processing sequenced genomes. It presents the design and implementation of ASAP, an accelerator for computing Levenshtein distance (LD) in the context of the SRA problem. LD computation is a prominent underlying mathematical kernel that is common to a large number of SRA tools (e.g., BLAST, BWA, SNAP) and is responsible for 50-70% of their runtime. These algorithms mentioned above calculate the exact value of LD between nucleotide strings but only use them to build a total ordering (an ordered list) of the most likely point of origin in the genome. ASAP computes an approximation of LD by encoding computation in propagation delay of circuit elements. This approximation is calculated in an accelerated fashion in hardware and preserves the original total ordering of LDs produced by the traditional algorithms. This computation is performed by constructing circuits that comprise the recursive definition of the LD computation and measuring propagation delay of a signal entering and leaving the circuit. Additionally, ASAP can explore large portions of the search space (substrings of the strings being compared) within one clock cycle, and ignore parts of the search space that does not contribute to an answer. Our design is implemented on an Altera Stratix V FPGA in an IBM POWER8 system using the CAPI interface for cache coherence across the CPU and FPGA. Our design is 200x faster (median measurement) than the equivalent C implementation of the kernel running on the host processor and 2.2x faster for an end-to-end alignment tool for 120-150bp short-read sequences.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"97 3-4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114025403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data flow graph (DFG) mapping is critical for the compiling of spatial programmable architecture, where compilation time is a key factor for both time-to-market requirement and mapping successful rate. Inspired from the great progress made in tree search game using deep neural network, we proposed a framework for learning convolutional neural networks for mapping DFGs onto spatial programmable architectures. Considering that mapping is a process from source to target, we present a dual-input neural network capturing features from both DFGs in applications and Process Element Array (PEA) in spatial programmable architectures. In order to train the neural network, algorithms are designed to automatically generate a data set from PEA intermediate states of preprocessed DFG. Finally, we demonstrate that the trained neural network can get high identifying accuracy of mapping quality and our proposed mapping approach are competitive with state-of-the-art DFG mapping algorithms in performance while the compilation time is greatly reduced.
数据流图(DFG)映射对于空间可编程架构的编译至关重要,其中编译时间是上市时间要求和映射成功率的关键因素。受深度神经网络在树搜索游戏中取得的巨大进展的启发,我们提出了一个学习卷积神经网络的框架,用于将DFGs映射到空间可编程架构。考虑到映射是一个从源到目标的过程,我们提出了一个双输入神经网络,从应用程序中的DFGs和空间可编程架构中的过程元素阵列(process Element Array, PEA)中捕获特征。为了训练神经网络,设计了从预处理DFG的PEA中间状态自动生成数据集的算法。最后,我们证明了训练后的神经网络可以获得较高的映射质量识别精度,并且我们提出的映射方法在性能上与最先进的DFG映射算法相竞争,同时大大减少了编译时间。
{"title":"Learning Convolutional Neural Networks for Data-Flow Graph Mapping on Spatial Programmable Architectures (Abstract Only)","authors":"S. Yin, Dajiang Liu, Lifeng Sun, Xinhan Lin, Leibo Liu, Shaojun Wei","doi":"10.1145/3020078.3021801","DOIUrl":"https://doi.org/10.1145/3020078.3021801","url":null,"abstract":"Data flow graph (DFG) mapping is critical for the compiling of spatial programmable architecture, where compilation time is a key factor for both time-to-market requirement and mapping successful rate. Inspired from the great progress made in tree search game using deep neural network, we proposed a framework for learning convolutional neural networks for mapping DFGs onto spatial programmable architectures. Considering that mapping is a process from source to target, we present a dual-input neural network capturing features from both DFGs in applications and Process Element Array (PEA) in spatial programmable architectures. In order to train the neural network, algorithms are designed to automatically generate a data set from PEA intermediate states of preprocessed DFG. Finally, we demonstrate that the trained neural network can get high identifying accuracy of mapping quality and our proposed mapping approach are competitive with state-of-the-art DFG mapping algorithms in performance while the compilation time is greatly reduced.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"51 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114130588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nitish Kumar Srivastava, Steve Dai, R. Manohar, Zhiru Zhang
High-level synthesis (HLS) enables designing at a higher level of abstraction to effectively cope with design complexity of emerging applications on modern programmable system-on-chip (SoC). While HLS continues to evolve with a growing set of algorithms, methodologies, and tools to efficiently map software designs onto optimized hardware architectures, there continues to lack realistic benchmark applications with sufficient complexity and enforceable constraints. In this paper we present a case study of accelerating face detection based on the Viola Jones algorithm on a programmable SoC using a C-based HLS flow. We also share our insights in porting a software-based design into a synthesizable implementation with HLS-specific data structures and optimizations. Our design is able to achieve a frame rate of 30 frames per second which is suitable for realtime applications. Our performance and quality of results are comparable to those of many traditional RTL implementations.
{"title":"Accelerating Face Detection on Programmable SoC Using C-Based Synthesis","authors":"Nitish Kumar Srivastava, Steve Dai, R. Manohar, Zhiru Zhang","doi":"10.1145/3020078.3021753","DOIUrl":"https://doi.org/10.1145/3020078.3021753","url":null,"abstract":"High-level synthesis (HLS) enables designing at a higher level of abstraction to effectively cope with design complexity of emerging applications on modern programmable system-on-chip (SoC). While HLS continues to evolve with a growing set of algorithms, methodologies, and tools to efficiently map software designs onto optimized hardware architectures, there continues to lack realistic benchmark applications with sufficient complexity and enforceable constraints. In this paper we present a case study of accelerating face detection based on the Viola Jones algorithm on a programmable SoC using a C-based HLS flow. We also share our insights in porting a software-based design into a synthesizable implementation with HLS-specific data structures and optimizations. Our design is able to achieve a frame rate of 30 frames per second which is suitable for realtime applications. Our performance and quality of results are comparable to those of many traditional RTL implementations.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121819536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep learning has garnered significant visibility recently as an Artificial Intelligence (AI) paradigm, with success in wide ranging applications such as image and speech recognition, natural language understanding, self-driving cars, and game playing (e.g., Alpha Go). This special session is devoted to exploring the potential role of FPGAs in this important fast-evolving domain.
{"title":"The Role of FPGAs in Deep Learning","authors":"A. Ling, J. Anderson","doi":"10.1145/3020078.3030013","DOIUrl":"https://doi.org/10.1145/3020078.3030013","url":null,"abstract":"Deep learning has garnered significant visibility recently as an Artificial Intelligence (AI) paradigm, with success in wide ranging applications such as image and speech recognition, natural language understanding, self-driving cars, and game playing (e.g., Alpha Go). This special session is devoted to exploring the potential role of FPGAs in this important fast-evolving domain.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122370635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Steve Dai, Ritchie Zhao, Gai Liu, S. Srinath, Udit Gupta, C. Batten, Zhiru Zhang
Current pipelining approach in high-level synthesis (HLS) achieves high performance for applications with regular and statically analyzable memory access patterns. However, it cannot effectively handle infrequent data-dependent structural and data hazards because they are conservatively assumed to always occur in the synthesized pipeline. To enable high-throughput pipelining of irregular loops, we study the problem of augmenting HLS with application-specific dynamic hazard resolution, and examine its implications on scheduling and quality of results. We propose to generate an aggressive pipeline at compile-time while resolving hazards with memory port arbitration and squash-and-replay at run-time. Our experiments targeting a Xilinx FPGA demonstrate promising performance improvement across a suite of representative benchmarks.
{"title":"Dynamic Hazard Resolution for Pipelining Irregular Loops in High-Level Synthesis","authors":"Steve Dai, Ritchie Zhao, Gai Liu, S. Srinath, Udit Gupta, C. Batten, Zhiru Zhang","doi":"10.1145/3020078.3021754","DOIUrl":"https://doi.org/10.1145/3020078.3021754","url":null,"abstract":"Current pipelining approach in high-level synthesis (HLS) achieves high performance for applications with regular and statically analyzable memory access patterns. However, it cannot effectively handle infrequent data-dependent structural and data hazards because they are conservatively assumed to always occur in the synthesized pipeline. To enable high-throughput pipelining of irregular loops, we study the problem of augmenting HLS with application-specific dynamic hazard resolution, and examine its implications on scheduling and quality of results. We propose to generate an aggressive pipeline at compile-time while resolving hazards with memory port arbitration and squash-and-replay at run-time. Our experiments targeting a Xilinx FPGA demonstrate promising performance improvement across a suite of representative benchmarks.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"465 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127428141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, Convolutional Neural Networks (ConvNets) have become the state-of-the-art in several Artificial Intelligence tasks. Across the range of applications, the performance needs vary significantly, from high-throughput image recognition to the very low-latency requirements of autonomous cars. In this context, FPGAs can provide a potential platform that can be optimally configured based on the different performance needs. However, the complexity of ConvNet models keeps increasing leading to a large design space. This work presents fpgaConvNet, an end-to-end framework for mapping ConvNets on FPGAs. The proposed framework employs an automated design methodology based on the Synchronous Dataflow (SDF) paradigm and defines a set of transformations on the SDF graph in order to efficiently explore the architectural design space. By treating high-throughput and latency-critical systems separately, the presented tool is able to efficiently explore the architectural design space and to generate hardware designs from high-level ConvNet specifications, explicitly optimised for the performance metric of interest. Overall our framework yields designs that improve the performance density and the performance efficiency by up to 6× and 4.49× respectively over existing highly-optimised FPGA, DSP and embedded GPU work.
{"title":"fpgaConvNet: Automated Mapping of Convolutional Neural Networks on FPGAs (Abstract Only)","authors":"Stylianos I. Venieris, C. Bouganis","doi":"10.1145/3020078.3021791","DOIUrl":"https://doi.org/10.1145/3020078.3021791","url":null,"abstract":"In recent years, Convolutional Neural Networks (ConvNets) have become the state-of-the-art in several Artificial Intelligence tasks. Across the range of applications, the performance needs vary significantly, from high-throughput image recognition to the very low-latency requirements of autonomous cars. In this context, FPGAs can provide a potential platform that can be optimally configured based on the different performance needs. However, the complexity of ConvNet models keeps increasing leading to a large design space. This work presents fpgaConvNet, an end-to-end framework for mapping ConvNets on FPGAs. The proposed framework employs an automated design methodology based on the Synchronous Dataflow (SDF) paradigm and defines a set of transformations on the SDF graph in order to efficiently explore the architectural design space. By treating high-throughput and latency-critical systems separately, the presented tool is able to efficiently explore the architectural design space and to generate hardware designs from high-level ConvNet specifications, explicitly optimised for the performance metric of interest. Overall our framework yields designs that improve the performance density and the performance efficiency by up to 6× and 4.49× respectively over existing highly-optimised FPGA, DSP and embedded GPU work.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116441812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
FPGAs are increasingly popular as application-specific accelerators because they lead to a good balance between flexibility and energy efficiency, compared to CPUs and ASICs. However, the long routing time imposes a barrier on FPGA computing, which significantly hinders the design productivity. Existing attempts of parallelizing the FPGA routing either do not fully exploit the parallelism or suffer from an excessive quality loss. Massive parallelism using GPUs has the potential to solve this issue but faces non-trivial challenges. To cope with these challenges, this work presents Corolla, a GPU-accelerated FPGA routing method. Corolla enables applying the GPU-friendly shortest path algorithm in FPGA routing, leveraging the idea of problem size reduction by limiting the search in routing subgraphs. We maintain the convergence after problem size reduction using the dynamic expansion of the routing resource subgraphs. In addition, Corolla explores the fine-grained single-net parallelism and proposes a hybrid approach to combine the static and dynamic parallelism on GPU. To explore the coarse-grained multi-net parallelism, Corolla proposes an effective method to parallelize mutli-net routing while preserving the equivalent routing results as the original single-net routing. Experimental results show that Corolla achieves an average of 18.72x speedup on GPU with a tolerable loss in the routing quality and sustains a scalable speedup on large-scale routing graphs. To our knowledge, this is the first work to demonstrate the effectiveness of GPU-accelerated FPGA routing.
{"title":"Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion","authors":"Minghua Shen, Guojie Luo","doi":"10.1145/3020078.3021732","DOIUrl":"https://doi.org/10.1145/3020078.3021732","url":null,"abstract":"FPGAs are increasingly popular as application-specific accelerators because they lead to a good balance between flexibility and energy efficiency, compared to CPUs and ASICs. However, the long routing time imposes a barrier on FPGA computing, which significantly hinders the design productivity. Existing attempts of parallelizing the FPGA routing either do not fully exploit the parallelism or suffer from an excessive quality loss. Massive parallelism using GPUs has the potential to solve this issue but faces non-trivial challenges. To cope with these challenges, this work presents Corolla, a GPU-accelerated FPGA routing method. Corolla enables applying the GPU-friendly shortest path algorithm in FPGA routing, leveraging the idea of problem size reduction by limiting the search in routing subgraphs. We maintain the convergence after problem size reduction using the dynamic expansion of the routing resource subgraphs. In addition, Corolla explores the fine-grained single-net parallelism and proposes a hybrid approach to combine the static and dynamic parallelism on GPU. To explore the coarse-grained multi-net parallelism, Corolla proposes an effective method to parallelize mutli-net routing while preserving the equivalent routing results as the original single-net routing. Experimental results show that Corolla achieves an average of 18.72x speedup on GPU with a tolerable loss in the routing quality and sustains a scalable speedup on large-scale routing graphs. To our knowledge, this is the first work to demonstrate the effectiveness of GPU-accelerated FPGA routing.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128972776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Machine Learning","authors":"J. Cong","doi":"10.1145/3257184","DOIUrl":"https://doi.org/10.1145/3257184","url":null,"abstract":"","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"377 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115175089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}